Preliminary Programme

Browse networks

Wed 12 April
08.30 - 10.30
11.00 - 13.00
14.00 - 16.00
16.30 - 18.30

Thu 13 April
08.30 - 10.30
11.00 - 13.00
14.00 - 16.00
16.30 - 18.30

Fri 14 April
08.30 - 10.30
11.00 - 13.00
14.00 - 16.00
16.30 - 18.30

Sat 15 April
08.30 - 10.30
11.00 - 13.00
14.00 - 16.00

All days

Go back

Thursday 13 April 2023 11.00 - 13.00

T-6 FAM11 Methodological Advances in Social and Economic History
Victoriagatan 13, Victoriasalen

Network: Family and Demography	Chair: Rick Mourits
Organizer: Rick Mourits	Discussant: Ivo Zandhuis

Trygve Andersen, Narae Park & Bjørn-Richard Pedersen & Hilde Sommerseth & Lars-Ailo Bongo : From Rule-based to ML-based Linking of Norwegian Population Censuses from the 19th and 20th Centuries

Abstract: Currently, the Norwegian Historical Population Register (HPR) is constructed by rule-based record linkage algorithms. This paper aims to present the first results of machine learning models that links individuals that appear in two successive population censuses. The study uses the current HPR links as a training set, and the following steps comprises the construction of a final optimal model. First, the data is preprocessed to address the challenges unique to the Norwegian censuses, and additional constructed variables are created. Next, different machine learning models are trained with various settings such as different feature sets (time-invariant variable set and extended variable set) and different methods (Logistic regression, Random Forest, Support vector machine, XGBoost). Finally, linking results and quality are evaluated in several aspects (difference in results, linking rate, precision and recall), and a final optimal model is derived from this. (Show less)

Bjorn-Richard Pedersen, Hilde Sommerseth & Lars Ailo Bongo : Manual Review and Correction of ML Transcribed Occupational Codes from the Norwegian Population Census of 1950

The use of Machine Learning as a tool for transcription of historical data has increased dramatically over the past few years and have made the process more efficient, both in terms of the time spent and the costs required, but no ML model has an accuracy rate of 100%.
This means that there will always be a remainder of data that could not be accurately transcribed by the model, and that requires human intervention. Either for manual review or correction.
In one of our previous projects, we created one such model to transcribe occupational codes from the 1950 Norwegian census that achieved satisfactory results, transcribing ~97% of our data with a confidence threshold of correctness above 65%. This means however that we still have ~3% of our data that needs manual review and/or correction.
In real numbers, we would be looking at ~90.000 images of occupational codes that would need manual processing. This is very time consuming and tedious, so the goal of this current project is reducing the time and effort needed for such tasks.
We have created a custom labeling tool that will allow a reviewer to process up to 60 images at once, updating where needed, or confirming the ML code assigned to the image. We have assumed that human review will be more correct than the ML assigned codes, since these are results that the model was uncertain about, but we will also have 10% of our images labeled by 2 reviewers, in an effort to measure human inaccuracy. We have also performed a pilot project to preemptively explore ways of reducing the number of inter-reviewer conflicts and human error. The results of this pilot were presented at the ESHD in Madrid in 2022.
This year we will present the results of the complete manual review and correction and report on the personal experiences gather through interviews with our reviewers that was conducted after the end of the project period. (Show less)

Lee Williamson, Eilidh Garrett : Methods and Findings from the Creating the Scottish Historic Population Database (SHPD): Auto-coding of Deaths (to ICD-10) and Occupations (to HISCO) from Large Training Datasets

The Digitising Scotland project digitised 25.8 million Scottish civil registration records: birth, marriages and death from when records began in 1855 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a suitable research ready format – including the broad classification of some of information captured.
The information included on the digitised birth, marriage, and death certificates includes textual descriptions of occupations and causes of death. The overall project aims to create the research ready Scottish Historic Population Database (SHPD), with these textual descriptions coded to widely used standard coding schemes. For SHPD, to code transcribed occupations to the Historical International Standard Classification of Occupations (HISCO) and the causes of death to the International Classification of Diseases, 10th revision (ICD-10).
It is impractical to have domain experts hand code all the records manually (31 million occupations and 8 million causes of death) to create SHPD, especially as for deaths there is often more than one cause given on each record. As such the problem of coding is viewed as a text classification task and to automate the process applying Natural Language Processing and Machine Learning techniques. To facilitate the auto-coding a proportion of the records – a random sample of unique strings - were recently manually coded and will be used to train the system (90,000 occupations and 102,000 deaths).
Ahead of the auto-coding initial pre-processing, cleaning and standardising is done on both the raw transcribed occupations and deaths, along with the training data for each type. For the training data this included removing white space, unreadable characters (deliberately introduced as part of the transcription) and creating dictionaries. For the main data, for occupations this reduced the 31M occupations down to 2M uniques (or 1.6M removing those with unreadable characters) and for causes of death reduced the 8M to which 2.2M uniques (or 1.8M removing unreadable characters). Then, following earlier computer science work within the SHPD project, an approach experimenting with Bayesian classifiers is applied.
Preliminary experiments undertaken using a relatively small pilot dataset obtained reasonable results from a combination of exact matching and statistical classification. From the pilot by combining exact matching for texts that have been seen in the training data and the Bayes classifiers for the rest, the accuracy levels achieved from cross-validation are: 92% for causes of death and 94-97% for occupations. Further, experiments using a larger pilot (using 50,000 occupations) rather interestingly uncovered that since some occupations are very common, this set of manually coded strings covers a very large proportion of the SHPD records (almost 88%). This proportion is of course not as high for deaths given the different ways a cause of death can be written down, the variety of causes, etc.
This is work in progress to create the research ready SHPD, and the final paper will include the results from experiments using the full training data (90,000 occupations and 102,000 deaths) covering the whole period from 1855 to 1973. (Show less)

Richard Zijdeman : burgerLinker - Civil Registries Linking Tool

Mass digitisation projects provide historians and social scientists with datasets containing millions of observations on individuals and households. It is extremely valuable to link these records. However, as record linkage moves from thousands to millions of cases, efficient linkage strategies become paramount. In this paper, we present Burgerlinker - our open source tool to match historical records. The tool is 1) extremely fast and scalable, 2) designed to match complex first names, and requires no blocking (i.e. no restrictions on registration date, location, or parts of names). Moreover, the detected links contain detailed provenance metadata, can be saved in different formats (CSV and RDF are covered in the current version), and allow for family and life course reconstructions by computing the transitive closure over all detected links. We will use the Dutch civil registry to showcase our new matching tool. The birth, marriage, and death certificates from the 19th and early 20th centuries have been digitized to reconstruct families and life courses. This would entail a dataset containing 27.5 million certificates. We describe linkage strategies and software reducing the computing time on this database to several hours. We also describe a data model to structure the resulting dataset. (Show less)