Preliminary Programme

Wed 18 March
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.30 - 18.30

Thu 19 March
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.30 - 18.30

Fri 20 March
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.30 - 18.30

Sat 21 March
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.00 - 17.00

All days
Go back

Wednesday 18 March 2020 08.30 - 10.30
S-1 FAM20a Building of Demographic Databases I Shortening the Building: of Experiences on Handwriting Text Recognition
P.J. Veth, 1.01
Network: Family and Demography Chair: Gunnar Thorvaldsen
Organizer: Joana-Maria Pujadas-Mora Discussant: Martin Dribe
Trent Alexander, Katie Genadek & Jonathan Fisher : Digitizing Handwriting with Automated Methods: a Pilot Project Using the 1990 U.S. Census Manuscripts
The U.S. Census Bureau maintains a large longitudinal research infrastructure that currently includes linked data from the 1940 census, the 2000-2010 censuses, major national surveys going back to 1973, and administrative records dating from the 1990s. These data are accessible to researchers around the U.S. via the the Federal Statistical ... (Show more)
The U.S. Census Bureau maintains a large longitudinal research infrastructure that currently includes linked data from the 1940 census, the 2000-2010 censuses, major national surveys going back to 1973, and administrative records dating from the 1990s. These data are accessible to researchers around the U.S. via the the Federal Statistical Research Data Centers (FSRDC) network. The major shortcoming of this infrastructure is that it lacks linkable files from the decennial censuses of 1950 through 1990. Full-count microdata from the 1960-1990 censuses are available for research, but datasets from these years do not include respondent names and therefore have not been linked over time. Focusing on the 1990 U.S. census, we describe the results of a project to develop methods for filling this gap. We created digital images from 1990 census microfilm, hand-keyed “truth data” from those images, supported two teams’ attempts to conduct Handwriting Recognition on the images, appended recovered names to already-existing microdata files, and linked the new 1990 census microdata records to previous and subsequent censuses. We describe our processes, the accuracy of the Handwriting Recognition, and the accuracy of the record linkage with the recovered names. (Show less)

Lars Ailo Bongo, Tim Alexander Teige & Nikita Shvetsov & Johan Ravn & Einar Holsbø & Trygve Andersen & Gunnar Thorvaldsen : Automated Approaches for Transcription of 20th Century Norwegian Census Microdata
Automated approaches are necessary to transcribe large handwritten historical source materials such as the 1950 Norwegian Population Census, with 801 000 scanned double sided questionnaires. We are first in the world to transcribe such a full count census (3.3 million inhabitants) without support from genealogists, and not as part of ... (Show more)
Automated approaches are necessary to transcribe large handwritten historical source materials such as the 1950 Norwegian Population Census, with 801 000 scanned double sided questionnaires. We are first in the world to transcribe such a full count census (3.3 million inhabitants) without support from genealogists, and not as part of the original production of statistics.

In this paper, we describe and discuss lessons learned developing and using two automated approaches. First, to transcribe handwritten names we use a clustering method to find similar images. To verify the name in each image, we search for the names in previously transcribed sources, and we manually remove images in the wrong cluster using a graphical user interface. Second, we use a deep learning approach to transcribe handwritten numbers. We split the codes into single digits and train a single-digit model. We combine the classified single digits into a multi-digit number. To verify the transcribed numbers, we check that the number is a valid for a given column, and we plan to manually verify numbers using a graphical user interface that sorts the numbers based on the confidence of the classification algorithm.

These two approaches allow faster handwritten microdata transcription with quality controls generally, and may be used to recognize other types of nominative sources such as parish registers and vital records. (Show less)

Joseph Price, Mark Clement : Using Hand-writing Recognition to Auto Index the US Census Records
Recent breakthroughs in handwriting recognition have the capability to improve the quality of the 1940 Census data and expand the set of fields that are available to use for research. Our hand-writing recognition algorithms uses new data augmentation and normalization methods applied to a convolutional neural network that feeds into ... (Show more)
Recent breakthroughs in handwriting recognition have the capability to improve the quality of the 1940 Census data and expand the set of fields that are available to use for research. Our hand-writing recognition algorithms uses new data augmentation and normalization methods applied to a convolutional neural network that feeds into a Long-Short-Term-Memory (LSTM) network. We also have a unique advantage by having access to a training set that is unprecedented size. Census records consist of a set of rows for each person and columns for each of the fields of information for that person. We’ve developed an algorithm to extract the sub-image in each cell of the census record and match these with the indexed data for that cell. This provides us a labeled training set with 2.4 billion images from the 1940 census (18 fields x 132 million individuals). We are using our algorithm to re-index the 1940 census and fix mistakes made by the original human indexers and also expand the number of fields that are indexed. We conducted a pilot study on the 1930 census using a small training set and have already achieved a character error rate (CER) of 10.4% for names.
We also make use of the FamilySearch Family Tree, a crowdsourced genealogical database which includes a substantial number of individuals linked to the 1940 census. These sources have often been attached to the Family Tree by family members who have access to additional information about these people that improve the accuracy of the linkages to these sources. We use information from these sources to correct mistakes in the index of the 1940 census and identify alternative name spellings and nicknames for the individual. (Show less)

Helene Vezina, Jean-Sébastien Bournival & Christopher Kermorvant & Marie-Laurence Bonhomme : i-BALSAC: Completing Families with the Help of Automatic Text Recognition
For almost50 years, BALSAC has been reconstructing the genealogical lines and kinship relations of the Quebec population using data from marriage records. More recently, we have conducted projects aiming at completing family reconstruction through the addition of birth and death records. However, technical limitations have emerged since, as we move ... (Show more)
For almost50 years, BALSAC has been reconstructing the genealogical lines and kinship relations of the Quebec population using data from marriage records. More recently, we have conducted projects aiming at completing family reconstruction through the addition of birth and death records. However, technical limitations have emerged since, as we move forward in time, more events are recorded every year and the task to integrate them in the database increases accordingly. It has become obvious that to pursue the development of the database we could no longer rely exclusively on manual or semi-automatic operations to digitize, integrate and link millions of records.
Progress in machine learning open up promising avenues for historical databases. Word recognition algorithms, especially handwritten text recognition (HTR), have improved significantly in the past few years. We initiated in the first months of 2019 a new project relying on HTR for the transcription of Quebec civil registers. Ultimately, our goal is to process approximately 1.3 million pages of digitized records and extract about 6 million birth and death certificates from 1850 to 1917. We intend to identify and to index various entities contained in each record: names and surnames (of subject, parents, and spouse), dates, places, and occupations. In this paper, we provide an overview of our approach and we discuss in depth the difficulties encountered and the choices made to overcome them and achieve the best possible results. One of the key issues we are facing pertains to the quality of the digitized documents (quality of preservation as well as quality of digitization). Moreover, since we cover the whole Quebec territory and a period spanning 70 years, we observe a great diversity across registers in terms of wording and of handwriting styles. Last but not least, the majority of records are in French but there is also a significant share written in English. We will conclude with a presentation of our most recent results on HTR operations. (Show less)



Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer