Preliminary Programme

Wed 18 March
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.30 - 18.30

Thu 19 March
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.30 - 18.30

Fri 20 March
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.30 - 18.30

Sat 21 March
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.00 - 17.00

All days
Go back

Wednesday 18 March 2020 08.30 - 10.30
S-1 FAM20a Building of Demographic Databases I Shortening the Building: of Experiences on Handwriting Text Recognition
P.J. Veth, 1.01
Network: Family and Demography Chair: Gunnar Thorvaldsen
Organizer: Joana-Maria Pujadas-Mora Discussant: Martin Dribe
Trent Alexander, Katie Genadek & Jonathan Fisher : Digitizing Handwriting with Automated Methods: a Pilot Project Using the 1990 U.S. Census Manuscripts
The U.S. Census Bureau maintains a large longitudinal research infrastructure that currently includes linked data from the 1940 census, the 2000-2010 censuses, major national surveys going back to 1973, and administrative records dating from the 1990s. These data are accessible to researchers around the U.S. via the the Federal Statistical ... (Show more)
The U.S. Census Bureau maintains a large longitudinal research infrastructure that currently includes linked data from the 1940 census, the 2000-2010 censuses, major national surveys going back to 1973, and administrative records dating from the 1990s. These data are accessible to researchers around the U.S. via the the Federal Statistical Research Data Centers (FSRDC) network. The major shortcoming of this infrastructure is that it lacks linkable files from the decennial censuses of 1950 through 1990. Full-count microdata from the 1960-1990 censuses are available for research, but datasets from these years do not include respondent names and therefore have not been linked over time. Focusing on the 1990 U.S. census, we describe the results of a project to develop methods for filling this gap. We created digital images from 1990 census microfilm, hand-keyed “truth data” from those images, supported two teams’ attempts to conduct Handwriting Recognition on the images, appended recovered names to already-existing microdata files, and linked the new 1990 census microdata records to previous and subsequent censuses. We describe our processes, the accuracy of the Handwriting Recognition, and the accuracy of the record linkage with the recovered names. (Show less)

Lars Ailo Bongo, Tim Alexander Teige & Nikita Shvetsov & Johan Ravn & Einar Holsbø & Trygve Andersen & Gunnar Thorvaldsen & Hilde L. Sommerseth : Automated Approaches for Transcription of 20th Century Norwegian Census Microdata
Automated approaches are necessary to transcribe large handwritten historical source materials such as the 1950 Norwegian Population Census, with 801 000 scanned double sided questionnaires. We are first in the world to transcribe such a full count census (3.3 million inhabitants) without support from genealogists, and not as part of ... (Show more)
Automated approaches are necessary to transcribe large handwritten historical source materials such as the 1950 Norwegian Population Census, with 801 000 scanned double sided questionnaires. We are first in the world to transcribe such a full count census (3.3 million inhabitants) without support from genealogists, and not as part of the original production of statistics.

In this paper, we describe and discuss lessons learned developing and using two automated approaches. First, to transcribe handwritten names we use a clustering method to find similar images. To verify the name in each image, we search for the names in previously transcribed sources, and we manually remove images in the wrong cluster using a graphical user interface. Second, we use a deep learning approach to transcribe handwritten numbers. We split the codes into single digits and train a single-digit model. We combine the classified single digits into a multi-digit number. To verify the transcribed numbers, we check that the number is a valid for a given column, and we plan to manually verify numbers using a graphical user interface that sorts the numbers based on the confidence of the classification algorithm.

These two approaches allow faster handwritten microdata transcription with quality controls generally, and may be used to recognize other types of nominative sources such as parish registers and vital records. (Show less)

Joseph Price, Mark Clement : Using Hand-writing Recognition to Auto Index the US Census Records
Recent breakthroughs in handwriting recognition have the capability to improve the quality of the 1940 Census data and expand the set of fields that are available to use for research. Our hand-writing recognition algorithms uses new data augmentation and normalization methods applied to a convolutional neural network that feeds into ... (Show more)
Recent breakthroughs in handwriting recognition have the capability to improve the quality of the 1940 Census data and expand the set of fields that are available to use for research. Our hand-writing recognition algorithms uses new data augmentation and normalization methods applied to a convolutional neural network that feeds into a Long-Short-Term-Memory (LSTM) network. We also have a unique advantage by having access to a training set that is unprecedented size. Census records consist of a set of rows for each person and columns for each of the fields of information for that person. We’ve developed an algorithm to extract the sub-image in each cell of the census record and match these with the indexed data for that cell. This provides us a labeled training set with 2.4 billion images from the 1940 census (18 fields x 132 million individuals). We are using our algorithm to re-index the 1940 census and fix mistakes made by the original human indexers and also expand the number of fields that are indexed. We conducted a pilot study on the 1930 census using a small training set and have already achieved a character error rate (CER) of 10.4% for names.
We also make use of the FamilySearch Family Tree, a crowdsourced genealogical database which includes a substantial number of individuals linked to the 1940 census. These sources have often been attached to the Family Tree by family members who have access to additional information about these people that improve the accuracy of the linkages to these sources. We use information from these sources to correct mistakes in the index of the 1940 census and identify alternative name spellings and nicknames for the individual. (Show less)



Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer