The Digitising Scotland project digitised 25.8 million Scottish civil registration records: birth, marriages and death from when records began in 1855 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a suitable research ready format – including the ... (Show more)
The Digitising Scotland project digitised 25.8 million Scottish civil registration records: birth, marriages and death from when records began in 1855 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a suitable research ready format – including the broad classification of some of information captured.
The information included on the digitised birth, marriage, and death certificates includes textual descriptions of occupations and causes of death. The overall project aims to create the research ready Scottish Historic Population Database (SHPD), with these textual descriptions coded to widely used standard coding schemes. For SHPD, to code transcribed occupations to the Historical International Standard Classification of Occupations (HISCO) and the causes of death to the International Classification of Diseases, 10th revision (ICD-10).
It is impractical to have domain experts hand code all the records manually (31 million occupations and 8 million causes of death) to create SHPD, especially as for deaths there is often more than one cause given on each record. As such the problem of coding is viewed as a text classification task and to automate the process applying Natural Language Processing and Machine Learning techniques. To facilitate the auto-coding a proportion of the records – a random sample of unique strings - were recently manually coded and will be used to train the system (90,000 occupations and 102,000 deaths).
Ahead of the auto-coding initial pre-processing, cleaning and standardising is done on both the raw transcribed occupations and deaths, along with the training data for each type. For the training data this included removing white space, unreadable characters (deliberately introduced as part of the transcription) and creating dictionaries. For the main data, for occupations this reduced the 31M occupations down to 2M uniques (or 1.6M removing those with unreadable characters) and for causes of death reduced the 8M to which 2.2M uniques (or 1.8M removing unreadable characters). Then, following earlier computer science work within the SHPD project, an approach experimenting with Bayesian classifiers is applied.
Preliminary experiments undertaken using a relatively small pilot dataset obtained reasonable results from a combination of exact matching and statistical classification. From the pilot by combining exact matching for texts that have been seen in the training data and the Bayes classifiers for the rest, the accuracy levels achieved from cross-validation are: 92% for causes of death and 94-97% for occupations. Further, experiments using a larger pilot (using 50,000 occupations) rather interestingly uncovered that since some occupations are very common, this set of manually coded strings covers a very large proportion of the SHPD records (almost 88%). This proportion is of course not as high for deaths given the different ways a cause of death can be written down, the variety of causes, etc.
This is work in progress to create the research ready SHPD, and the final paper will include the results from experiments using the full training data (90,000 occupations and 102,000 deaths) covering the whole period from 1855 to 1973. (Show less)