Preliminary Programme

Wed 12 April
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.30 - 18.30

Thu 13 April
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.30 - 18.30

Fri 14 April
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00
    16.30 - 18.30

Sat 15 April
    08.30 - 10.30
    11.00 - 13.00
    14.00 - 16.00

All days
Go back

Wednesday 12 April 2023 14.00 - 16.00
A-3 FAM02a Data & Methods I
SEB salen (Z)
Network: Family and Demography Chair: Glenn Sandström
Organizers: - Discussant: Glenn Sandström
Alan Dearle, Graham Kirby & Özgür Akgün : Linking Swedish Records Posing as Scottish Data: Exploring Linkage Methodologies, the Effects of Missing Data and the Use of Graph Databases
Previously it has been difficult to evaluate the relative effectiveness of different linkage
approaches, due to limited availability of ground truth data. Here we report on an
evaluation of linkage using a large Swedish dataset which has been manipulated to create a
superset of the Scottish vital event records being collected as part ... (Show more)
Previously it has been difficult to evaluate the relative effectiveness of different linkage
approaches, due to limited availability of ground truth data. Here we report on an
evaluation of linkage using a large Swedish dataset which has been manipulated to create a
superset of the Scottish vital event records being collected as part of the Digitising Scotland
project. Critically this dataset is large (373,000 records) and includes a high proportion of
ground truth links, captured within the original clerical process. The Scottish data comprises
birth, marriage and death records collected by the General Register Office for Scotland
during 1856-1973. All records include the names of the parties involved: child and parents
on birth records; spouses and their parents on marriage records; and the deceased, their
parents and their spouse on death records, the date and location of the relevant event, and
in some cases, additional information such as occupation and marital status. The extract of
the Swedish data, from CEDAR, contains 227,888 birth records, 42,207 marriage records and
102,655 death records, plus high quality ground truth. To link records we use a probabilistic
distance-based approach, in which two records are deemed to be linked if the difference
between the relevant information on the two records falls below some pre-determined
threshold. We perform two different forms of linkage on records: identity linkage and
indirect linkage. Our system currently includes 22 different linkers, each
of which creates inter-node relationships that are stored in a graph database. The
linker encodes the links between these records as relationships between the stored nodes.
Each of the relationships has attributes which encode the name of the builder used to form
the link (and thus all the provenance in the source code), the distance between the nodes,
and the type of relationship (e.g. mother, father, sibling, identity (same actor on two
records). In some cases additional information is stored in the relationship attributes (for
example in the case of identity links which actors on the records are involved). Using the
ground truth data available from the Swedish data is possible to determine the efficacy of
different approaches to linkage with a high degree of confidence. It is also possible to
determine the effect of different linkage metrics and different thresholds. We will briefly
report on our experimental results in this regard. During these experiments we have
observed the high impact of missing data on linkage quality and we will discuss these
observations and approaches to mitigating against missing data. The graph(s) that are
created by the different builders may contain a number of different classes of error. In
general errors are in one of 4 categories: errors of omission, errors of inclusion, uniqueness
constraint errors and semantic errors. Many of these errors are only readily manifested
when the outputs from the different building processes are combined into a single graph. In
this paper we will also fully describe these errors, how they may be detected and how they
may be, at least partially corrected. (Show less)

Rick Mourits, Prats López, M. & Van Oort, T. & Ganzevoort, W. & Van Galen C. : Engaging the Crowd: Citizen Science for Historical Demography
Crowdsourcing is increasingly being used in scientific research projects of different disciplines and hence often referred to as crowd science or online citizen science (Sauermann & Franzoni, 2015). This new way of conducting research entails the online participation of citizens in research projects usually initiated by professional scientists. One important ... (Show more)
Crowdsourcing is increasingly being used in scientific research projects of different disciplines and hence often referred to as crowd science or online citizen science (Sauermann & Franzoni, 2015). This new way of conducting research entails the online participation of citizens in research projects usually initiated by professional scientists. One important advantage for scientists with limited resources is that it is an resource efficient method to engage more people by untapping existing knowledge and interest in society. Hence, citizen science allows more people to work on a project at the same time (Sauermann & Franzoni, 2015).
Capturing the interest of citizens to participate in a citizen science project and keeping them engaged for the duration of a (sometimes lengthy) project are important and challenging activities for project organizers (Frensley et al., 2017; West & Pateman, 2016). The citizen science literature provides recommendations for recruitment and engagement strategies (West & Pateman, 2016; Crall et al., 2017), and the most important is taking into account the different motivations of citizen volunteers and use engagement strategies accordingly (Ponciano & Brasilerio, 2014). However, research shows that participants have multiple motivations (Rotman et al., 2014) and it is not clear what engagement strategies work best for the different types of citizen participants (Ponciano & Brasileiro, 2014).
Several studies propose measuring volunteers engagement in terms of their actual participation behaviour, examine engagement patterns and use these to create different participant profiles (Ponciano & Brasileiro, 2014; Jackson et al. 2016; Aristeidou et al., 2017). These engagement patterns and participant profiles have been done in projects in the fields of astronomy and natural sciences where the most common task is the annotation of images. There is a need for more research to ensure the generalizability of their findings (Aristeidou et al., 2017; Ponciano & Brasileiro, 2014) and, in particular, to understand engagement patterns and participant profiles in the knowledge-intensive and time-consuming citizen science projects in the humanities (Prats López et al., 2020).
The aim of our research is to study the engagement patterns of citizens participating in a humanities project and to create participation profiles that can be used for further research. To this purpose, we are using the log data of the citizen science project ‘Historical Database Suriname and Curaçao’ (https://hdsc.ning.com/). The objective of this citizen science project is to create a database of the population of Suriname and Curaçao from the years 1830 to 1950, by digitizing and transcribing civil registers and death certificates. This database will be open access available for both scientist and the public in general, to facilitate genealogical research, the study social processes and diversity in colonial society as well as the repercussions of slavery over multiple generations. (Show less)

Barbara Revuelta-Eugercios, Asbjørn Thomsen & Nicolai Rask Mathiesen & Olivia Robinson & Anne Løkke & Lise Bødtker Sunde & Anna Lodberg Sparres & Line Hjørt-Moritzsen : Occupation, Position in the Household and Socio-economic Status in 19th Century Denmark: Combining Historical Expertise and Automated Methods to Code Millions of Strings
Socio-economic status is one of the most important variables used in social history to study stratification, inequality in life chances and social mobility. While nowadays it can be measured with different variables (e.g. education, income, wealth), for historical times the operationalization of socio-economic status often relies on the use of ... (Show more)
Socio-economic status is one of the most important variables used in social history to study stratification, inequality in life chances and social mobility. While nowadays it can be measured with different variables (e.g. education, income, wealth), for historical times the operationalization of socio-economic status often relies on the use of occupation, as this is generally the individual-level information which it is easiest to obtain for large populations.
A major hurdle for large-scale historical analysis of occupation is that it requires reducing the enormous variability of occupational titles from the sources to a finite number of groups (such as Hisco), which can then be classified and aggregated in different class or status schemes (such as Hisclass, Hiscam, Socpo, etc). Traditionally this coding/classification has been done manually by domain experts, but this approach is not scalable to the new projects building large-scale databases for complete populations. So new automatic methods are in development to be able to code at least in broader occupational groups.
The aim of this paper is to present a method to uncover long-term trends in social stratification by coding individual-level occupation data from different sources through automated methods. The chosen context of study is 19th century Denmark, where the project Link-lives is developing a large-scale linked database for the full population, combining data from different sources. First, we discuss the challenges of coding large-scale population data covering long periods of time, and we describe the available manually coded occupations so far. Second, we present a new approach for automated coding occupation and position in the household to well-known coding schemes for 19th century Danish terms. Third, we present the preliminary results for 19th century Denmark, showing social stratification and intergenerational social mobility over time.
We use data from Danish censuses and Copenhagen burial records, which contain more than 700.000 unique strings of occupation and positions in the household combined. This is an extremely challenging dataset, as in many censuses both types of information were mixed into the same variable (“house father, carpenter”), making any automated process very difficult. Previous standardizations have been done for some of the censuses, but they were done with semi-automatic processes and lack metadata on how each string was coded. This makes it difficult to reutilize the old routines for the new data. We re-use part of the material, include new coded occupations, and we present a transparent and replicable pipeline that can be used for any other occupation data with Danish sources – and whose framework can be copied internationally and adapted to other languages . (Show less)



Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer