The Documentation section serves to document some of our research processes and to provide concrete workflows for data analysis in digital historical sources. This is part of our engagement not just to make the results of research available through publications, but to lift the veil on how we actually explore, examine, manipulate, and interpret the data extracted from sources. By sharing our workflows, mostly in the form of “scripts” produced with R, we hope to contribute to new ways of sharing knowledge, sharing methods, and making our experiments with digital historical sources reusable and replicable.
We produced all the “scripts” proposed below in R using the Markdown format. While this may at first sound a bit “techie”, in fact R is a programming language now widely used in social sciences and humanities. It offers a wide range of libraries for most tasks related to processing data, from very simple calculations to data mining, map making, network analysis, etc. The R Markdown format is particularly adapted to sharing scripts because it allows the combination of narratives (e.g., explaining the research operations and presenting interpretations) and ready-made chunk of replicable chunks of code for the operations described in the narratives.
MCBD Design Manual [ENP-China Team]
We wrote the design manual of the Modern China Biographical Database to present the nature, the purpose, and the structure of the database. This document is not strictly a “script” for analytical use. We chose this format to elaborate this manual for two main reasons: first, this was a collective exercise in which each of us could write his/her part(s) independently, which were then compiled automatically into a single document; second, we wanted to work with a format that allowed us to update any part at any time and to produce seamlessly a flexible document for the web.
Data Transformation [ENP-China Team]
In processing data extracted from sources, historical or otherwise, we often face the same issues of having to clean, homogenize, group, etc. the data to make it useful for analysis. Data transformation is a tedious task that requires rigor and accuracy. While small data sets can be processed by hand in a spreadsheet, larger data sets can quickly become time consuming, with a greater risk of making mistakes. An R script enables the systematic processing of data without erasing the original data set on which the transformation is being applied.
With the Data Transformation script we mean to provide a series of operations for the transformation of data, including messy data, to fit
the requirements of clean tabular data. Our purpose is to facilitate the production of data in Chinese studies according to the norms of
academia in Western countries, but the script can be extended to other fields of study. We propose here a wide range of examples, from simple text editing to the extraction of data from complex sentences. The examples we use apply to both English (or any Latin script) and Chinese.
Industrialist [Christian Henriot]
This is the first part of a multidirectional exploratory study of Shanghai industrialists in the Shenbao from the mid 19th-mid 20th century. In this document, we examine the terms through which “industrialists” were named in the press. This essay takes two of the most common terms thatdesignated “industrialists” in the Chinese press in the Republican period: 工業家 and 實業家. While 工業家 represents an unambiguous term for “industrialist”, 實業家 can refer to “entrepreneur” (in other sectors, including banking) and “industrialist”. Our purpose in this script is to extract all the texts that refer to any of these terms and to extract all the Named entities (person, organization, location) mentioned in these texts. The second stage of this survey is to link these entities to events to which these terms may be related. The next instalment will explore a wider range of terms associated to “industrialists”.
The Rotary Club [Cécile Armand]
Case 1 – The Rotary Club of China in the press
A Practical Guide to the ‘enpchina’ package: The Rotary Club in the Chinese Press: This guide aims to demonstrate how China historians can take advantage of the “enpchina” package to explore massive corpora of historical newspapers, focusing on a major Chinese newspaper – Shenbao 申報 – and a concrete case study – the Rotary Club of Shanghai 上海扶輪社 (Shanghai fulunshe) (Rmd version: https://bookdown.enpchina.eu/Rotary_sb_eng.Rmd).
A Practical Guide to the ‘enpchina’ package: The Rotary Club in the English-language Press: This guide aims to demonstrate how China historians can take advantage of the “enpchina” package to explore massive corpora of historical newspapers – i.e. ProQuest “Chinese Newspapers Collection” – taking the Rotary Club of Shanghai as a case study (Rmd version: https://bookdown.enpchina.eu/Rotary_pq_eng.Rmd).
Case 2 – The Golden Age of the returned students
This collection of tutorials explore the presence of the returned students in the Chinese modern press. The press corpora include a dozen of Chinese newspapers spanning from the mid 19th-mid 20th century. They are part of the large collections of historical sources that the ENP China project has acquired and made available in full text for the first time. The potential for exploration is infinite. It may be disturbing too. As humanists trained in the close reading and critical hermeneutics of a limited, human-scale amount of documents, we are poorly equipped for facing this data deluge. Where to start? How to proceed? These tutorials provide some useful tips for turning historians into data-driven humanists. We will experiment with various techniques and methods to handle massive historical corpora and approach modern Chinese history from new perspectives.
The purpose of this tutorial series is twofold :
- Substantially, to introduce a step change in the history of the returned students and contribute to a new understanding of their role in building a new China after the empire – a much disputed issue in the existing scholarship (Wang, 1966). The corpus-based, data-driven approach we propose will enrich and contextualize the biographical, proposopographical and cultural studies that have prevailed to date.
- Methodologically, we aim to:
- introduce the enpchina R package – a set of tools relying on R programming language tailored specifically for exploring massive, multilingual corpora of Chinese sources – and other R packages we consider useful for historical research ;
- devise on-the-fly yet sustainable solutions for harnessing large collections of historical newspapers ;
- empower historians with various programming skills so that they gain full control over the “datafication” process and escape the black boxes that we inherit from web platforms and off-the-shelf softwares.
We chose R studio because it provides an integrated framework for combining a variety of approaches and commanding the complete chain of operations. Under R, data-driven historians can conduct the entire research process – from data extraction to the exploration, analysis, interpretation and publishing of their findings and methodology – within a single, unified environment, while ensuring the traceability of the workflow and the replicability of their experiments, through sharing the code and emphasizing collaboration. Moreover, it is supported by a large community of users (historians/scholars, data scientists/computing specialists ) and it is constantly evolving toward greater integration and accessibility. Altogether, the following tutorials develop a standard workflow that any historian can emulate or transpose to her own research needs. She will be guided step by step from building the corpus to analyzing its textual content, mapping the underlying network of social actors and many other applications.
Corpus building with the enpchina package : in this tutorial, you will learn how to use the enpchina package to build a corpus (i.e. a collection of newspaper articles) from a keyword-based query and to conduct a preliminary exploration of this corpus (Rmd version: https://bookdown.enpchina.eu/Liumei/01_Corpus.rmd).
- Text analysis with tidytext: apply basic text analysis techniques to approach the content of articles (tokenisation, word frequency, correlation, co-occurrences) with the package tidytext (Rmd version: https://bookdown.enpchina.eu/Liumei/02_TextAnalysis.Rmd).
- Text statistics with quanteda: learn how to create a corpus object to perform more advanced text analyses (frequency, time series) and visualisations (heat maps) with the package quanteda (Rmd version: https://bookdown.enpchina.eu/Liumei/021_TextStats.Rmd).
- Keyword extraction with quanteda: learn how to handle multi-word units (e.g. « United States”), extract key terms and compare corpora of varying size using more sophisticated metrics (TF-ID, log-likelihood ratio test) (Rmd version: https://bookdown.enpchina.eu/Liumei/022_KeyTerm.Rmd).
- Text co-occurrences (1) explore relations between words, learn how to find and visualize collocates and to measure their significance (Rmd version: https://bookdown.enpchina.eu/Liumei/023_TextCooc.Rmd).
- Text co-occurrences (2) : discover alternative ways of visualizing text collocations (Rmd version: https://bookdown.enpchina.eu/Liumei/024_Collocation.Rmd).
- Concordancing: to bridge the gap between distant and close reading, learn how to analyze words in their original context and apply regular expressions to refine your research (coming soon).
Sentiment analysis (coming soon)
Topic modeling (coming soon)
Named Entity Recognition (coming soon)
- Extraction with the enpchina package
- Processing : clean, homogenize and classify named entity with the tidyverse meta package and open refine R extensions.
- Network analysis of persons and organizations
- Mapping locations
Corpus forensics (coming soon)
- Text classification
- Text metrics
- Text features
- Text reuse