Documentation – Elites, Networks and Power in modern China

Enp Resources

HistText 1.0

HistText is the application developed in R by the ENP-China team to explore, extract, and process the data from the historical digital corpora that the project acquired or produced. Initially, this application was designed as an internal tool (series of search functions) under the name “enpchina R Library”. It provided ready-made functions to mine data in digital corpora. We plan to update all the scripts that we have made available so far as markdown files, but for those who use our initial package we still maintain the functions in the scripts with the ‘enpchina’ denomination. Eventually, ‘HistText’ will replace all these mentions.

The HistText application is used by the ENP-China research team on its own corpora, but the field of application of HistText is much broader. It can be used for any digital corpora, readily in English, French, and Chinese, and with only light adaptations for other languages. We have released the public version of the code on Gitlab. The current version of HistText 1.0 comes with a complete HistText Manual that we have prepared to describe all the functions and to provide ready-made examples of scripts.

We have also developed a public search interface that allows anyone to access and search the publicly accessible corpora (vs. corpora under copyright) without having to go through coding in R. Yet if your institutions happens to have a license for these corpora (ProQuest, Shenbao), please contact us and we shall open an access to you. The HistText online interface has two access pages: one for the Search and Query functions, one for Named Entity Extraction.The functions of the interface are fully described in the HistText User Guide.

There are two online presentations of the functions of the enpchina package and of HistText.

MCBD Design Manual [ENP-China Team]

We wrote the design manual of the Modern China Biographical Database to present the nature, the purpose, and the structure of the database. This document is not strictly a “script” for analytical use. We chose this format to elaborate this manual for two main reasons: first, this was a collective exercise in which each of us could write his/her part(s) independently, which were then compiled automatically into a single document; second, we wanted to work with a format that allowed us to update any part at any time and to produce seamlessly a flexible document for the web.

Data Transformation [ENP-China Team]

In processing data extracted from sources, historical or otherwise, we often face the same issues of having to clean, homogenize, group, etc. the data to make it useful for analysis. Data transformation is a tedious task that requires rigor and accuracy. While small data sets can be processed by hand in a spreadsheet, larger data sets can quickly become time consuming, with a greater risk of making mistakes. An R script enables the systematic processing of data without erasing the original data set on which the transformation is being applied.

With the Data Transformation script we mean to provide a series of operations for the transformation of data, including messy data, to fit
the requirements of clean tabular data. Our purpose is to facilitate the production of data in Chinese studies according to the norms of
academia in Western countries, but the script can be extended to other fields of study. We propose here a wide range of examples, from simple text editing to the extraction of data from complex sentences. The examples we use apply to both English (or any Latin script) and Chinese.

Industrialist [Christian Henriot]

This is the first part of a multidirectional exploratory study of Shanghai industrialists in the Shenbao from the mid 19th-mid 20th century. In this document, we examine the terms through which “industrialists” were named in the press. This essay takes two of the most common terms thatdesignated “industrialists” in the Chinese press in the Republican period: 工業家 and 實業家. While 工業家 represents an unambiguous term for “industrialist”, 實業家 can refer to “entrepreneur” (in other sectors, including banking) and “industrialist”. Our purpose in this script is to extract all the texts that refer to any of these terms and to extract all the Named entities (person, organization, location) mentioned in these texts. The second stage of this survey is to link these entities to events to which these terms may be related. The next instalment will explore a wider range of terms associated to “industrialists”.

The Rotary Club [Cécile Armand]

Case 1 – The Rotary Club of China in the press

A Practical Guide to the ‘enpchina’ package: The Rotary Club in the Chinese Press: This guide aims to demonstrate how China historians can take advantage of the “enpchina” package to explore massive corpora of historical newspapers, focusing on a major Chinese newspaper – Shenbao 申報 – and a concrete case study – the Rotary Club of Shanghai 上海扶輪社 (Shanghai fulunshe) (Rmd version: https://bookdown.enpchina.eu/Rotary_sb_eng.Rmd).

A Practical Guide to the ‘enpchina’ package: The Rotary Club in the English-language Press: This guide aims to demonstrate how China historians can take advantage of the “enpchina” package to explore massive corpora of historical newspapers – i.e. ProQuest “Chinese Newspapers Collection” – taking the Rotary Club of Shanghai as a case study (Rmd version: https://bookdown.enpchina.eu/Rotary_pq_eng.Rmd).

Case 2 – American University Men of China

This tutorial series applies a place-based methodology to study Sino-American alumni networks in modern China, based on a directory of the American University Club of Shanghai published in 1936. It is divided into four parts:

1. Find and analyze places using the R package “Places” (html version, Markdown version)

2. From places to networks (a dual approach): Build, visualize and analyze place-based networks using igraph (html version, Markdown version).

3. Community detection in place-based networks (igraph): Identify and analyze subgroups of places (igraph)

4. Place formation over time: Create period-based subnetworks to analyze the formation of academic places between 1883 and 1935

Case 3 – The Golden Age of the returned students

This collection of tutorials explore the presence of the returned students in the Chinese modern press. The press corpora include a dozen of Chinese newspapers spanning from the mid 19th-mid 20th century. They are part of the large collections of historical sources that the ENP China project has acquired and made available in full text for the first time. The potential for exploration is infinite. It may be disturbing too. As humanists trained in the close reading and critical hermeneutics of a limited, human-scale amount of documents, we are poorly equipped for facing this data deluge. Where to start? How to proceed? These tutorials provide some useful tips for turning historians into data-driven humanists. We will experiment with various techniques and methods to handle massive historical corpora and approach modern Chinese history from new perspectives.

The purpose of this tutorial series is twofold :

Substantially, to introduce a step change in the history of the returned students and contribute to a new understanding of their role in building a new China after the empire – a much disputed issue in the existing scholarship (Wang, 1966). The corpus-based, data-driven approach we propose will enrich and contextualize the biographical, proposopographical and cultural studies that have prevailed to date.
Methodologically, we aim to:

- introduce the enpchina R package – a set of tools relying on R programming language tailored specifically for exploring massive, multilingual corpora of Chinese sources – and other R packages we consider useful for historical research ;
- devise on-the-fly yet sustainable solutions for harnessing large collections of historical newspapers ;
- empower historians with various programming skills so that they gain full control over the “datafication” process and escape the black boxes that we inherit from web platforms and off-the-shelf softwares.

We chose R studio because it provides an integrated framework for combining a variety of approaches and commanding the complete chain of operations. Under R, data-driven historians can conduct the entire research process – from data extraction to the exploration, analysis, interpretation and publishing of their findings and methodology – within a single, unified environment, while ensuring the traceability of the workflow and the replicability of their experiments, through sharing the code and emphasizing collaboration. Moreover, it is supported by a large community of users (historians/scholars, data scientists/computing specialists ) and it is constantly evolving toward greater integration and accessibility. Altogether, the following tutorials develop a standard workflow that any historian can emulate or transpose to her own research needs. She will be guided step by step from building the corpus to analyzing its textual content, mapping the underlying network of social actors and many other applications.

Corpus building with the EnpChina package : in this tutorial, you will learn how to use the enpchina package to build a corpus (i.e. a collection of newspaper articles) from a keyword-based query and to conduct a preliminary exploration of this corpus (Rmd version: https://bookdown.enpchina.eu/Liumei/01_Corpus.rmd).

Text analysis

Text analysis with tidytext: apply basic text analysis techniques to approach the content of articles (tokenisation, word frequency, correlation, co-occurrences) with the package tidytext (Rmd version: https://bookdown.enpchina.eu/Liumei/02_TextAnalysis.Rmd).
Text statistics with quanteda: learn how to create a corpus object to perform more advanced text analyses (frequency, time series) and visualisations (heat maps) with the package quanteda (Rmd version: https://bookdown.enpchina.eu/Liumei/021_TextStats.Rmd).
Keyword extraction with quanteda: learn how to handle multi-word units (e.g. « United States”), extract key terms and compare corpora of varying size using more sophisticated metrics (TF-ID, log-likelihood ratio test) (Rmd version: https://bookdown.enpchina.eu/Liumei/022_KeyTerm.Rmd).
Text co-occurrences (1) explore relations between words, learn how to find and visualize collocates and to measure their significance (Rmd version: https://bookdown.enpchina.eu/Liumei/023_TextCooc.Rmd).
Text co-occurrences (2) : discover alternative ways of visualizing text collocations (Rmd version: https://bookdown.enpchina.eu/Liumei/024_Collocation.Rmd).
Concordancing: to bridge the gap between distant and close reading, learn how to analyze words in their original context and apply regular expressions to refine your research (coming soon).

Sentiment analysis (coming soon)

Topic modeling (coming soon)

Named Entity Recognition (coming soon)

Extraction with the enpchina package
Processing : clean, homogenize and classify named entity with the tidyverse meta package and open refine R extensions.
Network analysis of persons and organizations
Mapping locations

Corpus forensics (coming soon)

- Text classification
- Text metrics
- Text features
- Stylometry
- Text reuse

Elites, Networks and Power in Modern China

Enp Resources

Case 1 – The Rotary Club of China in the press

Case 2 – American University Men of China

Case 3 – The Golden Age of the returned students

LOCATION

Quik Menu

Related Links

USER TOOLS

Hotel Information

Internal Team Meeting

Internal Form

Connected to Us

© 2024 All rights reserved ENP China

Contact

Top Restaurants in town :