This text is meant to provide a protocol about the processing of data extracted from sources and provided in spreadsheet format (and any data produced manually or automatically).
Rules for data identification
- First, before you initiate any type of manipulation, add an “ID” column and create an ID number for each line of data (use the “series” function in the Edit menu). It will preserve the original order of the data as you entered it from the source. It will come very handy when you need to double-check the data from the source after you manipulate the file (sorting) for various purposes. After almost any sorting of the data, the original order cannot be restored. This is not the ID that we may eventually add for more permanent usage. Logically, our IDs will come from the MCBD Heurist database.
- If the data is in English with only non-Chinese names for person, place, etc., nothing further needs to be done at the level of names (except separating Surname and Given name, see below)
- If the data is in transliteration — pinyin or Wade-Giles, Chinese characters need to be added right away whenever possible. This may be done using the MCDB or MCGD or any source that will allow the correct identification of the person or place (or any other entity if possible). Pinyin is bound to create homophones and require disambiguation for data processing. This is especially true for place names. The use of Chinese characters for named entities (person, place, institution, event) should be the norm.
- If two persons have the exact same name in transliteration, it is necessary to add a mark in the following format Chen Yi [A], Chen Yi [B]. It is best to start with [A], [B], etc. unless there is a particular reason to use a different letter. The same applies to names in Chinese. In this case, use the Chinese numbering system: 陳達 [甲], 陳達 [乙]
Rules for field names
When preparing a final file for import into the MCBD, it is very helpful to label the name of the columns in the spreadsheet with the corresponding labels in the MCDB. You can work with your own labels, but when preparing the file for upload to MCBD, all it requires is simply to add a line at the top of the file with the corresponding MCBD labels (see “Preparing files for MCBD” below). The labels will vary depending on whether you handle only names in Western languages or names in Chinese, or names in both Chinese and Western languages.
Names in Western languages: If you create the file yourself, it is preferable to record the Surname and the Given name in two distinct columns, with the labels indicated below. Alternatively, always write the name in the following format in the same cell: Surname, Given name (e.g. Smith, John). It is very easy to concatenate a new column “Surname, Given name” from the distinct columns Surname/Given name. The opposite is not true. If your file contains names in which the Surname and the Given name are not separated, it will be necessary to split the names as discussed below.
Names in Chinese: Most of the time, if this is data extracted from sources, the Chinese name will come in the 姓名 format. If you create a file manually, you may also want to separate 姓 and 名 in two distinct columns. Yet this may be tedious and name splitting will be your savior here!
Once you have a file in Chinese, with one or two columns, the very next step is to transform all the characters into traditional characters (繁體). This needs to be done right away and any further transformation or extraction from any file shall stick to this original list of names. This is important because some characters come under different forms in a computer environment (even with Unicode) and this may be a source of trouble later when sorting, analyzing in a spreadsheet or processing in R.
You can of course keep a separate column with the same content in simplified characters (简体) if this is something you need to have. In MCBD, we code our labels in work fiels with “cbcd_ZhT” for traditional Characters and “cbcd_ZhS” for simplified charcaters.
The transformation between traditional and simplified characters can be done very easily. It can be done in R with various packages (see our “Data transformation” markdown script). But it can also be done one go (an entire column) with online tools:
Lexilogos works very well: https://www.lexilogos.com/keyboard/chinese_conversion.htm
Once you have a file with the Chinese characters (with a single Full_name_Vernacular column or two columns for Surname/Given name), you need to add a column (or two) for pinyin. It is mandatory to have a column in alphabetical letters, mostly for sorting the data.
The transformation into pinyin can be done using R (see our “Data transformation” markdown script). If you use an online tool, however, you need to split the names into two columns before transforming into pinyin (otherwise 劉少奇 will be rendered as “liushaoqi”). You will also need to add capital letters to Surname and Name (Liu, not liu; Shaoqi, not shaoqi). This can be done using our “Data transformation” markdown script or regular functions in Excel. For the transformation of Chinese characters into pinyin:
Names in pinyin: The rule for files in pinyin are the same as indicated above. On the one hand, a column with the full name, on the other hand, two distinct columns for Surname/Given name. The latter two can be obtained through name splitting in R.
Place names: Any file with names in pinyin needs to be supplemented with a column in Chinese. Place names in pinyin can create a lot of confusion. Place names may require refining when they come with a suffix for their administrative level (北京市, 紹興縣, etc.). It is fine to keep this information in its original format, but this information needs to be trimmed down to the sole place name (北京, 紹興), first because the sole place name usually refers to a point location (city, town, village), not the whole administrative entity; second because the sole place name can more easily be used for matching names to get the geospatial coordinates (Longitude/Latitude); third because this may be used to label point locations on maps.
Name splitting: It is very tedious to do this by hand. We developed a script in R for data transformation that allows the user do this automatically (see our “Data transformation” markdown script), actually with various ways to process different forms of labeling in the data. Yet there is a standard script for names separated by a comma and for names in Chinese. The splitting can also be done using Excel (Left/Right functions). Just beware that these functions will apply nicely to standard 3-character names, but they need to be tuned up for other formats, especially with compound surnames such as 歐陽， 上官, etc. In such cases, you will need to do the corrections manually.
Labeling columns for import into MCDB
Basically, your column names in your work files should be labelled in this way:
|Surname – vernacular||<—-||Suname_ZH|
|Given name(s) – vernacular||<—-||Gname_ZH|
|Zi 字||<—-||Zi 字|
|Hao 號||<—-||Hao 號|
Please note that for gender, you need to write explicitly Male or Female (no M or F).
Preparing files for MCDB
Most of the time, you will be working with your own files with the field names you feel most comfortable with or that you find more practical. This is fine. You do not have to change your practice.
To prepare the file for upload into MCBD, however, the names fields in your file must correspond to the field names in MCBD. There is a fairly easy way to do this with our messing up with your original data. For this you need to rely on the templates available on G-Drive: https://docs.google.com/spreadsheets/d/14pAWfyoTX_NBHNq9PzoI8LxYjLvubGzz8RvCfh_PGHU/edit#gid=0
Above the first row of your file insert three more empty rows. In the middle empty row, just paste the field names (column names) that you can find in the various templates we have created. Rather than moving your columns to meet with the order of the template, just copy/paste the template field names in the empty row below the template field row. Just copy each field name one by one above your corresponding columns. This is the safest and quickest way to proceed.
When you are done, you can also decide which columns will be imported and which columns will just remain in your original file. Again, no need to create a specific version for import. Very simply, just use the first empty row and insert “Include” or “Discard” above each column to determine which ones will be imported (Include) and which ones will be left aside.
This is an example of what the final file should look like:
- First row: this is to indicate which columns are to be included for import into MCBD
- Second row: original data labels of the user
- Third row: MCBD date labels
Just save it under its original file name + “_MCBD”. You can email it or drop it in Dropbox: ENP-China —> MCBD_Heurist_Folder –> Files-for-Import
Data transformation: basic workflow
Proceed in the following order:
- Add an ID column and ID numbers
- Change all simplified characters to traditional characters
- Split Surname and Given name (both for Chinese and Western names)
- Split or extract point location name from place name
- Transliterate named entities (surname, given name, place name) into pinyin
- Institutions and positions : keep original name as in source
- Degrees and disciplines (higher education): keep original name as in source, but create a new column for each with standardized names
- Columns with composite information: extract and distribute data in distinct columns