Pca and Network Data Explorer

I presented the following visualization at the DHAsia summit in late April 2018, which displays nearly 18k chapters or juan from historical and fictional genres of texts written in the Ming and Qing dynasties. This Principal Component Analysis is based on the relative frequencies of the 1000 most common characters in this corpus.

The primary difficulty with principal component analysis as a method for studying Chinese corpora is identifying how individual texts fit within the larger picture. This visualization is meant to facilitate moving from the macro-level analyses like the ones seen in my 2016 Cultural Analytics paper to the micro-level where individual documents are identifiable.

I will continue to refine this, and if there is interest in the community, I may release it so others can use it for their own data.

Instructions for Use

Zoom in and out with your mouse/trackpad, click and drag to pan. You can remove or display a text category by clicking on the corresponding box in the legend. The "Titles" button will turn labels on and off (this can be a bit laggy when displaying all documents, as nearly 18k titles render). You can switch the Principal Component that corresponds with each axis using the "X-Axis" and "Y-Axis" buttons (PCs 1-4 can be viewed).  "Loadings" shows the component loadings, the distribution of which can help you in interpreting the meaning of the axes. "Data" will toggle the points on and off (to help see the loadings better). "Font -" and "Font +" change the font size of the labels/loadings one pt at a time. "Wei/Shi" will display only documents that mention Wei Zhongxian or Shi Kefa. "Network" displays intertextuality edges between the Wei and Shi documents. These edges were detected using my BLAST-like algorithm. Each edge gets a score based on intertextuality (100 means that at least 100 characters were found within quotes. This could be ten 10-character quotes, one 100-character quote, two 50-character quotes, etc. The scores are also adjusted based on quote similarity scores, hence the possibility for non-integer scores). "Run" converts the visualization into a force-directed graph and will automatically reduce the viz to just Wei and Shi documents, as those are the only edges I've included (all edges run too slowly). Once "Run" is pressed, you will have to reload the page to return to the Principal Component View. In the network view you can increase or decrease the threshold for an edge ("-/+ Thresh"), remove/replace the edges if desired ("Edges," so you can see document labels more easily), and show only edges that actually mention Wei or Shi ("Limit").

Note: You can press the "Run" button multiple times to "recharge" the network! If you did not remove non-Wei and non-Shi documents before pressing "Run," the graph will reduce to these documents but not run on the first click. If it does not move at first, just try pressing several more times. You may need to zoom/pan to get the best view of the network. Please let me know if you find any bugs! This is a demo at the moment and may have some flaws.