I am a University Lecturer (Assistant Professor) of Digital Humanities at Leiden University in the Netherlands. My domain expertise is late imperial Chinese literature and print history. I specialize is text mining and natural language processing with a focus on authorship attribution and genre studies. I archive my research as presented in papers and talks on this page. You can find information on my teaching and other activities using the nav bar at the top.


Talk poster

Talk poster

Digital Approaches to Intertextuality and Stylistics in the Plum in the Golden Vase

February 24th, 2017, Arizona State University

Extensive digitization efforts are creating large corpora of imperial Chinese texts, a process that is opening a variety of new avenues for research. Digital corpora are valuable not just for democratizing access to materials, they are also creating the possibility for computational approaches to Chinese literature. Text-mining, and other digital techniques, are beginning to allow scholars to use these materials to study Chinese literature at a large systematic level. In this talk, I will discuss my current research into using these large digital corpora to identify and stylistically analyze the wide variety of textual materials that are reproduced within the late Ming novel the Jin Ping Mei 金瓶梅 The Plum in the Golden Vase. It is an illuminating case, as it is well known for its highly intertextual and experimental nature, making large scale digital analysis a fruitful exercise. This analysis can situate the Plum in the Golden Vase within the broader stylistic context of its source materials. I begin by identifying where the anonymous author of the novel relies on language that originates in earlier works. Then, I analyze the relationships among these source works using stylometric analysis, a type of analysis that has proven useful for both authorship attribution and genre studies. This allows me to visualize the relationships among the Plum and the Golden Vase, its sources, and its later editions in ways that were not possible in a pre-digital world. The goal of this research is to distill the original voice of the author to facilitate the larger task of determining his or her identity. As part of this talk, I will also discuss some of the pitfalls inherent in digital research on imperial Chinese texts.


t-SNE comparing document vectors representing works of fiction, historical fiction, unofficial histories, and official histories.

t-SNE comparing document vectors representing works of fiction, historical fiction, unofficial histories, and official histories.

Clustering Late Imperial Chinese Texts by Style: Principal Component Analysis and t-SNE.

July 10, 2016, Leiden University

As large corpora of late Imperial Chinese texts become more readily available, they open up exciting new possibilities for digital research. They offer an opportunity to grasp large stylistic trends that are invisible at narrower levels of analysis. However, their number and highly variable content introduce computational and visualization difficulties.  Fortunately, a variety of linear algebraic and machine-learning algorithms exist that facilitate these tasks. In this talk, I will compare and contrast several of these algorithms in the context of late Imperial Chinese literature. In the first part of the talk, I will discuss using the benefits and drawbacks of using PCA (Principal Component Analysis), a type of linear algebraic transformation of document-term matrixes, to analyze a variety of historical and semi-historical texts. In the second portion, I will focus on analyzing these same documents with a related machine-learning technique called t-SNE (t-distributed Stochastic Neighbor Embedding). This technique may offer significant advantages over older clustering methodologies, while producing easy to understand, meaningful visualizations. I will finish by discussing the insights these algorithms offer into the nature of late Imperial stylistics.


PCA comparing various quasi-historical late Imperial Chinese documents

PCA comparing various quasi-historical late Imperial Chinese documents

Fiction and History: Polarity and Stylistic Gradience in Late Imperial Chinese Literature

Appears in CA: Journal of Cultural Analytics. Published May 23, 2016.

In this article I use stylometric analysis to evaluate the stylistic relationships among a collection of "quasi-historical" documents dating from the late Imperial period in China (the works were mostly written from 1550 to 1800). I argue that when I use principal component analysis and hierarchical cluster analysis to analyze a large corpus of Chinese texts that contain historical narratives, I find a gradient of style that runs from purely fictional works through historical romances (novels with historical content) and yeshi 野史 unofficial histories to official historical works vetted by the imperial government.


Principal Component Analysis of the Chapters of the cihua edition of the Plum and the Golden Vase.

Principal Component Analysis of the Chapters of the cihua edition of the Plum and the Golden Vase.

Who Wrote the Jin Ping Mei? Stylometry and Machine Learning for Chinese Studies.

April 2016, Digital Approaches to Chinese Culture, Part 1: Tools and Methods for Textual and Historical Analysis. Association for Asian Studies Conference, Seattle, WA.

Stylometric analysis, which operates on the premise that authors leave a distinct signature in their writing that can be statistically identified, is an important branch of authorship attribution research. It has proven useful in a number of applications, from confirming James Madison as the author of several disputed Federalist Papers to identifying J.K. Rowling as the author of a pseudonymous mystery novel. As researchers in Chinese studies gain access to an increasing number of digital editions of classical and vernacular Chinese texts, stylometric tools are becoming useful for conducting authorship attribution research and comparative stylistic analysis. In this paper, I introduce several new digital tools (stylometric analysis and machine-learning algorithms) that allow me to explore the probable authorship of the late Ming dynasty novel the Jin Ping Mei金瓶梅 (Plum in the Golden Vase). Written pseudonymously by the Lanling xiaoxiao sheng 蘭陵笑笑生 (Laughing Scholar of Lanling), the authorship of this work has been argued over since it first began circulating in manuscript form in the late sixteenth century. A wide variety of probable authors have been suggested, many of whom left behind works that have recently been digitized. I propose a new line of analysis that significantly narrows the range of possible candidates by comparing character usage frequency in the Jin Ping Mei with contemporary texts of known authorship. A machine-learning algorithm then classifies the texts by probable author, determining which works have the most similar character frequencies.

The focus of this talk was primarily methodological (the talks at Stanford and Peking University were more focused on results).


Wang Shizhen and his proximal social network. Created with China Biographical Database data and Gephi.

Wang Shizhen and his proximal social network. Created with China Biographical Database data and Gephi.

Digital Research into the Authorship of the Jin Ping Mei

February 9, 2016, Stanford University

In this talk Paul Vierthaler will share his research into using computer-aided authorship detection methodologies to offer insight into the authorship of the late Ming novel the Jin Ping Mei 金瓶梅 (Plum in the Golden Vase). Pseudonymously written by the “Laughing Scholar of Lanling” sometime in the late 1500s or early 1600s, the identity of the Plum in the Golden Vase’s author has been the subject of intense debate since its initial circulation. Many candidates have been proposed, argued over, and discarded, but scholars continue to offer new possibilities and to rehash old arguments. Paul offers new insight into the authorship question using two distinct lines of evidence derived from network and stylometric analysis. Scholars currently have a relatively clear, but possibly incomplete, picture of who possessed a Plum in the Golden Vase manuscript prior to the cihua edition’s 1617 publication. From this end-point, modeling manuscript circulation in elite social networks offers some evidence for the work’s initial starting point. Stylometric analysis offers further evidence: by analyzing n-gram frequency in a variety of contemporary texts, and using machine learning based classification algorithms, the author’s identity becomes clearer.


Digital Analysis and the Authorship of the Jin Ping Mei (in Chinese)

January 9, 2016, Peking University

In this talk, Paul Vierthaler will discuss using digital methods to analyze anonymous authorship in late Ming and early Qing novels. This talk will focus on using two distinct lines of evidence to assess the potential authorship of the late Ming novel the Jin Ping Mei. In the first part of his talk, Paul will assess the known circulation of manuscripts of the Jin Ping Mei using the China Biographical database and social network analysis. In the second portion of his talk, Paul will discuss the use of stylometric and machine learning analyses in evaluating the most likely candidate author.

数字分析与金瓶梅的作者

2016年1月9日,北京大学

在这次演讲中,李友仁将讨论如何使用数字人文的方法来分析明末清初那些作者不详小说的作者。李友仁会从两个角度来探索金瓶梅的潜在作者:第一,通过中国历代人物传记资料库(China Biographical Database)和社会网络关系分析(Social Network Analysis)来研究金瓶梅抄本的流传。第二,使用 "stylometry" 和机器学习来分析电子化文本,从而评估最有可能的作者。


Textual relationships among a variety of Ming and Qing texts.

Textual relationships among a variety of Ming and Qing texts.

Quantitative historical imagination: late Ming and early Qing Chinese unofficial histories, novels, and dramas.

November 20, 2015, University of Chicago

In this talk, Paul Vierthaler will discuss his research in using digital techniques to analyze the differences among texts that transmitted unofficial historical narratives in the late Ming and early Qing periods in China. This talk centers on novels on current events, dramas on current events, and yeshi (unofficial, or wild, histories). These texts, which Paul calls “quasi-histories”, purport to move information about recent events, but their historical validity and generic nature have been debated by contemporary and modern scholars. In the past, their sheer numbers made systematic analysis difficult. Paul will begin with a meta-analysis of extensive secondary bibliographic information to analyze the claim that late Ming and early Qing quasi-histories were unprecedentedly focused on the recent past. He will finish with a discussion on using stylometric analysis to explore the complex stylistic relationships among texts of these genres, and their relationship with official dynastic histories.

Much of the contents of this talk can be found in the "Fiction and History" article.


Hierarchical cluster analysis of the 100 chapters of the Plum and the Golden Vase.

Hierarchical cluster analysis of the 100 chapters of the Plum and the Golden Vase.

Who wrote the Jin Ping Mei? Early Results of Quantitative Explorations.

January 2015, Princeton University

This was the first talk I gave on my efforts to use the digital humanities to study the author of the Jin Ping Mei. I gave this talk to Paize Keulemans seminar on the Jin Ping Mei at Princeton. It was mostly exploratory analysis and focused on ways of understanding the relationships among the chapters of the novel. I also presented very early authorship analysis results.


Digital Approaches to Late Imperial Chinese Literature: Exploring Quasi-historical Texts.

September 19, 2014. An Wang Postdoctoral Talk, Fairbank Center for Chinese Studies, Harvard University.

The ever-increasing availability of digital information on pre-modern Chinese texts, from online bibliographic records to fully digitized transcripts, is allowing scholars to adapt mathematical and statistical tools for literary analysis. Paul Vierthaler will address the promise and some of the drawbacks of using digital techniques to analyze braod stylistic differences among late imperial Chinese texts. Stylometry, developed by linguists and widely used in authorship attribution studies, shows promise for illustrating differences in style among various genres of late Imperial writing. This, in turn, provides insight into why traditional bibliographers often classified unofficial histories as novels.