Skip to Main Content

Digital Humanities : Text Analysis & Data Mining

A resource guide for learning about tools, practices, and projects in the field of digital humanities.

Text Analysis

Text analysis refers to a process of conducting analysis on a body, or corpus, of natural language text, in order to detect patterns (such as word frequency or associative links), create visualizations from the text, categorize or annotate the text, or otherwise "mine" it for relevant, novel, or interesting information and interpretation.

Datasets and Tools in the Libraries

A not-for-profit collaborative of academic and research libraries, including UW-Madison Libraries, working to preserve 17+ million digitized items. It stewards the collection under the aims of scholarly, not corporate, interests and advances its mission and goals through a variety of services and programs for scholars that include:

  • The HathiTrust Digital Library - This digital library provides access to millions of digitized items as well as to a collection builder tool you can use to create your own collections or datasets for analysis. 
  • The HathiTrust Research Center (HTRC) - This research center based out of Indiana University and the University of Illinois at Urbana-Champaign offers services that support use of the HathiTrust corpus as a dataset for analysis via text and data mining research. 
    • HTRC Analytics - provides access to HTRC worksets and off-the-shelf algorithms to analyze them
    • Extracted Features Datasets - composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. Features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and much more.
    • HathiTrust+Bookworm - visualization tool allows researchers to graph word trends across the HathiTrust corpus and facet their search by bibliographic metadata.
    • Data Capsules - secure computing environment that allows researchers to create a virtual machine desktop “capsule” that can be used to run text analysis on the HathiTrust corpus.  

Gale Digital Scholar Lab
An online tool for collecting data sets comprised of content from the UW-Madison Libraries’ subscriptions to Gale Primary Sources databases. Those data sets can then be analyzed using text analysis and visualization tools built into the Digital Scholar Lab. Digital humanities analysis methods include: Named Entity Recognition, Topic Modelling, Parts of Speech, and more.

Clarivate Web of Science Dataset
This data set is based on the content of the UW-Madison Libraries subscriptions to the Web of Science database and the specific data files included in our subscriptions package. The data covers information published between 1900 and 2017. If you would like to begin using this data set, please contact the UW-Madison Libraries’ Library Technology Group using this Technical Assistance contact form.

Preparing for Text Analysis

When defining your project and preparing your materials for analysis, you will need to pay special attention to the copyright, licensing, and permissions associated with your materials so that you understand what you are allowed to do in your project. What kind of analysis are you hoping to do, what do you aim to publish as an end product of your project, and do the materials you would like to work with allow for this? The Building Legal Literacies for Text Data Mining open access resource can help you learn more. 

Depending on the quality of your data and the goals of your project, you may need to clean or standardize your data by removing unwanted characters or removing/retaining stopwords. You may also need to structure the chosen text. Structuring textual data can include tagging it with metadata (data about data) or with part-of-speech tags (POST) in order to help define relationships between words, describe parts of the corpus, and disambiguate meanings. Structuring textual data can be done manually or algorithmically.

For more support and expertise on preparing and structuring your text for analysis, consult the Text Encoding Initiative (TEI), "an international project to develop guidelines for the preparation and interchange of electronic texts for scholarly research". A good place to start is on their "Learn the TEI" webpage, which introduces the TEI guidelines and maintains a listing of tutorials for learning TEI.

Choosing a Tool

Your research question will help you determine which text analysis method may be best for your project and this may also help you decide on which tool is best for you. There are many tools, including open source software, that exist to help you process your textual data. Some are more geared toward certain purposes than others, such as clustering and categorization, sentiment analysis, and predictive models.