Skip to Main Content

HathiTrust Digital Library : Analyze

A introduction to using the HathiTrust Digital Library and HathiTrust Research Center for computational analysis and text mining.

The HathiTrust Research Center (HTRC)

The HathiTrust Research Center (HTRC) facilitates non-consumptive computational analysis of the HathiTrust Digital Library for non-profit research and educational use. The HTRC is co-located at Indiana University and the University of Illinois at Urbana-Champaign and engages in both research and development for computational text analysis of digital libraries. You can find documentation, tutorials, datasets, and the tools below on the HTRC Analytics page. 

NOTICE: HathiTrust will suspend funding of the HathiTrust Research Center at the end of 2026 to allocate resources toward new programs that leverage emerging technologies to enhance their services and collection. Please see their HTRC Transition FAQ page for updates.

Tools

HathiTrust + Bookworm: Tool that visualizes word trends across the HathiTrust corpus (including copyrighted and public domain volumes), allowing researchers to discover textual use patterns over time. 

Text Analysis Algorithms: Click and run tools to perform computational analysis on volumes in the HathiTrust Digital Library. Includes: 

  • Extracted Features Download Helper (v3.0.2): Allows you to download extracted features data for your workset of choice.
  • InPhO Topic Model Explorer (v1.0): Trains multiple LDA topic models and allows you to export files containing the word-topic and topic-document distributions, along with an interactive visualization.
  • Named Entity Recognizer (v2.0): Creates a list of all of the entities or names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset
  • Token Count and Tag Cloud Creator (v2.0): Identifies the tokens (words) that occur most often in a workset and the number of times they occur. 

Data Capsules: Secure computing environments for performing researcher-driven text analysis on the HathiTrust corpus. All users may access public domain items in the capsule, but computational access to copyrighted items for non-consumptive use is available only to member affiliated researchers. Capsules are available for the following use cases: 

  • Demo capsules: small virtual computing environment meant for testing out data capsule features. 
  • Research capsules: larger virtual computing environments intended for robust research. 
  • User template capsules: Designed to help educators and collaborators share snapshots of their own research data capsules. 

Datasets

Extracted Features Datasets: Research datasets of downloadable non-consumptive book data. Copyright protected texts are not downloadable, but these datasets offer features extracted from the full text such as volume-level metadata, page-level metadata, parts-of-speech-tokens, and token counts. Datasets include the main extracted features dataset and two derived datasets:

  • HTRC Extracted Features Dataset:  Page-level features from 17.1 million volumes
  • Word Frequencies in English-Language Literature, 1700-1922: Genre-specific wordcounts for 178,381 volumes
  • Geographic Locations in English-Language Literature, 1701-2011: Geographic locations mentioned in volumes of fiction

Worksets: User created collections of HathiTrust volumes treated as data for analysis in HTRC tools and services. You can either create your own worksets or use those curated by other researchers. Worksets can be shared or cited to increase reproducibility. 

Custom Datasets: Researchers seeking large numbers of public domain texts for analysis locally can submit a request

Helpful HathiTrust Research Center Pages

The following pages will help you plan how to use the HathiTrust Research Center's tools and services for your own research:

  • Documentation - Here you can find more information on HTRC's datasets, tools, and policies, along with tutorials and guides to get you started. 
  • Non-consumptive use policy - This page provides information about HathiTrust's non-consumptive use policy and how it applies to research.
    • Non-consumptive research applies to research performed on one or more volumes (textual or image objects) in the HT collection, but not  research in which a researcher reads or displays substantial portions of an in-copyright or rights-restricted volume to understand the expressive content. 
  • HTRC Advanced Collaborative Support - The HTRC offers specialized expertise, developer time, and computing resources to researchers who apply for and are awarded support.