HathiTrust Digital Library Expands Access to Copyrighted Works for Research

As I’ve written previously on Rent Check, one of the worst features of copyright law is that, unlike patent law, which at least requires the disclosure of an innovation, copyright holders can prevent access to copyrighted content itself, rather than just the commercialization of those ideas.

HathiTrust, an institution that has prevailed after previous legal challenges, is an online repository with data collected from research libraries, Google Books, and Internet Archive. Previously, the site only allowed its data-mining methodologies (such as checks for the frequency with which specific words or phrases are used) to be applied to works that are in the public domain. Just recently, HT has revised its policy:

With the development of a landmark HathiTrust policy and an updated release of HTRC Analytics, HTRC now provides access to the text of the complete 16.7-million-item HathiTrust corpus for nonconsumptive research, such as data mining and computational analysis, including items protected by copyright.

This extraordinary opportunity to use copyrighted materials for nonconsumptive research purposes expands research access to the entire HathiTrust digital collection, which is sustained by HathiTrust’s 140+ member libraries. Researchers may access HTRC’s easy-to-use computational tools ideal for beginners, as well as more complex tools to meet advanced data analysis needs.

Thanks to recent court rulings that have found “a solid legal basis for nonconsumptive research on copyrighted materials,” countless works (including books and full archives of magazines) published after 1923 and still under copyright protection are now available for analysis. Of course, you won’t have access to full works (hence the “nonconsumptive” specification), but this is an important tool for researchers.

Don’t take my word for it. On Twitter, Richard Jean So explained the implications:

This is a huge deal. The Hathitrust has over 18 million digitized books. That includes full runs of magazines like the New Yorker. Any scholar can now access this data including stuff 1923-present and analyze it. Sounds hyperbolic but it’s not: this will transform the humanities. https://t.co/rXRj4a1ri2

— Richard Jean So (@RichardJeanSo) September 22, 2018

For one example of how these data can be used, social scientists often perform analysis of language in large batches of text to examine what words or phrases have been used more or less frequently, a useful tool when analyzing the development of political thought.