Text Mining Resources¶

This work is licensed under a Creative Commons Attribution 4.0 International License.

This guide provides information on text mining resources

AI Taxonomy¶

NIST Artificial Intelligence Risk Management Framework: refers to an AI system as an engineered or machine-based system that can, for a given set of objectives, generate outputs such as predictions, recommendations, or decisions influencing real or virtual environments. AI systems are designed to operate with varying levels of autonomy.
NIST Trustworthy and Responsible AI : aims to provide a flexible means of classifying how an AI system contributes to an outcome. The taxonomy sets forward 16 AI use “activities” which are independent of AI techniques and domains. Tasks are combinations of one or more AI use activities.
Generalist Repository Ecosystem Initiative (GREI) AI Taxonomy: funded by the NIH, developed an AI taxonomy tailored to data repository roles to guide AI integration across repository management. It categorizes the roles into stages, including acquisition, validation, organization, enhancement, analysis, sharing, and user support, providing a structured framework for implementing AI in repository workflows.

Text Data Sources¶

Constellate: Constellate was the text analytics service from ITHAKA (JSTOR and Portico). It was a platform for teaching, learning, and performing text analysis using archival repositories of scholarly and primary source content. Constellate was sunset in June 2025.

Access Note: create a free account with your @arizona.edu email address to obtain full functionality of the platform.
Dimensions Plus API: Dimensions Plus includes grants, publications, citations, alternative metrics, clinical trials, patents, and policy documents. Must register with NetID and Password and email support@dimensions.ai to enable API access.
Elsevier API: Elsevier's API program allows you to integrate content and data from Elsevier products into your own website and applications. APIs are free for the products Arizona subscribes to: Scopus, Engineering Village, and subscribed journals in Science Direct.
Scopus Search: Scopus search API includes basic, advanced, and AI powered queries of the Scopus literature archive.
IEEE API Portal: API portal for IEEE.
JSTOR for Data Research: Data for Research (DfR) provides datasets of content on JSTOR for use in research and teaching. Researchers may use DfR to define and submit their desired dataset to be automatically processed. Data available through the service includes metadata, n-grams, and word counts for most articles and book chapters, and for all research reports and pamphlets on JSTOR. Datasets are produced at no cost to researchers and may include data for up to 25,000 documents.
LexisNexis Web Services Kit: Lexis Nexis Web Services Kit is a mediated service that allows bulk download of Nexis UNI content (formerly Lexis Nexis Academic). Up to 250 documents and 1000 metadata downloads are allowable on Nexus UNI without use of the API. Contact your subject librarian for access to LexisNexis Web Services Kit.
PLOS API: Python tool for downloading/updating/maintaining a repository of all PLOS XML article files. Use this program to download all PLOS XML article files instead of doing web scraping.
ProQuest TDM Studio: ProQuest TDM (Text and Data Mining) Studio allows you to create and analyze datasets from ProQuest content.
Ravenpack News Analytics: Use for financial and economic analysis. Access through WRDS.
Web of Science: is a collection of databases that index the world’s leading scholarly literature in the sciences, social sciences, arts, and humanities, as published in journals, conference proceedings, symposia, seminars, colloquia, workshops, and conventions across the globe.

Freely Available Text Data Sources¶

arXiv Bulk Data: Our mission is to provide rapid dissemination of scientific results at no cost to authors or readers. Providing free Application Programming Interfaces (APIs) helps us to advance that mission by enabling platforms and projects that extend the discoverability of arXiv e-prints and provide valuable services to scientists and interested readers.
Books to Scrape: Demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
CORE: Open Access Research Papers: CORE provides a central API to access full content from tens of thousands of openly available scientific publications from thousands of OA repositories. Full datasets available by request.
HathiTrust Research Center Analytics: Supports large-scale computational analysis of the works in the HathiTrust Digital Library to facilitate non-profit and educational research.
Internet Archive: Internet Archive is a non-profit library of millions of free books, movies, software, music, websites, and more.
Library of Congress (LC) for Robots: We hope this list of APIs, bulk downloads, and tutorials will help you begin exploring the many ways the Library of Congress provides machine-readable access to its digital collections.
New York Times Developer Network: All the APIs fit to post.
Project Gutenberg Robot Access: Project Gutenberg is a library of over 60,000 free eBooks. Information about robot access to our pages outlines allowable automated access to content.
PubMed APIs: PMC hosts a number of important article datasets and makes our APIs and some code available via public code repositories.
OpenAlex: is a fully open catalog of the global research system. It's named after the ancient Library of Alexandria and made by the nonprofit OurResearch.

For data collection from social media, it is typical to use the publicly available APIs made available by the social media platforms, such as the following:

X API
Access Twitter (X) data for posts, threads, comments, users, and more. Suitable for data mining and analysis.
Google Blogger
API for accessing and managing Blogger content programmatically.
Internet Archive Bulk Download
Download files from archive.org in an automated way using tools like wget.
Reddit API
Access data from posts, threads, comments, users, and more from Reddit and its subreddits.
Pushshift Reddit Data
Historical Reddit data collected as monthly CSV downloads.
Stanford Large Network Dataset Collection (SNAP)
The SNAP library collects data on large social and information networks since 2004.
Twitter Streaming APIs
Public streams provide access to real-time public data flowing through Twitter. Suitable for following specific users or topics and data mining. You can also access single-user streams, containing roughly all of the data corresponding with a single user’s view of Twitter.
Wikipedia Data Dumps
Monthly database backups of all Wikimedia wikis in various formats.
Yelp API
Access to business data, including location, photos, Yelp rating, price levels, hours of operation, and types of transactions. Also includes a Review API, which returns up to 3 review excerpts for a business.
Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers.

Government Documents¶

Congress.gov API
Includes bills, amendments, summaries, Congress members, the Congressional Record, committee reports, nominations, treaties, and House Communications. Over time, hearing transcripts and Senate Communications will be added. Sign up for a free API key to use.
ProQuest Congressional Text 1824-2020
Full text of United States Congressional Hearings (both House and Senate) from 1824-2020 as extracted by ProQuest. Delivered in bulk as XML files with pre-processing completed to extract individual hearing files, rename by hearing ID, and group into folders by decade. By accessing the data, you agree to abide by the included Terms of Use file. Read it thoroughly before use.
CourtListener API / Bulk Legal Data
Access opinions, docket files, and more from 420 courts.
FDSys Bulk Download
Bulk data downloads of major US Government publications including Congressional Bills, Commerce Business Daily, Federal Register, Public Papers of the Presidents of the United States, Supreme Court Decisions 1937-1975 (FLITE), and more.
Harvard Caselaw Access Project
Includes all official, book-published United States case law—every volume designated as an official report of decisions by a court within the United States. Research scholars can qualify for bulk data access by agreeing to certain use and redistribution restrictions. Request a bulk access agreement by creating an account and then visiting your account page.
U.S. Department of the Interior: Bureau of Land Management - General Land Office Records (GLO)
Provides direct access to all of the data behind the glorecords.blm.gov website with a series of web service methods in XML format.
Voxgov
Provides access to real-time documents, press releases, and social media posts from candidates for Congress and governor across the U.S. Options to compare candidates and groups (e.g., Senate Democrats vs. Republicans), filter by geography or demographics, and generate term frequency charts and word clouds.
United States Patent & Trademark Open Data Portal
"Open data" is publicly available data that is structured in a way that enables the data to be fully discoverable and usable by end users. It can be freely used, reused, and redistributed by anyone. Its value lies not only in what it does today but also in what it can do in the future. It is a valuable national resource and a strategic asset to the federal government, its partners, and the public.

Original Source: https://libguides.princeton.edu/textmining/sources

Text Mining Resources¶

AI Taxonomy¶

Text Data Sources¶

Freely Available Text Data Sources¶

Social Media and the Web¶

Government Documents¶