Online Workshop 5: Making More Sense With Machines: AI/ML Methods for Interrogating and Understanding Our Textual Heritage in the Humanities, Natural Sciences, and Social Sciences.

This workshop will be held on ZOOM and hosted by University of Illinois and the HathiTrust Research Center, 29th–30th November 2022 from 10am to 2pm Central Time both days.
Title: “Making More Sense With Machines: AI/ML Methods for Interrogating and Understanding Our Textual Heritage in the Humanities, Natural Sciences, and Social Sciences.”
Our cultural heritage includes texts in the widest imaginable variety of subjects, including not only the humanities and arts, but also in the natural and social sciences; likewise, our largest digital libraries – including that of the HathiTrust (and its Research Center, which hosts this workshop) – consist of legacy documents in practically all areas of human thought and creativity.
These digitized heritage libraries represent some special challenges both to computational study in general, and to emerging AI/ML approaches in particular: digital library documents are much longer, often by orders of magnitude, and much more diverse, than most of the training sets and algorithms that have been at the foundation of modern machine learning.
This workshop, the fifth in the series, will focus on the work of interrogating documents of many types and scope, with the aim of unlocking their data and making it more accessible and more computable. Our shared goal is to make our heritage digital collections in all subject areas richer and more usable through the application and enhancement of computational methods both old and new.
Listed times are CST – please convert accordingly
Day 1 (Tuesday 29 November), 10:00 am to 2:00 pm CST
10:00 Glen Layne-Worthey, J. Stephen Downie, UK AEOLIAN Team Welcoming remarks and general workshop introduction
10:30 Jill Naiman (University of Illinois Urbana-Champaign) “Document Layout Analysis for Scientific Article Figure & Caption Extraction”
11:15 Hema Natarajan (Benetech Corporation) “Making Math Accessible, One Image at a Time”
12:00 Break
12:30 Undergraduate Research Showcase (lightning talks):
- Morgan Cosillo on OCR post-correction with NLP machine learning model
- Rushdan Jimoh on natural and artificial “page aging” processes for machine learning
12:45 Nikolaus Parulian and Glen Layne-Worthey (University of Illinois) “Machine Learning to Identify Creative Content and Paratext at the Page Level”
1:30 Peter Organisciak (University of Denver) “Neural Nets to Identify Work Relationships in HathiTrust”
Day 2 (Wednesday 30 November), 10:00 am to 2:00 pm CST
10:00 Glen Layne-Worthey Summary of Day 1, and introduction to Day 2 topics
10:15 Janet Swatscheno (HathiTrust Research Center, University of Michigan) Tutorial: “HathiTrust Extracted Features for Machine Learning”
10:45 Julian Schröter (Universität Würzburg & University of Illinois) “Modeling prototypicality of genre concepts with machine learning and the c@1-score”
11:15 Ben Schmidt (Nomic AI) “How Small Can Big Data Get? HathiTrust Extracted Features in Bits and Browsers”
11:45 Break
12:15 Undergraduate Research Showcase (lightning talks):
- Kiara Balleza on crowd-sourcing the extraction of non-textual page elements
- David Zhu on a Tesseract “parameter sweep” for OCR optimization
12:30 Ming Jiang (Indiana University Indianapolis) and Yuerong Hu (University of Illinois) “The impact of OCR quality on BERT embeddings”
1:00 Ryan Dubnicek and Ted Underwood (University of Illinois) “Piloting a machine-learning approach to identify English-language fiction in the HathiTrust Digital Library”
1:30 Roundtable (all speakers) “Challenges and opportunities with AI/ML methods for understanding our textual heritage across the disciplines”
Speaker Information:
J. Stephen Downie, University of Illinois Urbana-Champaign / HathiTrust Research Center
Ryan Dubnicek, University of Illinois Urbana-Champaign / HathiTrust Research Center
Yuerong Hu, University of Illinois Urbana-Champaign / HathiTrust Research Center
Ming Jiang, Indiana University Indianapolis
Glen Layne-Worthey, University of Illinois Urbana-Champaign / HathiTrust Research Center
Jill Naiman, University of Illinois Urbana-Champaign
Hema Natarajan, Benetech Corporation
Peter Organisciak, University of Denver
Nikolaus Parulian, University of Illinois Urbana-Champaign
Ben Schmidt, Nomic AI
Julian Schröter
Janet Swatscheno, University of Michigan / HathiTrust Research Center
Ted Underwood, University of Illinois Urbana-Champaign
29th to 30th November 2022