How We're Using AI in an Audiovisual Archive

For over three years, we archivists and technologists at WGBH Educational Foundation have been collaborating with computational linguists at Brandeis University to develop, test, and implement a set of tools called CLAMS (Computational Linguistic Applications for Multimedia Services) with the goal of increasing access to our collections through the use of machine-generated metadata.

Scaling Metadata Creation with AI

The American Archive of Public Broadcasting has around 2.5 million metadata records of which almost 100,000 records have digitized media associated with them. New collections are added by public media stations every year. Some metadata for programs is rich while other times the metadata can be sparse. Cataloging a collection this size becomes an issue. However, by using machine learning/AI we are able to create machine-generated metadata and improve access to the collection.

Kaldi vs Whisper: Automatic Speech Recognition

One way in which we’ve increased access to our collections is by creating transcripts using Kaldi. Kaldi is an automatic speech recognition application. We run audio or video files through it and we get a mediocre transcript. We indexed those mediocre transcripts in Solr while also correcting the transcripts on the side through FixIt+, the crowdsourcing transcript correction software originally developed by NYPL. We planned to improve both Kaldi and FixIt+ and were awarded an IMLS National Leadership Grant to do that work. However, technology is advancing so quickly in machine learning/AI that in the time in which the grant was submitted, reviewed, and awarded, OpenAI released its own ASR application called Whisper. Anecdotally, transcripts created by Whisper are incredibly good. The release of Whisper solved the issues we planned to address with Kaldi, so we worked with the grant officer to update the deliverables. We’re now quantitatively comparing the two ASR applications while continuing to improve FixIt+.

Creating Tools to Process Audiovisual Collections

So what have the computational linguists at Brandeis developed? The tools so far include applications to:

recognize if there are bars and tones in media and provide timecodes
recognize if there is a slate present in the video and if so, take an image and perform optical character recognition
recognize if there are chyrons on the lower third of the video and perform OCR on that text
recognize if there are rolling credits at the end of a video and OCR that text
create a transcript using Whisper
performing named entity recognition on corrected transcripts

From CLAMS to Chowda: Managing the Workflow

One deliverable for us at GBH is taking the tools developed at Brandeis and implementing them with our infrastructure. Our software developers are currently creating a web app we named Chowda. When this application is done, archivists will be able to run collections through various CLAMS tools to create metadata which we will then either add to our PBCore records or incorporate the metadata from the MMIF file as part of the digital object.

A Few Challenges

One issue with machine-generated metadata is quality control. It’s not going to be as good as something cataloged by a human, so how does one define good enough? Another issue can be offensive language. Sometimes, metadata can be created that has a typo, but with an application like Kaldi which uses phonemes, there can be erroneously transcribed words that are offensive to people in general or given the context of the media are offensive. We’ve addressed some of these issues with our transcript correction application, FixIt+. Other possibilities we discussed have been not using or displaying the metadata and instead using it as a jumping-off point for cataloging. For example, it may be faster to correct metadata that has issues than to create metadata from scratch.

What are your thoughts on using AI to create metadata for libraries and archives?