July 05, 2023

Dionysus: Visualizing TV & Movies with ML

Part 1

A few weeks ago I started on a project to visualize the TV shows and movies I watch. I'm interested in what data I can extract from the raw media and was heavily inspired by the Network of Thrones.

In fact, my interest in this project date's back to my first semester of college, when I attempted to build a simlar network graph of Nolan's Batman trilogy. My implmentation then was quite näive at the time: A simple parser that took a film script and identified character lines by the all-caps tradition. That method required a lot of manual cleaning after the fact and the script always differs a bit from the actual film transcript. So with fresh eyes I wanted to approach the problem again this year using SOTA ML models to more accurately pull data from video media. A few of my goals:

  1. 1. Extract data from raw audio/video with minimal pre-processing
  2. 2. Identify and associate transcripts with individual speakers with 90%+ accuracy
  3. 3. Structure transcript data and append additional data (sentiment, entities, vocabulary)
  4. 4. EDA & JS visualizations over structured transcript data
  5. 5. Pair with a retrival-augmented generation model for naturally querying data
  6. 6. Chat with a character

For some background on the progress I've already made, here's a thread walking through my first attempt:

Since then I've done some more exploring. While Whisper + Pyannote is mostly ok out of the box, with roughly 30% error rates I need to either supply a better speaker embedding to the model, or do some fine-tuning. Fortunately I found this Huggingface space which is trying to do a roughly similar task and appears to perform better than my inital attempts. It uses the same pipeline I started pursing (OpenAI Whisper -> Pyannote pipeline).

As a side note, if you're using a box with Amazon Linux 2 you might need to upgrade your python version. 3.8 will be sufficient for me.

After trying a few different media sources, the pipeline doesn't seem to work well even when using a relatively clean podcast with more than 2 people. Likely the speaker embedding model (Speechbrain's ECAPA-TDNN model pretrained on voxceleb) is not good enough for the novel speakers. While the Whisper transcriptions are fantastic, the speaker diarization is not. I'll need to do some more research on how to improve this. I'm also suspicious of the sklearn clustering algorithm, will need to drum up some visualizations to see what's going on.

Anyway, I made almost no progress, but it's time to call it a night. Look forward to part 2.