Current Status
The first pass at the ad detector is done. First pass might even be too generous of a phrase - the code exists and is producing results, but they are not anywhere close to being good enough.
The current list of the top ads, along with number of ad placements and total air time, is here. It obviously needs more work as its still picking up lots of general topics, but there are some bright spots. “Chase Aeroplane Mastercard”, “VPN”, “publicsq.com”, and “Juniper” are valid ad segments. It just a matter of doing the nitty gritty work of separating ad content from general content.
The Nitty Gritty
The detector roughly works by combining the outputs of three different AI models (transcription, speaker detection, and content tagging) to produce chunks of text that are tied to specific times in a show along with the rough topic of conversation at that time.
From here its the “simple” task of determining what makes a segment an advertisement vs just any other topic of conversation. Simple, yeah?
The basic heuristic is that ad copy tends to not be unique - ads will have a phrase or two that they like to stick to and the same ad will run several times across a few episodes or even shows. So, its just a matter of finding the least unique phrases and marking those as ads, right?
Well, the most common sentences in the dataset are:
Yeah.
Okay.
Right.
Not really ad material, yeah? It seems easy enough though, just filter out the broadly common phrases and stick to the locally unique sentences to a show - something akin to the small signal analysis of differential amplifiers, and this does get us one step closer.
The next hurdle are catch phrases. I’ve spent more time than I’d like having to read through Steve Bannon’s content because this dumbass has a ton of catch phrases like “we're going to medieval on these people.” that are repeated many times but unique to only his show. So now there has to be other checks to make sure that a non-unique phrase is actually about something.
And on and on and on. A perfect system doesn’t exist, but there is always a less bad one.
At a certain point I’ll probably find a knock off ChatGPT/LLM that I can toss the ad possibility at and ask “is this an ad, and for what?”. Definitely more expensive route than a rules engine, but it could act as the arbiter for the cases where a rules based approach is unsure of the result
Coming Up
The next week has three main priorities. First, generally massaging the ad detector to get more accurate. This is the dumb, boring work of “data munging” - basically the engineers equivalent of manually sorting and ranking topics and continually tweaking the detector to get closer and closer to something thats 90% accurate.
Second, making a web UI for all this data. This will make everything immensely easier to share, talk about, and iterate on. It also breaks up the boring work of the data munging.
Finally, its the long running task of trying to build up the backlog of audio content. My goal is to have the top 200 shows indexed by the end of the week, but the transcription step remains the main bottleneck of this whole operation.
With that, I’m off to keep hacking. If you have any questions about the data or just want to play around with some, let me know. We’re all friends here.
It's coming together! Keep grinding! excited to see a web UI