Current Status
After some yak-shaving I’ve now stood up an ingestion pipeline thats able to pull in arbitrary audio content, transcribe it, and perform other AI analysis on that text. Currently this is content extraction and sentiment analysis, meaning that I can tell you what people are talking about and how they feel about it without you actually having to sit down and listen to every single minute of a show.
Its a slow process but I’ve ingested ~120 of the top podcasts, generally of the News/Politics variety, along with all of their published episodes from this year. This is roughly 3,000 episodes that have all been transcribed and their content extracted and in a useable way.
For the curious, heres a quick look at the top 100 topics of these shows from the start of the year.
From one side, to the other
On one hand, there is nothing surprising in this list. Any person who’s marginally keeping up with the news could tell you the top 10 topics without fail. On the other, it also means that this system passes the lowest level of a sniff test - in an automated way it spat out something completely expected.
This is the first step when it comes to “bridging the gap”.
The “gap” that I refer to here is the space of tasks (manual and intellectual) that a human still feels the need to do because the machines they have are unable to perform them consistently with enough accuracy that a human feels comfortable offloading such work onto the machine. AI is certainly cool and the new hotness - but if its used in a way that humans can’t feel confident in letting it take on work, then it will remain stuck in the world of hype cycles and Balenciaga memes.
The pipeline has passed fulfilling the lowest level of trust - and the process of bridging the gap requires continued cajoling and tweaking to find the trustworthy (and untrustworthy!) parts of the machine.
The work continues
The upcoming week is going to involve a much deeper exploration of this data, and hopefully some interesting insights into how shows differ. The datapoint that I still have my eyes set on is being able to determine what ads are running against what content and this is a little ways out. I’m able to do this pretty well for insert ads, those that are recorded once and re-used several times. Sponsored ads, where the hosts themselves take a break and talk about some product, are a little more difficult to pin down, but now that I have a more robust dataset to play I should be able to at least evaluate whats missing in order to make such a distinction.
If you have any questions about the data or just want to play around with some, let me know. We’re all friends here.