After a week of learning about LLMs its now time to actually find a use case that is constrained enough to offer some sort of value. I earlier hinted that LLMs would be useful for determining what content inside a podcast transcript was an advertisement or sponsor against all of the other text.
With some cajoling I have a system that can reliably do this, which generally speaks to the power and limitations of LLMs. This is generally building off of the work that I mentioned in this earlier post.
Locked in place
In an ideal world I’d be able to just provide a human-readable transcript of a podcast (of any length, any language) and the LLM would return the data in a computer-usable format. That is, we want a machine that behaves like this:
Input:
... long podcast transcript ...
Output:
Ads:
Topic: North Carolina, Text: ... text of the ad
Topic: Indeed, Text: ... text of the ad
ChatGPT is magic, right? What not just ask for this and bing-bang-boom you get the answers you want?
Well, two limitations: First is that ChatGPT has a limit on the amount of text you can actually send it - and its far lower than the length of a podcast transcript. Second is that ChatGPT charges for each word1 that you send and receive from their API. Podcast transcripts generally have more non-ad content than ads, so there ends up being a significant waste of money to brute force it in this way.2
More subtly, more text fed into the system also means more places for things to go wrong - more chances for LLM hallucination. What we want to do is to constrain the system such that we can actually use ChatGPT within its limits while also being confident in the results that are being returned.
A smaller problem
With previous ad-detection efforts, I was able to build a system that was pretty good at detecting when specific text looked like an ad. Generally, this was content that used phrases that were not unique - since advertisers generally have specific ad-copy that gets reused.
The limitations of this are that it would also pick up segments of text that were non-unique, but also not advertisements. Opening lines, closing lines, catch phrases all got picked up with this heuristic and separating this type of text from the true ads proved very difficult from a rule-based approach.
And this ends up being a perfect use case for LLMs - rather than trying to build an ever complex rules-based approach to determining if something is an ad, lets just find the content that looks like an ad and use the LLM to select what we’re looking for. That is, we now have a machine like this:
Input:
... long podcast transcript ...
Prompt for the LLM:
Extract the topic and text of advertisements from a bulleted list of possible advertisements from a podcast transcript
* This podcast is brought to you by the state of North Carolina. ...
* This episode is brought to you by Indeed. ...
* That's all for today. Thursday, May 18th. ...
Output:
Ads:
Topic: North Carolina, Text: ... text of the ad
Topic: Indeed, Text: ... text of the ad
And this works - it works really well. Rather than having to stand up a series of complex weights and tests to determine if something is an ad, what its about, and extracting its text, the LLM is able to bridge that gap. Moreover, because the actual problem space is constrained to just looking at text that might be an advertisement the number of ways that the system can provide a wrong answer is much smaller than dumping the entire transcript into the LLM.
Generally speaking - this is where LLMs probably have the most immediate and direct use. Its able to take a human-usable format (eg free-form text) and convert it into a computer-useable format (eg json) and generate this from human-usable instructions. The LLM interface here isn’t code that I’ve written, but text instructions fed into the machine to generate something thats useful.
My previous attempts at solving this problem took 2 weeks and produced middling results. This effort took 3 hours and produced something I’m confident it - its a real game changer in terms of extracting information from unstructured content.
There’s still more to build and this upcoming week will focus on getting this new system wired up and running. Theres also pile of not-interesting-to-talk-about cleanup work in the code base that will make future development easer. The current goal is to make siev easier to navigate and generally more useful for exploring whats actually happening in the podcast-news space.
Technically its “tokens” not “words" where a token is roughly 4 characters. I’m being sloppy on purpose as jargon can be more correct but tends to reduces that actual meaning of what you’re trying to convey.
For now at least, I’d expect the cost of these systems to drop over time and for the number of possible tokens to be fed in to increase.