I love when people care about their subtitles. Not enough people on YouTube do it, and it’s wildly important from a basic accessibility perspective. What’s more, tons of people (myself included) prefer to watch movies with the subtitles on. And while the process of transcription itself can be tedious, you can have a lot of metatextual fun with authoring subtitles for dramatic effect, in particular with descriptive subtitles. So I wanna talk a little bit about a tool I use in DaVinci Resolve called StoryToolkitAI, which not only simplifies the process but actually has some rudimentary translation services built into it.
What are SRTs?
Before we get to that, though, we gotta talk about subtitle formats. SRTs (aka SubRip Subtitle files) are one of the most common forms of subtitle formats out there. It’s a simple text file with chat and timecodes that can be easily understood by Youtube, VLC and more. There are tons of other formats, including one that features the ability to do advanced formatting, color, and position and are mainly used by anime and Japanese TV fansubbers (shoutout to the appropriately named Advanced SubStation Alpha, or .ass format), but SRT files are easy to deal with and understood by program files like DaVinci Resolve.
Premiere vs. Resolve
As a general rule, I am not a fan of the video editing software Adobe Premiere. I think it’s a broken, expensive piece of software with a very annoying and predatory subscription model that is (almost) trumped feature-for-feature by DaVinci Resolve at this point. The main feature that initially sold me on Resolve and got me to move my whole workflow over was actually how efficient and easy it was to author subtitles in it, versus Premiere, which was a nightmare and crashed all the time.
Since then, Premiere has rapidly improved at doing one thing: transcribing interviews. Even if I still don’t like using and especially paying for the product, they put some real time in here on improving the product. Hands down, the most interesting work being done on Premiere has to do with its cloud features involving transcription. It’s a joy to use, and when it works, it works. You are going to have to babysit it a lot of the time, and any result will obviously need another pass with an editor, but it reduces the workflow of timing these things by a massive margin. They are also iterating on it in really interesting ways, in particular “text-based editing,” despite the fact I find the actual process of dealing with subtitles far less enjoyable than DaVinci. DaVinci’s subtitle editor has just been consistently far more intuitive, less laggy, laid out better, and way more flexible than Premiere’s.
Adding a very handy feature
So currently, there is a Transcription-shaped hole in Resolve’s feature set vs. Premiere. In the past, I used a Python-powered transcriber called pyTranscriber, which runs your audio through the Google Speech Recognition API. Luckily, someone has a solution in the form of Whisper, a project by OpenAI. We have talked before about Whisper for transcription. Since then, a few people have applied the code to multiple projects and added several UI frontends. The most recent and interesting one is StoryToolkitAI.
StoryToolkitAI is not a piece of software that is built by Blackmagic Design. It is a GitHub project by developer Octimot that runs on OpenAI’s Whisper and Python and uses Resolve’s API. As a result, it’s a bit finicky to install. I personally was having real trouble installing until I checked the issues page on the repo, realized I had conflicting versions of Python installed, uninstalled, and reinstalled the correct versions and got it to work.
In order to get it to run, you need to make sure Resolve is running with scripting on and then install and open the software. It will do some installation, install dependencies, then be up and running. From there, StoryToolkitAI will need to export a rough version of your timeline in Resolve to a folder of your choosing, where it will use Whisper to get transcription running based on one of the many Language models available. Once that is done, you can look at and search your transcript, have that transcript sync up to your timeline in Resolve, drop in the SRT file, and more.
StoryToolkitAI has two huge advantages for me: it is free, and it runs locally, meaning you do not need to pay Adobe or use their servers or their machine learning software they call Adobe Sensei. I will admit that Adobe’s product is currently smoother and slicker to work with, but for something offered for free on a GitHub repository, StoryToolkitAI runs very well. In my tests, StoryToolkitAI does a pretty solid job at figuring out speaker timing, transcribing, recognizing proper nouns, and placing those subtitles at the correct moment, although there are almost always some errors. In particular, you do need to babysit when the clip begins and ends, as sometimes the subtitle will hang longer than it should. I found that it has difficulty with multiple speakers, crosstalk, and, on occasion, gets thrown off by background noises and long silences. You will always need to clean up, which is thankfully a joy to do in Resolve, but as a first draft, it does great.
On top of that, StoryToolkitAI also has the ability to take that transcription, search it, and turn individual portions of the transcript into markers. This means you can search and notate the timeline based on times when a speaker mentions a specific word or topic, a very handy feature to have that works comparably to Adobe’s toolset. Even outside of Adobe, comparable services like Trint are going to cost you way more, although reliability is a factor at this point.
StoryToolkitAI also has one other feature worth noting: Translation.
It knows Japanese. Sort of. Well, nouns and verbs, at least
It’s important to have sober expectations when it comes to what AI currently can and should do. I constantly see people overselling and overhyping AI, which is not only annoying but does a disservice to what is actually possible with the technology. What’s more, I think many current pitches for AI are lazy, lack the basic intent of a human hand, and aspire to a tremendously bleak future.
Translation is a very nuanced process — many would say an art — that requires a person to ensure it’s done correctly. Machine learning is getting much better for sure, but the results you get can vary wildly from model to model, so you need a human being to make sure your results are accurate. The same goes for any English subtitles involving descriptive text. Machine learning cannot properly understand what is going on in a scene just by listening. With that said, StoryToolkitAI seems like a decent tool for assisting in timing and translating subs, depending on the language.
I first noticed this while attempting to transcribe a mostly English-language timeline that included footage from the game Yakuza Kiwami. StoryToolkitAI not only flagged that the speaker was speaking Japanese, but it also took a stab at translating it, and it turned out to get it right. I also tried running a few already-translated clips through it, and it seemed smart enough to get many of the basic nouns and verbs correctly, minus the context. Does it match the nuance and robustness of a real translation? Absolutely not, especially not for a language like Japanese, where context is vital. But I could see it simplifying hours of work with timing subtitles for a seasoned translator.
As a tool for first drafting in subtitles and adding a little accessibility to your videos, I can’t recommend StoryToolkitAI enough. It’s a little wonky and rough around the edges, installation is a little tricky and does not have the finesse of Premiere’s transcriber, but that is to be expected. I also don’t have to give Adobe money, and it’s only a matter of time before something like this was added into Resolve either way. StoryToolkitAI’s developers also say they’re adding new features down the line, like integration with other AI and Machine Learning tools, and I would love to see a tool that allows you to custom-select a specific language model for transcription. And as far as machine learning goes, Whisper and competing models are only getting more robust. As a translator, though? It’s fun, and for small things, it’s pretty useful, but you should get someone that isn’t a machine.