This video walks through practical strategies to improve the accuracy of auto-generated captions – saving time, cutting costs, and creating a better experience for all viewers.
Why captions matter
- Over 85% of Netflix viewers use captions, and other platforms show similar usage.
- Viewers expect captions to be available and accurate.
- Inaccurate captions hurt user experience and can lead to legal risks.
- Captions are legally required for both live and recorded videos.
What makes good captions
- Accurate wording, punctuation, and speaker identification.
- Speaker names or relevant identifiers should be included during changes.
- Captions should appear 1–2 lines at a time for readability.
- Recorded captions should include full punctuation and correct spelling.
- Speaker changes must be clearly indicated, especially when multiple speakers are present or not visible on screen.
How to speak for better auto-captions
- Avoid filler words like “um”, “uh”, and “like” – they clutter the captions.
- Speak clearly and separate words to help software recognize them.
- Maintain practical pacing – 120 to 150 words per minute is ideal.
- Use tapping techniques off-screen to stay on rhythm and pace.
- Pause appropriately between commas, sentences, and paragraphs.
- Avoid slang, which often doesn’t caption or translate well.
- Announce acronyms clearly and define them the first time.
- Add variety through tone, not volume – sudden loudness is disruptive.
Caption editing tools
- YouTube’s auto-captions are often inaccurate and lack punctuation.
- Tools like Happy Scribe, Kapwing, and Descript help you edit and export accurate captions (e.g., SRT files).
Final thoughts
Higher accuracy saves hours of cleanup and enhances accessibility for all viewers.
Small adjustments to your speaking and workflow can dramatically improve caption quality.
Transcript
A recent study showed that 85% of Netflix viewers are using closed captions. Studies on other platforms have also shown that more than half of users are using closed captions. That means that your visitors not only expect captions, they expect good captions.
If you are failing to bring those captions, you are failing your users. Unfortunately, captions can be time-consuming, and most AI tools are only hitting about 80% accuracy. That means the captions can end up costing a bit of money.
But there is good news. By making some changes to how you speak, you can actually improve your caption accuracy a lot. I frequently hit 98% accuracy on my auto captions. That means instead of 2 in 10 words having a mistake, it’s 2 in 100.
This saves a lot of manual cleanup later. I’m Gen Harris from the Easy A11y Guide, and in this video, we’re going to be talking about speaker tips for better captions, saving you both time and money, and improving your user experiences for all of your viewers.
Before we get into these specific speaker tips, just a few notes about captions. Captions are a legal requirement, and they are required for both live videos and recorded media.
Captions need to be accurate or they are of very little use to your visitors. For live video, captions will normally appear one or two words at a time, keeping about a sentence worth of information on the screen at once. This is normally two to four lines.
I’ve got an example of live captions running on the screen right now so that you can see what they are supposed to look like. Also, if you watch any news broadcast, you’ll see how live captions look. For a recorded video, it’s expected to have one to two lines of caption content available at a time.
This is sometimes a whole sentence or it may just be parts of a sentence. And punctuation is absolutely essential. When doing captions, you need to announce speaker changes, especially if the speaker is not visible on the screen or there are multiple speakers and it may be confusing as to which one is speaking.
This is especially important when you have a panel discussion with several different people. There are both people with poor vision, poor attention to detail, and people who may stick your video off to the side and be looking at other content while listening to your audio.
It’s important for them to know who is currently speaking. When identifying the speaker, you want to talk about what is relevant. That could be the person’s name, or it could be the person’s company or their specific product that they’re talking about. You want to identify this when changing speakers.
For example, if I were the host interviewing a panelist, I would say, Hey, Sally, at ACME Company, what do you That helps people identify the speaker or the company depending on which is most relevant to the situation.
If I were a guest on a panel, let’s say I was Sally from ACME Company, I would probably reply to an open question with, At ACME Company, we do XYZ. That introduces the company which is most likely what’s relevant. Now, let’s get into the actual speaking tips for better captions.
It’s important to think about where captions come from, especially especially these automatically generated ones. They come from huge data training sets. That means a lot of professional speakers reading out content.
So let’s try and make our speaking match what those professional readers are doing. Filler words are your foes. You want to omit, a, um, er, and all of those other filler words from your vocabulary.
They are not part of normal speech, and we want to drop them out as they They just create a lot of noise and clutter in captions. Speak clearly and separate your words. You want to enunciate out your individual words, and you also want to separate them into separate words.
Some people have a tendency to blur their words together, and that makes it a lot more difficult for caption software to separate out the individual words.
Practical pacing. Most professional speakers will speak at around 120 to 150 words per minute. This is something That’s something that people are normally very comfortable listening to. They don’t need to speed it up or slow it down. If they do prefer it faster or slower, they can adjust your speed. But this is what is expected by most captioning software.
It’s important if you are reading out content to keep your pace the same. A lot of people have a habit of increasing their pace when they are reading out content. Typically, the read out content is absolutely essential for the listeners to hear. So make sure that you keep your pace consistent.
Taps for Timing. I will frequently tap my fingers together somewhere off screen.
That helps me keep my pacing. That also helps me time out my pauses. You’ll notice that a lot of musicians will tap their foot while they’re singing or playing the instruments. That keeps their timing consistent.
A lot of professional speakers will adopt some small tapping system to help keep them on pace until they can deliver really consistently.
Pauses are your pals. Pauses are so important for comprehension. You want to have a short pause or one tap when there’s a comma. You want to have a longer pause or two taps between sentences or between list items, if you are reading out different items on a list, and you want to have a still longer pause between paragraphs or about three taps.
If you’re tapping at a speed of about 120 beats per minute, that will work out really nicely.
Stop the slang. It can be so easy for many of us to fall into slang habits. Unfortunately, slang translates terrible. Broadly. We really want to reduce the amount of slang, as many of our listeners may not be familiar with the context. Speaking of context, people are constantly switching context all over the internet.
Announce your acronym. Acronyms. Sql could mean a lot of different things to a lot of different people. If someone has a developer background, they may have just been working with MySQL queries. That’s what they’re going to think of when you say SQL.
Whereas if you’re talking to a salesperson, they probably think it’s a sales qualified lead. Make sure to announce your acronyms the first time you use them. Practice variety without volume change. You can put a lot of emphasis behind words without getting a lot of volume change.
People really dislike big changes in volume as suddenly they’re trying to play with their computer settings, or what was a normal comfortable volume to not disturb others suddenly becomes disruptive. If you just woke up someone’s sleeping baby, they hate you right now.
Finally, when you’re getting your captions ready, use dedicated caption software. Youtube is actually quite poor on their automatic captions. They do not use much punctuation or capitalization, and it can be very difficult to read.
This may be acceptable for your YouTube live captions, but it’s definitely not acceptable for your recorded captions. Personally, I use the software Happy Scribe as it keeps a database of my preferred dictionary words.
There are also a lot of video tools like Kapwing or Descript, which will allow you to create a transcript alongside with your video editing. You can then export this transcript as an SRT file to upload for your captions.
To wrap up our speaker tips, filler words are your foes. Eliminate them. Speak clearly and separate your words. Practical pacing, 120 to 150 words per minute. Taps for timing. Just like musicians tap to keep their rhythm, you want to tap to keep your pacing.
Pauses are your pals. Pauses are a natural part of speech, and they help Caption software to understand where different types of punctuation go. They also improve your listeners’ comprehension. Stop the slang. Slang words translate between languages and regions very poorly. Try to avoid them as much as possible.
Announce acronyms on their first use. People are constantly context switching on the internet. Make sure that you are specific about what your acronyms mean in your context. Variety without volume change. You can put a lot of inflection and emphasis on your words without big volume changes. People really dislike large volume changes.
Thanks so much for watching this video. If you found it helpful, please subscribe and share the video with your friends.
For more tips and information on accessibility, visit easya11yguide.com, and I hope to see you in the next video.
