On this page we'll look at ways of transcribing spoken content.
We'll consider the different options available, including not transcribing at all, and we'll look at the media formats we should be using. We'll explore options for live captions, as well as for automatically and manually generating and editing subtitles and transcripts after the fact. And we'll see how best to share a video with subtitles.
Subtitles (or captions — we'll use the terms interchangeably on this guide) are blocks of transcribed text that appear at the bottom of a video. Whether you're deaf, struggling with an accent, watching the video in a distracting environment, or just don't have the sound on, subtitles allow you to read the speech and sound of the video.
Subtitles might be 'burnt in' to the video: part of the actual video image and impossible to remove. This is particularly common in videos for social media, where video players don't have support for 'closed' captioning (more of which in the next paragraph). You'll also often see it in television news reports or documentaries, where a speaker's words are being translated for the audience.
Generally preferable to burnt in subtitles are 'closed captions' — captions that are overlaid onto the video, and can be turned on or off. These captions are stored in a separate file to the video itself. They're supported online in more-elaborate media players such as those found on YouTube or Google Drive, and in the University's Panopto lecture captures. The optional subtitles you get on television or on a DVD are closed captions.
A transcript is a written account of material originally presented in another medium (for this Guide's purposes, usually speech). A subtitle file will be a form of transcript, or it could be a stand-alone text document (for instance, a transcript of a speech as an alternative to watching it). If you're conducting research you might be transcribing an interview. In such cases you might not need the same level of accuracy or coverage as would be used for subtitling.
Creating a written transcript of a spoken text can be useful (as we'll discuss below), but it also requires a lot of hard work (as we'll also discuss below!). A question you should ask yourself very early on, ideally even before you go recording any audio or video, is...
Now perhaps you've got an audio or video file that you're intending to share, and you're wanting to add captions and/or an accompanying transcript in an effort to make the content more accessible. In that case then you are going to need to fully transcribe what's being said. There are tools that can help you with this, and we'll take a look at them in the section on automatically generating subtitles and transcripts after the fact, but it's also going to require some work to get the transcript in a fit state for public consumption.
As well as for captioning and accessibility, there are other good reasons for creating a transcript, not least the simple detail that written text is not encumbered by the dimension of time in the same way that sound is. Yes, you can play back audio at double speed or whatever, but you're still only ever hearing a single instant. With written text you can see whole passages at once and take in the broader context; you can skim-read, search for key passages, and quickly re-read anything you didn't quite follow. It can therefore be a lot quicker to take in a read text than a heard one.
Transcripts are a lot like the Eventide H910 Harmonizer in terms of what they can do with the fabric of time...
The original audio, however, may contain details and meaning that is lost in a transcript. The way things are spoken; the rhythm of a sentence; the stress on particular words — all of this may imbue specific meaning that might be lost when written down...
"Oh yes, the geese at York are very friendly!"
Depending on how the above sentence is stressed, it could be a genuine remark about the geese at York being friendly, it could be a sarcastic remark about how the geese at York are very much un-friendly, or it could perhaps even be a euphemistic allusion to some sort of over-familiarity by the geese!
A simple transcript cannot capture the nuance that the original speech contains, and even 'stage direction'-style notes can only carry so much detail. A lot of meaning is lost when we aren't hearing the emotional tone (and even seeing the body-language) of the person speaking. Audio or video, then, while lacking some of the speed or convenience of reference of a written text, might give a more complete picture of what is being said. To work with a transcript alone is to be working with only a fraction of the original data. And modern technology makes working with audio/visual media a lot easier than it used to be.
For example, the qualitative data analysis tool NVivo can be used to analyse more than just text-based files: images, audio and video can also be marked up. If you have an audio file of, say, an interview, you can annotate it directly in NVivo, without the need for transcription:
So it may be that you don't need to make a transcript at all. But even if you feel you do...
What is it you're trying to do, and can what you're trying to do be achieved with summary notes rather than a full transcript? Is everything that is being said relevant to your needs, or can you get away with just transcribing certain passages?
How accurate does the transcription need to be? Can you work adequately with the usually rather messy and inaccurate output of an automatic transcription or does it need to be word-perfect?
The process of transcription can be very beneficial to your project: listening back to your recordings and typing them out will help further familiarise you with the content and may help focus your work. However, exact transcription can be a painstaking process, so to transcribe materials entirely yourself, (even with automated help), is going to be time consuming. What you choose to transcribe is likely to depend on the nature of your recordings, as well as how many minutes (or hours) you've got to get through!
One of the things transcripts do with the fabric of time is eat great chunks of it.
If you're making a digital recording that you'll ultimately want to transcribe, think about what format you're recording to. Some sound formats and video formats are more universal than others (they'll work with more programs). Do a test recording to see what format you're getting, and change the options in the settings if necessary. Apple devices in particular have a habit of defaulting to Apple-specific formats that may be hard to use in other tools. Whatever format you're using, and whatever device you're using, test that you can get the file out of the device and into any other applications you'll want to use.
Zoom's default recording options save meetings as an impressively well-compressed .mp4 video format or .m4a audio format, both of which are easily used in other programs. If you record to the cloud, you can also download an automated transcript as a .vtt caption file (again, a common format). Whether you're recording research interviews or making a quick video for public use, Zoom is definitely one of the more convenient tools you could use.
If you're presenting live, be it in an online setting, or potentially even face to face, there are options available for automatic live subtitling. The quality of automatic subtitling is variable, but for those of your audience who need it, it will be better than no subtitles at all.
So long as you're logged into it with your York account, Zoom can provide live captions. These need to be enabled by the host during the meeting:
If you record your presentation to the cloud, captions will be included so long as 'Audio Transcript' has been enabled in your recording settings.
Captions are not retained when recording to your computer.
PowerPoint and Google Slides can both provide live subtitles. A quick-and-dirty way to record subtitles for a screencast is to piggyback on a PowerPoint presentation.
Be aware that, since this relies on having PowerPoint running beneath your other windows, you won't be able to show your desktop.
This approach is ok for quick, dirty subtitles, but they'll contain errors, there's a delay in them appearing (by which time the video has moved on), and they're 'burnt-in' into the video. So really you should look to making some proper stand-alone subtitles...
If you're making a video that has sound, you'll need some subtitles to ensure that your video is as accessible as possible to its audience. If you're recording an interview for research purposes, you might also want to get a transcript of what was said.
The only foolproof way of getting such a transcript is to get a human to type it out, and even that might not be 100% successful (you might not be able to make out some of what is being said). But computers are getting better and better at understanding the human voice, and so may be able to do some of the work for you. Computers are pretty stupid, though, and will make all sorts of ridiculous mistakes. It can sometimes take as long to correct an automated transcript as it would to write it from scratch. Don't expect miracles !
Of course, if the transcript is just for your own use, transcription errors won't be such an issue; but if you're making something for an audience, it will require some tidying up.
With that in mind, here are some of the ways you might go about generating closed caption subtitles (and transcripts) for a pre-existing video recording:
Zoom can use the built-in Otter.ai speech recognition service to create caption files for any meeting that has been recorded to the cloud. The process is quicker if you had live captioning turned on) during the meeting (otherwise it may take hours or even days), and, as with all automated systems, accuracy is variable. But the generated file is in a good format (.vtt) and is relatively easy to work with.
An easy way to auto-generate subtitles for a video is to upload your video to YouTube. YouTube will generate automatic captions which you can then edit.
Here's a video we uploaded to YouTube: you can use the subtitle options in the cog menu to switch between the automatic captions and a tidied-up version.
While you can make private videos in YouTube, you will probably want to avoid this method if working with sensitive content!
You can download your subtitle files from YouTube in the YouTube Studio Video Manager: open the video, go to the "Subtitles" section, hover over the subtitles you want to download and use the vertical dots menu (⋮):
Google Drive uses a similar playback engine to YouTube, so is a great alternative for hosting videos, especially if you're wanting to control access.
If you've already got a transcript of your video as a text file, and the video is stored in Google Drive, you can automatically generate subtitles simply by uploading the transcript.
The above options will give you caption files which are essentially time-coded text files, and it's relatively straightforward to convert a caption file into a simple transcript.
There are a number of other free subtitling and transcription tools online, but you'll need to consider questions of safety and security regarding their use, and you should certainly not use such tools with sensitive data such as research interviews.
A far safer way to get a transcript is to simply play the sound of what you're wanting to transcribe through your microphone and into a Google Doc to use the transcription service in Google Docs. It's an inelegant method but it just about works!
If you've got the Office 365 version of Word, you could do the same thing using Home > Voice > Dictate. Again it will only work through the microphone, and the dictation will periodically time out.
Artificial intelligence isn't very intelligent, so the best way to get a transcript is to do it yourself. Or get another human to do it for you.
Transcription is a slow process. The quicker you can type (or even write, if you're just transcribing for your own use) the less you'll have to stop and rewind.
If you've got a lot of transcription to do, it could be worth investing in a foot-pedal controller: that way you can operate the playback with your feet, while typing with your hands. Alternatively, you could play back the recording at a slower speed to give you chance to keep up. Most players can handle different speeds these days, including VLC Player:
Also check out if there are any keyboard shortcuts you can use to control the player.
Don't be afraid to use abbreviations for commonly used words or phrases as you transcribe. You can then use the search and replace tool in your text editor to expand your shorthand.
If you're going down the route of getting another human to do your transcription for you, be that someone in your department or a paid-for transcription service, you'd need to have told any participants in your recording that data may be shared with a third party for the purposes of transcription. You should include this in any disclaimer signed by participants.
As with AI transcription, you will need to be mindful of privacy agreements if you get involved with any third party transcription service. The way these services store and access data may breach university security recommendations (and the law).
Oh, and bear in mind that if you get someone else to transcribe, you'll still have to proof-read it, because humans also make mistakes, especially with technical terms or jargon they don't understand.
Whether you've got an automatically generated set of subtitles, or even just a transcript, it's relatively straightforward to edit those captions, convert those captions to a transcript, or change that transcript into a subtitle file.
For more involved editing, it may be helpful to download the caption file to work on it outside of the host system before re-uploading. Options for this are usually found in the same location as the other captioning tools.
In Zoom, for instance, you can download generated captions from the relevant Recordings page (although there's no option to re-upload them afterwards).
In YouTube, subtitles can be downloaded via YouTube Studio Video Manager: open the video, go to the "Subtitles" section, hover over the subtitles you want to download and use the vertical dots menu (⋮):
In Google Drive you can download captions from "Manage caption tracks" on the right-click context menu for the file in Drive, or on the three-dots menu (⋮) in the file itself. This opens the Captions side-panel, where you can use the three-dots for the caption track you want to download:
There's a load of different file formats for caption files but they're usually pretty straightforward to write and edit. They're all basically text files and so can be edited in a basic text editor like Notepad on Windows or TextEdit on a Mac.
They tend to follow a similar sort of format.
For instance, .srt files (a particularly common type) look like this:1
...while .sbv files (the type you get from a YouTube auto-transcript) look like this:
...and .vtt files (one of the more customisable options, and the type you get from Zoom) look like this:
In all of the above examples, each caption is separated from the next by an empty line. Be sure to only have one empty line between captions.
.srt files require caption numbers and these will need to follow consecutively. Caption numbers can be used in .vtt files too, but they're much less fussy about them.
All of the examples above include a pair of timecodes: the time the caption comes on screen and the time it leaves. These times can be accurate up to a thousandth of a second, but a quarter of a second is probably the most detailed you'd ever need to get. Just remember that if the start time of one caption isn't after the finish time of the preceding caption, they'll superimpose, which you probably don't want! And pay attention to punctuation on the timecode line: if you get it wrong, your caption file won't work properly.
The above examples also include line-breaks within the captions, just to demonstrate that that's possible. But for online use you would generally avoid adding line-breaks, since people may be watching your video with custom display settings, and the lines will be wrapped automatically so may end up looking weird. As a rule of thumb, captions should be one or two lines of under 32 characters per line. So if your caption is longer than 64 characters it's probably too long and should probably be broken over two captions.
.vtt files (the type Zoom uses) have a lot of flexibility in terms of how they can be formatted, styled, and positioned, although not all of these features will be supported in every player. YouTube and Google Drive support some features of .vtt, including caption positioning, bold, italics, and split timings. Other applications such as VLC Player support other features such as speaker colours. The Worldwide Web Consortium have a breakdown of all the features available in the .vtt format, and how to apply them.
Good captioning takes time, even if you have an automatic transcript to work with. And there are certain principles of best practice to consider (although there will often be reason not to follow them directly).
Keep in mind that the main purpose for subtitling is to provide audio information to people who cannot hear the sound. But also consider that reading text from a screen is different to hearing it spoken. While verbatim transcription is generally the best approach, it may be necessary to make some changes so that subtitles are more easily read and digested.
The Worldwide Web Consortium's Web Accessibility Initiative has some useful (and relatively brief) guidance on transcription:
Another useful style guide is that used by the BBC for its television subtitles:
If you're wanting to provide a transcript as an alternative format, you probably don't want it cluttered with timecodes. Short of going through and manually deleting each one, you could make use of search and replace tools. For instance, with a Zoom .vtt file opened in Microsoft Word, you could use a 'wildcard' search (Home > Editing > Replace > More > Use wildcards) for the pattern...
...replacing with a space character or a new line (\n) according to taste.
You'll need to tweak the above pattern to match other file-types (or for it to work in different text editors) but the principle of searching for new lines and patterns of characters (?) in timecode formats should generally hold. Watch out for hidden space characters, and be sure to test with "Find" before comiting to "Replace"!
If you create a transcript that you're wanting to analyse in NVivo, be sure to make use of Styles in your document. This will allow you to do some basic auto-coding (i.e. to identify the interviewer and interviewee). You can also sync a transcript to a video though this might require some preparation if you're working from a conventional subtitle file.
There are a number of video hosting options for closed caption videos. The most obvious one is YouTube, but if you want more control over who has access to your video you could choose instead to upload it to Google Drive.
Another way of distributing a video with captions is to offer the files for download. VLC Media Player can display captions on a video if it finds a caption file with the same name as the video (not including the file extension) in the same folder location on the computer.
If you're sharing a video on social media, it may be necessary to burn in captions. At a basic level this can be done with VLC Media Player. For more attractive results you may have to use a video editor, although it's possible to do surprisingly good things with PowerPoint (video can be embedded into a slide, captions animated over the top, and the whole thing exported as a new video).
Forthcoming sessions on :
There's more training events at: