How to use sound representation
Sound representation in a transcript is meant to enable deaf and hard-of-hearing viewers (as well as viewers watching the talk without the sound on) to understand all the non-spoken auditory information that is necessary to comprehend the talk to the same degree that a hearing audience potentially would. Sound information should be enclosed in parentheses, with the first word starting with a capital letter (e.g. (Music)). There are generally two types of sound information used in TEDx transcripts: sound representation and speaker identification.
- 1 Phrasing the sound representations
- 2 Duration of the sound representation
- 3 Common sound representation
- 4 Uncommon sound representation
- 5 Speaker identification
Phrasing the sound representations
Note that sound representations are not like stage directions (in a script or play), and they represent sounds, not the actions that cause the sounds. For example, the sound label should be (Gunshot) not "(Dog fires gun)."
The sound representations should also be short and have a simple grammatical structure - subject + active verb. For example, the sound representation should say: (Glasses clink), and NOT (Clinking of glasses) or (Glasses clinking).
Duration of the sound representation
The line-length and duration rules for subtitles with sound representation are generally the same as for any subtitle. However, even if there is a longer piece of music playing, or a longer bit of audience applause, don't make the sound representation stay on the screen for more than 3 seconds. It's enough to indicate that the music or applause has started.
If a video consists of more than one music piece and no talk at all, indicate the beginning and end of each piece, with (Music) and (Music ends), respectively, so that the viewer knows what is going on. Place the (Music ends) subtitle about 1.5-2 seconds BEFORE the end of the given piece of music (not after) and leave it onscreen until the music ends. Note that this only applies if there is a pause between the different pieces of music - if they flow into one another continuously, you do not need to indicate their boundaries.
Similarly, if the video combines some speaking from the stage, some music, then no music for a while, and then the music comes back, you need to signify again that the music has come back.
Common sound representation
The most common sound representations in TEDx transcripts are:
- (Laughter) - for laughter that fills any time in the talk where the speaker is not saying anything
- (Applause) - for applause (clapping) that fills any time in the talk where the speaker is not saying anything
- (Music) - for music that fills any time in the talk where the speaker is not saying anything (to identify a specific song, use (Music: "Name of song"))
Try to look at some other transcripts in your language to see what people have been using as the equivalents of these most common sounds, and use the most common one (ideally, there should be one sound label for one type of sound throughout the transcripts in one language, and not a few different versions, like (Applause) and (Clapping)).
Note: As much as possible, sound information reflecting audience sounds (e.g. (Applause) should be placed in a separate subtitle.
Uncommon sound representation
In addition to (Laughter), (Music) and (Applause), you will sometimes encounter other sounds that also need to be represented in your transcript.
Important sound information can also include sounds made by the speaker, e.g. (Gasping), (Hooting). It is necessary to represent these sounds if they are not made accidentally, but instead constitute an important part of the talk, e.g.:
Do you know how I felt after talking the whole day? (Gasping) I had to take a day off after that.
These types of speaker sounds must also be represented in the transcript if they are later referred to in some way, even if the sound was produced accidentally (e.g. if the speaker clears her throat and says "I wish they gave us more water").
Sometimes, it may be important to indicate that the speaker is intentionally raising their voice or whispering. In such cases, use sound cues like (Screaming) or (Shouting). Do not use capitalization to indicate shouting (e.g.
I AM SHOUTING!) or intonation (e.g. I'll stress THIS word in this sentence).
Note that (Screaming) or (Whispering) refers to the whole subtitle, and represents the way the whole subtitle was spoken. If the speaker screams or whispers incoherently (makes an exclamation), use (Screams) or (Whispers).
Indicating a change of language
If a speaker speaks in a language different than the main language of the talk, you should indicate the language but translate the text:
(Arabic) This is my idea.
There may be cases when the foreign language phrase was meant to be misunderstood by the audience. For example, the speaker may be quoting something she heard in a foreign language and originally did not understand, and then proceed to explain what the phrase meant a few minutes later. In this case, you should consider leaving the foreign phrase in the transcript.
You can reach out to other volunteers in the OTP community to help you identify parts of the talk in a language that you don't understand (for example, through the I transcribe TEDx talks or I translate TEDTalks Facebook groups or by contacting one of the Language Coordinators for the given language, using this list to find them).
Indicating sentence stress/emphasis
Do not indicate sentence stress (the way a certain word is emphasized in a sentence) with capital letters ("This is NOT what I'm talking about") or italics.
There are sounds that are not an important part of the talk and elicit no visible reaction from the speaker or the audience (e.g. a shutter sound from somebody taking a picture in the audience), and so, they do not need to be represented in the transcript. The only exception to this rule is when a coincidental sound causes the speaker or the audience to react in a visible way. For example, if somebody in the audience drops a plastic bottle and the speaker jumps and then laughs, the sound of the bottle falling needs to be represented, in order to give the non-hearing viewers an idea of why the speaker reacted in this manner.
Speaker changes need to be represented in the transcript. Additional speakers may appear if the speaker who began the talk is joined by another speaker on stage (e.g. for a question-and-answer session), or if video or audio material featuring spoken utterances is included in the talk. Speakers should be indicated by their full names and a colon the first time they appear, and by their initials (no periods) when they appear again in the same conversation. Consider this example:
Oh, you've got a question for me? Okay. (Applause) Chris Anderson: Thank you so much for that. You know, you once wrote, I like this quote, "If by some magic, autism had been eradicated from the face of the Earth, then men would still be socializing in front of a wood fire at the entrance to a cave." Temple Grandin: Because who do you think made the first stone spears? The Asperger guy. (...) CA: So, I wanted to ask you a couple other questions. (...) But if there is someone here who has an autistic child, or knows an autistic child and feels kind of cut off from them, what advice would you give them? TG: Well, first of all, you've got to look at age. (...)
Source: Temple Grandin: The world needs all kinds of minds
If some time has passed since a given speaker was introduced, when they start speaking again, they need to be re-identified by their full name, not just the initials. For example, if a talk by speaker X features a short video with speaker Y, and the video is paused and then continued five minutes later into the talk, speaker Y must be identified again by their full name when they start speaking in the video again, because without access to sound information, a non-hearing viewer may not be able to tell that it is the same speaker as in the first part of the video.
Identifying off-camera voices
Any comment from off-camera also needs to be identified by the speaker's name. If the comment comes from the audience, it can be identified generically with just the word "Audience" used as a sound representation cue, i.e.:
(Audience) I want to add something!