Speech synthesis

Март 5, 2021

Содержание

2. Speech synthesis What is the task? Generating natural sounding speech on the fly, usually from text
3. Input type Concept-to-speech vs text-to-speech In CTS, content of message is determined from internal representation, not
4. Text-to-speech What to say: text-to-phoneme conversion is not straightforward Dr Smith lives on Marine Dr in
5. Text-to-phoneme module Architecture of TTS systems Grapheme-to-phoneme conversion Prosodic modelling Acoustic synthesis Abbreviation lexicon Exceptions lexicon
6. Text normalization Any text that has a special pronunciation should be stored in a lexicon Abbreviations
7. Grapheme-to-phoneme conversion English spelling is complex but largely regular, other languages more (or less) so Gross
8. Grapheme-to-phoneme conversion Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean) Much harder for others
9. Syntactic (etc.) analysis Homograph disambiguation requires syntactic analysis He makes a record of everything they record.
10. Text-to-phoneme module Architecture of TTS systems Grapheme-to-phoneme conversion Prosodic modelling Acoustic synthesis Abbreviation lexicon Exceptions lexicon
11. Prosody modelling Pitch, length, loudness Intonation (pitch) essential to avoid monotonous robot-like voice linked to basic
12. Acoustic synthesis Alternative methods: Articulatory synthesis Formant synthesis Concatenative synthesis Unit selection synthesis
13. Articulatory synthesis Simulation of physical processes of human articulation Wolfgang von Kempelen (1734-1804) and others used
14. Formant synthesis Reproduce the relevant characteristics of the acoustic signal In particular, amplitude and frequency of
15. Formant synthesis Demo: In control panel select “Speech” icon Type in your text and Preview voice
16. Concatenative synthesis Concatenate segments of pre-recorded natural human speech Requires database of previously recorded human speech
17. Diphone synthesis Most important for natural sounding speech is to get the transitions right (allophonic variation,
18. Diphone synthesis Most systems use diphones because they are Manageable in number Can be automatically extracted
19. Concatenative synthesis Input is phonemic representation + prosodic features Diphone segments can be digitally manipulated for
20. Unit selection synthesis (USS) Same idea as concatenative synthesis, but database contains bigger variety of “units”
21. Speech synthesis demo
23. Скачать презентацию

Слайд 2

Speech synthesis
What is the task?
Generating natural sounding speech on the fly, usually

from text
What are the main difficulties?
What to say and how to say it
How is it approached?
Two main approaches, both with pros and cons
How good is it?
Excellent, almost unnoticeable at its best
How much better could it be?
marginally

Слайд 3

Input type
Concept-to-speech vs text-to-speech
In CTS, content of message is determined from internal

representation, not by reading out text
E.g. database query system
No problem of text interpretation

Слайд 4

Text-to-speech
What to say: text-to-phoneme conversion is not straightforward
Dr Smith lives on Marine

Dr in Chicago IL. He got his PhD from MIT. He earns $70,000 p.a.
Have toy read that book? No I’m still reading it. I live in Reading.
How to say it: not just choice of phonemes, but allophones, coarticulation effects, as well as prosodic features (pitch, loudness, length)

Слайд 5

Text-to-phoneme module
Architecture of TTS systems
Grapheme-to-phoneme conversion
Prosodic modelling
Acoustic synthesis
Abbreviation lexicon
Exceptions lexicon
Orthographic rules
Normalization
Grammar rules
Prosodic

model

Phoneme-to-speech module

Various methods

Слайд 6

Text normalization
Any text that has a special pronunciation should be stored in

a lexicon
Abbreviations (Mr, Dr, Rd, St, Middx)
Acronyms (UN but UNESCO)
Special symbols (&, %)
Particular conventions (£5, $5 million, 12°C)
Numbers are especially difficult
1995 2001 1,995 ?236 3017 233 4488

Слайд 7

Grapheme-to-phoneme conversion
English spelling is complex but largely regular, other languages more (or

less) so
Gross exceptions must be in lexicon
Lexicon or rules?
If look-up is quick, may as well store them
But you need rules anyway for unknown words
MANY words have multiple pronunciations
Free variation (eg controversy, either)
Conditioned variation (eg record, import, weak forms)
Genuine homographs

Слайд 8

Grapheme-to-phoneme conversion
Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean)
Much harder

for others (English, French)
Especially if writing system is only partially alphabetic (Arabic, Urdu)
Or not alphabetic at all (Chinese, Japanese)

Слайд 9

Syntactic (etc.) analysis
Homograph disambiguation requires syntactic analysis
He makes a record of everything

they record.
I read a lot. What have you read recently?
Analysis also essential to determine appropriate prosodic features

Слайд 10

Text-to-phoneme module
Architecture of TTS systems
Grapheme-to-phoneme conversion
Prosodic modelling
Acoustic synthesis
Abbreviation lexicon
Exceptions lexicon
Orthographic rules
Normalization
Grammar rules
Prosodic

model

Phoneme-to-speech module

Various methods

Слайд 11

Prosody modelling
Pitch, length, loudness
Intonation (pitch)
essential to avoid monotonous robot-like voice
linked to basic

syntax (eg statement vs question), but also to thematization (stress)
Pitch range is a sensitive issue
Rhythm (length)
Has to do with pace (natural tendency to slow down at end of utterance)
Also need to pause at appropriate place
Linked (with pitch and loudness) to stress

Слайд 12

Acoustic synthesis
Alternative methods:
Articulatory synthesis
Formant synthesis
Concatenative synthesis
Unit selection synthesis

Слайд 13

Articulatory synthesis
Simulation of physical processes of human articulation
Wolfgang von Kempelen (1734-1804)

and others used bellows, reeds and tubes to construct mechanical speaking machines
Modern versions simulate electronically the effect of articulator positions, vocal tract shape, etc.
Too much like hard work

Слайд 14

Formant synthesis
Reproduce the relevant characteristics of the acoustic signal
In particular, amplitude and

frequency of formants
But also other resonances and noise, eg for nasals, laterals, fricatives etc.
Values of acoustic parameters are derived by rule from phonetic transcription
Result is intelligible, but too “pure” and sounds synthetic

Слайд 15

Formant synthesis
Demo:
In control panel select “Speech” icon
Type in your text and

Preview voice
You may have a choice of voices

Слайд 16

Concatenative synthesis
Concatenate segments of pre-recorded natural human speech
Requires database of previously recorded

human speech covering all the possible segments to be synthesised
Segment might be phoneme, syllable, word, phrase, or any combination
Or, something else more clever ...

Слайд 17

Diphone synthesis
Most important for natural sounding speech is to get the transitions

right (allophonic variation, coarticulation effects)
These are found at the boundary between phoneme segments
“diphones” are fragments of speech signal cutting across phoneme boundaries
If a language has P phones, then number of diphones is ~P2 (some combinations impossible) – eg 800 for Spanish, 1200 for French, 2500 for German)

m y n u m b er

Слайд 18

Diphone synthesis
Most systems use diphones because they are
Manageable in number
Can be automatically

extracted from recordings of human speech
Capture most inter-allophonic variants
But they do not capture all coarticulatory effects, so some systems include triphones, as well as fixed phrases and other larger units (= USS)

Слайд 19

Concatenative synthesis
Input is phonemic representation + prosodic features
Diphone segments can be digitally

manipulated for length, pitch and loudness
Segment boundaries need to be smoothed to avoid distortion

Слайд 20

Unit selection synthesis (USS)
Same idea as concatenative synthesis, but database contains bigger

variety of “units”
Multiple examples of phonemes (under different prosodic conditions) are recorded
Selection of appropriate unit therefore becomes more complex, as there are in the database competing candidates for selection

Speech synthesis

Содержание

Speech synthesisWhat is the task?Generating natural sounding speech on the fly, usually

Input typeConcept-to-speech vs text-to-speechIn CTS, content of message is determined from internal

Text-to-speechWhat to say: text-to-phoneme conversion is not straightforwardDr Smith lives on Marine

Text-to-phoneme moduleArchitecture of TTS systemsGrapheme-to-phoneme conversionProsodic modellingAcoustic synthesisAbbreviation lexiconExceptions lexiconOrthographic rulesNormalizationGrammar rulesProsodic

Text normalizationAny text that has a special pronunciation should be stored in

Grapheme-to-phoneme conversionEnglish spelling is complex but largely regular, other languages more (or

Grapheme-to-phoneme conversionMuch easier for some languages (Spanish, Italian, Welsh, Czech, Korean)Much harder

Syntactic (etc.) analysisHomograph disambiguation requires syntactic analysisHe makes a record of everything

Text-to-phoneme moduleArchitecture of TTS systemsGrapheme-to-phoneme conversionProsodic modellingAcoustic synthesisAbbreviation lexiconExceptions lexiconOrthographic rulesNormalizationGrammar rulesProsodic

Prosody modellingPitch, length, loudnessIntonation (pitch)essential to avoid monotonous robot-like voicelinked to basic

Acoustic synthesisAlternative methods:Articulatory synthesisFormant synthesisConcatenative synthesisUnit selection synthesis

Articulatory synthesisSimulation of physical processes of human articulation Wolfgang von Kempelen (1734-1804)

Formant synthesisReproduce the relevant characteristics of the acoustic signalIn particular, amplitude and

Formant synthesisDemo: In control panel select “Speech” iconType in your text and

Concatenative synthesisConcatenate segments of pre-recorded natural human speechRequires database of previously recorded

Diphone synthesisMost important for natural sounding speech is to get the transitions

Diphone synthesisMost systems use diphones because they areManageable in numberCan be automatically

Concatenative synthesisInput is phonemic representation + prosodic featuresDiphone segments can be digitally

Unit selection synthesis (USS)Same idea as concatenative synthesis, but database contains bigger

Speech synthesis demo

Похожие презентации

Speech synthesis
What is the task?
Generating natural sounding speech on the fly, usually

Input type
Concept-to-speech vs text-to-speech
In CTS, content of message is determined from internal

Text-to-speech
What to say: text-to-phoneme conversion is not straightforward
Dr Smith lives on Marine

Text-to-phoneme module
Architecture of TTS systems
Grapheme-to-phoneme conversion
Prosodic modelling
Acoustic synthesis
Abbreviation lexicon
Exceptions lexicon
Orthographic rules
Normalization
Grammar rules
Prosodic

Text normalization
Any text that has a special pronunciation should be stored in

Grapheme-to-phoneme conversion
English spelling is complex but largely regular, other languages more (or

Grapheme-to-phoneme conversion
Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean)
Much harder

Syntactic (etc.) analysis
Homograph disambiguation requires syntactic analysis
He makes a record of everything

Text-to-phoneme module
Architecture of TTS systems
Grapheme-to-phoneme conversion
Prosodic modelling
Acoustic synthesis
Abbreviation lexicon
Exceptions lexicon
Orthographic rules
Normalization
Grammar rules
Prosodic

Prosody modelling
Pitch, length, loudness
Intonation (pitch)
essential to avoid monotonous robot-like voice
linked to basic

Acoustic synthesis
Alternative methods:
Articulatory synthesis
Formant synthesis
Concatenative synthesis
Unit selection synthesis

Articulatory synthesis
Simulation of physical processes of human articulation
Wolfgang von Kempelen (1734-1804)

Formant synthesis
Reproduce the relevant characteristics of the acoustic signal
In particular, amplitude and

Formant synthesis
Demo:
In control panel select “Speech” icon
Type in your text and

Concatenative synthesis
Concatenate segments of pre-recorded natural human speech
Requires database of previously recorded

Diphone synthesis
Most important for natural sounding speech is to get the transitions

Diphone synthesis
Most systems use diphones because they are
Manageable in number
Can be automatically

Concatenative synthesis
Input is phonemic representation + prosodic features
Diphone segments can be digitally

Unit selection synthesis (USS)
Same idea as concatenative synthesis, but database contains bigger