Speech synthesis

Содержание

Слайд 2

Speech synthesis

What is the task?
Generating natural sounding speech on the fly, usually

Speech synthesis What is the task? Generating natural sounding speech on the
from text
What are the main difficulties?
What to say and how to say it
How is it approached?
Two main approaches, both with pros and cons
How good is it?
Excellent, almost unnoticeable at its best
How much better could it be?
marginally

Слайд 3

Input type

Concept-to-speech vs text-to-speech
In CTS, content of message is determined from internal

Input type Concept-to-speech vs text-to-speech In CTS, content of message is determined
representation, not by reading out text
E.g. database query system
No problem of text interpretation

Слайд 4

Text-to-speech

What to say: text-to-phoneme conversion is not straightforward
Dr Smith lives on Marine

Text-to-speech What to say: text-to-phoneme conversion is not straightforward Dr Smith lives
Dr in Chicago IL. He got his PhD from MIT. He earns $70,000 p.a.
Have toy read that book? No I’m still reading it. I live in Reading.
How to say it: not just choice of phonemes, but allophones, coarticulation effects, as well as prosodic features (pitch, loudness, length)

Слайд 5

Text-to-phoneme module

Architecture of TTS systems

Grapheme-to-phoneme conversion

Prosodic modelling

Acoustic synthesis

Abbreviation lexicon

Exceptions lexicon

Orthographic rules

Normalization

Grammar rules

Prosodic

Text-to-phoneme module Architecture of TTS systems Grapheme-to-phoneme conversion Prosodic modelling Acoustic synthesis
model

Phoneme-to-speech module

Various methods

Слайд 6

Text normalization

Any text that has a special pronunciation should be stored in

Text normalization Any text that has a special pronunciation should be stored
a lexicon
Abbreviations (Mr, Dr, Rd, St, Middx)
Acronyms (UN but UNESCO)
Special symbols (&, %)
Particular conventions (£5, $5 million, 12°C)
Numbers are especially difficult
1995 2001 1,995 ?236 3017 233 4488

Слайд 7

Grapheme-to-phoneme conversion

English spelling is complex but largely regular, other languages more (or

Grapheme-to-phoneme conversion English spelling is complex but largely regular, other languages more
less) so
Gross exceptions must be in lexicon
Lexicon or rules?
If look-up is quick, may as well store them
But you need rules anyway for unknown words
MANY words have multiple pronunciations
Free variation (eg controversy, either)
Conditioned variation (eg record, import, weak forms)
Genuine homographs

Слайд 8

Grapheme-to-phoneme conversion

Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean)
Much harder

Grapheme-to-phoneme conversion Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean)
for others (English, French)
Especially if writing system is only partially alphabetic (Arabic, Urdu)
Or not alphabetic at all (Chinese, Japanese)

Слайд 9

Syntactic (etc.) analysis

Homograph disambiguation requires syntactic analysis
He makes a record of everything

Syntactic (etc.) analysis Homograph disambiguation requires syntactic analysis He makes a record
they record.
I read a lot. What have you read recently?
Analysis also essential to determine appropriate prosodic features

Слайд 10

Text-to-phoneme module

Architecture of TTS systems

Grapheme-to-phoneme conversion

Prosodic modelling

Acoustic synthesis

Abbreviation lexicon

Exceptions lexicon

Orthographic rules

Normalization

Grammar rules

Prosodic

Text-to-phoneme module Architecture of TTS systems Grapheme-to-phoneme conversion Prosodic modelling Acoustic synthesis
model

Phoneme-to-speech module

Various methods

Слайд 11

Prosody modelling

Pitch, length, loudness
Intonation (pitch)
essential to avoid monotonous robot-like voice
linked to basic

Prosody modelling Pitch, length, loudness Intonation (pitch) essential to avoid monotonous robot-like
syntax (eg statement vs question), but also to thematization (stress)
Pitch range is a sensitive issue
Rhythm (length)
Has to do with pace (natural tendency to slow down at end of utterance)
Also need to pause at appropriate place
Linked (with pitch and loudness) to stress

Слайд 12

Acoustic synthesis

Alternative methods:
Articulatory synthesis
Formant synthesis
Concatenative synthesis
Unit selection synthesis

Acoustic synthesis Alternative methods: Articulatory synthesis Formant synthesis Concatenative synthesis Unit selection synthesis

Слайд 13

Articulatory synthesis

Simulation of physical processes of human articulation
Wolfgang von Kempelen (1734-1804)

Articulatory synthesis Simulation of physical processes of human articulation Wolfgang von Kempelen
and others used bellows, reeds and tubes to construct mechanical speaking machines
Modern versions simulate electronically the effect of articulator positions, vocal tract shape, etc.
Too much like hard work

Слайд 14

Formant synthesis

Reproduce the relevant characteristics of the acoustic signal
In particular, amplitude and

Formant synthesis Reproduce the relevant characteristics of the acoustic signal In particular,
frequency of formants
But also other resonances and noise, eg for nasals, laterals, fricatives etc.
Values of acoustic parameters are derived by rule from phonetic transcription
Result is intelligible, but too “pure” and sounds synthetic

Слайд 15

Formant synthesis

Demo:
In control panel select “Speech” icon
Type in your text and

Formant synthesis Demo: In control panel select “Speech” icon Type in your
Preview voice
You may have a choice of voices

Слайд 16

Concatenative synthesis

Concatenate segments of pre-recorded natural human speech
Requires database of previously recorded

Concatenative synthesis Concatenate segments of pre-recorded natural human speech Requires database of
human speech covering all the possible segments to be synthesised
Segment might be phoneme, syllable, word, phrase, or any combination
Or, something else more clever ...

Слайд 17

Diphone synthesis

Most important for natural sounding speech is to get the transitions

Diphone synthesis Most important for natural sounding speech is to get the
right (allophonic variation, coarticulation effects)
These are found at the boundary between phoneme segments
“diphones” are fragments of speech signal cutting across phoneme boundaries
If a language has P phones, then number of diphones is ~P2 (some combinations impossible) – eg 800 for Spanish, 1200 for French, 2500 for German)

m y n u m b er

Слайд 18

Diphone synthesis

Most systems use diphones because they are
Manageable in number
Can be automatically

Diphone synthesis Most systems use diphones because they are Manageable in number
extracted from recordings of human speech
Capture most inter-allophonic variants
But they do not capture all coarticulatory effects, so some systems include triphones, as well as fixed phrases and other larger units (= USS)

Слайд 19

Concatenative synthesis

Input is phonemic representation + prosodic features
Diphone segments can be digitally

Concatenative synthesis Input is phonemic representation + prosodic features Diphone segments can
manipulated for length, pitch and loudness
Segment boundaries need to be smoothed to avoid distortion

Слайд 20

Unit selection synthesis (USS)

Same idea as concatenative synthesis, but database contains bigger

Unit selection synthesis (USS) Same idea as concatenative synthesis, but database contains
variety of “units”
Multiple examples of phonemes (under different prosodic conditions) are recorded
Selection of appropriate unit therefore becomes more complex, as there are in the database competing candidates for selection

Слайд 21

Speech synthesis demo

Speech synthesis demo