Using a Computer in Foreign Language Pronunciation Training: What Advantages?
Maxine Eskenazi
Carnegie Mellon University
Abstract:
This paper looks at how speech-interactive CALL can help the classroom teacher carry out recommendations from immersion-based approaches to language instruction. Emerging methods for pronunciation tutoring are demonstrated from Carnegie Mellon University's FLUENCY project, addressing not only phone articulation but also speech prosody, responsible for the intonation and rhythm of utterances. New techniques are suggested for eliciting freely constructed yet specifically targeted utterances in speech-interactive CALL. In addition, pilot experiments are reported that demonstrate new methods for detecting and correcting errors by mining the speech signal for information about learners' deviations from native speakers' pronunciation.
KEYWORDS
Speech Recognition, Pronunciation, CALL, Immersion, Direct Approach, Phonemes, Prosody, Pitch, Intensity, Duration
INTRODUCTION
The ever growing speed and memory of commercially available computers, coupled with decreasing price, is making feasible the idea of creating computer-assisted language learning (CALL) that is speech-interactive. Even though the hardware conditions for an ideal automatic training system exist, can the same be said of state-of-the-art automatic speech recognition (ASR) and of our knowledge of the variability of the speech signal--the main stumbling block to higher quality speech recognition? Has the technology come far enough for systems to be able to teach pronunciation effectively?
447
To answer these questions, we will first specify what is believed to contribute to successful language learning under a direct approach, drawing largely from principles described by Celce Murcia and Goodwin (1991). We will then list pedagogical recommendations following from this approach, such as providing language samples from many different speakers. Next, we will look at what speech-interactive CALL can do to help the classroom teacher carry out these recommendations. We illustrate with emerging methods for pronunciation tutoring from the FLUENCY project at Carnegie Mellon University (CMU) (Eskenazi, 1996), methods that support both articulation of phonemes and use of prosody—the intonation and rhythm of speech. The emphasis here is on pronunciation in the context of overall language learning. Proficient pronunciation is essential to language learning because below a certain level of pronunciation, even if grammar and vocabulary have been mastered, communication obviously cannot take place.
WHAT CONTRIBUTES TO SUCCESS IN TARGET LANGUAGE PRONUNCIATION?
Conditions for Success and Pedagogical Recommendations Based on Immersion
Many foreign language instructors agree (Celce Murcia & Goodwin, 1991) that living in a country where the target language is spoken is the best way to become fluent—a total immersion situation. They also generally agree (Kenworthy, 1987; Laroy, 1995; Richards & Rodgers, 1986) on which conditions of living abroad are critical to effective language learning:
• Learners hear large quantities of speech.
• Learners hear many different native speakers.
• Learners produce large quantities of utterances on their own.
• Learners receive pertinent feedback.
• The context in which the language is practiced has significance.
These conditions cover the external environment of language learning. From each we can extract recommendations for how to learn language under less than total immersion conditions.1 These recommendations cannot always be carried out in classroom contexts, thus presenting opportunity and motivation for ASR technology to complement teaching.
Recommendation 1
Learners hear large quantities of speech. For language learners who are not living in the country of the target language, immersion
448
courses consisting of six to eight hours daily are often the best alternative for exposing learners to the language. An ideal ratio of one student-one teacher would provide maximum speaking and feedback time. This situation is not always feasible. On the one hand, most students have other daily activities and, on the other, employing human teachers for eight hours a day is expensive (Bernstein, 1994). Moreover, immersion classes usually have five to ten students, and attending to individual needs reduces the amount of time the teacher speaks to the class.
Recommendation 2
Learners hear many different native speakers. This recommendation implies employing many native teachers with a diversity of voice types and dialects. However, the variety of native speakers available locally is limited, as is the number of people that a school can afford to hire. Traditional educational materials that promote wider exposure, such as audio and video cassettes, tend to be non-interactive, and their audio quality can degrade over time.
Recommendation 3
Learners produce large quantities of utterances on their own. Ideally, the student is in a one-on-one setting where the teacher encourages short conversations, constantly eliciting the student's speech. In reality, students in the classroom share the teacher's attention. The amount of time they spend individually producing speech and participating in conversation is thus reduced.
Recommendation 4
Learners receive pertinent feedback. In immersion contexts feedback that leads to correction of form or content may occur in two ways. Implicit feedback comes when speaker and listener realize that the message did not get across. A clarification dialogue usually takes place ("I beg your pardon?" "What did you say?"), ending with a corrected message that is understood. Less often, when culture and interpersonal context permit, the listener offers explicit correction, such as pointing out the error or repeating what the speaker said but with correction. In the ideal classroom, teachers offer implicit and explicit feedback at just the right times, keeping a balance between not intervening too often, to avoid discouraging the student, and intervening often enough to keep an error from becoming a hard-to-break habit. Expert teachers adapt the pace of correction—how often they intervene—to fit the student's personality. In reality, however, not all teachers use the same techniques and, in the classroom, are not always able to adapt these techniques to individuals. When class size increases, the amount of feedback to the individual student decreases.
449
Recommendation 5
The context in which the language is practiced has significance. Living in the country where the target language is spoken gives learners the practical need to speak. Their utterances have immediate significance. To accommodate this recommendation, the ideal language classroom includes fast-paced games and everyday conversations that create meaningful contexts (Bowen, 1975; Brumfit, 1984; Crookall & Carpenter, 1990). The student has to respond rapidly and utter new terms in these contexts. In reality, classroom size again reduces the individual learner's time for participating in such activities.
Conditions for Success and Pedagogical Recommendations Based on Structured Intervention
There are two additional conditions that appear critical for learning pronunciation but that do not follow from immersion—indeed, they follow from an assumption of structured intervention that departs from pure immersion: 1) Learners feel at ease in the language learning situation. Whereas the very young language learner perceives and tries out new sounds easily, older learners lose this ability. Embarrassment or fear may inhibit the learner from trying new sounds or even from speaking, whether in a total immersion or a classroom environment (Laroy, 1995). 2) There is ongoing assessment of learners' progress. Language learning appears most efficient when the teacher constantly monitors progress to guide appropriate remediation or advancement.
These conditions lead to pedagogical recommendations that may be particularly hard to carry out in the classroom.
Recommendation 6
Learners feel at ease. A key dimension of the learner's "internal" environment is self-confidence and motivation. Although there are techniques to boost student confidence in the classroom (Laroy, 1995; Krashen, 1982)—such as correcting only when necessary, reinforcing good pronunciation, and avoiding negative feedback—these may not overcome learners' inhibitions. Laroy (1995) finds that when students are asked in front of peers to make sounds that do not exist in their native language, these students tend to feel ill at ease. As a result, they may stop trying completely or may only make sounds from their native language. One-on-one teaching is important at this point, allowing students to "perform" in front of the teacher alone, not in front of a whole class, until they are comfortable with the newly learned sounds.
450
In reality, there is often little time for such one-on-one sessions. When correction must keep pace with a whole class, the poorer and less confident speakers suffer.
Recommendation 7
There is ongoing assessment. To adapt training to individual needs, the teacher ideally monitors each student's moment-by-moment progress, assessing strong and weak points, and judges where to focus effort next. The effective teacher takes into account what the student feels is useful, thus keeping students involved in their own progress (Celce Murcia & Goodwin, 1991; Laroy, 1995). In reality, classroom teachers cannot maintain steady monitoring of each student at this level of detail.
WHERE CAN SPEECH-INTERACTIVE CALL MAKE A CONTRIBUTION?
It is not feasible to carry out these seven recommendations fully in the traditional language classroom, given constraints on teaching time and materials. The ideal CALL system could help toward realizing these recommendations by providing individualized practice and feedback in a safe environment and sending back regular progress reports to the teacher (Wyatt, 1988). The human teacher must still do the high-level, subtle work of creating a positive atmosphere for the production of new sounds and stress patterns, explaining fine conceptual differences between a student's native language and the target language, and exploring cultural differences (Bernstein, 1994).
For each of our recommendations we will consider where automatic functions, in the form of both ASR and CD-ROM, can support the classroom. We draw examples from the FLUENCY project and from other systems featured in this volume.
CALL Can Help Learners Hear Large Quantities of Speech
With the decreasing cost and increasing capacity of computer memory and storage, CALL can offer users a choice of many prerecorded utterances. CD-ROMs afford high-quality sound and video clips of speakers, giving learners a chance to see articulatory movements used in producing new sounds (e.g., LaRocca, 1994). The teacher no longer has to find or record native speakers, although tools can be provided for teachers to add new speakers to the data set. The highly available digitized speech supplements the teacher's speech without incurring additional cost at each use. It also allows individualized access to particular samples of speech.
451
CALL Can Help Learners Hear Many Different Native Speakers
Increased memory enables presentation of a variety of different speakers from different regions and dialects. Different speakers can be sampled to find one "golden" voice that the learner would like to imitate. The choice can, for example, center on finding a voice that has characteristics closest to the learner's, as suggested by Wachowicz and Scott (this issue). Speakers' voices can be sped up or slowed down if students wish. Many speakers can be made to repeat utterances over and over.
The Learn to Speak Spanish course (Duncan, Bruno, & Rice, 1995) takes advantage of CD-ROM storage to present speech utterances from a variety of speakers. Videos of different speakers pop up as the course exercises go along. Each utterance can be heard as many times as desired although for a given sentence, only one native speaker is available. Rypa and Price (this issue) demonstrate advances for exploiting recorded speech from a variety of speakers in the service of listening practice.
ASR-Based CALL Can Help Learners Produce Large Quantities of Utterances on Their Own
Limitations of Traditional ASR-based CALL
A major problem in speech-interactive CALL, in commercial products especially, is that learners remain relatively passive (Wachowicz & Scott, this issue). Although learners may be asked to voice an answer to a question, this by design involves either parroting an utterance just presented or reading one of a small set of written choices (Bernstein, 1994; Bernstein & Franco, 1995). Learners get no practice in constructing their own utterances (i.e., choosing vocabulary and assembling syntax). The commercially available AuraLang package (Auralog, 1995), for example, is an appealing language teaching system that feeds to ASR the user's pronunciation of one of three written sentences. Each choice leads the dialogue along a different path. A certain degree of realism is attained, but students do not actively construct utterances.
Constructing an utterance means putting it together at many levels; the syntax and lexicon are being readied at the same time as pronunciation. Readying pronunciation alone (as in minimal pair exercises) is only one step toward the end goal of being able to participate actively in a conversation. Current speech-interactive language tutors do not let learners freely create utterances because current ASR requires a high degree of predictability to recognize reliably what is said. CALL developers look for ways to palliate imperfect recognition for two reasons: so that the system does not often interrupt students to tell them they were wrong when, in fact, they were right, and so that errors are not overlooked and allowed to go uncorrected.
452
Techniques for Extending the Limitations: Sentence Elicitation
The FLUENCY project has developed a technique that enables users of speech-interactive CALL to participate more actively in constructing utterances (Eskenazi, 1996). In traditional speech-interactive CALL, ASR works well because the system "knows" what a speaker will say and matches exemplars of the phones; it expects (pre-stored in memory) against the incoming signal (what was actually said). The technique developed in FLUENCY, by contrast, makes it possible to predict enough of what the speaker will say to satisfy the needs of the recognizer while giving speakers apparent freedom to construct utterances on their own. The technique is based on sentence elicitation, modeled on the drills used in the once prevalent Audio-Lingual Method (Modern Language Materials, 1964) and the British Broadcasting Company tutorials (Allen, 1968).
Several studies have addressed whether specifically targeted speech data can be collected using sentence elicitation (Hansen, Novick, & Sutton, 1996; Isard & Eskenazi, 1991; Pean, Williams, & Eskenazi, 1993). Results confirm that a given prompt sentence in a carefully constructed exercise elicits at most one to three distinct response sentences from normal speakers (if speakers are cooperative and follow the examples given). Below is a sample exercise from the FLUENCY project designed for automatic tutoring of sentence structure and prosody:
System: When did you meet her? (yesterday) -I met her yesterday.
When did you find it?
Student: I found it yesterday.
System: Last Thursday.
Student: I found it last Thursday.
System: When did they find it?
Student: They found it last Thursday.
System: When did they introduce him?
Student: They introduced him last Thursday.
The exercise screen is free of written prompts and practice is completely oral, with students doing the sentence construction work themselves. The technique provides large amounts of fast-moving practice, making students active rather than passive speakers. Later, when they need to build an utterance during a real conversation, they will have acquired some of the necessary speaking experience and automatic reflexes. The goal is to enable them to speak rapidly, in pace with the conversation.
A database called ICY was created to study speakers' strategies when changing speaking styles (Isard & Eskenazi, 1991; Eskenazi, 1992). In studies that led up to ICY, a set of specific, pretargeted syntactic structures in French, British English, and German was successfully elicited
453
without telling students ahead of time what to say. In the French studies, using both oral sentence prompts and nonambiguous visual cues for elicitation, we succeeded in provoking the chosen structures over 85% of the time (varying from 70% for one sentence to 100% for several sentences). In vocabulary choice, too, a careful search for non-ambiguous target nouns and adjectives with few synonyms yielded a highly predictable answer over 95% of the time. For example, we targeted manche bleue 'blue sleeve' and succeeded in eliciting that structure (noun followed by adjective) as opposed to (a) noun followed by verb and adjective (manche est bleue 'sleeve is blue') or (b) noun followed by preposition, article, noun, verb, and adjective (manche de la robe est bleue 'sleeve of the dress is blue'). These studies can be considered pretests in the context of FLUENCY. They allow us to forecast which elicitation sentences and visual cues will evoke the most predictable responses and to adopt these sentences and cues for ASR-based tutoring exercises.
In FLUENCY, given that we can predict what will be said, we can use the method of "forced alignment" in ASR. In other words, we can automatically align the predicted text to the incoming speech signal, as is done in systems that impose multiple choice responses. Once the recognition results are obtained, the system can correct pronunciation errors immediately, breaking into the rhythm of the exercise, or it can hold correction until the end of the exercise. We have observed that waiting until the end of the exercise ensures a higher level of success in elicitation; however, our design will allow teachers to intervene earlier if a student's level and personality warrant it.
Students can practice constructing answers to the same elicitation sentences as often as they wish, at no additional cost in teacher time and materials. Availability and patience are other qualities that enable the system to support our recommendation of having learners produce large quantities of utterances on their own.
ASR-Based CALL Can Provide Learners With Pertinent Corrective Feedback
Teachers often ask what type of corrective feedback speech recognition can furnish. This section will address two aspects of the question: whether and what types of errors can be detected successfully, and what methods are effective in telling students about errors and showing them how to make corrections.
454
Can Errors Be Detected? Phone Errors Versus Prosody Errors
Language learners make pronunciation errors of two types: those involving the articulation of phones (phonemes) and those involving the use of prosody. Prosody is represented by three distinct components in the acoustic signal: (a) fundamental frequency (pitch), (b) duration (speaking rate and timing), (c) intensity (amplitude or loudness). These components underlie the rhythm and intonation of speech. Phone correction is important during the first year of language study because proper articulatory habits enhance the intelligibility of students' speech. But intelligible speech does not rest solely on correct phones. After the first year of study, pronunciation correction typically shifts to prosody. Appropriate prosody guides the flow of speech in a way that improves intelligibility even when phone targets are not reached (Celce Murcia & Goodwin, 1991). As discussed below, the two types of pronunciation errors differ in origin, and their detection and correction imply different procedures.
Learners make phone errors because the number and nature of phonemes differ between native language (L1) and target language (L2) or because the acceptable pronunciation space of a given phone may differ between L1 and L2. In prosody, by contrast, the components are the same in all languages; speakers vary fundamental frequency, duration, and intensity along the same dimensions. However, the relative importance of each component, the meanings linked to each, and how they vary may differ from language to language. For example, variations of intensity are used much less often and with less contrast in French than in Spanish.
Error detection procedures differ as follows. Phone-based errors are identified in forced alignment mode. Given an expected utterance, the recognizer takes the actual utterance and returns the placement in time of phones and words on the speech signal. By this method the learner's recognition scores can be compared to the mean recognition scores for native speakers—all uttering the same sentence in the same speaking style—and the learner's errors can thereby be identified and located (Bernstein & Franco, 1995). For prosodic errors, however, only duration can be obtained from the output of the recognizer. That is, when the recognizer returns the phones and their scores, it can also return the duration of the phones. Frequency and intensity, on the other hand, are measured on the speech signal before it is sent to the recognizer but after it is preprocessed. Intensity is usually obtained by using a technique known as cepstral analysis. Fundamental frequency is obtained from an algorithm that detects peaks in the signal and measures the distance between them. Speakers as individuals vary greatly on the three components of prosody. For example, some people speak louder or faster in general than do others. Thus, it is important that measures of the three be expressed in relative terms, such as the duration of one syllable compared to the next.
455
Phone Error Detection: A Pilot Study of ASR-based Comparisons of Native and Nonnative Speakers
Although researchers have been cautious about using ASR to pinpoint phone errors, recent work in the FLUENCY project shows that the recognizer can be used in this task if the context is well chosen (Eskenazi, 1996). Demonstrating this is a pilot study of native and nonnative speakers uttering responses in elicitation exercises.
Method
Ten native speakers of American English were recorded (5 male and 5 female) and 20 speakers of other languages (one male and one female from each of the following L1s: French, German, Hebrew, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, Spanish).2 Expert language teachers were asked to listen to the sentences recorded by each speaker and to judge where there was an error, what it was, and how (and when) they would intervene to correct it. Teachers marked these judgments on phonemically labeled copies of the target sentences. The agreement between human teachers and ASR detection was used as a preliminary indication of the validity of automatic error detection.
456
Figure 1
SPHINX II Recognition Scores for Native and Nonnative speakers
Note: The recognition scores for native speakers are represented by solid lines and for nonnative speakers by dotted lines for each phone in the utterance, "I did want to have extra insurance." The Vertical axis shows sequence of phones in target utterance using CMU phone notation. Individual speakers are referenced by 4-letter labels where first letter indicates gender (m or f), second letter indicates language of origin (by underlined letters) as follows: French, German, Hebrew, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, Spanish.
Results
Figure 1 shows the recognition results for native and nonnative male speakers when the speech was processed by CMU's SPHINX II automatic speech recognizer (Ravishankar, 1996) in forced alignment mode. The sentence recorded was, "I did want to have extra insurance," elicited from "You didn't want to have extra insurance!" Phones for this utterance are listed in sequence on the horizontal axis using CMU phone notation (e.g., the first phone /AY/ represents American English pronunciation of "I"). Native speakers' values are traced with solid lines, nonnatives with dotted lines. The speaker labels indicate gender and language of origin for each
457
speaker. (The following phonological variants, common to native speakers of English, were taken into account: for /TD/ of "want" and /AXR/ of "insurance," not all speakers show values since "want" can be pronounced /W AA N/ in this context (geminate); "sur" of "insurance" can be /SH AXR/ or / SH AO R/.)
The vertical axis in Figure 1 represents the normalization of the phone score given by SPHINX II over the total duration of the phone. As past experience has shown, information based on absolute thresholds of phone scores is of little use. However, a nonnative's input can have a score (for a given phone in a given context) that is noticeably distant from the cluster of scores for native speakers in the same context, thus defining a deviation from normal pronunciation. By this index, Figure 1 shows that the scores of nonnatives and natives sometimes coincide. That is, nonnatives sometimes sound like natives, especially when the phone in L2 is similar to the one in L1. In other cases, nonnatives are consistent outliers, for example, in the case of /DD/ (final stop in "did"). Underlying the case of /DD/ is the fact that natives did not release the stop whereas nonnatives did. Failure to release final stops contributes to perceived accent although it is not a feature that teachers noted as needing correction. This feature can be considered a minor deviation, one that causes listeners to "hear an accent" but not to misunderstand what was said.
The phone scores also indicate noticeable distance between natives and certain of the nonnatives for /HH AE V/ ("have"). These nonnatives tend to say /EH/ or /EHF/ instead of /AEV/ because their L1 does not contain the /AE/ sound. For /IH/ of "insurance," the German speaker's score is very far from the rest. For these two examples, the teachers noted the same outliers as indicated by the phone scores.
Therefore, for this small sample of nonnative speakers, diverse in terms of L1, SPHINX II confirms human observations of incorrectly pronounced phones and does so independently of the speaker's L1. More work is now being done to validate these measures over a larger population of speakers and utterances. The potential ability of ASR to spot outliers could serve to guide ASR-based CALL in deciding which phones need training.3
Prosody Error Detection: Frequency, Duration, Intensity
In speech-interactive CALL we posit that correcting prosody is at least as important as correcting phones. When listening to a foreign speaker, it is not uncommon to hear a sentence with correct phones and syntax that is hard to understand because of prosody errors. Yet we also hear sentences with correct prosody and faulty phones or syntax that we understand perfectly well.
Automatic detection of prosodic features is starting to be used successfully
458
in speech-interactive CALL. The SPELL foreign language teaching system (Rooney, Hiller, Laver, & Jack, 1992) addresses both fundamental frequency, or pitch, and duration. Pitch detection, like speech recognition, is by no means a perfected technique. But Bagshaw, Hiller, & Jack's (1993) work on better pitch detectors for SPELL shows that algorithms can be made more precise within a specific application. This work compared the student's pitch contours to those of native speakers to demonstrate the informativeness of pitch detection. Pitch detection was incorporated into SPELL and the output interpreted in visual and auditory feedback for the student. SPELL assumes that suprasegmental (prosodic) aspects of speech should be tied to segmental (phonemic) information—for example, by showing pitch trajectories (contours over segments) and pitch anchor points (centers of stressed vowels). SPELL also addresses speech rhythm, showing segmental duration and acoustic features of vowel quality (predicting strong vs. weak vowels).
Tajima, Port, and Dalby (1994) and Tajima, Dalby, and Port (1996) have addressed duration. They studied how timing changes in speech affect the intelligibility of nonnative speakers and created remedial training supported by ASR. By using speech that is practically devoid of segmental content (ma ma ma Ö), they separate the segmental and suprasegmental aspects of the speech signal to focus on one aspect—temporal pattern training.
The FLUENCY project has looked at how to detect changes in duration, pitch, and intensity to find where a nonnative speaker deviates from acceptable native values. Prosody training in FLUENCY is linked to segmental aspects, with students producing meaningful phones. We aim to detect deviations independently of L1 and L2 so that if a learning system is ported to a new target language, its prosody detection does not have to be changed fundamentally. We have promising results from a pilot study, reported below, using hand-labeled features of the spectrogram.
Prosody Error Detection: A Pilot Study of ASR-based Comparisons of Native and Nonnative Speakers
Method
For the English sentence data recorded in the pilot study on phones, we additionally asked human teachers to mark the location and type of prosodic errors of each speaker on transcriptions of the sentences. We first examined the speech signal to determine whether the information used by teachers to detect errors could be characterized in the spectrogram. After examining phone-, syllable-, and word-sized segments, we developed three measures, one for each component of prosody. We compared these with
459
human teachers' judgments of places where prosody needed improvement in each sentence and refined the measures until they showed close agreement with human judgments. These measures then define the features we want to extract automatically from the speech signal to diagnose where students need improvement.
Duration Results
The first measure was duration of the speech signal, measured on the waveform. The results of the duration comparisons are given in Figure 2. The duration of one voiced segment was compared to the duration of the preceding one ("ratio of seg1/seg2" on the vertical axis) to make the observations independent of individual variations in speaking rate. Note that a voiced segment starts at the onset of voicing after silence or after an unvoiced consonant; it ends when voicing stops at the onset of silence or at the onset of an unvoiced consonant (independently of the number or nature of phones the voiced segment contains).
Figure 2
Two-by-Two Comparison of Duration of Voiced Segments -1
Note: The comparison is expressed as ratio of segment 1 to segment 2 (same sentence and speakers as in Fig. 1). Notations on vertical axis include neighboring unvoiced consonants for clarity.
460
The spectrographic measures point to outliers that matched the judgments of the teachers. For example, take two cases of outliers in Fig. 2. First, for the word "extra" in "I did want to have extra insurance," the segment /EHK/ is unusually long compared to the following vocalic segment (/STRA/) for the speakers labeled mfrc and miwa (the color of these speakers' vowels was closer to IY here). This deviation can be due to a poor attempt to pronounce the lax vowel /EHK/ and was noted independently by teachers. Indeed, tense/lax vowel quality differences do not exist in French and Italian and are a pervasive problem for speakers of those languages learning English. Second, mkjp's "to" is about equal in length to his "have," departing from the other speakers. This deviation was also noted by the teachers. It is interesting to observe the extremely small spread of native and nonnative values at "want"/"to." "Want" may be longer than "to" for everyone, in the same relative proportions, because function words are marked by shortened duration cues in most languages.
Pitch Results
The second measure we developed was the total number of pitch peaks present in the speech signal, calculated for each segment.4 Again, results were compared between neighboring segments. We were able to detect pitch deviations related to duration as well as independent of it. For example, mfrc raised pitch much higher on /EHK/ in "extra" than on the following vocalic segment /STRA/, probably because /EHK/ is also longer (see Figure 2). However, the speaker mpeg varied pitch independently of duration.
Intensity Results
The third measure developed, for intensity, was the average of all the cepstral values over a given vocalic segment. To address relative rather than absolute intensity, we compared these values segment-to-segment with those of neighboring vocalic segments, as with duration and pitch. The resulting curves and spread of speaker space, shown in Figure 3, differ in general aspect from the results in Figures 1 and 2. Outliers were indicated that matched teachers' judgments about relative stress centers in utterances. For example, msjh shows stress displaced within the "I/did/want" region, mbob displaced stress within "did/want/to," and msjh, among others, within the "ex/tra/in" region. The speakers' changes in amplitude appeared to be independent of duration and pitch.
461
Figure 3
Two-by-Two Comparison of Average Intensity (Amplitude) on Voiced Segments
Note: The data in this figure are for the same sentence and speakers as in Figure 1.
Implications
Our pilot study suggests that the spectrogram can be mined for measures of speech prosody that have diagnostic value and are consistent with what expert teachers say they would detect and correct. We are now rendering these measures automatically detectable. Being separate from one another, the three measures of prosody, once analyzed in an utterance, could be expressed in visual displays for the learner that show pitch, duration, or amplitude. A learner's utterance could then be compared with a native speaker's utterance on each dimension to illustrate differences. Our results suggest that the components of prosody are not totally independent of each other. We saw this particularly in the dependency of pitch on duration. We suggest that correction first address the three components separately, then address their combined effect. Instruction could begin by exercising pitch and duration changes independently, then give practice on changing pitch and duration together.
462
An Argument for Early Prosody Instruction
Early prosody instruction, starting the first year of language study, could be a boon to learning both syntax and phone articulation. Because speakers prepare the syntax of a sentence they want to say at about the same point as they prepare prosody, incorrect word order will not fit the "song" that it is to be sung to. Self-correction then comes into play as students rearrange syntax to give a better fit to prosody. (Because the "song" is considered as a whole and the syntax as a concatenation of elements, the student should tend to rearrange syntax and not prosody.) Phones may benefit from early prosody training, for example, in the case of stressed and unstressed vowels in English. If a target vowel is unstressed and the Spanish speaker uses a tense (stressed) vowel that is close to the target in articulatory space, self-correction should follow because the speaker's longer tense vowel will not "fit the song" well. For example, the unstressed "this" in the sentence "I want this present" is shorter and softer than the surrounding vowels. Practice of correct prosody in this sentence should aid pronunciation of "this" by lessening emphasis on and shortening the / IH/ sound. Follow-up exercises could put "this" into new contexts, such as "This is yours," where the word is not so short and the speaker must make more effort to retain the shortened form just learned.
Effective Correction in Speech-interactive CALL
Learners' difficulties with phones and prosody, which our pilot studies suggest can be readily detected in the speech waveform, become targets for focused correction in CALL. The system that only detects pronunciation errors (e.g., parts of TriplePlayPlus by Syracuse Language Systems, 1994) is of limited aid. Learners will make random, trial-and-error attempts to correct the reported error. There may be little true amelioration and even negative effects if learners make a series of poor attempts at a sound. Such unsupervised repetitions could reinforce poor pronunciation to the point of becoming a hard-to-correct habit (Morley, 1994).
Effective correction requires that recognizer results be interpreted, as by putting them into a visually comprehensible form and comparing them to native speech. Our work in FLUENCY suggests that how recognizer results are best interpreted for instruction differs between phone correction and prosody correction. This suggestion stems from the fact that phones are different from one language to another while prosody is produced in the same way across languages. Whereas students must be guided as to tongue and teeth placement for a new phone, they don't need instruction on how to increase pitch if they have normal hearing: They only need to be shown when to increase and decrease it, and by how much.
463
Correcting Phone Errors
There has been some success in using minimal pairs—contrasting sounds in context in the target language, such as "I want a beet"/"I want a bit" (see Dalby & Kewley-Port, this issue; Wachowicz & Scott, this issue). Effective teachers often go further, with instructions on how to change articulator position and duration. This kind of instruction is important because if a sound does not already belong to a learner's phonetic repertory, the learner will associate it with a close speech sound that is in the repertory. For example, anglophones beginning to speak French typically hear and pronounce the French sound /y/ (in tu) as the English sound /u/ (in "too"); but they can be taught to use liprounding to approximate French /y/.
Automatic systems can teach articulator placement for new sounds, adding graphical views, for example, of the inside of the mouth (LaRocca, 1994). This instruction can be likened to gymnastics; the learner "feels" when the articulators are correctly in place and practices with the recognizer to confirm this. Learners can train their ears to recognize the new sounds and relate them to what they feel their muscles doing. Akhane-Yamada et al. (1996) suggest that learning to perceive sound distinctions helps in their production.
Phone articulation training can be L1-independent. A target vowel, for example, can be taught by starting with a close cardinal vowel (e.g., /a/, / i/, and /u/ have a high probability of existing in most L1s). A better solution, requiring more computer memory and linguistic knowledge, is to start with a close vowel in the learner's particular L1. Taking into account the learner's L1 can help anticipate errors and point to pertinent articulatory hints (Kenworthy, 1987). Thus, knowing that French has no lax vowels lets teachers of English to French speakers focus on how to go from a tense vowel to a close lax vowel ("peat" to "pit").
Correcting Prosody Errors
Based on work in FLUENCY, we propose that the visual display more than oral instructions will be critical to prosody correction. The key is for learners to see where the curve representing their production differs from the native speaker's curve. Prosody displays can benefit from the wealth of work on automated systems that teach the deaf to speak. For example, Video Voice (Micro Video, 1989) uses histograms to represent intensity (over time) and xy curves for pitch (over time). Duration is implicit in the time axis of the intensity histogram. Video Voice compares what the student says to a native speaker's prerecorded exemplar. For pitch the student sees the two frequency curves and, guided by hints, tries to increase
464
or decrease pitch at relevant points to come closer to the exemplar. Trials within the FLUENCY project confirm the importance of visual details to help learners understand the display, for example, using a continuous line as opposed to a divided contour for pitch.
ASR-based CALL Can Provide Significant Contexts for Language Practice
CALL can simulate authentic contexts using multimedia and multimodal displays in ways discussed elsewhere in this volume (e.g., Rypa & Price; Wachowicz & Scott). Learners can participate in one-to-one conversations with one or more simulated or videotaped interlocutors. The cue for the student to speak can be realistic, such as having a character on the screen turn head and eyes toward the user (or the camera).
ASR-Based CALL Can Put Learners at Ease
The computer can prove the ideal partner for putting a language learner at ease in speaking. Whereas the human teacher judges the student's production, the computer can be viewed as neutral. It can support continual practice of unusual sounds until students have enough confidence to go before others. The system becomes what Wyatt (1988) calls a collaborative tool rather than a facilitative one, with students assuming the role of judges of their own productions. This role not only has pedagogical backing (Celce Murcia & Goodwin, 1991) but can also benefit system performance. For example, if an exercise requires making a fine phonetic distinction that the recognizer detects poorly, the system can mislead and frustrate the student by giving errant pronunciation scores and, on that basis, deciding what to present next. However, if the system simply displays recognition results without pronunciation scores and allows students to decide whether they did well or need further practice, then ASR-derived error is less problematic. The student gains a sense of control over the chain of events but the teacher can still intervene to insist on more practice.
ASR-Based CALL Can Provide Ongoing Assessment
CALL today can enable rapid, constant assessment of the learner. The system can provide more details more rapidly than a teacher grading tests (Bernstein & Franco, 1995). The feedback given to the teacher can go beyond pronunciation scoring. In traditional computer-aided instruction, learners are scored right or wrong on a given question and the scores
465
tallied at the end of the session. But for a system that gives visual data to help learners decide where to correct themselves, feedback to the teacher can include learners' own decisions as to their strong and weak points. For example, in a lesson on how to emphasize content words in utterances, if the learner decides to work on duration rather than pitch or amplitude, we can assume either that duration presented more of a problem or that the learner did not have time for the other two aspects. In any case, the teacher who receives the system's report can immediately test progress in the aspect the learner worked on and recommend what to work on in the next session.
Latency of response can also be measured (Bernstein & Franco, 1995) to obtain an even clearer view of where learners are having difficulties. Responses that took more time to formulate can be noted, as can progress in decreasing latencies over a session.
CONCLUSION
Speech-interactive CALL brings to pronunciation instruction a wealth of new, sometimes unforeseen, techniques. Increases in computer memory and storage for expanded exposure to many speakers and for multimedia corrective feedback can reproduce some of the advantages of total immersion learning. There is still much to be done. Teachers and computer scientists need to collaborate more closely to refine ASR-based tools and to invent and validate new teaching methods to build on the advantages of the new medium.
NOTES
1 Although there is not, to our knowledge, quantitative proof of the effectiveness of these recommendations, they are important in teaching methodologies based on immersion (Celce Murcia & Goodwin, 1991; Krashen, 1982).
2 The nonnative speakers in this study had varying degrees of proficiency in English.
3 If knowledge of the difference between the speaker's L1 and L2 were also used in the form of post-processing heuristics, the system could hone in on only the errors that would be relevant to correct for the given speaker.
4 Total number of pitch peaks is defined as maxima in frequency of fundamental frequency over all of the utterance.
466
REFERENCES
Akhane-Yamada, R., Tohkura, Y., Bradlow, A., & Pisoni, D. (1996). Does training in speech perception modify speech production? In Proceedings of the international conference on spoken language processing. Philadelphia, PA.
Allen, W.S. (1968). Walter and Connie, parts 1-3. British Broadcasting Corporation.
Auralog (1995). AURA-LANG user manual. Voisins le Bretonneux, France: Author.
Bagshaw, P., Hiller, S., & Jack, M. (1993). Computer aided intonation teaching. In Proceedings of Eurospeech 93.
Bernstein, J. (1994). Speech recognition in language education. In F. L. Borchardt & E. Johnson (Eds.), Proceedings of the 1994 annual CALICO symposium: Human factors (pp. 37-41) Durham, NC: CALICO.
Bernstein, J., & Franco, H. (1995). Speech recognition by computer. In N. Lass (Ed.), Principles of experimental phonetics. St. Louis: Mosby.
Bowen, J. D. (1975). Patterns of English pronunciation. Rowley, MA: Newbury House.
Brumfit, C. (1984). Communicative methodology in second language teaching. Cambridge, UK: Cambridge University Press.
Celce Murcia, M., & Goodwin, J. (1991). Teaching pronunciation. In M. Celce Murcia (Ed.), Teaching English as a second language. Boston: Heinle & Heinle.
Crookall, D., & Carpenter, R. (Eds.). (1990). Simulation, gaming, and language learning. New York: Harper Collins.
Duncan, C., Bruno, C., & Rice, M. (1995). Learn to speak Spanish: Text and workbook. Hyperglot Software Co. Inc.
Eskenazi, M. (1992). Changing speech styles, speakers' strategies in read speech and careful and casual spontaneous speech. In Proceedings of the international conference on spoken language processing. Banff.
Eskenazi, M. (1996). Detection of foreign speakers' pronunciation errors for second language training—preliminary results. In Proceedings of the international conference on spoken language processing, '96.
Hansen, B., Novick, D., & Sutton, S. (1996). Systematic design of spoken prompts. Proceedings of computer human interaction (CHI) '96 (pp. 157-164).
Isard, A., & Eskenazi, M. (1991). Characterizing the change from casual to careful style in spontaneous speech. Journal of the Acoustical Society of America. 89 (4) pt. 2.
Kenworthy, J. (1987). Teaching English pronunciation. New York: Longman.
Krashen, S. (1982). Principles and practice in second language acquisition. New York: Pergamon.
467
LaRocca, S. (1994). Exploiting strengths and avoiding weaknesses in the use of speech recognition for language learning. CALICO Journal, 12 (1), 102-105
Laroy, C. (1995). Pronunciation. In Resource books for teachers. Oxford: Oxford University Press.
Micro Video Corporation. (1989). Getting started with Video Voice: A follow-along tutorial. Ann Arbor, MI: Author.
Modern Language Materials Development Center. (1964). French 8, audio-lingual materials. New York: Harcourt, Brace and World.
Morley, J. (Ed.). (1994). Pronunciation pedagogy and theory: New views, new directions. Alexandria, VA: TESOL.
Omaggio, A. (1993). Teaching language in context (2nd ed.). Boston: Heinle & Heinle.
Pean, V., Williams, S., & Eskenazi, M. (1993). The design and recording of ICY, a corpus for the study of intraspeaker variability and the characterization of speaking styles. Proceedings of Eurospeech '93 (pp. 627 -630).
Ravishankar, M. (1996). Efficient algorithms for speech recognition. (Doctoral dissertation, Carnegie Mellon University, 1996) Technical Report CMU-CS96-143.
Richards, J., & Rodgers, T. (1986). Approaches and methods in language teaching. Cambridge: Cambridge University Press.
Rooney, E., Hiller, S., Laver, J., & Jack, M. (1992). Prosodic features for automated pronunciation improvement in the SPELL system. Proceedings of the international conference on spoken language processing (pp. 413-416).
Syracuse Language Systems. (1994). TriplePlayPlus! User's Manual. Random House.
Tajima, K., Dalby, J., & Port, R. (1996). Foreign-accented rhythm and prosody in reiterant speech. Journal of the Acoustical Society of America, 99, 2493.
Tajima, K., Port, R., & Dalby, J. (1994). Influence of timing on intelligibility of foreign-accented English. Journal of the Acoustical Society of America (paper 5pSP2).
Wyatt, D. (1988). Applying pedagogical principles to CALL. In W. F. Smith (Ed.), Modern media in foreign language education. Lincolnwood, IL: National Textbook Company.
AUTHOR'S BIODATA
Dr. Maxine Eskenazi is a Systems Scientist at Carnegie Mellon University. She has had the dual experience of working in the field of automatic speech processing and extensively teaching both French and English as foreign languages, having obtained foreign language teaching accreditation from the state of Pennsylvania. She obtained her Doctorate in Computer Science from the University of Paris 11 and worked for over 15 years at the LIMSI-CNRS laboratory in France as a Chargée de Recherche.