外文翻譯--基于網(wǎng)絡的自動語音識別能度語言模型_第1頁
已閱讀1頁,還剩6頁未讀 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、<p><b>  中文3230字</b></p><p><b>  外文翻譯:</b></p><p>  原文來源: Computer Speech and Language (2014) </p><p><b>  譯文正文:</b></p><p>  基

2、于網(wǎng)絡的自動語音識別能度語言模型</p><p>  本文描述了一種基于可能性理論的新的語言模型。這些新模型的目的是為了更好地利用Web上可用的數(shù)據(jù)進行語言建模。這些模型的目的在于整合與不可能單詞序列有關的信息。我們找到了使用這種模型的兩個主要問題:如何估算單詞序列的長度以及如何把這種模型整合到自動語音識別系統(tǒng)(ASR)中去。</p><p>  我們提出了一個單詞序列可能性的措施和一個基

3、于單詞序列統(tǒng)計數(shù)據(jù)的實用估算方法,這種方法尤其適用于來自于Web數(shù)據(jù)的估算。對于在一個經(jīng)典的依靠一個語音識別過程中的概率模型的自動語音識別引擎中使用這些模型,我們提出了一些策略和構想。這項工作在兩種典型的使用場景中進行評估:具有非常大訓練集的廣播新聞轉錄和在一個專業(yè)領域,對只有非常有限訓練數(shù)據(jù)的醫(yī)療視頻的轉錄。</p><p>  該結果表明,在專業(yè)領域的任務中,能度模型提供了顯著低的單詞錯誤率,但是經(jīng)典的n元模

4、型由于訓練材料的缺少沒有能夠做到這一點。在廣播新聞方面,概率模型仍然優(yōu)于能度模型。但是,這兩種模型的對數(shù)線性組合優(yōu)于所有單獨使用的模型,這表明能度模型帶來了概率模型所沒有的信息。</p><p><b>  1.簡介</b></p><p>  最先進的大詞匯量連續(xù)語音識別系統(tǒng)(LVCSR)是基于n元語法模型的,這種模型是在由數(shù)十億單詞組成的文本集合中被估算的。這些模

5、型在很大范圍的應用中證明了自己的效率,但是它們的準確度是依賴于龐大的相關訓練語料庫的可用性上,但是如果對于資源很少的語言或者特定的某一領域,大量的數(shù)據(jù)集就不能保證了。</p><p>  處理這種訓練數(shù)據(jù)缺乏的最受歡迎的方法之一在于在互聯(lián)網(wǎng)上搜集文本材料和在這些自動搜集的數(shù)據(jù)集上估算n元統(tǒng)計模型。這種方法得益于互聯(lián)網(wǎng)兩個有趣的特點:覆蓋范圍廣和持續(xù)更新。</p><p>  覆蓋依賴于這樣

6、一個事實,Web可能被看作是一個趨于無限的語料庫,大多數(shù)的語言實現(xiàn)都可以在這里找到?;ヂ?lián)網(wǎng)提供了一個比通常在LM訓練中用到的文本集合大得多的語言覆蓋。</p><p>  用戶通過不斷地增加包含新單詞和新的慣用語言形式的文檔來提供更新。最后一點被廣泛地用于統(tǒng)計語言模型的各個方面,典型的應用如新單詞的發(fā)現(xiàn),n元語法模型的適應,不可見的n元語法模型的評價。</p><p>  然而,與互聯(lián)網(wǎng)內

7、容的規(guī)模和不穩(wěn)定性相關的技術問題限制了對大范圍和統(tǒng)計語言模型更新的利用。標準的方法會是定期地搜集互聯(lián)網(wǎng)上可以利用的所有數(shù)據(jù),在結果語料庫上來估算n元模型。這樣的一種技術很明顯是難以實施的;一些作者提出了使對龐大的LM數(shù)據(jù)集的估算可行的解決方法:Guthrie和Hepple(2010)解決了稀疏n元模型占用內存減少的問題;快速平滑的技術在Brants(2007)等被提出;基于分布式的數(shù)據(jù)存儲與處理的技術方法在Ghemawat(2003),

8、Chang(2006)等文章上被發(fā)表。最后,即使軟件和硬件技術不斷發(fā)展,在整個Web內容上最新的LM的訓練仍然是一個具有挑戰(zhàn)性的問題。</p><p>  另一個問題是和單詞序列在Web上的分布相關。由于文檔來源的多樣性,生產(chǎn)的可變性和使用的環(huán)境等因素,它們的可靠性很低。分布不僅僅不可靠,也不會匹配一個定向的應用程序上下文,這個應用程序上下文決定著可能的主題、說話的風格和語言的等級等。</p>&l

9、t;p>  考慮到使用整個Web在實用上和理論上的諸多限制,以前的很多研究都是提取相關的和易于處理的Web子集,這些Web子集被作為傳統(tǒng)的估算n元統(tǒng)計模型的語料庫來使用。語料庫是通過自動查詢搜索引擎取得的。就覆蓋,語言風格等而言,查詢組成技術決定了語料庫的精確度。不幸的是,查詢是基于以前的知識或者是一個與領域相關的描述的自動提取,這種描述可能是不完整或者不準確的。此外,獨立于查詢組成技術,搜索到的數(shù)據(jù)依賴于在商業(yè)引擎里面使用的搜索

10、策略,這些搜索策略可能完全或者部分是機密的。</p><p>  即使這些方法成功地在各種應用程序上下文中得到使用,一些作者仍試圖通過使用動態(tài)n元語法估算方法從Web的特殊性中得到更多的好處。在Beger和Miller(1998)的文章中,一個剛好及時的適應過程被提出了,它是基于一個在線的文檔主題分析和快速LM更新。在Zhu和Rosenfeld(2010)的文章中,作者們提出了一個倒轉的技術,它通過計算包含它的W

11、eb文檔的數(shù)量來估算一個單詞序列的概率。這個數(shù)量是通過使用一個帶有定向單詞序列查詢的搜索引擎返回的成功地案列數(shù)目。這篇文章專注于LM適應于一個專門的領域,但是它介紹了使用一個搜索引擎進行語言成績專門估算的思想。我們在Oger(2009a)等里面拓展了這個思想,在這里我們提出了一個高效的方法,在一個自動語音識別別系統(tǒng)中使用Web搜索引擎的命中率作為概率。一個特別的n元統(tǒng)計模型估計提供更新了的統(tǒng)計數(shù)據(jù),但是沒有解決Web統(tǒng)計數(shù)據(jù)可靠性的問題

12、。為了解決這個問題,我們在Oger(2009b)等提出了考慮單詞序列存在與否而不是它們出現(xiàn)的頻率的語言模型。這些模型是基于可能性理論的,可能性理論提供了一個在理論上解決不確定性問題的框架。我們通過Web查詢提出</p><p>  在多數(shù)情況下,基于概率的語言模型都表現(xiàn)不俗,尤其是在高頻率和中等頻率的事件中。低頻率事件發(fā)生概率的估計基本上依賴于一個倒轉或者平滑的策略,這種策略會導致不太可靠的概率。已提出的能度語言

13、模型僅僅在這些低頻率事件上起作用,通過測量這些事件的可信度,這種可信度實際上不是由通常估計這些事件的概率的平滑和倒轉的技術檢測的。因此,提出的并沒有取代基于能度的語言模型基于概率的語言模型,而是在基于概率的語言模型不可靠的情況下對其進行補充,這種情況,也就是低頻率事件?;谀芏鹊恼Z言模型的目標估計這些低頻事件的可信度,目的是為了在當主要的語言模型錯誤地分配給它們一個本應有的更高的概率的時候過濾掉它們。</p><p&

14、gt;  這篇文章講述了一個可能性語言模型的深入研究。我們將會聲明我們的動機和這些模型理論上的基礎以及陳述一個經(jīng)驗上的估算可能性的方法和新的把它們整合到一個自動語音識別系統(tǒng)中的方法。能度模型與傳統(tǒng)的n元語言模型在Web和傳統(tǒng)的文本語料庫的估算方面比較和結合。</p><p>  我們圍繞兩個任務做了實驗:廣播新聞轉錄,有大量的訓練材料可以提供,以及醫(yī)療視頻轉錄,這通常是致力于訓練外科醫(yī)生的。后者應用的環(huán)境對應于一

15、個只有很少的可提供資源的非常專業(yè)的領域。</p><p>  本文剩余部分的內容組織安排如下。第二部分從傳統(tǒng)的語料庫概率模型開始,提供了一個逐步的Web能度語言模型的描述。第三部分陳述了把能度語言模型整合到一個統(tǒng)計學的自動語音識別系統(tǒng)中各種各樣的策略。第四部分描述了實驗的設置,同時也做了對比的實驗。最后第五部分進行了總結并且提出了一些觀點與看法。</p><p>  2.從語料庫的可能性到

16、Web的可能性</p><p>  在這一部分,我們通過使用新的數(shù)據(jù)源(也就是Web)提出了新的方法來改進語言建模和一個新的理論框架,也就是可能性理論。我們首先描述了傳統(tǒng)的基于語料庫的概率語言模型,這種語言模型在大多數(shù)最先進的語音識別系統(tǒng)中得到使用。其次,我們介紹了一個新的方法來從Web中估算這些概率。最后,我們提出使用從可能性理論中的觀念來建造新的可以被在Web上以及傳統(tǒng)的封閉語料庫中估算的策略:能度策略。&l

17、t;/p><p>  2.1基于語料庫的概率</p><p>  在自動語音識別系統(tǒng)領域,語言模型設計的目的主要是估算一個單詞序列W的先驗概率P(W):</p><p>  W=(w1, w2, …,wn), wi€v</p><p>  這個概率可以被分解為條件概率的產(chǎn)物:</p><p>  P(W)=P(wi|w1,

18、w2,…,wi-1)</p><p>  這個公式假設一個單詞Wi只能通過前面的單詞序列來預測。整體上來講,n元語言模型組成一個將會在自動語音識別系統(tǒng)中被使用的條件概率的集合,為了一個單詞的預測給一個部分轉錄假設。</p><p>  就像在Eq中表現(xiàn)的那樣,單詞的概率取決于整個語言的歷史。實際上,這樣長期的相關性由于復雜性和語料庫的限制不能被估算:估算如此長的單詞序列所需要的訓練數(shù)據(jù)的量

19、是巨大的,并且對一個n(n>6)元語言模型的高階統(tǒng)計的直接估算通常是不可能完成的。因此,大多數(shù)先進的自動語音識別系統(tǒng)只使用4或5元模型。</p><p>  一些可以替代的語言評估的方法被提出使得長序列的單詞序列概率估算可行,主要使用提供有效的(但是間接的)推理和平滑的機制的神經(jīng)網(wǎng)絡。然而,理想的情況在一個詳盡的語料庫中進行直接的精確的概率估算,在這種情況下,所有可能的句子都會被發(fā)現(xiàn)。這將表明自動語音識別的

20、問題,可以被看做是在一個封閉的文本文檔集合中對正確轉錄的搜索。</p><p><b>  原文正文:</b></p><p>  Web-based possibilistic language models for automatic speech recognition</p><p>  This paper describes a n

21、ew kind of language models based on the possibility theory. The purpose of these new models is to better use the data available on the Web for language modeling. These models aim to integrate information relative to impo

22、ssible word sequences. We address the two main problems of using this kind of model: how to estimate the measures for word sequences and how to integrate this kind of model into the ASR system.</p><p>  We p

23、ropose a word-sequence possibilistic measure and a practical estimation method based on word-sequence statistics, which is particularly suited for estimating from Web data. We develop several strategies and formulations

24、for using these models in a classical automatic speech recognition engine, which relies on a probabilistic modeling of the speech recognition process. This work is evaluated on two typical usage scenarios: broadcast news

25、 transcription with very large training sets and transcr</p><p>  The results show that the possibilistic models provide significantly lower word error rate on the specialized domain task, where classical n-

26、gram models fail due to the lack of training materials. For the broadcast news, the probabilistic models remain better than the possibilistic ones. However, a log-linear combination of the two kinds of models outperforms

27、 all the models individually, which indicates that possibilistic models bring information that is not modeled by probabilistic ones.</p><p>  Keywords: speech processing; language modeling; theory of possibi

28、listic</p><p>  Introduction</p><p>  State-of-the-art large vocabulary continuous speech recognition (LVCSR) systems rely on n-gram language models that are estimated on text collections compos

29、ed of billions of words. These models have demonstrated their efficiency in a wide scope of applications but their accuracy depends on the availability of huge and relevant training corpora that may not be available, for

30、 instance for low resource languages or for specific domains.</p><p>  One of the most popular approaches for dealing with this lack of training data consists in collecting text material on the Internet and

31、in estimating classical n-gram statistics on these automatically collected datasets (Kemp and Waibel 1998; Bulyko et al., 2003). This approach benefits from two interesting characteristics of the Internet: large coverage

32、 and continuous updating.</p><p>  Coverage relies on the fact that the Web may be viewed as a close-to-infinite corpus, where most of the linguistic realizations may be found. The Internet provides a lingui

33、stic coverage significantly larger than the text corpora usually involved in LM training (Keller and Lapata, 2003).</p><p>  Updating is provided by users who continuously add documents containing new words

34、and new idiomatic forms. This last point was largely exploited for various aspects of statistical language modeling, typically for new words discovery (Asadi et al., 1990; Bertodi and Federico, 2001; Allauzen and Gauvain

35、, 2005), n-gram model adaptation, unseen n-gram scoring (Keller and Lapata, 2003), etc.</p><p>  Nevertheless, exploiting the large coverage and the updating for statistical language modeling is limited by t

36、echnical issues which are related to the size and instability of the Internet contents. The standard approach would be to regularly collect all data available on the Internet and to estimate n-gram models on the resultin

37、g corpus. Such a technique is clearly unfeasible; some authors proposed solutions that are supposed to enable the estimation of huge LM: Guthrie and Hepple (2010) tackled</p><p>  Another issue is related to

38、 word sequence distribution on the Web. They are poorly reliable due to the diversity of the document sources, the variability of production and usage contexts, etc. Distributions are not only unreliable, but they would

39、not match a targeted application context which determines the potential topics, speech styles, language level, etc.</p><p>  Considering these practical and theoretical limits of using the whole Web, most of

40、 the previous studies consisted in extracting relevant and tractable Web subsets, which are used as classical corpora for estimating n-gram statistics. Corpora are obtained by automatic querying search engines (Monroe et

41、 al, 2002; Wan and Hain, 2006; Lecorve et al, 2008). The query composing technique determines the corpus accuracy in terms of coverage, language styles, etc. Unfortunately, querying is based on prio</p><p> 

42、 Even if these methods were successfully applied in various application contexts, some authors tried to get further benefit from the Web specificities by using dynamic approaches of n-gram estimate. In Berger and Miller

43、(1998), a just-in-time adaption process, based on an on-line analysis of the document topic and fast LM updating, is proposed. In Zhu and Rosenfeld (2001), the authors proposed a back-off technique which estimates a word

44、 sequence probability by counting the number of Web documents</p><p>  Probability-based language models perform well in most situations, especially on high and medium-frequency events. The low-frequency eve

45、nt probability estimation rely generally on a back-off or smoothing strategy, which led to less reliable probabilities. The proposed possibility language models only operate on these low-frequency events, by measuring th

46、eir plausibility, which is not actually measured by the smoothing and back-off techniques used to estimate the probabilities on these events. Th</p><p>  This paper presents an in-depth study of possibilisti

47、c language models. We will state motivations and theoretical foundations of these models as well as present a method for empirically estimating possibilities and new ways to integrate them in an ASR system. Possibilistic

48、 models are compared and combined with classical n-gram probabilities estimated on both Web and classical text corpus.</p><p>  Experiments are conducted on two tasks: broadcast news transcription, for which

49、 large training materials are available, and transcription of medical videos that are dedicated to training surgeons. The latter application context corresponds to a very specialized domain with only low resources availa

50、ble.</p><p>  The rest of the paper is organized as follows. The next section proposes a step-by-step description of possibilistic Web models, starting from classical corpus probabilistic models. Section 3 p

51、resents various strategies for the integration of possibilistic language models into a statistical ASR system. Section 4 describes the experimental setup and the comparative experiments that were conducted. Finally, sect

52、ion 5 concludes and proposes some perspectives.</p><p>  From corpus probabilities to Web possibilities</p><p>  In this section, we present new approaches for improving language modeling by usi

53、ng a new data source, the Web, and a new theoretical framework, the possibility theory. We first describe the classical corpus-based probabilistic language models, as used in most of the state-of-the-art speech recogniti

54、on system. Then, we introduce a new approach for estimating these probabilities from the Web. Finally, we propose to use concepts from the possibility theory for building a new measure that can be es</p><p>

55、  2.1 Corpus-based probabilities</p><p>  In the ASR domain, language models are mainly designed with the purpose of estimating the prior probability P (W) of a word sequence W:</p><p>  W=(w1,

56、w2, …,wn), wi€v</p><p>  This probability may be decomposed as the product of conditional probabilities:</p><p>  P(W)=P(wi|w1,w2,…,wi-1)</p><p>  This formula assumes that a word w

57、i could be predicted only from the preceding word sequence. Globally, n-gram models consist in a collection of a conditional probabilities that will be used, in the ASR engine, for the prediction of a word, given a parti

58、ally transcribed hypothesis.</p><p>  As expressed in Eq. (1), a word probability depends on the whole linguistic history. In practice, such long-term dependencies cannot be estimated due to complexity and t

59、o the limits of the corpus: the amount of training data required for estimating such long sequences would be huge, and it is usually impossible to perform a direct estimate of –n-gram statistics of high order(n>6). Th

60、erefore, most state-of-the-art ASR system use only 4 or 5 gram models.</p><p>  Some alternative approaches for linguistic scoring were proposed to enable the estimation of long-sequence probability, mainly

61、with neural networks that offer efficient (but implicit) interface and smoothing mechanisms (Bengio et al., 2006; Mnih and Hinton, 2007). Nevertheless, the ideal situation would be to estimate directly accurate probabili

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論