版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、<p> 2600單詞,1.3萬(wàn)英文字符,4400漢字</p><p> 出處:Kumar A, Raj B. Features and kernels for audio event recognition[J]. arXiv preprint arXiv:1607.05765, 2016.</p><p><b> 畢業(yè)設(shè)計(jì)外文翻譯</b><
2、/p><p><b> ?。ㄔ募白g文)</b></p><p> 原文名稱:Features and Kernels for Audio Event Recognition</p><p> 學(xué) 生 姓 名 </p><p> 學(xué) 號(hào)
3、 </p><p> 指 導(dǎo) 教 師 </p><p> 所 在 系 部 </p><p> 專 業(yè) 名 稱 </p><
4、p> 2017年 3 月 7日</p><p> Features and Kernels for Audio Event Recognition</p><p> A Kumar , B Raj</p><p><b> Abstract</b></p><p> One of t
5、he most important problems in audio event detection research is absence of benchmark results for comparison with any proposed method. Different works consider different sets of events and datasets which makes it difficul
6、t to comprehensively analyze any novel method with an existing one. In this paper we propose to establish results for audio event recognition on two recent publicly-available datasets. In particular we use Gaussian Mixtu
7、re model based feature representation and combine th</p><p> Index Terms: Audio Event Detection, Audio Content Analysis.</p><p> 1. Introduction</p><p> In recent years automatic
8、 content analysis of audio recordings has been gaining attention among the audio research community. The goal is to develop methods which can automatically detect the presence of different kinds of audio events in a reco
9、rding. Audio event detection (AED) research is driven by its application in several areas. These include areas such as multimedia information retrieval or multimedia event detection[1] where the audio component contains
10、important information about the con</p><p> All of the methods have been shown to be effective and reasonable upto an extent on datasets and events on which they were evaluated. However, a major problem is
11、that we fail to understand how they compare against each other. Most of them use different datasets or sets of sound events and given the variations we encounter across datasets and more importantly across sound events i
12、t is difficult to assess different methods. Another factor lacking in several of the current works is analysis of the </p><p> Recently, the audio community have attempted to address this problem by creatin
13、g standard sound event databases meant for audio event detection tasks [22] [23] [24]. The UrbanSounds8k [22] dataset contains 10 different sound events with a total of about 8.75 hours of data. The ESC-50 [23] dataset c
14、ontains labeled data for 50 sound events and a total of 2.8 hours of audio. An important goal of our work is to perform a comprehensive analysis of these two datasets for audio event detection. These t</p><p&g
15、t; The ESC-50 dataset also allowed us to study audio event detection at different hierarchies. The 50 events of ESC-50 broadly belong to 5 different higher semantic categories. We report and analyze results for both low
16、er-level events and higher semantic categories. </p><p> It is important to note that our analysis on these two datasets is very different from the simple multi-class classification analysis done by the aut
17、hors of these datasets. In particular, we are interested in audio event detection where our goal is to detect presence or absence (binary) of an event in a given recording. Our analysis is in terms of ranked measure (AP)
18、 for each event as well as area under detection error curves for each event.</p><p> The rest of the paper is organized as follows. In Section 2 we provide a brief description of ESC-50 and Urban Sounds8k d
19、atasets. We describe the features and kernels used for audio event detection in Section 3 and report results in Section 4. Section 5 concludes our work.</p><p> 2. Datasets</p><p> Urban Sound
20、s8k: This dataset contains a total of 8732 audio clips over 10 different sound event classes. The event classes are air conditioner (AC), car horn (CH), children playing (CP), dog barking (DB), drilling (DR), engine idli
21、ng (EI), gun shot (GS), jackhammer (JK), siren (SR) and street music (SM). The total duration of the recordings is about 8.75 hours. The length of each audio clip is less than or equal to </p><p> 4 seconds
22、. The dataset comes with a presorted 10 folds and we use the same fold structure in our experiments.</p><p> ESC-50: ESC-50 contains a total of 2000 audio clips over 50 sound events. Examples sound classes
23、are dog barking, rain, crying baby, fireworks, clock ticking etc. The full list of sound events can be found in [23]. The entire set of 50 events in the dataset belong 5 groups, namely , animals (e.g dog, cat), natural s
24、oundscapes (e.g rain, seawaves), non-speech human sounds (e.g sneezing, clapping), interior domestic sounds (e.g door knock, clock), and exterior sounds (e.g helicopter, siren). The t</p><p> For both datas
25、ets we convert all audios into single channel and resample all audio to 44.1KHz sampling frequency.</p><p> 3. Features and Kernels</p><p> 3.1. Features</p><p> Before we can go
26、 into the details of different feature representation we need low-level characterization of audio signals. We use Mel-Frequency Cepstral Coefficients (MFCC) as low-level representations of audio signals. Let these D-dime
27、nsional MFCC vectors for a recording be represented as , where t = 1 to T , T being the total number of MFCC vectors for the recording. </p><p> Broadly, we employ two higher level feature representations f
28、or characterizing audio events. Both of these features attempt to capture the distribution of MFCC vectors of a recording. We will refer to these as α and β features and the subtypes will be represented using appropriate
29、 sub-scripts and superscripts.</p><p> The first step to obtain higher-level fixed dimensional representation is to train a GMM on MFCC vectors of the training data. Let us represent this GMM by G = {,N(,),
30、=1 to M}, where , and are the mixture weight, mean and covariance parameters of the Gaussian in G. We will assume diagonal covariance matrices for all Gaussians andwill represent the diagonal vector of .</p><
31、;p> 3.2. Kernels</p><p> As stated before we use SVM [29] classifiers. We analyze different kernel functions for training SVMs on the above features. The first one is linear kernel (LK). </p><
32、;p> The other two kernels can be described in a general form for feature vectorsand </p><p> Table 1: Mean Average Precision for different cases (ESC-50)</p><p> Table 2: Mean AUC for dif
33、ferent cases (ESC-50)</p><p> by K(,) = . For we obtain the radial basis function kernel (RK). For histogram features (features) exponential Chi-square distance kernels (CK) are known to work well for visi
34、on tasks [30]. The Chi-square distance [31] is given by , where is dimensionality of feature vectors f . We use this kernel only for histogram features (). Linear and RBF kernel are used for all features. It is worth not
35、ing thathave been designed specifically for linear kernels. In our experiments we trywith RBF kernels as</p><p> 4. Experiments and Results</p><p> We report results on both Urban Sounds8k and
36、 ESC-50 datasets.First, we divide the datasets into 10 folds. Urban Sounds8k already comes with presorted folds and we follow the same fold structure. 9 folds are used as training data and the 10th fold is used for testi
37、ng. This done in all 10 ways such that each fold becomes test data once. This allows us to obtain results on the entire dataset and aggregate results over entire dataset are reported here. 20-dimensional MFCC vectors are
38、 extracted for e</p><p> ESC-50: Table 1 shows MAP values over all 50 events of ESC-50 . It shows MAP for different M values, features and kernels. Table 2 shows the mean AUC values for different cases. We
39、can observe that forfeatures, exponential Chi-square kernel (CK) is much superior compared to linear (LK) and RBF (RK) kernels. There is an absolute improvement of 20-30% in MAP using the exponential Chi-square kernel. T
40、he MAP also improves with increasing M. Among thefeaturesseems to be the best variant. Adding adapt</p><p> Figure 1: ESC-50:Event AP values for+ CK (BLUE) and+ LK (RED), Bottom Right Corner - MAP over High
41、er Semantic Categories</p><p> Figure 1 (bottom right) also shows MAP over each higher semantic category. It can be observed that animal sounds are the easiest to detect and interior sounds are the hardest.
42、 The maximum difference betweenandfeatures is for non-speech human sounds. Mean AUC results has more or less similar trend.</p><p> Table 3: Mean AP for different cases (UrbanSounds8k)</p><p>
43、 Table 4: Mean AUC for different cases (UrbanSounds8k)</p><p> Urban Sounds8k: Table 3 and Table 4 shows MAP and MAUC values respectively on the Urban Sounds8k dataset for different features and kernels. an
44、d have been removed from analysis due to their inferior performance. Once again we notice thatfeatures with the exponential Chi-square kernel is in general the best feature and kernel pair. Interestingly, in this casewit
45、h CK is better than allfeatures for lower M values as well. At low M MAP gain is between 3?8% </p><p> whereas for large M the difference goes upto 16?17% in absolute terms. The primary reason for this is
46、the fact that forfeatures performance goes down as we increase M whereas forfeatures the performance goes up. Among thefeatures,is in general better than, though, for M = 32 and M = 64 under suitable kernel both features
47、 give more or less similar performance. However, we had previously observed for the ESC-50 dataset thatis always better than. Another new observation foron the Urban Sounds8k dat</p><p> Figure 2: Urabn Sou
48、nds8k: Left - Event AP values, Right-Event AUC values (+ CK (BLUE) and+ LK (RED)).</p><p> Figure 3: ESC-50 and Urban Sounds8k: Left - Event AP values, Right-Event AUC values (+ CK (BLUE) and+ LK (RED)).<
49、;/p><p> Event wise results (AP and AUC) for 10 events of Urban-Sounds8k is shown in Figure 2. The event names have been encoded in the figure as per the description in Section 2. Events like dog barking, Sire
50、n and Gun Shot are again easy do detect. The source of the original audios for both datasets is Freesound [34]. This allows us to make general comments for events which are common to both datasets, namely, Car Horn, Dog
51、bark, Engine and Siren. The results are shown in Figure3. We observe that we do </p><p> 5. Conclusions</p><p> In this work we have performed a comprehensive analysis of audio event detection
52、 using two publicly available datasets. Overall, we presented results for 56 unique events across the two datasets. Our analysis included two broad sets of features and overall 5 different feature types. The features are
53、 GMM-based representations. The classifier side included SVMs with 3 different kernel types. </p><p> ESC-50 dataset also allowed us to study audio event detection at lower and higher semantic levels. We ob
54、served that events in the Animal group are in general easier to detect. Interior sound events which includes events such as Washing Machine, Vacuum Cleaner, Clock Ticking are relatively harder to detect. Together both da
55、tasets allowed us to study audio event detection in an exhaustive manner for the features and kernels used in this work. This is potentially very helpful in standardizing AED m</p><p> 6. References</p&g
56、t;<p> [1] P. Over, G. Awad, M. Michel, J. Fiscus, W. Kraaij, A. Smeaton, G. Quenot, and R. Ordelman, “Trecvid 2015–an overview of the goals, tasks, data, evaluation mechanisms and metrics,” in Proceedings of TRE
57、CVID 2015. NIST, USA, 2015.</p><p> [2] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti, “Scream and gunshot detection and localization for audio-surveillance systems,” in Advanced Vide
58、o and Signal Based Surveillance, IEEE Conference on, 2007, pp. 21–26.</p><p> [3] D. Stowell and M. D. Plumbley, “Automatic large-scale classification of bird sounds is strongly improved by unsupervised fea
59、ture learning,” Peer J, 2014.</p><p> [4] J. Ruiz-Muñoz, M. Orozco-Alzate, and G. Castellanos-Dominguez, “Multiple instance learning-based birdsong classification using unsupervised recording segmentat
60、ion,” in Proceedings of the 24th Intl. Conf. on Artificial Intelligence, 2015.</p><p> [5] D. Battaglino, A. Mesaros, L. Lepauloux, L. Pilati, and N. Evans, “Acoustic context recognition for mobile devices
61、using a reduced complexity svm,” in Signal Processing Conference (EUSIPCO), 2015 23rd European. IEEE, 2015, pp. 534–538.</p><p> [6] A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T
62、. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 14, no. 1, pp. 321–329, 2006.</p><p> [7] X. Zhuang, X. Zhou, M. A
63、. Hasegawa-Johnson, and T. S. Huang, “Real-world acoustic event detection,” Pattern Recognition Letters, vol. 31, no. 12, pp. 1543–1551, 2010.</p><p> [8] S. Pancoast and M. Akbacak, “Bag-of-audio-words app
64、roach for multimedia event classification,” in Interspeech, 2012.</p><p> [9] A. Plinge, R. Grzeszick, G. Fink et al., “A bag-of-features approach to acoustic event detection,” in IEEE ICASSP, 2014.</p&g
65、t;<p> [10] A. Kumar, R. Hegde, R. Singh, and B. Raj, “Event detection in short duration audio using gaussian mixture model and random forest classifier,” in Signal Processing Conference (EUSIPCO), 2013 Proceedin
66、gs of the 21st European. IEEE, 2013, pp. 1–5.</p><p> [11] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and M. Vento, “Reliable detection of audio events in highly noisy environments,” Pattern Recogni
67、tion Letters, vol. 65, pp. 22–28, 2015.</p><p> [12] K. Ashraf, B. Elizalde, F. Iandola, M. Moskewicz, J. Bernd, G. Friedland, and K. Keutzer, “Audio-based multimedia event detection with DNNs and sparse sa
68、mpling,” in Proc. of the 5th ACM International Conference on Multimedia Retrieval, 2015.</p><p> [13] O. Gencoglu, T. Virtanen, and H. Huttunen, “Recognition of acoustic events using deep neural networks,”
69、in Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European. IEEE, 2014, pp. 506–510.</p><p> [14] A. Kumar, P. Dighe, R. Singh, S. Chaudhuri, and B. Raj, “Audio event detection from ac
70、oustic unit occurrence patterns,” in IEEE ICASSP, 2012, pp. 489–492.</p><p> [15] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Sparse representation based on a bag of spectral exemplars for acoustic event dete
71、ction,” in IEEE ICASSP, 2014, pp. 6255–6259.</p><p> [16] A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, “Sound event detection in real life recordings using coupled matrix factorization of spectral r
72、epresentations and class activity annotations,” in ICASSP. IEEE, 2015, pp. 606–618.</p><p> [17] A. Kumar and B. Raj, “Audio event detection using weakly labeled data,” in 24th ACM International Conference
73、on Multimedia. ACM Multimedia, 2016.</p><p> [18] ——, “Weakly supervised scalable audio content analysis,” in 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016.</p><p&g
74、t; [19] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results,” http://www.pascalnetwork.org/challenges/VOC/voc2011/workshop/index.h
75、tml.</p><p> [20] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.</p><p> [21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagen
76、et: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.</p><p> [22] J. Salamon, C. Jacoby, and J. P. Bello,
77、 “A dataset and taxonomy for urban sound research,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2014, pp. 1041–1044.</p><p> [23] K. J. Piczak, “Esc: Dataset for environmental sou
78、nd classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 2015, pp. 1015–1018.</p><p> [24] A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic sce
79、ne classification and sound event detection,” in 24th European Signal Processing Conference, 2016.</p><p> [25] A. Kumar. Aed. http://www.cs.cmu.edu/%7Ealnu/AED.htm Copy and Paste in browser if clicking doe
80、s not work.</p><p> [26] J. Gauvain and C. Lee, “Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains,” Speech and audio processing, IEEE Trans. on, 1994.</p>
81、;<p> [27] F. Bimbot and et al., “A tutorial on text-independent speaker verification,” EURASIP journal on applied signal processing, vol. 2004, pp. 430–451, 2004.</p><p> [28] W. Campbell, D. Sturi
82、m, and A. Reynolds, “Support vector machines using gmm supervectors for speaker verification,” Signal Processing Letters, IEEE, pp. 308–311, 2006.</p><p> [29] T. Hastie, R. Tibshirani, J. Friedman, and J.
83、Franklin, “The elements of statistical learning: data mining, inference and prediction,” The Mathematical Intelligencer, vol. 27, no. 2, pp. 83–85, 2005.</p><p> [30] J. Zhang, M. Marsza?ek, S. Lazebnik, an
84、d C. Schmid, “Local features and kernels for classification of texture and object categories: A comprehensive study,” International journal of computer vision, vol. 73, no. 2, pp. 213–238, 2007.</p><p> [31
85、] P. Li, G. Samorodnitsk, and J. Hopcroft, “Sign cauchy projections and chi-square kernel,” in Advances in Neural Information Processing Systems, 2013, pp. 2571–2579.</p><p> [32] C.-C. Chang and C.-J. Lin,
86、 “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.</p><p> [33] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and
87、M. Przybocki, “The det curve in assessment of detection task performance,” DTIC Document, Tech. Rep., 1997.</p><p> [34] Free Sound, “https://freesound.org/,” 2015.</p><p><b> 音頻事件識(shí)別特征核&
88、lt;/b></p><p><b> 摘要</b></p><p> 音頻事件檢測(cè)研究中最重要的問(wèn)題之一就是沒(méi)有基準(zhǔn)結(jié)果與任何提出的方法進(jìn)行比較。不同的工作考慮不同的事件和數(shù)據(jù)集,使得難以用現(xiàn)有的方法全面分析任何新的方法。在本文中,我們提出在兩個(gè)最近公開(kāi)可用的數(shù)據(jù)集上建立音頻事件識(shí)別的結(jié)果。特別地,我們使用基于特征表示的高斯混合模型,并將它們與線性以及非線
89、性內(nèi)核支持向量機(jī)相結(jié)合。 </p><p> 索引術(shù)語(yǔ):音頻事件檢測(cè),音頻內(nèi)容分析。</p><p><b> 1.引言</b></p><p> 近年來(lái),音頻記錄的自動(dòng)內(nèi)容分析已經(jīng)在音頻研究界中獲得關(guān)注。其目的是開(kāi)發(fā)能夠自動(dòng)檢測(cè)錄音中不同音頻種類事件發(fā)生的方法。音頻事件檢測(cè)(AED)研究是由其在幾個(gè)領(lǐng)域的應(yīng)用驅(qū)動(dòng)的。其應(yīng)用包括諸如多媒體
90、信息檢索或多媒體事件檢測(cè)[1]領(lǐng)域,其中音頻分量包含關(guān)于多媒體數(shù)據(jù)的內(nèi)容的重要信息。這對(duì)于支持基于內(nèi)容的搜索和在網(wǎng)絡(luò)上檢索多媒體數(shù)據(jù)特別重要。其他應(yīng)用如監(jiān)視[2]、野生動(dòng)物監(jiān)測(cè)[3]、情景感知系統(tǒng)[5] [6]、健康監(jiān)測(cè)等也是推動(dòng)AED研究的動(dòng)力。在過(guò)去幾年中已經(jīng)提出了用于音頻事件檢測(cè)的各種方法。類似于文獻(xiàn)[7]中提出的用于自動(dòng)語(yǔ)音識(shí)別的GMM- HMM結(jié)構(gòu)。一個(gè)簡(jiǎn)單而有效的方法是文獻(xiàn)[8] [9] [10]提到的詞袋模型。這種方法還具
91、有噪聲環(huán)境中的AED的特點(diǎn)[11]??紤]到AED的復(fù)雜性,以具有高度建模的非線性功能而為人們所熟知的深層神經(jīng)網(wǎng)絡(luò)也已經(jīng)用于AED研究[12] [13]。一些其他有趣的方法涉及使用AED的聲學(xué)單元描述符[14],AED的頻譜樣本[15]和頻譜表示的矩陣分解[16]。缺乏音頻事件的監(jiān)督數(shù)據(jù)同樣會(huì)導(dǎo)致一些弱監(jiān)督的音頻事件檢測(cè)工作[17] [18]。</p><p> 在對(duì)數(shù)據(jù)集和事件進(jìn)行評(píng)估方面,所有的方法都被證明是
92、合理有效的。然而,一個(gè)主要的問(wèn)題是我們不能理解他們?nèi)绾蜗嗷ケ容^。大多數(shù)方法使用不同的數(shù)據(jù)集或聲音事件集,考慮到我們?cè)跀?shù)據(jù)集之間遇到的變化,更重要的是在聲音事件中,很難評(píng)估不同的方法。在當(dāng)前幾項(xiàng)研究中缺少的另一個(gè)因素是合理地對(duì)大的聲音事件詞匯提出的方法進(jìn)行分析。通常,僅有少量的音頻事件集被考慮到。在當(dāng)前AED文獻(xiàn)中對(duì)公開(kāi)可用的音頻集缺乏的一個(gè)重要原因是缺少公開(kāi)可用的標(biāo)準(zhǔn)聲音事件數(shù)據(jù)庫(kù),其可以用作任何所提出的方法的公共測(cè)試臺(tái)。更重要的是,數(shù)
93、據(jù)集應(yīng)該在音頻的總持續(xù)時(shí)間和事件的詞匯表方面具有合理的大小。開(kāi)源數(shù)據(jù)集有助于對(duì)問(wèn)題的研究工作進(jìn)行標(biāo)準(zhǔn)化,并對(duì)任何提出的方法所顯示的結(jié)果具有更好的理解。例如,在圖像識(shí)別和分類中,和眾所周知的標(biāo)準(zhǔn)數(shù)據(jù)集例如PASCAL [19],CIFAR [20]和圖像網(wǎng)[21]進(jìn)行比較,其結(jié)果會(huì)被顯著性報(bào)告。</p><p> 最近,音頻協(xié)會(huì)試圖通過(guò)創(chuàng)建用于音頻事件檢測(cè)任務(wù)的聲音事件標(biāo)準(zhǔn)數(shù)據(jù)庫(kù)來(lái)解決這個(gè)問(wèn)題[22] [23]
94、[24]。 UrbanSounds8k [22]數(shù)據(jù)集包含10個(gè)不同的聲音事件,總共約8.75小時(shí)的數(shù)據(jù)。 ESC-50 [23]數(shù)據(jù)集包含50個(gè)聲音事件的標(biāo)簽數(shù)據(jù),音頻總共2.8小時(shí)。我們工作的一個(gè)重要目標(biāo)是對(duì)這兩個(gè)數(shù)據(jù)集進(jìn)行全面音頻事件檢測(cè)分析。這兩個(gè)公開(kāi)數(shù)據(jù)集解決事件檢測(cè)在聲音事件的總持續(xù)時(shí)間和詞匯表方面的問(wèn)題。音頻的總持續(xù)時(shí)間相當(dāng)大,整體上,這些允許我們?cè)?6個(gè)獨(dú)特的聲音事件上呈現(xiàn)結(jié)果。具體來(lái)說(shuō),事我們分析高斯混合模型(GMM)
95、的特征表示。我們使用GMM檢查兩種形式的特征表示。第一個(gè)是使用GMM獲得的軟計(jì)數(shù)直方圖表示,第二個(gè)使用GMM參數(shù)的最大后驗(yàn)(MAP)適應(yīng)以獲得音頻段的固定維度特征表示。第一個(gè)特征類似于音頻詞袋表示。通過(guò)對(duì)MAP改進(jìn)獲得的第二個(gè)特征試圖對(duì)記錄的低級(jí)特征的實(shí)際分布進(jìn)行描述。在分類器方面,我們使用支持向量機(jī)(SVM)來(lái)探索用于音頻事件檢測(cè)不同的內(nèi)核。</p><p> ESC-50數(shù)據(jù)集還可以在不同層次上研究音頻事件
96、檢測(cè)。 ESC-50的50個(gè)事件涵蓋了5個(gè)不同的較高級(jí)語(yǔ)義類別。 我們報(bào)告和分析較低級(jí)別事件和較高語(yǔ)義類別的檢測(cè)結(jié)果。</p><p> 還有一點(diǎn)尤為重要,就是要注意到我們對(duì)這兩個(gè)數(shù)據(jù)集的分析與這些數(shù)據(jù)集的作者進(jìn)行的簡(jiǎn)單多類分類分析有很大不同。 尤其是,我們感興趣的是音頻事件檢測(cè),其目標(biāo)就是檢測(cè)給定錄音中事件的存在與否(二進(jìn)制)。 根據(jù)每個(gè)事件的排名度量(AP)以及每個(gè)事件的檢測(cè)誤差曲線下的面積對(duì)其進(jìn)行分析。&
97、lt;/p><p> 本文的其余部分包括如下內(nèi)容。 在第2節(jié)中,我們對(duì)ESC-50和Urban Sounds8k數(shù)據(jù)集進(jìn)行了簡(jiǎn)要的描述。 第3節(jié)介紹了用于音頻事件檢測(cè)的特征和內(nèi)核,第4節(jié)論述了實(shí)驗(yàn)結(jié)果。第5節(jié)對(duì)工作進(jìn)行總結(jié)。</p><p><b> 2.數(shù)據(jù)集</b></p><p> Urban Sounds8k:此數(shù)據(jù)集包含超過(guò)10個(gè)不
98、同聲音事件種類的總共8732個(gè)音頻剪輯。事件類別包括空調(diào)(AC),汽車?yán)龋–H),兒童游戲(CP),狗吠聲(DB),鉆井(DR),發(fā)動(dòng)機(jī)怠速(EI),槍聲(GS),手提鉆(JK),報(bào)警器(SR)和街頭音樂(lè)(SM)。錄音的總時(shí)長(zhǎng)約為8.75小時(shí)。每個(gè)音頻剪輯的長(zhǎng)度小于或等于4秒。數(shù)據(jù)集有10個(gè)重疊的預(yù)先分類,我們的實(shí)驗(yàn)也使用了相同的折疊結(jié)構(gòu)。</p><p> ESC-50:ESC-50包含超過(guò)50個(gè)聲音事件的
99、共2000個(gè)音頻剪輯。示例聲音類別是狗吠,下雨,嬰兒啼哭,煙花,時(shí)鐘滴答等。聲音事件的完整列表可以在文獻(xiàn)[23]中找到。數(shù)據(jù)集中的全部50個(gè)事件來(lái)自5類,即動(dòng)物(例如狗,貓),自然聲音(例如雨,海水),非語(yǔ)言人聲(例如打噴嚏,拍手),室內(nèi)家用聲音(例如敲門(mén),鬧鐘)和外部聲音(例如直升機(jī),警報(bào)器)。此數(shù)據(jù)集錄音的總時(shí)長(zhǎng)大約為2.8小時(shí)。我們將數(shù)據(jù)集劃分為10個(gè)折疊部分來(lái)進(jìn)行實(shí)驗(yàn)。為了便于將來(lái)其他人使用我們的設(shè)置,這個(gè)折疊結(jié)構(gòu)的細(xì)節(jié)可以在網(wǎng)
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 車輛音頻特征分析及車型識(shí)別研究.pdf
- 基于音頻和視頻特征融合的身份識(shí)別.pdf
- 基于壓縮感知的車輛音頻特征識(shí)別研究.pdf
- 基于音頻特征分析的車輛識(shí)別軟件實(shí)現(xiàn).pdf
- 基于韻律序列特征和非時(shí)序特征的音頻場(chǎng)景識(shí)別.pdf
- 外文翻譯---關(guān)于顏色識(shí)別中顏色特征分析的方法
- 基于驚奇模型與聲譜圖的音頻事件檢測(cè)與識(shí)別研究.pdf
- 外文翻譯--杏核去核設(shè)備
- 外文資料翻譯---車牌識(shí)別
- 外文翻譯---說(shuō)話人識(shí)別
- 穩(wěn)健語(yǔ)音特征和音頻場(chǎng)景識(shí)別方法的研究.pdf
- 基于核函數(shù)的手部特征識(shí)別算法研究.pdf
- [雙語(yǔ)翻譯]人臉識(shí)別外文翻譯—人臉識(shí)別技術(shù)綜述(節(jié)選)
- [雙語(yǔ)翻譯]人臉識(shí)別外文翻譯—人臉識(shí)別技術(shù)綜述(原文)
- 基于分塊核函數(shù)特征的交通標(biāo)識(shí)識(shí)別.pdf
- 音頻事件檢測(cè)算法研究.pdf
- [雙語(yǔ)翻譯]人臉識(shí)別外文翻譯—人臉識(shí)別技術(shù)綜述(原文).PDF
- 外文翻譯--射頻識(shí)別(rfid)趨勢(shì)
- 語(yǔ)音識(shí)別外文文獻(xiàn)翻譯
- [雙語(yǔ)翻譯]手勢(shì)識(shí)別外文翻譯--基于fpga的實(shí)時(shí)手勢(shì)識(shí)別
評(píng)論
0/150
提交評(píng)論