版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、<p> 字?jǐn)?shù):英文2291單詞,12196字符;中文3868漢字</p><p> 出處:VH Shastri,V Sreeprada.A Study of Data Mining with Big Data[J]International Journal of Emerging Trends and Technology in Computer Science.2016,38(2):99-103
2、</p><p><b> 外文文獻(xiàn): </b></p><p> A Study of Data Mining with Big Data</p><p> Abstract Data has become an important part of every economy, industry, organization, busi
3、ness, function and individual. Big Data is a term used to identify large data sets typically whose size is larger than the typical data base. Big data introduces unique computational and statistical challenges. Big Data
4、are at present expanding in most of the domains of engineering and science. Data mining helps to extract useful data from the huge data sets due to its volume, variability and velocity. This</p><p> Keyword
5、s: Big Data, Data Mining, HACE theorem, structured and unstructured.</p><p> I.Introduction</p><p> Big Data refers to enormous amount of structured data and unstructured data that overflow th
6、e organization. If this data is properly used, it can lead to meaningful information. Big data includes a large number of data which requires a lot of processing in real time. It provides a room to discover new values, t
7、o understand in-depth knowledge from hidden values and provide a space to manage the data effectively. A database is an organized collection of logically related data which can be easily m</p><p> Big Data
8、includes 3 V’s as its characteristics. They are volume, velocity and variety. Volume means the amount of data generated every second. The data is in state of rest. It is also known for its scale characteristics. Velocity
9、 is the speed with which the data is generated. It should have high speed data. The data generated from social media is an example. Variety means different types of data can be taken such as audio, video or documents. It
10、 can be numerals, images, time series, arrays etc.</p><p> Data Mining analyses the data from different perspectives and summarizing it into useful information that can be used for business solutions and pr
11、edicting the future trends. Data mining (DM), also called Knowledge Discovery in Databases (KDD) or Knowledge Discovery and Data Mining, is the process of searching large volumes of data automatically for patterns such a
12、s association rules. It applies many computational techniques from statistics, information retrieval, machine learning and pattern re</p><p> Big Data is expanding in all domains including science and engin
13、eering fields including physical, biological and biomedical sciences.</p><p> II.BIG DATA with DATA MINING</p><p> Generally big data refers to a collection of large volumes of data and these
14、data are generated from various sources like internet, social-media, business organization, sensors etc. We can extract some useful information with the help of Data Mining. It is a technique for discovering patterns as
15、well as descriptive, understandable, models from a large scale of data.</p><p> Volume is the size of the data which is larger than petabytes and terabytes. The scale and rise of size makes it difficult to
16、store and analyse using traditional tools. Big Data should be used to mine large amounts of data within the predefined period of time. Traditional database systems were designed to address small amounts of data which wer
17、e structured and consistent, whereas Big Data includes wide variety of data such as geospatial data, audio, video, unstructured text and so on.</p><p> Big Data mining refers to the activity of going throug
18、h big data sets to look for relevant information. To process large volumes of data from different sources quickly, Hadoop is used. Hadoop is a free, Java-based programming framework that supports the processing of large
19、data sets in a distributed computing environment. Its distributed file system supports fast data transfer rates among nodes and allows the system to continue operating uninterrupted at times of node failure. It runs Map
20、Reduce</p><p> III.BIG DATA characteristics- HACE THEOREM.</p><p> We have large volume of heterogeneous data. There exists a complex relationship among the data. We need to discover useful in
21、formation from this voluminous data.</p><p> Let us imagine a scenario in which the blind people are asked to draw elephant. The information collected by each blind people may think the trunk as wall, leg a
22、s tree, body as wall and tail as rope. The blind men can exchange information with each other.</p><p> Figure1: Blind men and the giant elephant</p><p> Some of the characteristics that includ
23、e are:</p><p> i.Vast data with heterogeneous and diverse sources: One of the fundamental characteristics of big data is the large volume of data represented by heterogeneous and diverse dimensions. For exa
24、mple in the biomedical world, a single human being is represented as name, age, gender, family history etc., For X-ray and CT scan images and videos are used. Heterogeneity refers to the different types of representation
25、s of same individual and diverse refers to the variety of features to represent single in</p><p> ii.Autonomous with distributed and de- centralized control: the sources are autonomous, i.e., automatically
26、generated; it generates information without any centralized control. We can compare it with World Wide Web (WWW) where each server provides a certain amount of information without depending on other servers.</p>&
27、lt;p> iii.Complex and evolving relationships: As the size of the data becomes infinitely large, the relationship that exists is also large. In early stages, when data is small, there is no complexity in relationships
28、 among the data. Data generated from social media and other sources have complex relationships.</p><p> IV.TOOLS:OPEN SOURCE REVOLUTION</p><p> Large companies such as Facebook, Yahoo, Twitte
29、r, LinkedIn benefit and contribute work on open source projects. In Big Data Mining, there are many open source initiatives. The most popular of them are:</p><p> Apache Mahout: Scalable machine learning an
30、d data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent patternmi
31、ning.</p><p> R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckla
32、nd, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.</p><p> MOA: Stream data mining open source software to perform data mining in real time. It has implementatio
33、ns of classification, regression; clustering and frequent item set mining and frequent graph mining. It started as a project of the Machine Learning group of University of Waikato, New Zealand, famous for the WEKA softwa
34、re. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA, Android and Storm.</p><p> SAMOA: It is a new upcoming
35、software project for distributed stream mining that will combine S4 and Storm with MOA.</p><p> Vow pal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design
36、a fast, scalable, useful learning algorithm. VW is able to learn from terafeature datasets. It can exceed the throughput of any single machine networkinterface when doing linear learning, via parallel learning.</p>
37、<p> V.DATA MINING for BIG DATA</p><p> Data mining is the process by which data is analysed coming from different sources discovers useful information. Data Mining contains several algorithms which
38、 fall into 4 categories. They are:</p><p> 1.Association Rule</p><p> 2.Clustering</p><p> 3.Classification</p><p> 4.Regression</p><p> Association i
39、s used to search relationship between variables. It is applied in searching for frequently visited items. In short it establishes relationship among objects. Clustering discovers groups and structures in the data.Classif
40、ication deals with associating an unknown structure to a known structure. Regression finds a function to model the data.</p><p> The different data mining algorithms are:</p><p> Table 1. Clas
41、sification of Algorithms</p><p> Data Mining algorithms can be converted into big map reduce algorithm based on parallel computing basis.</p><p> Table 2. Differences between Data Mining and B
42、ig Data</p><p> VI.Challenges in BIG DATA</p><p> Meeting the challenges with BIG Data is difficult. The volume is increasing every day. The velocity is increasing by the internet connected de
43、vices. The variety is also expanding and the organizations’ capability to capture and process the data is limited.</p><p> The following are the challenges in area of Big Data when it is handled:</p>
44、<p> 1.Data capture and storage</p><p> 2.Data transmission</p><p> 3.Data curation</p><p> 4.Data analysis</p><p> 5.Data visualization</p><p>
45、 According to, challenges of big data mining are divided into 3 tiers.</p><p> The first tier is the setup of data mining algorithms. The second tier includes</p><p> 1.Information sharing an
46、d Data Privacy.</p><p> 2.Domain and Application Knowledge.</p><p> The third one includes local learning and model fusion for multiple information sources.</p><p> 3.Mining from
47、 sparse, uncertain and incomplete data.</p><p> 4.Mining complex and dynamic data.</p><p> Figure 2: Phases of Big Data Challenges</p><p> Generally mining of data from different
48、 data sources is tedious as size of data is larger. Big data is stored at different places and collecting those data will be a tedious task and applying basic data mining algorithms will be an obstacle for it. Next we ne
49、ed to consider the privacy of data. The third case is mining algorithms. When we are applying data mining algorithms to these subsets of data the result may not be that much accurate.</p><p> VII.Forecast o
50、f the future</p><p> There are some challenges that researchers and practitioners will have to deal during the next years:</p><p> Analytics Architecture: It is not clear yet how an optimal ar
51、chitecture of analytics systems should be to deal with historic data and with real-time data at the same time. An interesting proposal is the Lambda architecture of Nathan Marz. The Lambda Architecture solves the problem
52、 of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer. It combines in the same system Hadoop for the batch&
53、lt;/p><p> Statistical significance: It is important to achieve significant statistical results, and not be fooled by randomness. As Efron explains in his book about Large Scale Inference, it is easy to go wro
54、ng with huge data sets and thousands of questions to answer at once.</p><p> Distributed mining: Many data mining techniques are not trivial to paralyze. To have distributed versions of some methods, a lot
55、of research is needed with practical and theoretical analysis to provide new methods.</p><p> Time evolving data: Data may be evolving over time, so it is important that the Big Data mining techniques shoul
56、d be able to adapt and in some cases to detect change first. For example, the data stream mining field has very powerful techniques for this task.</p><p> Compression: Dealing with Big Data, the quantity of
57、 space needed to store it is very relevant. There are two main approaches: compression where we don’t loose anything, or sampling where we choose what is thedata that is more representative. Using compression, we may tak
58、e more time and less space, so we can consider it as a transformation from time to space. Using sampling, we are loosing information, but the gains inspace may be in orders of magnitude. For example Feldman et al use cor
59、e sets to</p><p> Visualization: A main task of Big Data analysis is how to visualize the results. As the data is so big, it is very difficult to find user-friendly visualizations. New techniques, and frame
60、works to tell and show stories will be needed, as for example the photographs, infographics and essays in the beautiful book ”The Human Face of Big Data”.</p><p> Hidden Big Data: Large quantities of useful
61、 data are getting lost since new data is largely untagged file based and unstructured data. The 2012 IDC studyon Big Data explains that in 2012, 23% (643 exabytes) of the digital universe would be useful for Big Data if
62、tagged and analyzed. However, currently only 3% of the potentially useful data is tagged, and even less is analyzed.</p><p> VIII.CONCLUSION</p><p> The amounts of data is growing exponentiall
63、y due to social networking sites, search and retrieval engines, media sharing sites, stock trading sites, news sources and so on. Big Data is becoming the new area for scientific data research and for business applicatio
64、ns.</p><p> Data mining techniques can be applied on big data to acquire some useful information from large datasets. They can be used together to acquire some useful picture from the data.</p><p
65、> Big Data analysis tools like Map Reduce over Hadoop and HDFS helps organization.</p><p><b> 中文譯文:</b></p><p><b> 大數(shù)據(jù)挖掘研究</b></p><p> 摘要 數(shù)據(jù)已經(jīng)成為各個(gè)經(jīng)濟(jì)、
66、行業(yè)、組織、企業(yè)、職能和個(gè)人的重要組成部分。大數(shù)據(jù)是用于識(shí)別大型數(shù)據(jù)集的一個(gè)術(shù)語(yǔ),通常其大小比典型的數(shù)據(jù)庫(kù)要大。大數(shù)據(jù)引入了獨(dú)特的計(jì)算和統(tǒng)計(jì)挑戰(zhàn)。在工程和科學(xué)的大部分領(lǐng)域,大數(shù)據(jù)目前都有延伸。由于大數(shù)據(jù)的數(shù)量之多、速度之快、種類之繁,所以可以使用數(shù)據(jù)挖掘,有助于從龐大的數(shù)據(jù)集中提取有用的數(shù)據(jù)。本文介紹了HACE定理,它描述了大數(shù)據(jù)革命的特征,并從數(shù)據(jù)挖掘角度提出了一個(gè)大數(shù)據(jù)處理模型。</p><p> 關(guān)鍵詞:
67、大數(shù)據(jù),數(shù)據(jù)挖掘,HACE定理,結(jié)構(gòu)化和非結(jié)構(gòu)化。</p><p><b> 一、簡(jiǎn)介</b></p><p> 大數(shù)據(jù)指的是大量的結(jié)構(gòu)化數(shù)據(jù)和非結(jié)構(gòu)化數(shù)據(jù),這些數(shù)據(jù)遍布了整個(gè)組織。如果這些數(shù)據(jù)被正確使用,將會(huì)產(chǎn)生有意義的信息。大數(shù)據(jù)包括大量的數(shù)據(jù),需要大量的實(shí)時(shí)處理。它提供了兩個(gè)空間,一個(gè)用于發(fā)現(xiàn)新價(jià)值,并從隱藏的價(jià)值中了解深入的知識(shí),另一個(gè)用于有效管理數(shù)據(jù)。數(shù)
68、據(jù)庫(kù)是一個(gè)與數(shù)據(jù)相關(guān)的邏輯上有組織的集合,可以方便地管理、更新和訪問(wèn)。數(shù)據(jù)挖掘是從數(shù)據(jù)庫(kù)或其他存儲(chǔ)庫(kù)中存儲(chǔ)的大量數(shù)據(jù)中發(fā)現(xiàn)有趣的知識(shí)(如關(guān)聯(lián)、模式、更改、異常和重要結(jié)構(gòu))的過(guò)程。</p><p> 大數(shù)據(jù)包括3V的特征。它們是大量(volume)、高速(velocity)和多樣(variety)。大量意味著每秒生成的數(shù)據(jù)量。數(shù)據(jù)是靜態(tài)的,它的規(guī)模特征也是眾所周知的。高速是數(shù)據(jù)生成的速度。大數(shù)據(jù)應(yīng)該有高速數(shù)據(jù),社
69、交媒體產(chǎn)生的數(shù)據(jù)就是一個(gè)例子。多樣意味著可以采取不同類型的數(shù)據(jù),例如音頻、視頻或文檔。它可以是數(shù)字、圖像、時(shí)間序列、數(shù)組等。</p><p> 數(shù)據(jù)挖掘從不同的角度分析數(shù)據(jù),并將其匯總為有用的信息,可用于商業(yè)解決方案和預(yù)測(cè)未來(lái)趨勢(shì)。數(shù)據(jù)挖掘(DM)也稱為數(shù)據(jù)庫(kù)中的知識(shí)發(fā)現(xiàn)(KDD),或者知識(shí)發(fā)現(xiàn)和數(shù)據(jù)挖掘,是為關(guān)聯(lián)規(guī)則等模式自動(dòng)搜索大量數(shù)據(jù)的過(guò)程。它應(yīng)用了統(tǒng)計(jì)學(xué)、信息檢索、機(jī)器學(xué)習(xí)和模式識(shí)別等方面的許多計(jì)算技術(shù)
70、。數(shù)據(jù)挖掘僅在短時(shí)間內(nèi)從數(shù)據(jù)庫(kù)中提取所需的模式。根據(jù)要挖掘的模式類型,可以將數(shù)據(jù)挖掘任務(wù)分為匯總、分類、聚類、關(guān)聯(lián)和趨勢(shì)分析。</p><p> 在包括物理、生物和生物醫(yī)學(xué)等科學(xué)和工程領(lǐng)域在內(nèi)的所有領(lǐng)域,大數(shù)據(jù)都有延伸。</p><p><b> 二、大數(shù)據(jù)挖掘</b></p><p> 一般而言,大數(shù)據(jù)是指大量數(shù)據(jù)的集合,這些數(shù)據(jù)來(lái)自互
71、聯(lián)網(wǎng)、社交媒體、商業(yè)組織、傳感器等各種來(lái)源。我們可以借助數(shù)據(jù)挖掘技術(shù)來(lái)提取一些有用的信息。這是一種從大量數(shù)據(jù)中發(fā)現(xiàn)模式以及描述性、可理解的模型的技術(shù)。</p><p> 容量是數(shù)據(jù)的大小,大于PB和TB。規(guī)模和容量的增加使得傳統(tǒng)的工具難以存儲(chǔ)和分析。在預(yù)定的時(shí)間段內(nèi),應(yīng)該使用大數(shù)據(jù)挖掘大量數(shù)據(jù)。傳統(tǒng)的數(shù)據(jù)庫(kù)系統(tǒng)旨在解決少量的結(jié)構(gòu)化和一致性的數(shù)據(jù),而大數(shù)據(jù)包括各種數(shù)據(jù),如地理空間數(shù)據(jù)、音頻、視頻、非結(jié)構(gòu)化文本等。
72、</p><p> 大數(shù)據(jù)挖掘是指通過(guò)大數(shù)據(jù)集來(lái)查找相關(guān)信息的活動(dòng)。為了快速處理不同來(lái)源的大量數(shù)據(jù),使用了Hadoop。Hadoop是一個(gè)免費(fèi)的基于Java的編程框架,支持在分布式計(jì)算環(huán)境中處理大型數(shù)據(jù)集。其分布式文件系統(tǒng)支持節(jié)點(diǎn)之間的快速數(shù)據(jù)傳輸速率,并允許系統(tǒng)在發(fā)生節(jié)點(diǎn)故障時(shí)不中斷運(yùn)行。它為分布式數(shù)據(jù)處理進(jìn)行MapReduce,用于結(jié)構(gòu)化和非結(jié)構(gòu)化數(shù)據(jù)。</p><p> 三、大數(shù)
73、據(jù)特征——HACE定理</p><p> 我們有大量的異構(gòu)數(shù)據(jù)。數(shù)據(jù)之間存在復(fù)雜的關(guān)系。我們需要從這些龐大的數(shù)據(jù)中發(fā)現(xiàn)有用的信息。</p><p> 讓我們想象一下,一個(gè)盲人被要求畫大象的場(chǎng)景。每個(gè)盲人收集到的信息可能會(huì)認(rèn)為軀干像墻,腿像樹,身體像墻,尾巴像繩子。盲人們可以相互交換信息。</p><p><b> 圖1:盲人和大象</b>
74、</p><p> 其中的一些特征包括:</p><p> 1.具有異構(gòu)及不同來(lái)源的海量數(shù)據(jù):大數(shù)據(jù)的基本特征之一是大量的異構(gòu)數(shù)據(jù)和多樣數(shù)據(jù)。例如,在生物醫(yī)學(xué)世界中,個(gè)人用姓名、年齡、性別、家族病史等來(lái)表示,用于X射線和CT掃描圖像和視頻。異構(gòu)是指同一個(gè)體的不同表現(xiàn)形式,多樣是指用各種特征來(lái)表示單一信息。</p><p> 2.具有分布式和非集中式控制的自治:
75、來(lái)源是自治的,即自動(dòng)生成;它在沒(méi)有任何集中控制的情況下生成信息。我們可以將它與萬(wàn)維網(wǎng)(WWW)進(jìn)行比較,其中每臺(tái)服務(wù)器都提供一定數(shù)量的信息,而不依賴于其他服務(wù)器。</p><p> 3.復(fù)雜且不斷演化的關(guān)系:隨著數(shù)據(jù)量變得無(wú)限大,存在的關(guān)系也很大。在早期階段,當(dāng)數(shù)據(jù)很小時(shí),數(shù)據(jù)之間的關(guān)系并不復(fù)雜。社交媒體和其他來(lái)源生成的數(shù)據(jù)具有復(fù)雜的關(guān)系。</p><p> 四.工具:開放源碼革命&l
76、t;/p><p> Facebook、雅虎、Twitter、LinkedIn等大公司受益于開源項(xiàng)目,并為之做出貢獻(xiàn)。在大數(shù)據(jù)挖掘中,有許多開源計(jì)劃。其中最受歡迎的是:</p><p> ApacheMahout:主要基于Hadoop的可擴(kuò)展機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘的開源軟件。它實(shí)現(xiàn)了廣泛的機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘算法:聚類、分類、協(xié)同過(guò)濾和頻繁模式。</p><p> R:為
77、統(tǒng)計(jì)計(jì)算和可視化設(shè)計(jì)的開源編程語(yǔ)言和軟件環(huán)境。R是由在新西蘭奧克蘭大學(xué)的Ross Ihaka和Robert Gentleman在1993年開始設(shè)計(jì)的,用于統(tǒng)計(jì)分析超大型數(shù)據(jù)集。</p><p> MOA:流數(shù)據(jù)挖掘開源軟件,可以實(shí)時(shí)進(jìn)行數(shù)據(jù)挖掘。它具有分類、回歸、聚類和頻繁項(xiàng)集挖掘和頻繁圖挖掘等實(shí)現(xiàn)。它始于新西蘭懷卡托大學(xué)機(jī)器學(xué)習(xí)小組的一個(gè)項(xiàng)目,以WEKA軟件著稱。流框架為使用簡(jiǎn)單的根據(jù)XML來(lái)定義和運(yùn)行流過(guò)程
78、提供了一個(gè)環(huán)境,并能夠使用MOA、Android和Storm</p><p> SAMOA:這是一個(gè)新的即將推出的分布式流挖掘軟件項(xiàng)目,它將S4和Storm與MOA結(jié)合在一起。</p><p> Vow pal Wabbit:在雅虎啟動(dòng)的開源項(xiàng)目。研究并繼續(xù)在微軟研究院設(shè)計(jì)一個(gè)快速的、可擴(kuò)展的、有用的學(xué)習(xí)算法。VW能夠從大量特征數(shù)據(jù)集中學(xué)習(xí)。在進(jìn)行線性學(xué)習(xí)、通過(guò)并行學(xué)習(xí)時(shí),它可以超過(guò)任
79、何單機(jī)網(wǎng)絡(luò)接口的吞吐量。</p><p> 五、大數(shù)據(jù)的數(shù)據(jù)挖掘</p><p> 數(shù)據(jù)挖掘是通過(guò)分析不同來(lái)源的數(shù)據(jù)從而發(fā)現(xiàn)有用的信息的過(guò)程。數(shù)據(jù)挖掘包含多種算法,分為4類。他們是:</p><p><b> 1.關(guān)聯(lián)規(guī)則</b></p><p><b> 2.聚類</b></p>
80、;<p><b> 3.分類</b></p><p><b> 4.回歸</b></p><p> 關(guān)聯(lián)用于搜索變量之間的關(guān)系。它用于搜索經(jīng)常訪問(wèn)的項(xiàng)目??偠灾⒘藢?duì)象之間的關(guān)系。聚類發(fā)現(xiàn)數(shù)據(jù)中的組和結(jié)構(gòu)。分類處理將未知結(jié)構(gòu)關(guān)聯(lián)到已知結(jié)構(gòu)。回歸找到一個(gè)函數(shù)來(lái)模擬數(shù)據(jù)。</p><p> 不同的
81、數(shù)據(jù)挖掘算法有:</p><p><b> 表1.算法的分類</b></p><p> 數(shù)據(jù)挖掘算法可以轉(zhuǎn)化為基于并行計(jì)算的MapReduce算法。</p><p> 表2.大數(shù)據(jù)和數(shù)據(jù)挖掘的不同之處</p><p><b> 六、大數(shù)據(jù)挑戰(zhàn)</b></p><p>
82、 面對(duì)大數(shù)據(jù)的挑戰(zhàn)很困難。數(shù)量每天都在增加。網(wǎng)絡(luò)連接設(shè)備的速度在增加。種類也在不斷擴(kuò)大,而組織采集和處理數(shù)據(jù)的能力是有限的。</p><p> 以下是處理大數(shù)據(jù)時(shí)面臨的挑戰(zhàn):</p><p><b> 1.數(shù)據(jù)采集和存儲(chǔ)</b></p><p><b> 2.數(shù)據(jù)傳輸</b></p><p>
83、;<b> 3.數(shù)據(jù)管理</b></p><p><b> 4.數(shù)據(jù)分析</b></p><p><b> 5.數(shù)據(jù)可視化</b></p><p> 據(jù)了解,大數(shù)據(jù)挖掘面臨的挑戰(zhàn)分為3層。</p><p> 第一層是數(shù)據(jù)挖掘算法的設(shè)置。第二層包括</p>
84、<p> 1.信息共享和數(shù)據(jù)隱私。</p><p><b> 2.域和應(yīng)用知識(shí)。</b></p><p> 第三層包括多個(gè)信息源的局部學(xué)習(xí)和模型融合。</p><p> 3.從稀疏、不確定和不完全的數(shù)據(jù)中挖掘。</p><p> 4.挖掘復(fù)雜和動(dòng)態(tài)數(shù)據(jù)。</p><p>
85、圖2:大數(shù)據(jù)挑戰(zhàn)的階段</p><p> 由于數(shù)據(jù)量較大,通常從不同數(shù)據(jù)源挖掘數(shù)據(jù)是很繁瑣的。大數(shù)據(jù)存儲(chǔ)在不同的地方,采集這些數(shù)據(jù)將是一項(xiàng)繁瑣的任務(wù),應(yīng)用基本的數(shù)據(jù)挖掘算法將成為其障礙。接下來(lái)我們需要考慮數(shù)據(jù)的隱私。第三種情況是挖掘算法。當(dāng)我們將數(shù)據(jù)挖掘算法應(yīng)用于這些數(shù)據(jù)子集時(shí),結(jié)果可能不那么準(zhǔn)確。</p><p><b> 七、未來(lái)預(yù)測(cè)</b></p>
86、;<p> 研究人員和從業(yè)人員在未來(lái)幾年中將面臨一些挑戰(zhàn):</p><p> 分析架構(gòu):尚不清楚分析系統(tǒng)的最佳架構(gòu)應(yīng)該如何同時(shí)處理歷史數(shù)據(jù)和實(shí)時(shí)數(shù)據(jù)。一個(gè)有趣的建議是Nathan Marz的Lambda架構(gòu)。Lambda架構(gòu)通過(guò)將問(wèn)題分解為三個(gè)層次:批處理層、服務(wù)層和速度層,解決任意數(shù)據(jù)任意函數(shù)的實(shí)時(shí)計(jì)算問(wèn)題。它將同一系統(tǒng)的Hadoop集成到批處理層,Storm集成到速度層。該系統(tǒng)的特性是:魯棒
87、性和容錯(cuò)性、可升級(jí)、通用性和可擴(kuò)展性,允許臨時(shí)查詢、最小維護(hù)和調(diào)試。</p><p> 統(tǒng)計(jì)學(xué)意義:獲得重要的統(tǒng)計(jì)結(jié)果非常重要,而不要被隨機(jī)性所愚弄。正如Efron在他的關(guān)于大規(guī)模推論的書中解釋的那樣,馬上回答龐大的數(shù)據(jù)集和數(shù)以千計(jì)的問(wèn)題會(huì)很容易出錯(cuò)。</p><p> 分布式挖掘:許多數(shù)據(jù)挖掘技術(shù)都不是微不足道的。為了實(shí)現(xiàn)某些方法的分布式版本,需要進(jìn)行大量的實(shí)踐和理論分析,以提供新的
88、方法。</p><p> 時(shí)間演化數(shù)據(jù):數(shù)據(jù)可能會(huì)隨著時(shí)間推的移而發(fā)生演化,因此重要的是大數(shù)據(jù)挖掘技術(shù)應(yīng)該能夠適應(yīng)并在某些情況下首先檢測(cè)到演化。例如,數(shù)據(jù)流挖掘領(lǐng)域?qū)Υ巳蝿?wù)提供了非常強(qiáng)大的技術(shù)。</p><p> 壓縮:處理大數(shù)據(jù)所需的空間容量非常重要。有兩種主要方法:壓縮,我們不放棄任何數(shù)據(jù),或者抽樣,選擇更有代表性的數(shù)據(jù)。使用壓縮技術(shù),我們可能需要更多的時(shí)間和更少的空間,所以我們可
89、以將其視為從時(shí)間到空間的轉(zhuǎn)換。使用抽樣,我們正在丟失信息,但漲幅空間可能是數(shù)量級(jí)的。例如,F(xiàn)eldman等人使用核心集來(lái)降低大數(shù)據(jù)問(wèn)題的復(fù)雜性。核心集是一個(gè)小集合,它可以近似表示給定問(wèn)題的原始數(shù)據(jù)。然后使用合并-減小小集可以并行地解決硬機(jī)器學(xué)習(xí)問(wèn)題。</p><p> 可視化:大數(shù)據(jù)分析的主要任務(wù)是如何可視化結(jié)果。由于數(shù)據(jù)非常龐大,因此很難找到用戶友好的可視化。將需要新的技術(shù)和框架來(lái)訴說(shuō)和展示故事,例如《大數(shù)據(jù)
90、人類面孔》這本書中的照片、圖表和文章。</p><p> 隱藏的大數(shù)據(jù):由于新數(shù)據(jù)主要是基于未標(biāo)記的文件和非結(jié)構(gòu)化數(shù)據(jù),因此大量有用的數(shù)據(jù)正在丟失。2012年IDC研究大數(shù)據(jù)解釋說(shuō),2012年,如果對(duì)數(shù)據(jù)進(jìn)行標(biāo)記和分析,23%(643EB)的數(shù)字世界將對(duì)大數(shù)據(jù)有用。但是,目前只有3%的潛在有用數(shù)據(jù)被標(biāo)記,甚至更少被分析。</p><p><b> 八、結(jié)論</b>
91、</p><p> 由于社交網(wǎng)站、搜索和檢索引擎、媒體共享網(wǎng)站、股票交易網(wǎng)站、新聞來(lái)源等,數(shù)據(jù)量呈指數(shù)級(jí)增長(zhǎng)。大數(shù)據(jù)正在成為科學(xué)數(shù)據(jù)研究和商業(yè)應(yīng)用的新領(lǐng)域。</p><p> 數(shù)據(jù)挖掘技術(shù)可以應(yīng)用于大數(shù)據(jù),從大數(shù)據(jù)集中獲取有用的信息。它們可以一起使用,從數(shù)據(jù)中獲取有用的圖片。</p><p> 像MapReduce、Hadoop和HDFS這樣的大數(shù)據(jù)分析工具可
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 大數(shù)據(jù)挖掘外文翻譯—大數(shù)據(jù)挖掘研究(原文)
- 大數(shù)據(jù)數(shù)據(jù)挖掘
- 大數(shù)據(jù)數(shù)據(jù)挖掘
- 大數(shù)據(jù)數(shù)據(jù)挖掘
- 大數(shù)據(jù)挖掘-
- 大數(shù)據(jù)與數(shù)據(jù)挖掘
- 大數(shù)據(jù)數(shù)據(jù)挖掘案例
- 大數(shù)據(jù)數(shù)據(jù)挖掘案例
- 大數(shù)據(jù)挖掘技術(shù)
- 大數(shù)據(jù)與數(shù)據(jù)挖掘?qū)嶒?yàn)系統(tǒng)
- 數(shù)據(jù)挖掘與大數(shù)據(jù)技術(shù)應(yīng)用
- 大數(shù)據(jù)時(shí)代下的數(shù)據(jù)挖掘-簡(jiǎn)易
- 大數(shù)據(jù)關(guān)聯(lián)規(guī)則挖掘研究.pdf
- [雙語(yǔ)翻譯]大數(shù)據(jù)外文翻譯--大數(shù)據(jù)中的數(shù)據(jù)學(xué)習(xí)
- 面向大數(shù)據(jù)的高效數(shù)據(jù)挖掘算法研究.pdf
- 大數(shù)據(jù)時(shí)代的數(shù)據(jù)分析和挖掘
- 大數(shù)據(jù)挖掘智慧旅游的核心
- 大數(shù)據(jù)的特征、管理與挖掘
- 大數(shù)據(jù)時(shí)代的數(shù)據(jù)分析和挖掘
- 大數(shù)據(jù)挖掘分析_程陳.pdf
評(píng)論
0/150
提交評(píng)論