數(shù)據(jù)庫外文翻譯--mapreduce的簡(jiǎn)化數(shù)據(jù)處理大型集群(節(jié)選)_第1頁
已閱讀1頁,還剩16頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、<p>  中文3700字,2000英文單詞,10500英文字符</p><p>  文獻(xiàn)出處:Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.</p><p>  MapR

2、educe: Simplified Data Processing on Large Clusters</p><p>  Jeffrey Dean and Sanjay Ghemawat</p><p>  jeff@google.com, sanjay@google.com</p><p>  Google, Inc.</p><p>&

3、lt;b>  Abstract</b></p><p>  MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value

4、pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown i

5、n the paper.</p><p>  Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitio

6、ning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine ommunication. This allows programmers without any experience with p

7、arallel and distributed systems to easily utilize the resources of a large distributed system.</p><p>  Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a

8、 typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers and the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce

9、 jobs are executed on Google's clusters every day.</p><p>  1 Introduction</p><p>  Over the past five years, the authors and many others at Google have implemented hundreds of that process

10、large amounts of raw data,such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of web documents, summaries

11、of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straightforward. However, the input dat</p><p>  As a reaction to this c

12、omplexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing i

13、n a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical ?

14、record? in our input in order to comp</p><p>  The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, co

15、mbined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.</p><p>  Section 2 describes the basic programming model and gives several examples.</p&g

16、t;<p>  Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment. </p><p>  Section 4 describes several renements of the programming mo

17、del that we have found useful. </p><p>  Section 5 has performance measurements of our implementation for a variety of tasks. </p><p>  Section 6 explores the use of MapReduce within Google incl

18、uding our experiences in using it as the basis for a rewrite of our production indexing system. </p><p>  Section 7 discusses related and future work.</p><p>  2 Programming Model</p><

19、;p>  The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce.</p><p&g

20、t;  Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes t

21、hem to the Reduce function. </p><p>  The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smalle

22、r set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values

23、 that are too large to fit in memory.</p><p>  2.1 Example</p><p>  Consider the problem of counting the number of occurrences of each word in a large collection of documents.The user would writ

24、e code similar to the following pseudo-code:</p><p>  The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The reduce function sums together all cou

25、nts emitted for a particular word.</p><p>  In addition, the user writes code to fill in a mapreduce specification object with the names of the input and output files, and optional tuning parameters. The use

26、r then invokes the MapReduce function, passing it the specification object. The user's code is linked together with the MapReduce library (implemented in C++). Appendix A contains the full program text for this examp

27、le.</p><p><b>  2.2 Types</b></p><p>  Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by th

28、e user have associated types:</p><p>  I.e., the input keys and values are drawn from a different domain than the output keys and values. Furthermore,the intermediate keys and values are from the same domain

29、 as the output keys and values.</p><p>  Our C++ implementation passes strings to and from the user-defined functions and leaves it to the user code to convert between strings and appropriate types.</p>

30、;<p>  2.3 More Examples</p><p>  Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations.</p><p>  Distributed Grep: The map functio

31、n emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.</p><p>  Count of URL Access Frequency: The map fun

32、ction processes logs of web page requests and outputs The reduce function adds together all values for the same URL and emits <URL, total count> pair.</p><p>  ReverseWeb-Link Graph: The map function o

33、utputs <target,source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair<target,

34、list(source)></p><p>  Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of hword; frequencyi pairs. The map function emits a

35、 hhostname; term vector pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vector

36、s together, throwing away infrequent terms, and then emits a final <hostname,term vector> pair.</p><p>  Inverted Index: The map function parses each document,and emits a sequence of <word, document

37、 ID>pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word; list(document ID)> pair. The set of all output pairs forms a simple inverted index. It i

38、s easy to augment this computation to keep track of word positions.</p><p>  Distributed Sort: The map function extracts the key from each record, and emits a <key, record> pair. The reduce function em

39、its all pairs unchanged. This computation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Section 4.2.</p><p>  3 Implementation</p><p>  

40、Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be</p><p>  suitable for a small shared-memory machine

41、, another for a large NUMA multi-processor, and yet another for an even larger collection of networked machines.</p><p>  This section describes an implementation targeted to the computing environment in wid

42、e use at Google: large clusters of commodity PCs connected together with switched Ethernet [4]. In our environment:</p><p>  (1) Machines are typically dual-processor x86 processors running Linux, with 2-4 G

43、B of memory per machine.</p><p>  (2) Commodity networking hardware is used . typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisecti

44、on bandwidth.</p><p>  (3) A cluster consists of hundreds or thousands of machines, and therefore machine failures are common.</p><p>  (4) Storage is provided by inexpensive IDE disks attached

45、directly to individual machines. A distributed _le system [8] developed in-house is used to manage the data</p><p>  stored on these disks. The _le system uses replication to provide availability and reliabi

46、lity on top of unreliable hardware.</p><p>  (5) Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.

47、</p><p>  3.1 Execution Overview</p><p>  The Map invocations are distributed across multiplemachines by automatically partitioning the input data into a set of M splits. The input splits can be

48、 processed</p><p>  in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The n

49、umber of partitions (R) and the partitioning function are speci_ed by the user. </p><p>  Figure 1 shows the overall _ow of a MapReduce operation in our implementation. When the user program calls the MapRed

50、uce function, the following sequence of actions occurs (the numbered labels in Figure 1 correspond to the numbers in the list below):</p><p>  1. The MapReduce library in the user program _rst splits the inp

51、ut _les into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.</p><p

52、>  2. One of the copies of the program is special . the master. The rest are workers that are assigned work by the master. There areM map tasks and R reduce tasks to assign. The master picks idle workers and assigns e

53、ach one a map task or a reduce task.</p><p>  3. A worker who is assigned a map task reads thecontents of the corresponding input split. It parseskey/value pairs out of the input data and passes each pair to

54、 the user-de_ned Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.</p><p>  4. Periodically, the buffered pairs are written to local disk, partitioned into R

55、 regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.</p><p> 

56、 5. When a reduce worker is noti_ed by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate da

57、ta, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediat

58、e data is too large to fit in memory, an external sort is use</p><p>  6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and

59、the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a _nal output _le for this reduce partition.</p><p>  7. When all map tasks an

60、d reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. </p><p>  After successful completion, the outpu

61、t of the mapreduce execution is available in the R output files (one per reduce task, with _le names as speci_ed by the user).</p><p>  Typically, users do not need to combine these R output files into one f

62、iles -they often pass these _les as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.</p><p>  MapRedu

63、ce的:簡(jiǎn)化數(shù)據(jù)處理大型集群</p><p><b>  摘要</b></p><p>  MapReduce是一個(gè)編程模型,也是一個(gè)處理和生成超大數(shù)據(jù)集的算法模型的相關(guān)實(shí)現(xiàn)。用戶首先創(chuàng)建一個(gè)Map函數(shù)處理一個(gè)基于key/value pair的數(shù)據(jù)集合,輸出中間的基于key/value pair的數(shù)據(jù)集合;然后再創(chuàng)建一個(gè)Reduce函數(shù)用來合并所有的具有相同中間key

64、值的中間value值?,F(xiàn)實(shí)世界中有很多滿足上述處理模型的例子,本論文將詳細(xì)描述這個(gè)模型。</p><p>  MapReduce架構(gòu)的程序能夠在大量的普通配置的計(jì)算機(jī)上實(shí)現(xiàn)并行化處理。這個(gè)系統(tǒng)在運(yùn)行時(shí)只關(guān)心:如何分割輸入數(shù)據(jù),在大量計(jì)算機(jī)組成的集群上的調(diào)度,集群中計(jì)算機(jī)的錯(cuò)誤處理,管理集群中計(jì)算機(jī)之間必要的通信。采用MapReduce架構(gòu)可以使那些沒有并行計(jì)算和分布式處理系統(tǒng)開發(fā)經(jīng)驗(yàn)的程序員有效利用分布式系統(tǒng)的豐

65、富資源。</p><p>  我們的MapReduce實(shí)現(xiàn)運(yùn)行在規(guī)??梢造`活調(diào)整的由普通機(jī)器組成的集群上:一個(gè)典型的MapReduce計(jì)算往往由幾千臺(tái)機(jī)器組成、處理以TB計(jì)算的數(shù)據(jù)。程序員發(fā)現(xiàn)這個(gè)系統(tǒng)非常好用:已經(jīng)實(shí)現(xiàn)了數(shù)以百計(jì)的MapReduce程序,在Google的集群上,每天都有1000多個(gè)MapReduce程序在執(zhí)行。</p><p><b>  1 介紹 </b&

66、gt;</p><p>  在過去的5年里,包括本文作者在內(nèi)的Google的很多程序員,為了處理海量的原始數(shù)據(jù),已經(jīng)實(shí)現(xiàn)了數(shù)以百計(jì)的、專用的計(jì)算方法。這些計(jì)算方法用來處理大量的原始數(shù)據(jù),比如,文檔抓?。愃凭W(wǎng)絡(luò)爬蟲的程序)、Web請(qǐng)求日志等等;也為了計(jì)算處理各種類型的衍生數(shù)據(jù),比如倒排索引、Web文檔的圖結(jié)構(gòu)的各種表示形勢(shì)、每臺(tái)主機(jī)上網(wǎng)絡(luò)爬蟲抓取的頁面數(shù)量的匯總、每天被請(qǐng)求的最多的查詢的集合等等。大多數(shù)這樣的數(shù)據(jù)

67、處理運(yùn)算在概念上很容易理解。然而由于輸入的數(shù)據(jù)量巨大,因此要想在可接受的時(shí)間內(nèi)完成運(yùn)算,只有將這些計(jì)算分布在成百上千的主機(jī)上。如何處理并行計(jì)算、如何分發(fā)數(shù)據(jù)、如何處理錯(cuò)誤?所有這些問題綜合在一起,需要大量的代碼處理,因此也使得原本簡(jiǎn)單的運(yùn)算變得難以處理。</p><p>  為了解決上述復(fù)雜的問題,我們?cè)O(shè)計(jì)一個(gè)新的抽象模型,使用這個(gè)抽象模型,我們只要表述我們想要執(zhí)行的簡(jiǎn)單運(yùn)算即可,而不必關(guān)心并行計(jì)算、容錯(cuò)、數(shù)據(jù)分

68、布、負(fù)載均衡等復(fù)雜的細(xì)節(jié),這些問題都被封裝在了一個(gè)庫里面。設(shè)計(jì)這個(gè)抽象模型的靈感來自Lisp和許多其他函數(shù)式語言的Map和Reduce的原語。我們意識(shí)到我們大多數(shù)的運(yùn)算都包含這樣的操作:在輸入數(shù)據(jù)的“邏輯”記錄上應(yīng)用Map操作得出一個(gè)中間key/value pair集合,然后在所有具有相同key值的value值上應(yīng)用Reduce操作,從而達(dá)到合并中間的數(shù)據(jù),得到一個(gè)想要的結(jié)果的目的。使用MapReduce模型,再結(jié)合用戶實(shí)現(xiàn)的Map和R

69、educe函數(shù),我們就可以非常容易的實(shí)現(xiàn)大規(guī)模并行化計(jì)算;通過MapReduce模型自帶的“再次執(zhí)行”(re-execution)功能,也提供了初級(jí)的容災(zāi)實(shí)現(xiàn)方案。</p><p>  這個(gè)工作(實(shí)現(xiàn)一個(gè)MapReduce框架模型)的主要貢獻(xiàn)是通過簡(jiǎn)單的接口來實(shí)現(xiàn)自動(dòng)的并行化和大規(guī)模的分布式計(jì)算,通過使用MapReduce模型接口實(shí)現(xiàn)在大量普通的PC機(jī)上高性能計(jì)算。</p><p>  第

70、二部分描述基本的編程模型和一些使用案例。</p><p>  第三部分描述了一個(gè)經(jīng)過裁剪的、適合我們的基于集群的計(jì)算環(huán)境的MapReduce實(shí)現(xiàn)。</p><p>  第四部分描述我們認(rèn)為在MapReduce編程模型中一些實(shí)用的技巧。</p><p>  第五部分對(duì)于各種不同的任務(wù),測(cè)量我們MapReduce實(shí)現(xiàn)的性能。</p><p>  

71、第六部分揭示了在Google內(nèi)部如何使用MapReduce作為基礎(chǔ)重寫我們的索引系統(tǒng)產(chǎn)品,包括其它一些使用MapReduce的經(jīng)驗(yàn)。</p><p>  第七部分討論相關(guān)的和未來的工作。</p><p><b>  2 編程模型 </b></p><p>  MapReduce編程模型的原理是:利用一個(gè)輸入key/value pair集合來產(chǎn)生

72、一個(gè)輸出的key/value pair集合。MapReduce庫的用戶用兩個(gè)函數(shù)表達(dá)這個(gè)計(jì)算:Map和Reduce。</p><p>  用戶自定義的Map函數(shù)接受一個(gè)輸入的key/valuepair值,然后產(chǎn)生一個(gè)中間key/value pair值的集合。MapReduce庫把所有具有相同中間key值I的中間value值集合在一起后傳遞給reduce函數(shù)。</p><p>  用戶自定義

73、的Reduce函數(shù)接受一個(gè)中間key的值I和相關(guān)的一個(gè)value值的集合。Reduce函數(shù)合并這些value值,形成一個(gè)較小的value值的集合。一般的,每次Reduce函數(shù)調(diào)用只產(chǎn)生0或1個(gè)輸出value值。通常我們通過一個(gè)迭代器把中間value值提供給Reduce函數(shù),這樣我們就可以處理無法全部放入內(nèi)存中的大量的value值的集合。</p><p><b>  2.1 例子</b><

74、;/p><p>  例如,計(jì)算一個(gè)大的文檔集合中每個(gè)單詞出現(xiàn)的次數(shù),下面是偽代碼段:</p><p>  Map函數(shù)輸出文檔中的每個(gè)詞、以及這個(gè)詞的出現(xiàn)次數(shù)(在這個(gè)簡(jiǎn)單的例子里就是1)。Reduce函數(shù)把Map函數(shù)產(chǎn)生的每一個(gè)特定的詞的計(jì)數(shù)累加起來。</p><p>  另外,用戶編寫代碼,使用輸入和輸出文件的名字、可選的調(diào)節(jié)參數(shù)來完成一個(gè)符合MapReduce模型規(guī)范

75、的對(duì)象,然后調(diào)用MapReduce函數(shù),并把這個(gè)規(guī)范對(duì)象傳遞給它。用戶的代碼和MapReduce庫鏈接在一起(用C++實(shí)現(xiàn))。附錄A包含了這個(gè)實(shí)例的全部程序代碼。</p><p><b>  2.2 類型</b></p><p>  盡管在前面例子的偽代碼中使用了以字符串表示的輸入輸出值,但是在概念上,用戶定義的Map和Reduce函數(shù)都有相關(guān)聯(lián)的類型:</p&

76、gt;<p>  比如,輸入的key和value值與輸出的key和value值在類型上推導(dǎo)的域不同。此外,中間key和value值與輸出key和value值在類型上推導(dǎo)的域相同。</p><p>  我們的C++中使用字符串類型作為用戶自定義函數(shù)的輸入輸出,用戶在自己的代碼中對(duì)字符串進(jìn)行適當(dāng)?shù)念愋娃D(zhuǎn)換。</p><p><b>  2.3 更多的例子</b&g

77、t;</p><p>  這里還有一些有趣的簡(jiǎn)單例子,可以很容易的使用MapReduce模型來表示:</p><p>  分布式的Grep:Map函數(shù)輸出匹配某個(gè)模式的一行,Reduce函數(shù)是一個(gè)恒等函數(shù),即把中間數(shù)據(jù)復(fù)制到輸出。</p><p>  計(jì)算URL訪問頻率:Map函數(shù)處理日志中web頁面請(qǐng)求的記錄,然后輸出(URL,1)。Reduce函數(shù)把相同URL的

78、value值都累加起來,產(chǎn)生(URL,記錄總數(shù))結(jié)果。</p><p>  倒轉(zhuǎn)網(wǎng)絡(luò)鏈接圖:Map函數(shù)在源頁面(source)中搜索所有的鏈接目標(biāo)(target)并輸出為(target,source)。Reduce函數(shù)把給定鏈接目標(biāo)(target)的鏈接組合成一個(gè)列表,輸出(target,list(source))。</p><p>  每個(gè)主機(jī)的檢索詞向量:檢索詞向量用一個(gè)(詞,頻率)列

79、表來概述出現(xiàn)在文檔或文檔集中的最重要的一些詞。Map函數(shù)為每一個(gè)輸入文檔輸出(主機(jī)名,檢索詞向量),其中主機(jī)名來自文檔的URL。Reduce函數(shù)接收給定主機(jī)的所有文檔的檢索詞向量,并把這些檢索詞向量加在一起,丟棄掉低頻的檢索詞,輸出一個(gè)最終的(主機(jī)名,檢索詞向量)。</p><p>  倒排索引:Map函數(shù)分析每個(gè)文檔輸出一個(gè)(詞,文檔號(hào))的列表,Reduce函數(shù)的輸入是一個(gè)給定詞的所有(詞,文檔號(hào)),排序所有的

80、文檔號(hào),輸出(詞,list(文檔號(hào)))。所有的輸出集合形成一個(gè)簡(jiǎn)單的倒排索引,它以一種簡(jiǎn)單的算法跟蹤詞在文檔中的位置。</p><p>  分布式排序:Map函數(shù)從每個(gè)記錄提取key,輸出(key,record)。Reduce函數(shù)不改變?nèi)魏蔚闹?。這個(gè)運(yùn)算依賴分區(qū)機(jī)制(在4.1描述)和排序?qū)傩?在4.2描述)。</p><p><b>  3 實(shí)現(xiàn) </b></p

81、><p>  MapReduce模型可以有多種不同的實(shí)現(xiàn)方式。如何正確選擇取決于具體的環(huán)境。例如,一種實(shí)現(xiàn)方式適用于小型的共享內(nèi)存方式的機(jī)器,另外一種實(shí)現(xiàn)方式則適用于大型NUMA架構(gòu)的多處理器的主機(jī),而有的實(shí)現(xiàn)方式更適合大型的網(wǎng)絡(luò)連接集群。</p><p>  本章節(jié)描述一個(gè)適用于Google內(nèi)部廣泛使用的運(yùn)算環(huán)境的實(shí)現(xiàn):用以太網(wǎng)交換機(jī)連接、由普通PC機(jī)組成的大型集群。在我們的環(huán)境里包括:&l

82、t;/p><p>  1. x86架構(gòu)、運(yùn)行Linux操作系統(tǒng)、雙處理器、2-4GB內(nèi)存的機(jī)器。</p><p>  2. 普通的網(wǎng)絡(luò)硬件設(shè)備,每個(gè)機(jī)器的帶寬為百兆或者千兆,但是遠(yuǎn)小于網(wǎng)絡(luò)的平均帶寬的一半。</p><p>  3. 集群中包含成百上千的機(jī)器,因此,機(jī)器故障是常態(tài)。</p><p>  4. 存儲(chǔ)為廉價(jià)的內(nèi)置IDE硬盤。一個(gè)內(nèi)部分

83、布式文件系統(tǒng)用來管理存儲(chǔ)在這些磁盤上的數(shù)據(jù)。文件系統(tǒng)通過數(shù)據(jù)復(fù)制來在不可靠的硬件上保證數(shù)據(jù)的可靠性和有效性。</p><p>  5. 用戶提交工作(job)給調(diào)度系統(tǒng)。每個(gè)工作(job)都包含一系列的任務(wù)(task),調(diào)度系統(tǒng)將這些任務(wù)調(diào)度到集群中多臺(tái)可用的機(jī)器上。</p><p><b>  3.1 執(zhí)行概括</b></p><p>  通

84、過將Map調(diào)用的輸入數(shù)據(jù)自動(dòng)分割為M個(gè)數(shù)據(jù)片段的集合,Map調(diào)用被分布到多臺(tái)機(jī)器上執(zhí)行。輸入的數(shù)據(jù)片段能夠在不同的機(jī)器上并行處理。使用分區(qū)函數(shù)將Map調(diào)用產(chǎn)生的中間key值分成R個(gè)不同分區(qū)(例如,hash(key) mod R),Reduce調(diào)用也被分布到多臺(tái)機(jī)器上執(zhí)行。分區(qū)數(shù)量(R)和分區(qū)函數(shù)由用戶來指定。</p><p>  圖1展示了我們的MapReduce實(shí)現(xiàn)中操作的全部流程。當(dāng)用戶調(diào)用MapReduce

85、函數(shù)時(shí),將發(fā)生下面的一系列動(dòng)作(下面的序號(hào)和圖1中的序號(hào)一一對(duì)應(yīng)):</p><p>  1. 用戶程序首先調(diào)用的MapReduce庫將輸入文件分成M個(gè)數(shù)據(jù)片度,每個(gè)數(shù)據(jù)片段的大小一般從16MB到64MB(可以通過可選的參數(shù)來控制每個(gè)數(shù)據(jù)片段的大小)。然后用戶程序在機(jī)群中創(chuàng)建大量的程序副本。</p><p>  2. 這些程序副本中的有一個(gè)特殊的程序–master。副本中其它的程序都是wo

86、rker程序,由master分配任務(wù)。有M個(gè)Map任務(wù)和R個(gè)Reduce任務(wù)將被分配,master將一個(gè)Map任務(wù)或Reduce任務(wù)分配給一個(gè)空閑的worker。</p><p>  3. 被分配了map任務(wù)的worker程序讀取相關(guān)的輸入數(shù)據(jù)片段,從輸入的數(shù)據(jù)片段中解析出key/value pair,然后把key/value pair傳遞給用戶自定義的Map函數(shù),由Map函數(shù)生成并輸出的中間key/value

87、pair,并緩存在內(nèi)存中。</p><p>  4. 緩存中的key/value pair通過分區(qū)函數(shù)分成R個(gè)區(qū)域,之后周期性的寫入到本地磁盤上。緩存的key/value pair在本地磁盤上的存儲(chǔ)位置將被回傳給master,由master負(fù)責(zé)把這些存儲(chǔ)位置再傳送給Reduce worker。</p><p>  5. 當(dāng)Reduce worker程序接收到master程序發(fā)來的數(shù)據(jù)存儲(chǔ)位

88、置信息后,使用RPC從Map worker所在主機(jī)的磁盤上讀取這些緩存數(shù)據(jù)。當(dāng)Reduce worker讀取了所有的中間數(shù)據(jù)后,通過對(duì)key進(jìn)行排序后使得具有相同key值的數(shù)據(jù)聚合在一起。由于許多不同的key值會(huì)映射到相同的Reduce任務(wù)上,因此必須進(jìn)行排序。如果中間數(shù)據(jù)太大無法在內(nèi)存中完成排序,那么就要在外部進(jìn)行排序。</p><p>  6. Reduce worker程序遍歷排序后的中間數(shù)據(jù),對(duì)于每一個(gè)唯

89、一的中間key值,Reduce worker程序?qū)⑦@個(gè)key值和它相關(guān)的中間value值的集合傳遞給用戶自定義的Reduce函數(shù)。Reduce函數(shù)的輸出被追加到所屬分區(qū)的輸出文件。</p><p>  7. 當(dāng)所有的Map和Reduce任務(wù)都完成之后,master喚醒用戶程序。在這個(gè)時(shí)候,在用戶程序里的對(duì)MapReduce調(diào)用才返回。</p><p>  在成功完成任務(wù)之后,MapRedu

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論