Hadoop如何快速完成数值排序的工作

转载自:http://stackoverflow.com/questions/13331722/how-to-sort-numerically-in-hadoops-shuffle-sort-phase

Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.

  1. -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command
  2. You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort

EXAMPLE :

Create an identity mapper and reducer with the following code

This is the mapper.py & reducer.py

#!/usr/bin/env python
import sys
for line in sys.stdin:    
    print "%s" % (line.strip())

注:其实也可以用cat实现:-)

This is the input.txt

1
11
2
20
7
3
40

This is the Streaming command

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D  mapred.text.key.comparator.options=-n \
-input /user/input.txt \
-output /user/output.txt \
-file ~/mapper.py \
-mapper ~/mapper.py \ 
-file ~/reducer.py \
-reducer ~/reducer.py

And you will get the required output

1   
2   
3   
7   
11  
20  
40

NOTE :

  1. I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this
  2. Identity mapper is needed since you will need atleast one mapper for a MR job to run.
  3. Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.

 

Leave a Reply

Your email address will not be published.