需求:Reduce输出特殊的格式结果
例如:如Reducer的结果,压到Guava的BloomFilter中
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.h[......]
Category Archives: 大数据技术
Hadoop如何快速完成数值排序的工作
转载自:http://stackoverflow.com/questions/13331722/how-to-sort-numerically-in-hadoops-shuffle-sort-phase
Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.
- -D mapred.output.key.comparator.class=org.apach[......]
Writing Hive Custom Aggregate Functions (UDAF)
转载自:《Writing Hive Custom Aggregate Functions (UDAF): Part II》
Now that we got eclipse configured (see Part I) for UDAF development, its time to write our first UDAF. Searching for custom UDAF, most people might have already came across the followi[......]
Hive自定义UDF/UDAF/UDTF中,如何获得List的ObjectInspector
在Hive中,在使用GenercU**F实现自定义UDF/UDAF/UDTF时,经常要制定输出类型,其中要获得一个ObjectInspector。
对于基础类型:
PrimitiveObjectInspectorFactory.javaStringObjectInspector)
对于List等复合类型,要2步:
ObjectInspectorFactory
.getStandardListObjectInspector(PrimitiveObjectInspectorFa[......]
如何拓展Hadoop的InputFormat为其他分隔符
在Hadoop中,常用的TextInputFormat是以换行符作为Record分隔符的。
在实际应用中,我们经常会出现一条Record中包含多行的情况,例如:
<doc>
....
</doc>
此时,需要拓展TextInputFormat以完成这个功能。
先来看一下原始实现:
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {[......]