Category Archives: 大数据技术

Hadoop / Hive / HBase / Mahout

Hadoop如何快速完成数值排序的工作

转载自：http://stackoverflow.com/questions/13331722/how-to-sort-numerically-in-hadoops-shuffle-sort-phase

Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.

-D mapred.output.key.comparator.class=org.apach[......]
继续阅读

Writing Hive Custom Aggregate Functions (UDAF)

转载自：《Writing Hive Custom Aggregate Functions (UDAF): Part II》

Now that we got eclipse configured (see Part I) for UDAF development, its time to write our first UDAF. Searching for custom UDAF, most people might have already came across the followi[......]

如何拓展Hadoop的InputFormat为其他分隔符

1 Reply

在Hadoop中，常用的TextInputFormat是以换行符作为Record分隔符的。

在实际应用中，我们经常会出现一条Record中包含多行的情况，例如：
<doc>
....
</doc>
此时，需要拓展TextInputFormat以完成这个功能。

先来看一下原始实现：
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {[......]

Hive中找出Table和Partition的真实路径。

在Hive中，如果使用了External Table或者Partition，那么路径是不在自己的hive warehouse下的。
-- 获取table的真实hdfs路径
desc formatted my_table;

-- 获取partition的真实hdfs路径
desc formatted my_table (pt='20140804');
[......]