如何拓展Hadoop的InputFormat为其他分隔符

在Hadoop中，常用的TextInputFormat是以换行符作为Record分隔符的。

在实际应用中，我们经常会出现一条Record中包含多行的情况，例如：

<doc>
....
</doc>

此时，需要拓展TextInputFormat以完成这个功能。

先来看一下原始实现：

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
 
  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
// By default,textinputformat.record.delimiter = ‘/n’(Set in configuration file)
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }
 
  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}

根据上面的代码，不难发现，换行符实际上是由"textinputformat.record.delimiter"这个配置决定的。

所以我们有种解决方案：
(1) 在Job中直接配置textinputformat.record.delimiter为"</doc>\n"，这种方案是比较Hack的，很容易影响到其他代码的正常执行。
(2) 继承TextInputFormat，在return LineRecordReader时，使用自定义的分隔符。

本文采用第二种方案，代码如下：

public class DocInputFormat extends TextInputFormat {

	private static final String RECORD_DELIMITER = "</doc>\n";

	@Override
	public RecordReader<LongWritable, Text> createRecordReader(
			InputSplit split, TaskAttemptContext tac) {
		byte[] recordDelimiterBytes = null;
		recordDelimiterBytes = RECORD_DELIMITER.getBytes();
		return new LineRecordReader(recordDelimiterBytes);
	}

	@Override
	public boolean isSplitable(JobContext context, Path file) {
		CompressionCodec codec = new CompressionCodecFactory(
				context.getConfiguration()).getCodec(file);
		return codec == null;
	}
}

需要指出的是，InputFormat只是把原始HDFS文件分割成String的记录，如果你的<doc> </doc>内有其他结构化数据，那么需要在map中自己实现deserilize的相关业务逻辑来处理。

四号程序员

Keep It Simple and Stupid

如何拓展Hadoop的InputFormat为其他分隔符

One thought on “如何拓展Hadoop的InputFormat为其他分隔符”

Leave a Reply Cancel reply