Hive中如何强制UDF在Reducer执行

转载自:http://kernel-panik.blogspot.com/2013/05/force-udf-execution-to-happen-in-hive.html

Doing quick and dirty URL fetch from hive, I wanted for URL"s to be ditributed among 5 jobs. Input is small it's very hard to tune up on mapper side things to heppen on 5 mappers say.

Regular:

insert overwrite table url_raw_contant partition(dt = 20130606)
select
    full_url,
    priority,
    regexp_replace(curl_url(full_url), '\n|\r', ' ') as raw_html
from
    url_queue_table_sharded_temp;

Forced UDF execution to Reducer (5 reducers):

set mapred.reduce.tasks=5;

insert overwrite table url_raw_contant_table partition(dt = 20130606)
select
    full_url,
    priority,
    regexp_replace(curl_url(full_url), '\n|\r', ' ') as raw_html
from (
    select
        full_url, 
        priority
    from
        url_queue_table_sharded_temp
    distribute by
        md5(full_url) % 5
    sort by
        md5(full_url) % 5, priority desc
) d
distribute by
    md5(full_url) % 5;

Leave a Reply

Your email address will not be published. Required fields are marked *