Wednesday, May 13, 2015

distribute by in hive

DISTRIBUTE BY
=============
1. DISTRIBUTE BY controls how map output will be divided among Reducers.
2. In the below example DISTRIBUTE BY ensures that records of ename will go to the same reducer and then to SORT the same
   data the way we want (ascending order of empid).
3. DISTRIBUTE BY works similar to GROUP BY in the sence that how it controls the reducers to receives the rows for  processing.

NOTE: Hive requires DISTRIBUTE BY clause comes BEFORE the SORT BY if we are using both in a query.
-----------------------------------
hive> select * from disttab;
OK
108   Sandeep     22000
109   Sandeep     22500
110   Sandeep     23500
111   Sandeep     24340
112   Karan 45000
113   Karan 45600
101   Ravi  46000
102   Prakash     34000
103   Murali      23000
104   Madan 45555
105   Kanth 56000
106   Varma 33333
Time taken: 0.409 seconds
hive>
-------------------------------------
hive> select empid , ename , esal from disttab DISTRIBUTE BY ename SORT BY ename asc , esal asc;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201304160610_0008, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201304160610_0008
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201304160610_0008
2013-04-16 07:05:36,346 Stage-1 map = 0%,  reduce = 0%
2013-04-16 07:05:40,480 Stage-1 map = 100%,  reduce = 0%
2013-04-16 07:05:49,639 Stage-1 map = 100%,  reduce = 33%
2013-04-16 07:05:50,765 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201304160610_0008
OK
108   Sandeep     22000
109   Sandeep     22500
110   Sandeep     23500
111   Sandeep     24340
105   Kanth 56000
112   Karan 45000
113   Karan 45600
104   Madan 45555
103   Murali      23000
102   Prakash     34000
101   Ravi  46000
106   Varma 33333
Time taken: 24.721 seconds
hive>

 

No comments:

Post a Comment