Hive Partitioning and Bucketing Example on Twitter Data

Hive Partitioning and Bucketing Example on Twitter Data
Overview on Hive Partitioning :

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

Overview on Hive Bucketing :

The Hive Partition can be further subdivided into Clusters or Buckets.Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.

The dataset :
Tweet ID

Username

Text

Created Date

Profile Location

Favc

Retweet

Retweet Count

Count of Followers
Script :
Create table with Twitter Data—

create table twitter(tweetId BIGINT, username STRING,txt STRING,CreatedAt STRING,

profileLocation STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
row format delimited
fields terminated by ‘\t’
stored as textfile;

Load data from Input file (Twitterdata.txt) to table (twitter) :
Load Data local inpath ‘/home/hadoop/hive/Twitterdata.txt’ overwrite into table twitter;
# If you are using data from HDFS then you don’t have to mentioned “local”.

Create table with partitioning–

create table partitiontwitter(tweetId BIGINT, username STRING,txt STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
partitioned by(CreatedAt String,profileLocation STRING)
row format delimited
fields terminated by ‘\t’
stored as textfile;
Load data from twitter table to Partitioning table :

insert overwrite table partitiontwitter
partition (CreatedAt=”26 04:50:56 UTC 2014″,profileLocation=”Chicago”)
select tweetId,username,txt,favc,retweet,retcount,followerscount
from twitter where profileLocation=’Chicago’ limit 50;

Create table with bucketing–
create table buckettwitter(tweetId BIGINT, username STRING,txt STRING,

CreatedAt STRING,favc BIGINT,retweet STRING,retcount BIGINT, followerscount BIGINT)
partitioned by(profileLocation STRING)
clustered by(tweetId) into 2 buckets
row format delimited
fields terminated by ‘\t’
stored as textfile;

set hive.enforce.bucketing =’true’;              (Required to run bucketing feature in hive)

Load data from twitter table to Bucketing table :

insert overwrite table buckettwitter partition(profileLocation=”Chicago”)
select tweetId BIGINT, username STRING,txt STRING,CreatedAt STRING,favc BIGINT,retweet STRING,retcount BIGINT, followerscount BIGINT
from twitter
where profileLocation = ‘Chicago’ limit 100;

Downloads :
Sample Twitter Data

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Leave a Reply

Your email address will not be published. Required fields are marked *