Social Media (Twitter) Data Analysis PIG Case Study

This post contains examples of social media analysis using Pig. Individual examples are described in detail below.

The dataset :

  • full_text.txt: Contains geo-tagged Twitter data with the following fields:
    • Twitter user ID
    • Timestamp of the tweet
    • Location of the tweet
    • Latitude of the tweet
    • Longitude of the tweet
    • Tweet content
  • cities15000.txt: Contains information on cities around the world with the following fields:
    • Record ID
    • City name
    • Country code
    • Latitude of the city
    • Longitude of the city
    • Timezone ID


This script file will find the top 5 hashtags from full_text.txt. For this example, a hashtag is defined to be any string that starts with ‘#’ and contains numbers, letters or underscores.
— Load Data data = LOAD ‘/hadoopgyaan/user/popularhashtags/full_text.txt’ AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
— Convert all tweets to lowercase to allow for accurate grouping
lowercase = FOREACH data GENERATE LOWER(tweet) as tweet;
— Separate all tweets into individual words
tweetwords = FOREACH lowercase GENERATE FLATTEN(TOKENIZE(tweet)) as token;
— Extract only hashtags from the collection of words
hashtags = FOREACH tweetwords GENERATE REGEX_EXTRACT(token, ‘(#)[a-z0-9_](\w+)’,0) as hashtag;
— Group identical hashtags together and create an ordered list of aggregrate counts
grouphashtags = GROUP hashtags BY hashtag;
counthashtags = FOREACH grouphashtags GENERATE group as hashtag, COUNT(hashtags) as cnt;
orderhashtags = ORDER counthashtags BY cnt desc;
limithashtags = LIMIT orderhashtags 5;
DUMP limithashtags;


This script file will find the user that tweeted from the greatest number of locations, i.e. the greatest number of distinct latitude and longitude pairs.
–Load Data data = LOAD ‘/hadoopgyaan/user/mobiletweeter/full_text.txt’ AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
— Join latitude and longitude coordintates into a tuple
locations = FOREACH data GENERATE id, TOTUPLE(lat, lon) as loc_tuple:tuple(lat:chararray, lon:chararray);
— Find only unique locations for each user
distinct_locations = DISTINCT(FOREACH locations GENERATE id, loc_tuple);
— Create an ordered list of the counts of unique locations for each user, returning the top result
group_locations = GROUP distinct_locations BY id;
count_locations = FOREACH group_locations GENERATE group as id, COUNT(distinct_locations) as cnt;
ordered_counts = ORDER count_locations BY cnt desc;
limit_counts = LIMIT ordered_counts 1;
DUMP limit_counts;
Downloads :
I hope this tutorial will surely help you. If you have any questions or problems please let me know.
Happy Hadooping with Patrick..

2 thoughts on “Social Media (Twitter) Data Analysis PIG Case Study

  1. Sir i like your content on hadoop very much and i don’t think there will be any other website with these many helpful case studies.
    I thank you for your efforts for updating your site.
    I have request that could you please re-upload the data sets and the code links as some of them are not working.

    1. Hi Tushar,

      Sorry for the late reply and thanks for appreciating my efforts to make this website useful for others.Thanks for the feedback as you say I’ll upload links again even is their some unique like you want,please let me know.

Leave a Reply

Your email address will not be published. Required fields are marked *