“The Daily Show” of Comedy Central Data Analysis PIG Case Study

In this post, we will Analyse data from “The Daily Show “(also known as The Daily Show with Jon Stewart from 1999 until 2015, and The Daily Show with Trevor Noahas of 2015) is an American news satire and talk show television program, which airs each Monday through Thursday on Comedy Central and on The Comedy Network in Canada.

 

The Dataset : 

 

We have a historical data of The Daily Show guests from 1999 to 2004.

YEAR – The year the episode aired.

GoogleKnowlege_Occupation -Their occupation or office, according to Google’s Knowledge Graph. On the other hand, if they are not in there, how Stewart introduced them on the program.

Show – Air date of the episode. Not unique, as some shows had more than one guest

Group – A larger group designation for the occupation. For instance, U.S senators, U.S presidents, and former presidents are all under “politicians”

Raw_Guest_List – The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.

Problem Statements :

 

1.Find the top five kinds of GoogleKnowlege_Occupation people who were guests in the show, in a particular time period.

PIG SCRIPT :
= load ‘/home/hadoopgyaan/dialy_shows’ using PigStorage(‘,’) AS (year:chararray,occupation:chararray,date:chararray,group:chararray,gusetlist:chararray);
 B = foreach A generate occupation,date;
 C = foreach B generate occupation,ToDate(date,‘MM/dd/yy’) as date;
 D = filter C by ((date> ToDate(‘1/11/99’,‘MM/dd/yy’)) AND (date<ToDate(‘6/11/99’,‘MM/dd/yy’)));
#Date range can be modified by the user
E = group D by occupation;
F = foreach E generate group, COUNT(D) as cnt;
G = order F by cnt desc;
 H = limit G 5;

OUTPUT :

With this, we will get the top five GoogleKnowlege_Occupation guests in the show in a particular period.

When we dump the relation, we will get the below result.

(actor,28)
 (actress,20)
 (comedian,4)
 (television actress,3)
 (singer,2)

2.Find out the number of politicians who came each year.

PIG SCRIPT:

A = load ‘/home/hadoopgyaan/dialy_shows’ using PigStorage(‘,’) AS (year:chararray,occupation:chararray,date:chararray,group:chararray,gusetlist:chararray);
B = foreach A generate year,group;
C = filter B by group == ‘Politician’;
D = group C by year;
E = foreach D generate group, COUNT(C) as cnt;
F = order E by cnt desc;

OUTPUT:

When we dump, the relation we will get the number of politicians who were guests on the show each year and the result is as displayed below.

(2004,32)
(2012,29)
(2008,27)
(2009,26)
(2006,25)
(2010,25)
(2011,23)
(2005,22)
(2007,21)
(2015,14)
(2003,14)
(2014,13)
(2000,13)
(2013,11)
(2002,8)
(2001,3)
(1999,2)

3.Find the number of GoogleKnowledge occupation types in each group, who have been guests on the show

PIG SCRIPT:
A = load ‘/home/hadoopgyaan/dialy_shows’ using PigStorage(‘,’) AS (year:chararray,occupation:chararray,date:chararray,grp:chararray,gusetlist:chararray);
 B = foreach A generate occupation,grp;
 C = group B by grp;
 D = foreach C generate group, COUNT(B) as cnt;
 E = order D by cnt desc;

OUTPUT:

(Acting,930)
(Media,751)
(Politician,308)
(Comedy,150)
(Musician,123)
(Academic,103)
(Athletics,52)
(Misc,45)
(Government,40)
4.To verify problem statement 3, we will find out what are the combinations of group and the Google_knowledge_occupation types who have been guests in the show.
PIG SCRIPT :
A = load ‘/home/hadoopgyaan/dialy_shows’ using PigStorage(‘,’) AS (year:chararray,occupation:chararray,date:chararray,group:chararray,gusetlist:chararray);
B = foreach A generate occupation,group;
C = group B by (group,occupation);
D = foreach C generate group, COUNT(B) as cnt;
E = order D by group;
OUTPUT:
In statement E, we are displaying the count of the number of combinations of Google_knowledge_occupation types each group, who have been guests on the show and the sample result is displayed below.
(Acting,Film actor),9)
((Acting,Film actress),9)
((Acting,actor),596)
((Acting,actress),271)
((Acting,film actor),10)
((Acting,film actress),12)
((Acting,stunt performer),5)
((Acting,television Actor),2)
((Acting,television actor),1)
((Acting,television actress),13)
((Acting,television actor),1)
((Acting,television actor),1)
If you count all the combinations, you will get a total of 930 which has been displayed for Acting in the above problem statement.

DOWNLOADS :

The Daily Show Guest List Dataset

 

I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..

Leave a Reply

Your email address will not be published. Required fields are marked *