Chicago Crime HIVE and PIG Case Study



Crime Data with HIVE and PIG Using the Chicago Crime data. Here I will answer a few simple questions to illustrate the use of some common big data tools.


The dataset :


The data set contains a little over 90 plus records, perhaps not really on the scale of big data, however the tools and code used in this document (HIVE and PIG) will be unchanged if we were to handle this data set with tens of millions of records.


Questions to Answer: 

1. The most frequently occurring primary type (i.e. theft, narcotics etc..) 

2. Districts with the most reported incidents 
3. Blocks with the most reported incidents 
4. Blocks with the most reported incidents, grouped by primary type 
5. A look at the date and time when the highest number of incidents where reported 
6. Arrests by primary type 
7. Arrests by district 
8. A look at the date and time when the highest number of arrests took place.

In each instance we will restrict the reporting in this document to 10 lines of data, simply to preserve space.
 

The intention at a high level is to use historical data to assist law enforcement in answering, WHAT has been taking place (primary type i.e. narcotics, motor theft etc.), WHERE has it been taking place (district, block etc.), WHEN has it been taking place (month, day, hour). With this information law enforcement could operate in a more effective and efficient manner. In addition when combining this data with additional variables from other data sets/sources, law enforcement could possibly develop predictive models, further improving the effectiveness and efficiency of its operations.


1. The most frequently occurring primary type (i.e. theft, narcotics etc..)?


HIVE QUERY:


SELECT primarytype,
COUNT(*) AS cnt FROM crime GROUP BY primarytype
ORDER BY cnt DESC
 
 
PIG SCRIPT:
 
crime = LOAD ‘/home/cloudera/Downloads/ chicago _Crimes_2014.csv
(its path in which you have store your Chicago crime “csv” file. Path could be change as per your requirement)
 
crime_grp_type = GROUP crime BY primarytype;
crime_grp_type_cntd = FOREACH crime_grp_type GENERATE COUNT(crime) AS cnt;
srtd = ORDER crime_grp_type_cntd BY cnt;

DUMP srtd; 

RESULT:


2. Districts with the most reported incidents?

HIVE QUERY:
 
SELECT district,
COUNT(*) AS cntdistrict FROM crime GROUP BY district
ORDER BY cntdistrict DESC
 
 
PIG SCRIPT:
 
crime = LOAD ‘/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_grp_dist = GROUP crime BY district;
crime_grp_dist_cntd = FOREACH crime_grp_dist GENERATE COUNT(crime) AS cnt;
 srtd = ORDER crime_grp_dist_cntd BY cnt;

DUMP srtd;

RESULT:


3. Blocks with the most reported incidents?

HIVE QUERY:
 
SELECT block,
COUNT(*) AS cntblock FROM crime
GROUP BY block
ORDER BY cntblock DESC
 
 
PIG SCRIPT:
 
crime = LOAD ‘/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_grp_block = GROUP crime BY block;
 crime_grp_block_cntd = FOREACH crime_grp_block GENERATE COUNT(crime) AS cnt;
 srtd = ORDER crime_grp_block_cntd BY cnt;

 DUMP srtd; 

RESULT: 



4. Blocks with the most reported incidents, grouped by primary type?


 HIVE QUERY:


SELECT block,
primarytype, COUNT(*) AS cntblocktype FROM crime GROUP BY block,
primarytype ORDER BY cntblocktype DESC
 
 
PIG SCRIPT:
 
crime = LOAD ‘/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_cogrp_block_type = COGROUP crime BY (block, primarytype);
crime_ cogrp_block_type _cntd = FOREACH crime_ cogrp_block_type GENERATE COUNT(crime) AS cnt;
srtd = ORDER crime_ cogrp_block_type _cntd BY cnt;
DUMP srtd;
 
RESULT:
 
5. A look at the date and time when the highest number of incidents where reported?
 
HIVE QUERY:
 
SELECT date,
COUNT(*) AS cnt FROM crime
GROUP BY date
ORDER BY cnt DESC
 
 
PIG SCRIPT:
 
crime = LOAD ‘/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_grp_date = GROUP crime BY date;
crime_grp_date_cntd = FOREACH crime_grp_date GENERATE COUNT(crime) AS cnt;
srtd = ORDER crime_grp_date_cntd BY cnt;
DUMP srtd;
 
RESULT:
 
6. Arrests by primary type?
 
HIVE QUERY:
 
SELECT primarytype,
COUNT(*) AS cnt FROM crime WHERE arrest = True
GROUP BY primarytype
ORDER BY cnt DESC
 
 


PIG SCRIPT:
 
crime = LOAD ”/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_filter = FILTER crime BY ( UPPER (arrest) matches ‘.*TRUE.*’ );
crime_grp_type = GROUP crime_filter BY primarytype;
crime_grp_type_cntd = FOREACH crime_grp_type GENERATE COUNT(crime_filter) AS cnt;
srtd = ORDER crime_grp_type_cntd BY cnt;
DUMP srtd; 
 
RESULT:
 
7. Arrests by district?
 
HIVE QUERY:
 
SELECT district,
COUNT(*) AS cntdistrictarrest FROM crime WHERE arrest = True
GROUP BY district
ORDER BY cntdistrictarrest DESC
 
 
PIG SCRIPT:
 
crime = LOAD ‘/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_filter = FILTER crime BY ( UPPER (arrest) matches ‘.*TRUE.*’ );
crime_grp_dist = GROUP crime_filter BY district;
crime_grp_dist_cntd = FOREACH crime_grp_dist GENERATE COUNT(crime_filter) AS cnt;
srtd = ORDER crime_grp_dist_cntd BY cnt;
DUMP srtd; 
 
RESULT:
 
8. A look at the date and time when the highest number of arrests took place?
 
HIVE QUERY:
 
SELECT date,
COUNT(*) AS cnt_arrest FROM crime WHERE arrest = True
GROUP BY date
ORDER BY cnt_arrest DESC
 
 


PIG SCRIPT:
 
crime = LOAD ‘/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_filter = FILTER crime BY ( UPPER (arrest) matches ‘.*TRUE.*’ );
crime_grp_date = GROUP crime_filter BY date;
crime_grp_date_cntd = FOREACH crime_grp_date GENERATE COUNT(crime_filter) AS cnt;
srtd = ORDER crime_grp_date_cntd BY cnt;
DUMP srtd 
 
RESULT:
 
Downloads:
 

I hope this tutorial will surely help you. If you have any questions or problem let me know.

 
Happy Hadooping with Patrick..

Leave a Reply

Your email address will not be published. Required fields are marked *