Crime Data with HIVE and PIG Using the Chicago Crime data. Here I will answer a few simple questions to illustrate the use of some common big data tools. The dataset : The data set contains a little over 90 plus records, perhaps not really on the scale of big data, however the tools and code used in this document (HIVE and PIG) will be unchanged if we were to handle this data set with tens of millions of records.
Questions to Answer: 1. The most frequently occurring primary type (i.e. theft, narcotics etc..) 2. Districts with the most reported incidents 3. Blocks with the most reported incidents 4. Blocks with the most reported incidents, grouped by primary type 5. A look at the date and time when the highest number of incidents where reported 6. Arrests by primary type 7. Arrests by district 8. A look at the date and time when the highest number of arrests took place.
In each instance we will restrict the reporting in this document to 10 lines of data, simply to preserve space.
The intention at a high level is to use historical data to assist law enforcement in answering, WHAT has been taking place (primary type i.e. narcotics, motor theft etc.), WHERE has it been taking place (district, block etc.), WHEN has it been taking place (month, day, hour). With this information law enforcement could operate in a more effective and efficient manner. In addition when combining this data with additional variables from other data sets/sources, law enforcement could possibly develop predictive models, further improving the effectiveness and efficiency of its operations.
1. The most frequently occurring primary type (i.e. theft, narcotics etc..)? HIVE QUERY: