New York City Bike Data Analysis MapReduce Case Study

Citi Bike, a public bicycle sharing system that serves parts of New York City, is the largest bike-sharing program in the United States. With the existing wealth of data pertaining to Citi Bike users in New York City, it is possible to identify, segment and categorize the customer domain based on several factors such as Age, Sex and Occupation which will help the company to identify their potential customers that can be targeted and also the weak customer domain which needs to be improved. The prime objective of this project is to identify long-run customers and potential audience that can be targeted to increase the company’s customer base and drive more revenue.

Other deliverables of this project would be to identify hot-spot locations and peak-demand hours, which can be crucial information that can help the company to better, manage their business supply, which will also essentially help more customers. Identifying the hot-spots at peak hours would help the company to understand the market demand and thereby increase the bike availability at these hot-spots during peak hours will help in driving more revenue. Our design would rely largely upon existing Citi Bike trip histories data and Citi Bike Daily Ridership and Membership Data.


The Dataset:

  • Trip Duration (seconds)
  • Start Time and Date
  • Stop Time and Date
  • Start Station Name
  • End Station Name
  • Station ID
  • Station Lat/Long
  • Bike ID
  • User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
  • Gender (Zero=unknown; 1=male; 2=female)
  • Year of Birth

Problem Statements :

  • 1.Number of Subscribers and Customers
  • 2.Number of Subscribers and Customers for each gender
  • 3.Number of Subscribers and Customers for each gender in every age category
  • 4.Average Trip Distance
  • 5.Week Stats : Identify the day with most trips
  • 6.Most Popular Stations : Identify the stations with most originating trips and destinations.



New York City Bike Data Analysis MR Codes

New York City Bike Data Analysis Dataset and Output


I hope this tutorial will surely help you. If you have any questions or problems please let me know.

Happy Hadooping with Patrick..


5 thoughts on “New York City Bike Data Analysis MapReduce Case Study

  1. hdfs@localhost:/opt/ecosystems/citibike$ $HADOOP_HOME/bin/hadoop jar genderAnalytic.jar genderAnalyticDriver /user/citibike/ /user/citout1
    Exception in thread “main” java.lang.ClassNotFoundException: genderAnalyticDriver
    at Method)
    at java.lang.ClassLoader.loadClass(
    at java.lang.ClassLoader.loadClass(
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(
    at org.apache.hadoop.util.RunJar.main(

    1. Hi Nancy,

      Please mentioned the script which you are using to run CitiBike jar file. Like jar file name,input file and output file name as it’s looks like compiling error Moreover check out “genderAnalytic Driver” java code.

  2. Hi,
    I am receiving the below error during the Map phase itself. Could you please advise.:

    Error: java.lang.NumberFormatException: For input string: “start_station_latitude”
    at sun.misc.FloatingDecimal.readJavaFormatString(
    at java.lang.Double.parseDouble(
    at org.apache.hadoop.mapred.MapTask.runNewMapper(
    at org.apache.hadoop.mapred.YarnChild$
    at Method)
    at org.apache.hadoop.mapred.YarnChild.main(

    1. Silly one. Got it. Skipped the first line while parsing the input file. Works fine now. 😀

Leave a Reply

Your email address will not be published. Required fields are marked *