Banking Transaction PIG Case Study





we will analyze a banking domain dataset, which contains several files with details of its customers. 

The dataset :


We will refer to the Transaction dataset throughout this analysis. It;s a collection of financial information from a unknown bank. The dataset deals with over 5,300 bank clients with approximately 68,614 transactions. 


Q1. Find how many records are in the data set?


— Load data from Transactions.csv

transactions = LOAD ‘/home/cloudera/datasets/hadoopgyaan/Transactions.csv’;

— Group each record by itself

tGroup = GROUP transactions ALL;

–Count number of groups

tCount = FOREACH tGroup GENERATE COUNT(transactions);


DUMP tCount;

Q2. Show the top 5 customers with largest total sales?


Load data from Transactions.csv

transactions = LOAD ‘/home/cloudera/datasets/hadoopgyaan/Transactions.csv’
        USING PigStorage(‘,’) AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int, Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:int);

Group all entries by Customer_Number

customer = GROUP transactions BY Customer_Number;

— Add each sale by Customer_Number

sales = FOREACH customer GENERATE group,SUM(transactions.Sales_Amount) AS totalSales;

— Rank the sum of sales from largest to smallest

rankedSales = RANK sales BY totalSales DESC;

— Show only top 5 largest numbers

top5 = FILTER rankedSales BY $0 <= 5;

DUMP top5;


Q3. Count customers who made sales in System_Period 200401,200402 and 200403?


— Load data from Transactions.csv

transactions = LOAD ‘/home/cloudera/datasets/hadoopgyaan/Transactions.csv’
        USING PigStorage(‘,’) AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int, Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:int);

— Selects transactions that contain 200401, 200402, 200403 from System_Period

sysFilter = FILTER transactions BY System_Period == 200401 OR
        System_Period == 200402 OR System_Period == 200403;

DUMP sysFilter;

Q4. Show top 3 employees (using employee number)who have processed highest average sales?

— Load data from Transactions.csv

transactions = LOAD ‘/home/cloudera/datasets/hadoopgyaan/Transactions.csv’
        USING PigStorage(‘,’) AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:int);

— Group each record by Employee_number

employeeGroup = GROUP transactions BY Employee_Number;

— Average sales by employee number

employeeSales = FOREACH employeeGroup GENERATE group,
        AVG(transactions.Sales_Amount) AS avgSales;

— Rank average sales

rankedSales = RANK employeeSales BY avgSales DESC;

— Show top three from rankedSales

top3 = FILTER rankedSales BY $0<=3;

DUMP top3;


Q5. Show how many transactions were made during system periods 20040?


— Load data from Transactions.csv

transactions = LOAD ‘/home/cloudera/datasets/hadoopgyaan/Transactions.csv’
        USING PigStorage(‘,’) AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:chararray);

— Filter transactions by column System_Period starting with 20040

sysFilter = FILTER transactions BY STARTSWITH(System_Period, ‘20040’);

— Group all of sysFilter

sysGroup = GROUP sysFilter ALL;

— Count entries after being filtered

sysCount = FOREACH sysGroup GENERATE COUNT(sysFilter);

DUMP sysCount;


Q6. Display each unique Sales_Amount by Product_Number?


— Load data from Transactions.csv

transactions = LOAD ‘/home/cloudera/datasets/hadoopgyaan/Transactions.csv’
        USING PigStorage(‘,’) AS (Branch_Number:int, Contract_Number:int,
        Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
        Product_Number:int, Sales_Amount:double, Employee_Number:int,
        Service_Date:chararray, System_Period:int);

— Creates an alias for columns Product_Number and Sales_Amount

prices = FOREACH transactions GENERATE $5 as col0, $6 as col1;

— Groups groups columns

priceGroup = GROUP prices BY (col0,col1);

— Displays each unique Sales_Amount for each Product_Number

priceFilter = FOREACH priceGroup {
        sort = ORDER prices BY col0 DESC;
        topRec = LIMIT prices 1;
        GENERATE FLATTEN(topRec);
        };

DUMP priceFilter;


Downloads:

1.Sample Input file (Transaction.csv)

I hope this tutorial will surely help you. If you have any questions or problems please let me know.
Happy Hadooping with Patrick..

Leave a Reply

Your email address will not be published. Required fields are marked *