Skip to content

LexMainye/Matatu-Income-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Analysis of Potential Matatu Income Data

image alt

Brief History of Matatus

Matatus are the go to method of public transportation in Kenya, originating in the 1960s. The term "matatu" comes from the Swahili word for "three," referring to the three cents charged per trip when they first began operating. Initially, they were informal, privately owned minibuses or vans that filled a gap in public transportation. Over the years, matatus have become a cornerstone of Kenya's transport sector, evolving into a more structured yet still vibrant and culturally significant industry.

Background on Analyzing Hypothetical Matatu Income Data

In this project, I aim to analyze hypothetical income data for matatus, which has been generated synthetically to simulate real scenarios. Using hypothetical data allows us to focus on analytical methods, data modeling, and interpretation without the need for access to actual operational data, which may be sensitive or unavailable.

What the Hypothetical Data Includes:

  • The data contains simulated data for a week which includes:

Daily Income:

  • This is fare paid by passengers.

Time Period:

  • The income data is categorized into specific time periods to better analyze trends throughout the day. Below is a breakdown of the time periods:
Time Period Description Approximate Time Range
Early Morning The first part of the day, often busy with commuters starting their day. 4:00 AM - 6:59 AM
Morning The regular morning rush with peak passenger traffic. 7:00 AM - 9:59 AM
Late Morning A quieter period as the morning rush subsides. 10:00 AM - 11:59 AM
Early Afternoon Transition period with moderate passenger flow. 12:00 PM - 1:59 PM
Afternoon Midday activity with commutters. 2:00 PM - 3:59 PM
Late Afternoon Increasing activity as people begin their commute home. 4:00 PM - 5:59 PM
Evening High traffic as the majority return home from work. 6:00 PM - 6:59 PM
Night Reduced activity with occasional late-night travelers. 7:00 PM - 11:59 PM

This breakdown helps identify peak and off-peak periods, providing insights into revenue fluctuations and operational efficiency.

Date:

  • The simulated dates that the transactions happened.

Payment Channels:

  • This comprises of cash and mobile money channels (M-PESA, Airtel Money & T-Kash)

Gender:

  • This comprises of male and female genders.

The generated matatu data can be accessed here : Download the Excel file

Data Findings

Total Revenue per Day

image alt

  • The above plot gives a breakdown on the total revenue earned per day with the highest revenue being earned on the 07/11/2024.

Number of transactions per time period image alt

  • The graph shows the average number of transactions happening in each time period with the highest number of transactions taking place late morning periods.

Gender Distribution image alt

  • The above plot shows that females appear to use matatus more than males.

Payment Distribution by Time Period image alt

  • The above plot shows the breakdown of the payment distribution used at different time periods.

Average Fare Per Time Period image alt

  • The above plot shows the average fare paid during each time period.

Payment Channel by Gender image alt

  • The above plot show the above breakdown of the payment channel by gender.

Average Weekly Fare Comparison image alt

  • The above plot shows the comparison between the average farea paid between the off-peak and peak periods.

Transaction Load vs Vehicle Availability image alt

  • The above plot shows the transaction volume between the transaction load versus the vehicle availability of 1 matatu.

Transaction Volume Trend image alt

  • The above plot show the volume trend of the data showing trends in transactions across different days.

Profitability by Payment Channel image alt

  • The above plot show the profitability across the different payment channel by each prefered payment channel that may be preferred by a matatu passenger

Forecast of transactions image alt

  • The above plot shows the observed and predicted fare rates.

Linear Regression of Actual Vs Predicted Fare Amounts image alt

  • The above plot shows the linear regression between the predicted and actual fare amounts where there is comparison of the model's predicted values to the actual values and shows how well they align

Findings from the Apriori Analysis

The analysis conducted using the Apriori algorithm yielded a set of association rules that highlight patterns in the hypothetical matatu income data.

Link to the association rule spreadsheet:Download the Excel file

Below is a summary of the key findings from the spreadsheet:


Key Metrics Overview:

  1. Support:

    • Support values across the rules are relatively low (<5%), indicating that the specific combinations of antecedents (e.g., time periods) and consequents (e.g., payment methods) are not very frequent.
    • Example: The rule "Early Morning → Cash" has a support of 3.1%, meaning this combination is present in 3.1% of all transactions.
  2. Confidence:

    • Confidence values indicate the likelihood of a consequent occurring given the antecedent.
    • Higher confidence rules include "Morning → Cash" with 28.8%, suggesting a strong association between the morning time period and cash payments.
  3. Lift:

    • Lift values mostly hover slightly above 1 (e.g., 1.05–1.12), indicating weak but positive associations.
    • For example, the rule "Afternoon → Cash" has a lift of 1.079, showing that afternoon transactions are only slightly more likely to involve cash payments than by random chance.
  4. Jaccard, Leverage, and Conviction:

    • These secondary metrics further validate the associations:
      • Jaccard values are low (<0.05), reinforcing that the intersections of antecedents and consequents occur infrequently.
      • Leverage values are close to zero, confirming that the discovered rules do not drastically deviate from random co-occurrences.
      • Conviction values (>1) suggest modest dependency between antecedents and consequents.

Notable Patterns:

  1. Cash as a Predominant Payment Method:

    • Many rules indicate Cash as a frequent consequent across various time periods, reflecting its dominance as the primary payment channel.
  2. Time Period Influence:

    • Rules highlight specific patterns, such as:
      • Early Morning → Cash (28.8% confidence)
      • Late Morning → Cash (27.1% confidence)
  3. Weak Relationships Overall:

    • While the confidence levels suggest some predictable trends, the low support and moderate lift values indicate that these rules are not strongly definitive for most transactions.

Practical Applications:

  1. Optimize Payment Methods:

    • Insights into dominant cash payments during specific times can help introduce digital payment incentives during low-confidence periods (e.g., late afternoon or night).
  2. Peak Time Adjustments:

    • High-confidence rules for busy periods like Morning can guide scheduling and resource allocation strategies.
  3. Future Data Needs:

    • The weak associations suggest potential benefits in expanding the dataset or including additional variables, such as passenger demographics or route types, to uncover more actionable patterns.

Cluster Analysis Findings

The cluster analysis conducted on the transaction amounts grouped the data into three clusters. Each cluster represents a group of transactions with similar patterns in terms of the average transaction amount. Below is a summary of the findings:


Cluster Grouping

Cluster Average Transaction Amount (KES) Description
0 103.38 Transactions with slightly lower-than-average amounts.
1 104.42 Cluster with the highest average amount, possibly representing peak times or premium routes.
2 103.43 Similar to Cluster 0 but with a marginally higher average amount.

Key Observations

  1. Marginal Differences:

    • The average transaction amounts across clusters range from KES 103.38 to KES 104.42, showing only slight variations.
  2. Cluster 1 Dominance:

    • Cluster 1 stands out with the highest average amount, suggesting potential alignment with high-demand periods or specific routes.
  3. Homogeneity in Data:

    • The minimal variation in amounts suggests a relatively uniform dataset in terms of pricing or income patterns.

Random Forest Classifier Results

The Random Forest Classifier was applied to the income data, producing exceptional results. Below is a summary and interpretation of the findings:


Performance Metrics

Metric Value
Precision 1.00
Recall 1.00
F1-Score 1.00
Accuracy 1.00

Key Observations:

  1. Perfect Classification:

    • The classifier achieved 100% accuracy, with precision, recall, and F1-score all at their maximum values of 1.00. This means the model made no errors in classifying the dataset.
  2. Support:

    • A total of 421 data points were used, all belonging to a single class (labeled as "0"). This suggests uniformity or imbalance in the dataset.
  3. Feature Importance:

    • Random Forest provides a measure of feature importance, which can identify the most influential factors in income classification.

Insights from Feature Importance:

The most critical features influencing classification should be analyzed to guide practical decision-making. For example:

  • Time of Day: Which time periods generate the highest or lowest income?
  • Payment Method: Is cash or digital payment a more significant contributor to income patterns?

This information can help optimize operations or pricing strategies.


About

Analysis of potential Nganya data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published