Divvy Bikeshare 2023 Q3 Rides Analysis

Gon,case studysqlpythontableau

Source: Divvy Instagram

Objective

My analysis of Divvy bike-sharing service, (opens in a new tab) I have identified patterns and trends in the behavior of annual and casual non-member users. This analysis aims to inform strategies for increasing memberships and enhancing the overall user experience.

This write-up will be an in-depth breakdown of the methodology behind the analysis. For the analysis insights and conclusions, please refer to the Q3 rides presentation and visualization story:

  1. Presentation (opens in a new tab)
  2. Visualization (opens in a new tab)

This case study outlines the tools and methodologies employed in processing, analyzing, and visualizing Divvy ride data. Data retrieval is carried out using Python, (opens in a new tab) while data cleaning and transformation tasks are executed with Pandas. (opens in a new tab) Google BigQuery (opens in a new tab) facilitates SQL analysis, and Tableau (opens in a new tab) is employed for crafting data visualizations.

Process

Divvy offers anonymized ride data publicly on their website, accessible here. (opens in a new tab) The data is organized in monthly releases, each represented by a zip file linked on an index page with over 100 URLs. To streamline file retrieval, I developed a Python script for processing these URLs.

# divvy_trip_scraper.py
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
 
 
BASE_URL = "https://divvy-tripdata.s3.amazonaws.com/"
 
 
def download_file(url, filename):
    with requests.get(url, stream=True) as resp, open(filename, "wb") as file:
        file.write(resp.content)
 
 
def parse_xml(xml_content):
    soup = BeautifulSoup(xml_content, "xml")
    return [urljoin(BASE_URL, content.find("Key").text.strip()) for content in soup.find_all("Contents")]
 
 
def main():
    xml_url = BASE_URL
 
    resp = requests.get(xml_url)
    if resp.status_code == 200:
        urls = parse_xml(resp.content)
 
        for url in urls:
            print(f"Downloading: {url}")
            download_file(url, url.split("/")[-1])
 
 
if __name__ == "__main__":
    main()
 

The case study concentrates on the 2023 Q3 trip data, encompassing the peak summer season with the highest ride volume. Unpacking each zip file reveals CSV files containing hundreds of thousands of recorded trips. Due to the extensive datasets, Pandas is employed for initial processing. A Jupyter notebook detailing the data preparation steps is accessible on Kaggle. (opens in a new tab) The CSV files are then loaded into a Pandas DataFrame with specified columns.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 771693 entries, 0 to 771692
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype
---  ------              --------------   -----
 0   ride_id             771693 non-null  object
 1   rideable_type       771693 non-null  object
 2   started_at          771693 non-null  object
 3   ended_at            771693 non-null  object
 4   start_station_name  652774 non-null  object
 5   start_station_id    652774 non-null  object
 6   end_station_name    646125 non-null  object
 7   end_station_id      646125 non-null  object
 8   start_lat           771693 non-null  float64
 9   start_lng           771693 non-null  float64
 10  end_lat             770436 non-null  float64
 11  end_lng             770436 non-null  float64
 12  member_casual       771693 non-null  object
dtypes: float64(4), object(9)
memory usage: 76.5+ MB

Identifying key attributes influencing ride behavior, I focus on features such as bicycle types, ride start/end times, and starting/ending locations. Leveraging Pandas, I sample and validate column values in the exploratory phase to gain insights into the dataset.

ColumnValues
rideable_typeelectric, classic, docked
member_casualmember, casual

Subsequently, data cleaning is performed to prevent analysis errors. This involves removing unwanted observations, addressing structural data errors, and identifying and handling null values that could impact the analysis.

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    122943
start_station_id      122943
end_station_name      130304
end_station_id        130304
start_lat                  0
start_lng                  0
end_lat                 1254
end_lng                 1254
member_casual              0
dtype: int64

A notable presence of null values is observed in station names and IDs. Upon closer investigation, it's noted that when a station name is null, the rideable_type is "electric bikes." Further exploration on the Divvy website reveals that electric bikes can be docked at public bike racks, explaining the null station names, which are not Divvy stations. Nevertheless, these entries remain valid for our analysis as they still represent bike rides.

To ensure accurate data representation, it's crucial to standardize string values. Pandas facilitates this process by cleaning names, removing leading, trailing, and double whitespaces. Transforming timestamp strings into datetime objects is efficiently handled using Pandas, allowing for necessary calculations and extractions.

 2   started_at          767650 non-null  datetime64[ns]
 3   ended_at            767650 non-null  datetime64[ns]

Facilitating our analysis, I compute the duration of each ride using the started_at and ended_at datetimes, appending the results to a new column. To streamline future queries, I link each trip record with its corresponding day of the week. I extract the day from the started_at column and introduce a new column, day_of_week.

 13  duration            666371 non-null  int64
 14  day_of_week         666371 non-null  object

The refined and processed data is now prepared for analysis. Subsequently, I employ Pandas to write the DataFrames back into CSV files for further utilization.

Analyze

The CSV files are uploaded to Google BigQuery, allowing me to leverage SQL for comprehensive dataset querying. My analysis is concentrated on discerning disparities between member and casual rides, aiming to uncover meaningful distinctions in ride timing, duration, and popular locations.

WITH combined_data AS (
  SELECT * FROM `divvy_rides.202307_trips`
  UNION ALL
  SELECT * FROM `divvy_rides.202308_trips`
  UNION ALL
  SELECT * FROM `divvy_rides.202309_trips`
  )
 
SELECT COUNT(*)
FROM combined_data
WHERE duration > 60 AND member_casual = 'member'

The foundational query unifies monthly records into a combined dataset for Q3. To eliminate potential errors, rides with durations less than one minute are filtered out. Subsequent queries are executed separately for member and casual records. A crucial aspect of data analysis involves unbiased exploration, considering every attribute of the dataset, even when assumptions suggest minimal insights. BigQuery facilitates the construction of multiple queries to identify variations in ride behavior.

Ride time block sql query

In aligning with the business context, I systematically analyze each data attribute's correlation with bike ride behavior. Formulating specific questions is essential for achieving the project objectives. For the comparison between casual and member rides, I inquire about popular bike types, typical ride durations, peak and off-peak days, and the utilization of bike stations. The provided query calculates the average number of rides in distinct time blocks on weekdays. A comprehensive list of SQL queries executed on the dataset is available here. (opens in a new tab)

After obtaining SQL results, a thorough review was conducted to select key indicators. Results showing insignificant differences between member and casual rides were excluded, focusing on those with clear and substantial distinctions. The curated set of results forms the basis for visualizing the data story, effectively addressing the project's objectives.

Visualization

Tableau, a robust and versatile data visualization tool, empowers me to craft descriptive charts and maps to showcase my insights. By incorporating the CSV dataset as a data source in Tableau and validating each field type, I create worksheets to visualize key indicators efficiently.

Tableau worksheet chart

In Tableau, I seamlessly incorporate data columns like member type or day of the week as measures, while quantifiable data such as counts, sums, or median values are added as dimensions. Tableau adeptly organizes these columns and rows into charts or graphs with the desired visualization. Leveraging the marks option, I can fine-tune the appearance of charts by adjusting bar sizes, line colors, and adding labels, ensuring a polished and insightful presentation.

Tableau time bocks chart

To replicate the outputs obtained in SQL, I utilize Tableau's "Create Calculated Fields" feature. This enables the generation of outputs that incorporate aggregated and transformed results. Through multiple conditional statements, I define new columns, facilitating the visualization of disparities in ride durations and start times. Additionally, Tableau provides the capability to display data points on a map, enhancing the geographical representation of insights.

Tableau map casual

The dataset includes latitude and longitude information, allowing for effective mapping of ride locations. Utilizing the station name field as labels, I aggregate a calculated field for ride counts, providing a clear representation of ride totals at each location. A dedicated worksheet is designed to visualize the highest ride starting locations for both casual and member rides. To present a cohesive overview of key insights, all relevant worksheets are integrated into a Tableau Story, accessible here. (opens in a new tab)

Tableau story

The final insights and business implications were synthesized into a comprehensive presentation. I provided stakeholders with logical and understandable context, outlining the analysis objective, dataset background, and acquisition process. Each key insight was thoroughly explained in the context of the visualizations, offering a comprehensive understanding of the data. The presentation concluded with a summary and actionable recommendations grounded in the data. The complete Divvy Q3 2023 rides analysis presentation can be accessed here. (opens in a new tab)

Google slide presentation

This case study underscores the significance of meticulous data analysis. The pivotal first step involves crafting a precise and clear objective, leveraging a solid understanding of the business to formulate relevant questions. The subsequent analysis queries are informed by this understanding. Emphasizing the importance of data preparation, the study highlights the necessity of a thorough exploration of dataset attributes, ensuring accurate and standardized data. The quality of data directly influences the effectiveness of the final visualizations, reinforcing the idea that vibrant visuals are most impactful when anchored in precise and meaningful analysis.