Divvy Bikeshare 2023 Q3 Rides Analysis
Objective
My analysis of Divvy bike-sharing service, (opens in a new tab) I have identified patterns and trends in the behavior of annual and casual non-member users. This analysis aims to inform strategies for increasing memberships and enhancing the overall user experience.
This write-up will be an in-depth breakdown of the methodology behind the analysis. For the analysis insights and conclusions, please refer to the Q3 rides presentation and visualization story:
This case study outlines the tools and methodologies employed in processing, analyzing, and visualizing Divvy ride data. Data retrieval is carried out using Python, (opens in a new tab) while data cleaning and transformation tasks are executed with Pandas. (opens in a new tab) Google BigQuery (opens in a new tab) facilitates SQL analysis, and Tableau (opens in a new tab) is employed for crafting data visualizations.
Process
Divvy offers anonymized ride data publicly on their website, accessible here. (opens in a new tab) The data is organized in monthly releases, each represented by a zip file linked on an index page with over 100 URLs. To streamline file retrieval, I developed a Python script for processing these URLs.
# divvy_trip_scraper.py
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE_URL = "https://divvy-tripdata.s3.amazonaws.com/"
def download_file(url, filename):
with requests.get(url, stream=True) as resp, open(filename, "wb") as file:
file.write(resp.content)
def parse_xml(xml_content):
soup = BeautifulSoup(xml_content, "xml")
return [urljoin(BASE_URL, content.find("Key").text.strip()) for content in soup.find_all("Contents")]
def main():
xml_url = BASE_URL
resp = requests.get(xml_url)
if resp.status_code == 200:
urls = parse_xml(resp.content)
for url in urls:
print(f"Downloading: {url}")
download_file(url, url.split("/")[-1])
if __name__ == "__main__":
main()
The case study concentrates on the 2023 Q3 trip data, encompassing the peak summer season with the highest ride volume. Unpacking each zip file reveals CSV files containing hundreds of thousands of recorded trips. Due to the extensive datasets, Pandas is employed for initial processing. A Jupyter notebook detailing the data preparation steps is accessible on Kaggle. (opens in a new tab) The CSV files are then loaded into a Pandas DataFrame with specified columns.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 771693 entries, 0 to 771692
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ride_id 771693 non-null object
1 rideable_type 771693 non-null object
2 started_at 771693 non-null object
3 ended_at 771693 non-null object
4 start_station_name 652774 non-null object
5 start_station_id 652774 non-null object
6 end_station_name 646125 non-null object
7 end_station_id 646125 non-null object
8 start_lat 771693 non-null float64
9 start_lng 771693 non-null float64
10 end_lat 770436 non-null float64
11 end_lng 770436 non-null float64
12 member_casual 771693 non-null object
dtypes: float64(4), object(9)
memory usage: 76.5+ MB
Identifying key attributes influencing ride behavior, I focus on features such as bicycle types, ride start/end times, and starting/ending locations. Leveraging Pandas, I sample and validate column values in the exploratory phase to gain insights into the dataset.
Column | Values |
---|---|
rideable_type | electric, classic, docked |
member_casual | member, casual |
Subsequently, data cleaning is performed to prevent analysis errors. This involves removing unwanted observations, addressing structural data errors, and identifying and handling null values that could impact the analysis.
ride_id 0
rideable_type 0
started_at 0
ended_at 0
start_station_name 122943
start_station_id 122943
end_station_name 130304
end_station_id 130304
start_lat 0
start_lng 0
end_lat 1254
end_lng 1254
member_casual 0
dtype: int64
A notable presence of null values is observed in station names and IDs. Upon closer investigation, it's noted that when a station name is null, the rideable_type
is "electric bikes." Further exploration on the Divvy website reveals that electric bikes can be docked at public bike racks, explaining the null station names, which are not Divvy stations. Nevertheless, these entries remain valid for our analysis as they still represent bike rides.
To ensure accurate data representation, it's crucial to standardize string values. Pandas facilitates this process by cleaning names, removing leading, trailing, and double whitespaces. Transforming timestamp strings into datetime objects is efficiently handled using Pandas, allowing for necessary calculations and extractions.
2 started_at 767650 non-null datetime64[ns]
3 ended_at 767650 non-null datetime64[ns]
Facilitating our analysis, I compute the duration of each ride using the started_at
and ended_at
datetimes, appending the results to a new column. To streamline future queries, I link each trip record with its corresponding day of the week. I extract the day from the started_at
column and introduce a new column, day_of_week
.
13 duration 666371 non-null int64
14 day_of_week 666371 non-null object
The refined and processed data is now prepared for analysis. Subsequently, I employ Pandas to write the DataFrames back into CSV files for further utilization.
Analyze
The CSV files are uploaded to Google BigQuery, allowing me to leverage SQL for comprehensive dataset querying. My analysis is concentrated on discerning disparities between member and casual rides, aiming to uncover meaningful distinctions in ride timing, duration, and popular locations.
WITH combined_data AS (
SELECT * FROM `divvy_rides.202307_trips`
UNION ALL
SELECT * FROM `divvy_rides.202308_trips`
UNION ALL
SELECT * FROM `divvy_rides.202309_trips`
)
SELECT COUNT(*)
FROM combined_data
WHERE duration > 60 AND member_casual = 'member'
The foundational query unifies monthly records into a combined dataset for Q3. To eliminate potential errors, rides with durations less than one minute are filtered out. Subsequent queries are executed separately for member and casual records. A crucial aspect of data analysis involves unbiased exploration, considering every attribute of the dataset, even when assumptions suggest minimal insights. BigQuery facilitates the construction of multiple queries to identify variations in ride behavior.
In aligning with the business context, I systematically analyze each data attribute's correlation with bike ride behavior. Formulating specific questions is essential for achieving the project objectives. For the comparison between casual and member rides, I inquire about popular bike types, typical ride durations, peak and off-peak days, and the utilization of bike stations. The provided query calculates the average number of rides in distinct time blocks on weekdays. A comprehensive list of SQL queries executed on the dataset is available here. (opens in a new tab)
After obtaining SQL results, a thorough review was conducted to select key indicators. Results showing insignificant differences between member and casual rides were excluded, focusing on those with clear and substantial distinctions. The curated set of results forms the basis for visualizing the data story, effectively addressing the project's objectives.
Visualization
Tableau, a robust and versatile data visualization tool, empowers me to craft descriptive charts and maps to showcase my insights. By incorporating the CSV dataset as a data source in Tableau and validating each field type, I create worksheets to visualize key indicators efficiently.
In Tableau, I seamlessly incorporate data columns like member type or day of the week as measures, while quantifiable data such as counts, sums, or median values are added as dimensions. Tableau adeptly organizes these columns and rows into charts or graphs with the desired visualization. Leveraging the marks option, I can fine-tune the appearance of charts by adjusting bar sizes, line colors, and adding labels, ensuring a polished and insightful presentation.
To replicate the outputs obtained in SQL, I utilize Tableau's "Create Calculated Fields" feature. This enables the generation of outputs that incorporate aggregated and transformed results. Through multiple conditional statements, I define new columns, facilitating the visualization of disparities in ride durations and start times. Additionally, Tableau provides the capability to display data points on a map, enhancing the geographical representation of insights.
The dataset includes latitude and longitude information, allowing for effective mapping of ride locations. Utilizing the station name field as labels, I aggregate a calculated field for ride counts, providing a clear representation of ride totals at each location. A dedicated worksheet is designed to visualize the highest ride starting locations for both casual and member rides. To present a cohesive overview of key insights, all relevant worksheets are integrated into a Tableau Story, accessible here. (opens in a new tab)
The final insights and business implications were synthesized into a comprehensive presentation. I provided stakeholders with logical and understandable context, outlining the analysis objective, dataset background, and acquisition process. Each key insight was thoroughly explained in the context of the visualizations, offering a comprehensive understanding of the data. The presentation concluded with a summary and actionable recommendations grounded in the data. The complete Divvy Q3 2023 rides analysis presentation can be accessed here. (opens in a new tab)
This case study underscores the significance of meticulous data analysis. The pivotal first step involves crafting a precise and clear objective, leveraging a solid understanding of the business to formulate relevant questions. The subsequent analysis queries are informed by this understanding. Emphasizing the importance of data preparation, the study highlights the necessity of a thorough exploration of dataset attributes, ensuring accurate and standardized data. The quality of data directly influences the effectiveness of the final visualizations, reinforcing the idea that vibrant visuals are most impactful when anchored in precise and meaningful analysis.