Background
In this project, we look at the NYC Subway data and figure out if more people ride the subway when it is raining versus when it is not raining. We will wrangle the NYC subway data, use statistical methods and data visualization to draw an interesting conclusion about the subway dataset that we've analyzed. The dataset used can be downloaded using the following link
Data Overview
The dataset contains 21 variables and 131951 observations. UNIT is a categorical variable that contains 552 categories. The 552 categories are converted into 552 dummy variables for analysis - each having either a value 0 or 1.
Statistical Testing
Test Used
The distributions of the data sets for the two conditions - rainy and non-rainy - are not normal, hence the Mann-Whitney test is applicable. The Mann-Whitney U test was used to analyze the NYC subway data. I used a two-tailed P value. The null hypothesis is that the mean of the entries with rain is equal to the mean of the entries without rain. The p-critical value is 0.0498.
Statistical test results
import numpy as np
import scipy
import scipy.stats
import pandas
def mann_whitney_plus_means(turnstile_weather):
with_rain = turnstile_weather[turnstile_weather.rain == 1]['ENTRIESn_hourly']
without_rain = turnstile_weather[turnstile_weather.rain == 0]['ENTRIESn_hourly']
with_rain_mean = np.mean(with_rain)
without_rain_mean = np.mean(without_rain)
U,p = scipy.stats.mannwhitneyu(with_rain,without_rain)
return with_rain_mean, without_rain_mean, U, p
Results from the statistical test are as follows:
- P-value: 0.0498
- Mean with rain: 1105.4464
- Mean without rain: 1090.2788
The two-tailed p-value is less than 0.05. This means that the null hypothesis that the mean number of hourly entries with rain is equal to that without rain is rejected.
Linear Regression
OLS using Statsmodels library was the approach used to perform linear regression. Gradient descent was not used as OLS is more accurate. Rain and Hour were the features used apart from the dummy variables which can be seen from the code below.
import numpy as np
import pandas
import statsmodels.api as sm
def linear_regression(features, values):
"""
Perform linear regression given a data set with an arbitrary number of features.
"""
features = sm.add_constant(features,prepend=True)
model = sm.OLS(values,features).fit()
print model.summary()
params = model.params
intercept = params[0]
params = params[1:]
return intercept, params
def predictions(dataframe):
features = dataframe[['rain', 'Hour']]
dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
features = features.join(dummy_units[['unit_R170','unit_R022','unit_R179','unit_R029','unit_R012',
'unit_R046','unit_R023','unit_R047','unit_R293','unit_R055','unit_R084','unit_R011',
'unit_R178','unit_R452','unit_R001','unit_R108',
'unit_R010','unit_R195','unit_R021','unit_R102','unit_R018','unit_R033',
'unit_R175','unit_R177','unit_R051','unit_R138','unit_R017','unit_R057',
'unit_R168','unit_R080','unit_R050','unit_R127','unit_R025',
'unit_R141','unit_R163','unit_R044','unit_R019','unit_R031','unit_R248','unit_R032',
'unit_R083','unit_R461','unit_R240','unit_R235','unit_R208','unit_R014','unit_R053',
'unit_R020','unit_R142','unit_R081','unit_R028','unit_R222','unit_R131','unit_R105',
'unit_R113','unit_R144','unit_R167','unit_R079','unit_R463','unit_R035',
'unit_R132','unit_R176','unit_R300','unit_R158','unit_R164','unit_R097',
'unit_R101','unit_R116','unit_R027','unit_R110','unit_R201','unit_R160','unit_R041'
]])
# Values
values = dataframe['ENTRIESn_hourly']
# Perform linear regression
intercept, params = linear_regression(features, values)
predictions = intercept + np.dot(features, params)
return predictions
Choice of Predictors Explained
- ‘rain’ is the first variable that comes to mind to decide whether people might use the subway or not. If it rains, then people might decide to use the subway. Also, ‘Hour’ of the day is another variable that helps in predicting usage of subway as people would use the subway more often in the mornings and the evenings as compared to late night or during the afternoon. All the dummy variables were included as they categorize the data and hence would help in identification.
- To filter out the dummy variables, I ran a full model with all the dummy variables. The resulting model summary gave the t-values and p-values of the dummy variables. Inclusion of dummy variables with higher t-values improved the r-squared significantly. I sorted the variables in descending order of t-values and included dummy variables in the model till the r-squared value crossed the threshold mentioned of 0.4.
Non-dummy coefficient values and R^2
Coefficients of the non-dummy variables: ‘const’: -18.9345 ‘rain’: 62.6158 ‘hour’: 59.1683
R^2 value of 0.401 means that the model explained 40.1% of the data fit the regression model line. This linear model to predict ridership is not appropriate for this dataset, given the R^2 value. It could be that some variables given in the dataset which could increase the R^2 value have not been identified. There is also a chance that some important variable has not been captured during data collection too.
Visualizations
A histogram of the hourly number of entries is plotted when there is rain and when there is not. The histograms are not normally distributed and have their peaks near 0.
df = turnstile_weather.groupby('TIMEn')
dfSum = df['ENTRIESn_hourly'].sum()
TIMEn = []
ENTRIESn_hourlySum = []
for key, value in dfSum.iteritems():
TIMEn.append(key)
ENTRIESn_hourlySum.append(value)
ENTRIESn_hourlySum = np.log(ENTRIESn_hourlySum)
ENTRIESn_hourlySum = ENTRIESn_hourlySum.astype(int)
dfPlot = DataFrame({'TIMEn':Series(TIMEn),'ENTRIESn_hourlySum':Series(ENTRIESn_hourlySum)})
#Filtering out the -inf values
dfPlot = dfPlot[dfPlot['ENTRIESn_hourlySum'] > 0]
#Getting the highest ENTRIESn_hour sum values
dfPlot = dfPlot[dfPlot['ENTRIESn_hourlySum'] > (3*np.mean(dfPlot['ENTRIESn_hourlySum']))]
plot = ggplot(dfPlot,aes('TIMEn','ENTRIESn_hourlySum')) + geom_bar(stat="bar") + \
ggtitle('Highest entry counts with times during the day') + xlab('Times in Day') + ylab('Log of Hourly Entries')
return plot
The below graph shows the plot of the mean pressure variation versus dates. It shows how the mean pressure variation varies on different days.
pandas.options.mode.chained_assignment = None # default='warn'
turnstile_weather.iscopy = False
turnstile_weather.loc[:,('DATEn')] = pandas.to_datetime(turnstile_weather.loc[:,('DATEn')], format="%Y-%m-%d")
df = turnstile_weather.groupby('DATEn')
index = [gp_keys[0] for gp_keys in df.groups.values()]
unique_df = turnstile_weather.reindex(index).sort('DATEn')
plot = ggplot(unique_df, aes(x='DATEn', y='meanpressurei')) + geom_line(color = 'red') + \
ggtitle('Mean Pressure variation') + xlab('Date') + ylab('Mean Pressure') + scale_x_date(labels = date_format("%Y-%m-%d"))
return plot
Conclusion
More people ride the NYC subway when it rains as compared to when it does not. The low two-sided p-value of 0.0498 indicates that the null hypothesis that the mean subway number of hourly entries with rain is equal to that without rain is rejected. The linear model indicates that rain is an important variable in deciding the number of hourly entries. The positive coefficient value of 63.64 for variable ‘rain’ shows that subway ridership increases by 63.64 on average when it rains.
Reflection on Shortcomings
The shortcomings of the methods used are as follows:
- The data is not captured for all the hourly intervals for each of the unit types indicating that the data is not complete. This causes a hindrance in building reliable models for our analysis.
- The linear regression model has too many dummy variables being used. Ideally, the number of variables should be lesser in a dataset. The issue is that removing a few of the dummy variables caused a reduction in the R-squared value which is not desirable.
Author
The above project was done as part of the 'Data Analayst Nanodegree' by Roshan Shetty. Please contact me on rosshanabshetty@gmail.com for any doubts, queries or information.