Background

In this project, we look at the NYC Subway data and figure out if more people ride the subway when it is raining versus when it is not raining. We will wrangle the NYC subway data, use statistical methods and data visualization to draw an interesting conclusion about the subway dataset that we've analyzed. The dataset used can be downloaded using the following link

Data Overview

The dataset contains 21 variables and 131951 observations. UNIT is a categorical variable that contains 552 categories. The 552 categories are converted into 552 dummy variables for analysis - each having either a value 0 or 1.

Statistical Testing

Test Used

The distributions of the data sets for the two conditions - rainy and non-rainy - are not normal, hence the Mann-Whitney test is applicable. The Mann-Whitney U test was used to analyze the NYC subway data. I used a two-tailed P value. The null hypothesis is that the mean of the entries with rain is equal to the mean of the entries without rain. The p-critical value is 0.0498.

Statistical test results

import numpy as np
import scipy
import scipy.stats
import pandas

def mann_whitney_plus_means(turnstile_weather):
    with_rain = turnstile_weather[turnstile_weather.rain == 1]['ENTRIESn_hourly']
    without_rain = turnstile_weather[turnstile_weather.rain == 0]['ENTRIESn_hourly']
    with_rain_mean = np.mean(with_rain)
    without_rain_mean = np.mean(without_rain)
    U,p = scipy.stats.mannwhitneyu(with_rain,without_rain)
    return with_rain_mean, without_rain_mean, U, p

Results from the statistical test are as follows:

Linear Regression

OLS using Statsmodels library was the approach used to perform linear regression. Gradient descent was not used as OLS is more accurate. Rain and Hour were the features used apart from the dummy variables which can be seen from the code below.

import numpy as np
import pandas
import statsmodels.api as sm

def linear_regression(features, values):
    """
    Perform linear regression given a data set with an arbitrary number of features.
    """

    features = sm.add_constant(features,prepend=True)
    model = sm.OLS(values,features).fit()
    print model.summary()
    params = model.params
    intercept = params[0]
    params = params[1:]
    return intercept, params

def predictions(dataframe):
    features = dataframe[['rain', 'Hour']]
    dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
    features = features.join(dummy_units[['unit_R170','unit_R022','unit_R179','unit_R029','unit_R012',
                            'unit_R046','unit_R023','unit_R047','unit_R293','unit_R055','unit_R084','unit_R011',
                            'unit_R178','unit_R452','unit_R001','unit_R108',
                            'unit_R010','unit_R195','unit_R021','unit_R102','unit_R018','unit_R033',
                            'unit_R175','unit_R177','unit_R051','unit_R138','unit_R017','unit_R057',
                            'unit_R168','unit_R080','unit_R050','unit_R127','unit_R025',
                            'unit_R141','unit_R163','unit_R044','unit_R019','unit_R031','unit_R248','unit_R032',
                            'unit_R083','unit_R461','unit_R240','unit_R235','unit_R208','unit_R014','unit_R053',
                            'unit_R020','unit_R142','unit_R081','unit_R028','unit_R222','unit_R131','unit_R105',
                            'unit_R113','unit_R144','unit_R167','unit_R079','unit_R463','unit_R035',
                            'unit_R132','unit_R176','unit_R300','unit_R158','unit_R164','unit_R097',
                            'unit_R101','unit_R116','unit_R027','unit_R110','unit_R201','unit_R160','unit_R041'
                            ]])

    # Values
    values = dataframe['ENTRIESn_hourly']

    # Perform linear regression
    intercept, params = linear_regression(features, values)

    predictions = intercept + np.dot(features, params)
    return predictions

Choice of Predictors Explained

Non-dummy coefficient values and R^2

Coefficients of the non-dummy variables: ‘const’: -18.9345 ‘rain’: 62.6158 ‘hour’: 59.1683

R^2 value of 0.401 means that the model explained 40.1% of the data fit the regression model line. This linear model to predict ridership is not appropriate for this dataset, given the R^2 value. It could be that some variables given in the dataset which could increase the R^2 value have not been identified. There is also a chance that some important variable has not been captured during data collection too.

Visualizations

A histogram of the hourly number of entries is plotted when there is rain and when there is not. The histograms are not normally distributed and have their peaks near 0.

    df = turnstile_weather.groupby('TIMEn')
    dfSum = df['ENTRIESn_hourly'].sum()

    TIMEn = []
    ENTRIESn_hourlySum = []
    for key, value in dfSum.iteritems():
        TIMEn.append(key)
        ENTRIESn_hourlySum.append(value)
    ENTRIESn_hourlySum = np.log(ENTRIESn_hourlySum)    
    ENTRIESn_hourlySum = ENTRIESn_hourlySum.astype(int)
    dfPlot = DataFrame({'TIMEn':Series(TIMEn),'ENTRIESn_hourlySum':Series(ENTRIESn_hourlySum)})    
    #Filtering out the -inf values
    dfPlot = dfPlot[dfPlot['ENTRIESn_hourlySum'] > 0]  
    #Getting the highest ENTRIESn_hour sum values 
    dfPlot = dfPlot[dfPlot['ENTRIESn_hourlySum'] > (3*np.mean(dfPlot['ENTRIESn_hourlySum']))]    
    plot = ggplot(dfPlot,aes('TIMEn','ENTRIESn_hourlySum')) + geom_bar(stat="bar") + \
            ggtitle('Highest entry counts with times during the day') + xlab('Times in Day') + ylab('Log of Hourly Entries') 
    return plot

Histogram of subway entry frequency v/s times

The below graph shows the plot of the mean pressure variation versus dates. It shows how the mean pressure variation varies on different days.

    pandas.options.mode.chained_assignment = None  # default='warn'
    turnstile_weather.iscopy = False
    turnstile_weather.loc[:,('DATEn')] = pandas.to_datetime(turnstile_weather.loc[:,('DATEn')], format="%Y-%m-%d")
    df = turnstile_weather.groupby('DATEn')
    index = [gp_keys[0] for gp_keys in df.groups.values()]
    unique_df = turnstile_weather.reindex(index).sort('DATEn')   
    plot = ggplot(unique_df, aes(x='DATEn', y='meanpressurei')) + geom_line(color = 'red') + \
    ggtitle('Mean Pressure variation') + xlab('Date') + ylab('Mean Pressure') + scale_x_date(labels = date_format("%Y-%m-%d"))
    return plot

Mean Pressure Variation v/s Date

Conclusion

More people ride the NYC subway when it rains as compared to when it does not. The low two-sided p-value of 0.0498 indicates that the null hypothesis that the mean subway number of hourly entries with rain is equal to that without rain is rejected. The linear model indicates that rain is an important variable in deciding the number of hourly entries. The positive coefficient value of 63.64 for variable ‘rain’ shows that subway ridership increases by 63.64 on average when it rains.

Reflection on Shortcomings

The shortcomings of the methods used are as follows:

Author

The above project was done as part of the 'Data Analayst Nanodegree' by Roshan Shetty. Please contact me on rosshanabshetty@gmail.com for any doubts, queries or information.