The Red Wine Quality data is related to the red variant of the Portuguese “Vinho Verde” wine. Experts have graded the wine quality between 0 (very bad) and 10 (very excellent). The aim of this project is to explore this data, figure out interesting trends and attempt to build a model which predicts red wine quality. An initial summary of the variables is done.

## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ qualityCategory     : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality      qualityCategory
##  Min.   :3.000   3: 10          
##  1st Qu.:5.000   4: 53          
##  Median :6.000   5:681          
##  Mean   :5.636   6:638          
##  3rd Qu.:6.000   7:199          
##  Max.   :8.000   8: 18

1599 observations is not that much but good enough to explore further.

The histogram of fixed acidity is right skewed while that of volatile acidity is left skewed.

The histogram of citric acid is left skewed even on a log scale. It seems to have multiple modes as well. A new plot with a lower binwidth might be needed. On the other hand, the histogram of residual sugar is right skewed on a log scale.

Transforming the chlorides attribute and the free.supfur.dioxide attribute a log scale makes their histograms normal.

The log transform of the attribute total.sulfur.dioxide makes the histogram normal. The density attribute has a normal distribution with a bit more outliers on the right.

The histogram of pH attribute seems normal with a few outliers to the right. The histogram of sulphates on the log scale is a bit right skewed.

Even on a log scale, the histogram of the alcohol attribute seems right skewed and bi-modal which could be explored with lesser binwidth. The majority of red wines are of quality ‘5’ or ‘6’ which represent moderate quality. It shows that the data set provided is not randomly distributed across all the quality types.

The histogram of citric.acid attribute is plotted with lower bin width.

The histogram seems highly right skewed and it has at least 3 modes, if not more.

The histogram of alcohol seems bi-modal on a log scale. This can be explored in bit more details.

The histogram of alcohol clearly has 2 modes. The first mode also has 2 peaks very near by. It seems like the large proportion of red wine have lower amounts of alcohol content.

Box plots give a better sense of where most of the data lie. They are plotted for the variables below.

The box plots of fixed.acidtiy and volatile.acidity show high skewness to the right. Most of the values are in a small range.

The box plot of citric.acid is relatively normal, although a bit right skewed. On the other hand, residual.sugar is highly right skewed. The range of residual.sugar containing most of the values is also extremely small.

The box plot of chlorides is also extremely right skewed and most of the values fall in a very narrow range.There is one really far outlier at around 0.6 choloride value. Sulfur.dioxide is also right skewed and the majority of the values are in the small range too.

The box plots of density and pH are pretty normal. Density seems to be a bit more tightly distributed though.

The box plot of sulphates is also highly right skewed and shows two clusters of points at the extreme end. The range in which most of the points lie is also pretty small. The box plot of alcohol is also right skewed but the range of majority of the points is a bit larger.

To summarize, the box plots reveal the skewness in distribution and presence of a large amount of outliers in the variables - residual.sugar, chlorides, density & sulphates.

The bi-variate relationships can now be explored. A pairs plot will help in that respect.

The pairs plot reveals a lot of expected relationships while some are not:

  1. The variables which are relatively highly correlated in magnitude with each other are fixed.acidity and citric.acid;fixed.acidity and density; pH and citric.acid; pH and fixed.acidity; citric.acid and volatile.acidity. These relationships should be explored further. Scatter plots with smoothers are plotted to further explore these relationships.

The second plot shows that a slightly linear relationship exists between fixed.acidity and density although the slope of the line seems small. The linear relationship does not seem as pronounced as the one between citric.acid and fixed.acidity in terms of the slope. But the correlations are pretty similar as shown below.

## 
##  Pearson's product-moment correlation
## 
## data:  citric.acid and fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034
## 
##  Pearson's product-moment correlation
## 
## data:  density and fixed.acidity
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

It could be that the range of the density attribute is pretty small but there is a definite increasing trend within the range. Similar plots and correlation coefficients are also plotted for the other variables.

The scatter plot of pH and fixed.acidity with the regression line superimposed clearly shows a decreasing trend. A similar decreasing trend is observed between pH and citric.acid in a similar second plot. The scatter plot between volatile.acidity and citric.acid shows a steady decreasing trend which flattens out at the end.

## 
##  Pearson's product-moment correlation
## 
## data:  pH and fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782
## 
##  Pearson's product-moment correlation
## 
## data:  pH and citric.acid
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5756337 -0.5063336
## sample estimates:
##        cor 
## -0.5419041
## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

The correlation between pH and fixed.acidity is a pretty large negative value of -0.68. pH and citric.acid as well as volatile.acidity and citric.acid have relatively smaller correlation values of -0.54 and -0.55 respecitvely.

  1. Quality is also positively correlated with alcohol and sulphates while it is negatively correlated with volatile.acidity.
## 
##  Pearson's product-moment correlation
## 
## data:  quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Pearson's product-moment correlation
## 
## data:  quality and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

Box plots of quality with these variables are plotted below.

For lower quality red wine, alcohol content is similar but it increases for increasing values of quality above 5. Sulphates increases slightly with increasing values of quality. Similarly, the volatile acidity of alcohol decreases with increasing quality of red wine.

Although volatile acidity decreases with higher alcohol quality, the fixed acidity and citric acid density increases slightly with increasing alcohol quality. These relationships are plotted below.

The correlations between quality and citric.acid as well as quality and fixed.acid are also slightly positive as shown below.

## 
##  Pearson's product-moment correlation
## 
## data:  quality and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Pearson's product-moment correlation
## 
## data:  quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

The box plots of citric acid with quality seem interesting. As quality increases the citric acid proportion increases as the peaks move more towards the right. There could be other features in its distribution plot which can be explored with the density plots.

Higher quality alcohols have a higher density for citric acid. The density plots are multimodal for many of the quality categories.

The distribution doesn’t seem normal due to multiple modes. It can be tested using the Shapiro-Wilk test.

## 
##  Shapiro-Wilk normality test
## 
## data:  redwine_input_scatter[redwine_input_scatter$quality == 8, ]$density
## W = 0.95729, p-value = 0.5502
## 
##  Shapiro-Wilk normality test
## 
## data:  redwine_input_scatter[redwine_input_scatter$quality == 7, ]$density
## W = 0.98743, p-value = 0.07556
## 
##  Shapiro-Wilk normality test
## 
## data:  redwine_input_scatter[redwine_input_scatter$quality == 6, ]$density
## W = 0.99331, p-value = 0.006122
## 
##  Shapiro-Wilk normality test
## 
## data:  redwine_input_scatter[redwine_input_scatter$quality == 5, ]$density
## W = 0.9757, p-value = 3.316e-09
## 
##  Shapiro-Wilk normality test
## 
## data:  redwine_input_scatter[redwine_input_scatter$quality == 4, ]$density
## W = 0.97947, p-value = 0.4908
## 
##  Shapiro-Wilk normality test
## 
## data:  redwine_input_scatter[redwine_input_scatter$quality == 3, ]$density
## W = 0.96475, p-value = 0.8384

Conducting a Shapiro-Wilk test showed that the data was not-normal for some of the higer categories highlighting the multi-modal distribution of the data.

Since alcohol, volatile.acidity and citric.acid have skewed distributions, it might be useful to plot the log of these attributes against quality.

The relationships between the variables remain the same although the skweness in the continuous attributes is reduced. Alcohol reduces with increasing quality while volatile.acidity increases with it.

## 
##  Pearson's product-moment correlation
## 
## data:  quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  quality and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

The correlation of quality with alcohol is 0.47 while that of quality with volatile.acidity is -0.39 reflecting the plot trends.

It would be interesting to explore the variability of the different variables with wine category. This could help give insights on any interesting trends.

The variation of pH is similar in all categories except the last one where it is bit right skewed.

The variation of pH is similar in all categories where there are enough data points like 5, 6 and 7.

The variation of density with alcohol is similar in all categories.

The variation of total.sulfur.dioxide is similar in all categories.

The variation of citric.acid is similar in all categories except 7 and 8.

The variation of fixed.acidity is similar in all categories.

The variation of volatile.acidity is similar in all categories.

The plots seems to follow the same trends across all categories. Obviously, there are very few data points in the lower and extreme higher ends of alcohol quality. Hence, nothing much can be and nothing should be inferred out of those histograms.

Since quality seems to be highly influenced by acohol and volatile.acidity, a plot incorporating both has been attempted.

It looks like quality reduces with volatile.acidity while it increases with alcohol.

It seems that pH has a slightly decreasing trend with citric.acid but does not get affected by fixed.acidity.

It looks like volatile.acidity, alcohol, sulphates, citric.acid and fixed.acidity could be decent predictors for red wine quality. This cann be checked using a linear model.

## 
## Calls:
## m1: lm(formula = quality ~ I(log(volatile.acidity)) + alcohol, data = redwine_input)
## m2: lm(formula = quality ~ I(log(volatile.acidity)) + alcohol + chlorides + 
##     sulphates, data = redwine_input)
## m3: lm(formula = quality ~ I(log(volatile.acidity)) + alcohol + chlorides + 
##     sulphates + free.sulfur.dioxide, data = redwine_input)
## m4: lm(formula = quality ~ I(log(volatile.acidity)) + alcohol + chlorides + 
##     sulphates + free.sulfur.dioxide + citric.acid + fixed.acidity, 
##     data = redwine_input)
## m5: lm(formula = quality ~ I(log(volatile.acidity)) + alcohol + chlorides + 
##     sulphates + free.sulfur.dioxide + citric.acid + fixed.acidity + 
##     density, data = redwine_input)
## m6: lm(formula = quality ~ I(log(volatile.acidity)) + alcohol + chlorides + 
##     sulphates + free.sulfur.dioxide + citric.acid + fixed.acidity + 
##     density + total.sulfur.dioxide, data = redwine_input)
## 
## ================================================================================================
##                                m1         m2         m3         m4          m5          m6      
## ------------------------------------------------------------------------------------------------
##   (Intercept)                1.938***   1.791***   1.849***   1.398***   25.592      25.454     
##                             (0.165)    (0.177)    (0.181)    (0.225)    (15.381)    (15.310)    
##   I(log(volatile.acidity))  -0.680***  -0.568***  -0.568***  -0.599***   -0.577***   -0.520***  
##                             (0.049)    (0.050)    (0.050)    (0.060)     (0.062)     (0.063)    
##   alcohol                    0.309***   0.288***   0.286***   0.297***    0.279***    0.263***  
##                             (0.016)    (0.016)    (0.016)    (0.017)     (0.020)     (0.021)    
##   chlorides                            -1.650***  -1.674***  -1.473***   -1.491***   -1.654***  
##                                        (0.396)    (0.396)    (0.411)     (0.411)     (0.411)    
##   sulphates                             0.892***   0.904***   0.880***    0.908***    0.928***  
##                                        (0.111)    (0.111)    (0.112)     (0.113)     (0.113)    
##   free.sulfur.dioxide                             -0.003     -0.002      -0.001       0.004*    
##                                                   (0.002)    (0.002)     (0.002)     (0.002)    
##   citric.acid                                                -0.301*     -0.287*     -0.112     
##                                                              (0.143)     (0.143)     (0.149)    
##   fixed.acidity                                               0.045***    0.062***    0.051**   
##                                                              (0.013)     (0.017)     (0.017)    
##   density                                                               -24.227     -23.787     
##                                                                         (15.400)    (15.329)    
##   total.sulfur.dioxide                                                               -0.003***  
##                                                                                      (0.001)    
## ------------------------------------------------------------------------------------------------
##   R-squared                      0.3        0.3        0.3        0.3        0.3         0.4    
##   adj. R-squared                 0.3        0.3        0.3        0.3        0.3         0.3    
##   sigma                          0.7        0.7        0.7        0.7        0.7         0.7    
##   F                            360.0      203.5      163.5      119.1      104.6        95.6    
##   p                              0.0        0.0        0.0        0.0        0.0         0.0    
##   Log-likelihood             -1629.0    -1596.8    -1595.5    -1589.7    -1588.5     -1580.6    
##   Deviance                     718.2      689.9      688.7      683.8      682.7       676.0    
##   AIC                         3265.9     3205.6     3204.9     3197.4     3196.9      3183.1    
##   BIC                         3287.4     3237.9     3242.5     3245.8     3250.7      3242.3    
##   N                           1599       1599       1599       1599       1599        1599      
## ================================================================================================

The final linear model shows a decent R-squred value of 0.4. It looks like only after adding total.sulfur.dioxide along with the other variables added before does the model R-squared value go above 0.4. It looks like the variables apart from volatile.acidity and alcohol are weak predictors. Even the correlation of total.sulfur.dioxide is lesser than sulphates which shows that it in itself is not a strong predictor.

## 
##  Pearson's product-moment correlation
## 
## data:  redwine_input$quality and redwine_input$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Final Plots

Plot 1

Plot 1 description

The histogram of alcohol clearly has 2 modes. The first mode also has 2 peaks very near by. It seems like the large proportion of red wine have lower amounts of alcohol content.

Plot 2

Plot description

Higher quality alcohols have a higher density for citric acid. As quality increases the citirc acid proportion increases. The density plots are multimodal for many of the quality categories.

Plot 3

Plot description

Red Wine Quality has a decreasing - slightly linear - trend with volatile acidity. On the other hand, the quality of red wine increases steadily by alcohol content. It’s clear that red wine is better if the alcohol content is better and if the red wine is progressively less acidic.

Reflection

The red wine dataset consists of 1599 observations with 13 variables. I initially computed the summary of the variables in the dataset like mean, median and range. Then, I went about exploring the relationships between the various variables to find out interesting trends using bi-variate and multi-variate plots. I finally built a linear model to predict red wine quality.

When I initally plotted quality and alcohol with volatile.acidity as legend, I was able to clearly see the linear trend of quality and alcohol but the trend of quality with volatile.acidity was not clear. I then realized that high correlation variables can be more easily detected using color. But trends between low correlation variables can be more clearly discerned by using a regression line plotted on top of it. On doing that, the trends became more clear. I used this technique in the multivariate plots and I found pretty good success.

The faceted histograms of the variables on the other hand did not give me much information on the varying trends within each quality category, except for citric acid whose density plot was pretty rich in terms of differing number of modes and skewness among the different alcohol categories.

A linear model attempted gave a respectable R-squared value of 0.4. The log of volatile.acidity and alcohol turned out to be pretty good predictors in the model. Since the observations are so less, the model doesn’t have enough predictive power. If more data is present, then the model can definitely be improved.