Wine quality prediction with L1 regression#

# install dependencies and select solver
%pip install -q amplpy matplotlib

SOLVER = "highs"

from amplpy import AMPL, ampl_notebook

ampl = ampl_notebook(
    modules=["highs"],  # modules to install
    license_uuid="default",  # license to use
)  # instantiate AMPL object and register magics
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.6/5.6 MB 14.7 MB/s eta 0:00:00
?25hUsing default Community Edition License for Colab. Get yours at: https://ampl.com/ce
Licensed to AMPL Community Edition License for the AMPL Model Colaboratory (https://colab.ampl.com).

Problem description#

Regression is the task of fitting a model to data. If things go well, the model might provide useful predictions in response to new data. This notebook shows how linear programming and least absolute deviation (LAD) regression can be used to create a linear model for predicting wine quality based on physical and chemical properties. The example uses a well known data set from the machine learning community.

Physical, chemical, and sensory quality properties were collected for a large number of red and white wines produced in the Portugal then donated to the UCI machine learning repository (Cortez, Paulo, Cerdeira, A., Almeida, F., Matos, T. & Reis, J.. (2009). Wine Quality. UCI Machine Learning Repository.) The following cell reads the data for red wines directly from the UCI machine learning repository.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4), 547-553. https://doi.org/10.1016/j.dss.2009.05.016

Let us first download the data

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

wines = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
    sep=";",
)
display(wines)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows Γ— 12 columns

Model objective: Mean Absolute Deviation (MAD)#

Given \(n\) repeated observations of a response variable \(y_i\) (in this wine quality), the mean absolute deviation (MAD) of \(y_i\) from the mean value \(\bar{y}\) is

\[\text{MAD}\,(y) = \frac{1}{n} \sum_{i=1}^n | y_i - \bar{y}|\]
def MAD(df):
    return (df - df.mean()).abs().mean()


print("MAD = ", MAD(wines["quality"]))

fig, ax = plt.subplots(figsize=(12, 4))
ax = wines["quality"].plot(alpha=0.6, title="wine quality")
ax.axhline(wines["quality"].mean(), color="r", ls="--", lw=3)

mad = MAD(wines["quality"])
ax.axhline(wines["quality"].mean() + mad, color="g", lw=3)
ax.axhline(wines["quality"].mean() - mad, color="g", lw=3)
ax.legend(["y", "mean", "mad"])
ax.set_xlabel("observation")

plt.show()
MAD =  0.6831779242889846
../../_images/75de93bbd2f71cf6c964ae19542dc1e7aa9369b3f72d4269706820cfab692247.png

A preliminary look at the data#

The data consists of 1,599 measurements of eleven physical and chemical characteristics plus an integer measure of sensory quality recorded on a scale from 3 to 8. Histograms provides insight into the values and variability of the data set.

fig, axes = plt.subplots(3, 4, figsize=(12, 8), sharey=True)

for ax, column in zip(axes.flatten(), wines.columns):
    wines[column].hist(ax=ax, bins=30)
    ax.axvline(wines[column].mean(), color="r", label="mean")
    ax.set_title(column)
    ax.legend()

plt.tight_layout()
../../_images/965ef84966bae772896f909f2f14d53623dd392c47a644f9b7ad5325ad95bc3a.png

Which features influence reported wine quality?#

The art of regression is to identify the features that have explanatory value for a response of interest. This is where a person with deep knowledge of an application area, in this case an experienced onenologist will have a head start compared to the naive data scientist. In the absence of the experience, we proceed by examining the correlation among the variables in the data set.

_ = wines.corr()["quality"].plot(kind="bar", grid=True)
../../_images/ffee2115b31301f9b1f3456354e149a37a53b24b7a6ddab05b9c07ab5cd043d3.png
wines[["volatile acidity", "density", "alcohol", "quality"]].corr()
volatile acidity density alcohol quality
volatile acidity 1.000000 0.022026 -0.202288 -0.390558
density 0.022026 1.000000 -0.496180 -0.174919
alcohol -0.202288 -0.496180 1.000000 0.476166
quality -0.390558 -0.174919 0.476166 1.000000

Collectively, these figures suggest alcohol is a strong correlate of quality, and several additional factors as candidates for explanatory variables..

LAD line fitting to identify features#

An alternative approach is perform a series of single feature LAD regressions to determine which features have the largest impact on reducing the mean absolute deviations in the residuals.

\[ \begin{align*} \min \frac{1}{I} \sum_{i\in I} \left| y_i - a x_i - b \right| \end{align*} \]

This computation has been presented in a prior notebook.

%%writefile lad_fit_1.mod

set I;

param y{I};
param X{I};

var a;
var b;

var e_pos{I} >= 0;
var e_neg{I} >= 0;

var prediction{i in I} = a * X[i] + b;
s.t. prediction_error{i in I}: e_pos[i] - e_neg[i] == prediction[i] - y[i];

minimize mean_absolute_deviation: sum{i in I}(e_pos[i] + e_neg[i]) / card(I);
Writing lad_fit_1.mod
def lad_fit_1(df, y_col, x_col):
    m = AMPL()
    m.read("lad_fit_1.mod")

    m.set["I"] = df.index.values

    m.param["y"] = df[y_col]
    m.param["X"] = df[x_col]

    m.option["solver"] = SOLVER
    m.solve()

    return m


m = lad_fit_1(wines, "quality", "alcohol")

print(
    f'The mean absolute deviation for a single-feature regression is {m.obj["mean_absolute_deviation"].value():0.5f}'
)
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.5411666021
1578 simplex iterations
0 barrier iterations
 
The mean absolute deviation for a single-feature regression is 0.54117

This calculation is performed for all variables to determine which variables are the best candidates to explain deviations in wine quality.

mad = (wines["alcohol"] - wines["alcohol"].mean()).abs().mean()
vars = {}
for i in wines.columns:
    m = lad_fit_1(wines, "quality", i)
    vars[i] = m.obj["mean_absolute_deviation"].value()

fig, ax = plt.subplots()
pd.Series(vars).plot(kind="bar", ax=ax, grid=True)
ax.axhline(mad, color="r", lw=3)
ax.set_title("MADs for single-feature regressions")
plt.show()
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.6579111945
1494 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.5980365332
1642 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.6457762063
1584 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.6579111945
1508 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.6517416082
1628 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.6579111945
1543 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.6172771025
1706 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.6544712022
1648 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.6579111945
1489 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.6320777409
1636 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.5411666021
1578 simplex iterations
0 barrier iterations
 
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0
2 simplex iterations
0 barrier iterations
 
../../_images/4519fd98009e4db8c2e039560c2170c22b28a90db83b75fcfd49a3c2bb29a9a1.png
wines["prediction"] = [i[1] for i in m.var["prediction"].get_values()]
wines["quality"].hist(label="data")

wines.plot(x="quality", y="prediction", kind="scatter")
plt.show()
../../_images/6ea51e149b0b32c072986ae07b79049a44b335b3269812604cfba07ea04b80ea.png ../../_images/8f7f1124f1092606757e4c55d38794d155bba8fe8f3ca83a5ce0c5f0ce4f69d0.png

Multivariate \(L_1\)-regression#

Let us now perform a full multivariate \(L_1\)-regression on the wine dataset to predict the wine quality \(y\) using the provided wine features. We aim to find the coefficients \(m_j\)’s and \(b\) that minimize the mean absolute deviation (MAD) by solving the following problem:

\[\begin{split} \begin{align*} \text{MAD}\,(\hat{y}) = \min_{m, \, b} \quad & \frac{1}{n} \sum_{i=1}^n | y_i - \hat{y}_i| \\ \\ \text{s. t.}\quad & \hat{y}_i = \sum_{j=1}^J x_{i, j} m_j + b & \forall i = 1, \dots, n, \end{align*} \end{split}\]

where \(x_{i, j}\) are values of β€˜explanatory’ variables, i.e., the 11 physical and chemical characteristics of the wines. By taking care of the absolute value appearing in the objective function, this can be implemented in AMPL as an LP as follows:

%%writefile l1_fit.mod

set I;
set J;

param y{I};
param X{I, J};

var a{J};
var b;

var e_pos{I} >= 0;
var e_neg{I} >= 0;

var prediction{i in I} = sum{j in J}(a[j] * X[i, j]) + b;

s.t. prediction_error{i in I}: e_pos[i] - e_neg[i] == prediction[i] - y[i];

minimize mean_absolute_deviation: sum{i in I}(e_pos[i] + e_neg[i]) / card(I);
Writing l1_fit.mod
def l1_fit(df, y_col, x_cols):
    m = AMPL()
    m.read("l1_fit.mod")

    m.set["I"] = df.index.values
    m.set["J"] = x_cols

    m.param["y"] = df[y_col]
    m.param["X"] = df[x_cols]

    m.option["solver"] = SOLVER
    m.solve()

    return m


m = l1_fit(
    wines,
    "quality",
    [
        "alcohol",
        "volatile acidity",
        "citric acid",
        "sulphates",
        "total sulfur dioxide",
        "density",
        "fixed acidity",
    ],
)
print(f"MAD = {m.obj['mean_absolute_deviation'].value():0.5f}\n")

for k, v in m.var["a"].get_values():
    print(f"{k} {v}")
print("\n")

wines["prediction"] = [i[1] for i in m.var["prediction"].get_values()]
wines["quality"].hist(label="data")

wines.plot(x="quality", y="prediction", kind="scatter")
plt.show()
HiGHS 1.5.3: HiGHS 1.5.3: optimal solution; objective 0.4997972064
1631 simplex iterations
0 barrier iterations
 
MAD = 0.49980

alcohol 0.3424249748641086
citric acid -0.2892764146613565
density -18.50082905729104
fixed acidity 0.06381837845852499
sulphates 0.9060911918549712
total sulfur dioxide -0.002187357846776872
volatile acidity -0.9806174627416928
../../_images/6ea51e149b0b32c072986ae07b79049a44b335b3269812604cfba07ea04b80ea.png ../../_images/5f118809e26bea643b4aa6b4a3338e9e07178f1056648bb9dfa01d119c167fbd.png

How do these models perform?#

A successful regression model would demonstrate a substantial reduction from \(\text{MAD}\,(y)\) to \(\text{MAD}\,(\hat{y})\). The value of \(\text{MAD}\,(y)\) sets a benchmark for the regression. The linear regression model clearly has some capability to explain the observed deviations in wine quality. Tabulating the results of the regression using the MAD statistic we find

Regressors

MAD

none

0.683

alcohol only

0.541

all

0.500

Are these models good enough to replace human judgment of wine quality? The reader can be the judge.