Abalone age prediction¶

Here, I tried playing with a simple dataset about abalone shells. It consists of several physical measurement. The goal is to predict the age of the shell given the measurements. Age is determined with the number of rings the shell has, but sounting the number of rings is tidious. I will try making a linear model to predict the number of rings.

Importing libraries and the dataset¶
In [81]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns 
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC

from scipy.stats import chi2
sns.set()
In [82]:
df = pd.read_csv("./abalone.csv")

Sneak peek into the dataset.

In [83]:
df
Out[83]:
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
... ... ... ... ... ... ... ... ... ...
4172 F 0.565 0.450 0.165 0.8870 0.3700 0.2390 0.2490 11
4173 M 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 10
4174 M 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 9
4175 F 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 10
4176 M 0.710 0.555 0.195 1.9485 0.9455 0.3765 0.4950 12

4177 rows × 9 columns

Data Analysis¶

Checking missing values¶

In [84]:
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB
None

We have one column with categorical data ($\texttt{Sex}$), seven with continuous data ($\texttt{Length}$, $\texttt{Diameter}$, $\texttt{Height}$, $\texttt{Whole weight}$, $\texttt{Shucked weight}$, $\texttt{Viscera weight}$, $\texttt{Shell weight}$) and one column with discrete data ($\texttt{Rings}$).

Note about the different weight variables - $\texttt{Shucked weight}$ is the weight of meat, $\texttt{Viscera weight}$ is the gut weight after bleeding and $\texttt{Shell weight}$ is the weight after drying.

In [85]:
df.duplicated().sum()
Out[85]:
0

We see there are no missing nor duplicates values in the data set.

Univariate analysis¶

Let us now look into summary statistics.

In [86]:
df[['Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight', 'Viscera weight', 'Shell weight', 'Rings']].describe()
Out[86]:
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
count 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000
mean 0.523992 0.407881 0.139516 0.828742 0.359367 0.180594 0.238831 9.933684
std 0.120093 0.099240 0.041827 0.490389 0.221963 0.109614 0.139203 3.224169
min 0.075000 0.055000 0.000000 0.002000 0.001000 0.000500 0.001500 1.000000
25% 0.450000 0.350000 0.115000 0.441500 0.186000 0.093500 0.130000 8.000000
50% 0.545000 0.425000 0.140000 0.799500 0.336000 0.171000 0.234000 9.000000
75% 0.615000 0.480000 0.165000 1.153000 0.502000 0.253000 0.329000 11.000000
max 0.815000 0.650000 1.130000 2.825500 1.488000 0.760000 1.005000 29.000000

Notice that the minimun value in $\texttt{Height}$ column is 0. We will remove those data points, since essentially they have a missing value.

In [87]:
print(df[df['Height'] == 0])
df = df[df['Height'] > 0]
     Sex  Length  Diameter  Height  Whole weight  Shucked weight  \
1257   I   0.430      0.34     0.0         0.428          0.2065   
3996   I   0.315      0.23     0.0         0.134          0.0575   

      Viscera weight  Shell weight  Rings  
1257          0.0860        0.1150      8  
3996          0.0285        0.3505      6  

For more succinct code, we will define the list of continuous variables.

In [88]:
num_vars = ['Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight', 'Viscera weight', 'Shell weight']
Sex¶
In [89]:
sns.countplot(df, x='Sex')
Out[89]:
<Axes: xlabel='Sex', ylabel='count'>

Notice we have two types of speciments in the dataset, infants and adults (male and female)

Rings¶
In [90]:
sns.displot(df, x='Rings', discrete=True, kde=True, hue='Sex')
Out[90]:
<seaborn.axisgrid.FacetGrid at 0x31e0c1650>

Observe that sexual maturity comes at different age for different shells in the datasets. We can see that from this graph of distribution of $\texttt{Rings}$ variable stratified by sex ($\texttt{Infant}$, $\texttt{Female}$, $\texttt{Male}$)

Next,let's see the box plots to familiarize ouselves with the state of outliers

In [91]:
sns.set(rc={"figure.figsize":(8, 28)}) 
fig, axes = plt.subplots(7, 1)

for i, var in zip(range(len(num_vars)), num_vars):
    sns.boxplot(df, x=var, ax = axes[i])

sns.set(rc={"figure.figsize":(8, 4)})

Notice that quite a few data points fall into the outlier category.

Visualizing the relationship with the response variable - number of rings

In [92]:
sns.set(rc={"figure.figsize":(8, 28)}) 
fig, axes = plt.subplots(7, 1)

for i, var in zip(range(len(num_vars)), num_vars):
    sns.scatterplot(df, x=var, y=df.Rings, ax = axes[i])

sns.set(rc={"figure.figsize":(8, 4)})
Correlation¶
In [93]:
sns.heatmap(df[num_vars].corr())
Out[93]:
<Axes: >

Several features are higly correlated, namely $\texttt{Length}$ and $\texttt{Diameter}$ (as expected).

Bivariate analysis¶

Let us also plot the pairwise relationships between the explanatory variables.

In [94]:
sns.pairplot(data=df[num_vars + ['Sex']], hue='Sex', corner=True)
Out[94]:
<seaborn.axisgrid.PairGrid at 0x323f57050>

There are several observations we can make from these plots:

  • Notice that $\texttt{Diameter}$, $\texttt{Length}$ and $\texttt{Height}$ are strongly correlated, almost perfect positive correlation
  • Similar for $\texttt{Weigth}$s explanatory variables
  • Distribution of variables among males and females are approximately the same
  • Distribution of variables for infants is to the left of those of adults (as expected)
  • There is a greater variance in distribution among infants for $\texttt{Diameter}$, $\texttt{Height}$ and $\texttt{Length}$, but among the adults for $\texttt{Weigth}$s variables

Preprocessing¶

Outliers¶

In the plots above, we can notice there are some outliers, especially for the $\texttt{Height}$ variable. There are several ways to approach the outliers. We will determine outliers by using Mahalanobis distance

Mahalanobis distance¶
In [95]:
cov = np.cov(df[num_vars], rowvar=False)
cov_inv = np.linalg.inv(cov)
means = np.mean(df[num_vars], axis=0)

distances = []
for i, example in df[num_vars].iterrows():
    distances.append((example - means).T @ cov_inv @ (example - means))
    
df['M-Distance'] = distances
cutoff = chi2.ppf(0.95, len(num_vars))
sum((distances > cutoff)/df.shape[0])
Out[95]:
0.08694610778443086

Based on Mahalanobis distance, we would label around 8.7% of datapoints as outliers. Let us see the feature distributions and bivariate plots again.

In [109]:
df_in = df[distances <= cutoff]

sns.pairplot(data=df_in[num_vars+ ['Sex']], hue='Sex', corner=True)
Out[109]:
<seaborn.axisgrid.PairGrid at 0x30e54f910>

One hot fix¶

Considering we have three categories for attribute $\texttt{Sex}$, we will one-hot encode them.

In [97]:
df_in = pd.get_dummies(df_in)

Normalization¶

Since I have decided to use Linear Regression, we will apply the z-score normalization. Recall that the parameters mean and variance have to be fitted on the training data, not on the whole data. Otherwise, we will have a data leakage.

Splitting the data¶

In [98]:
y = df_in['Rings']
X = df_in
X = X.drop(['Rings', 'M-Distance'], axis=1)
In [99]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)

Now, we will fit the mean and variance estimations on the training set and transform both the training and test data

In [100]:
mean = np.mean(X_train[num_vars], axis=0)
std = np.std(X_train[num_vars], axis=0)

X_train[num_vars] = (X_train[num_vars] - mean.T) / std.T
X_test[num_vars] = (X_test[num_vars] - mean.T) / std.T
X_train
Out[100]:
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Sex_F Sex_I Sex_M
3620 1.108662 1.069035 0.962771 1.148178 1.176251 1.720532 0.765672 True False False
2244 -1.454991 -1.323513 -1.038572 -1.268759 -1.240889 -1.264593 -1.143890 False True False
139 -1.275085 -1.269136 -1.181525 -1.209326 -1.228073 -1.141917 -1.081622 False False True
3214 0.478993 1.014659 0.676865 0.744966 0.596957 1.081592 0.765672 True False False
3757 0.029229 0.090265 0.105052 -0.169833 0.020227 -0.410970 -0.234773 False True False
... ... ... ... ... ... ... ... ... ... ...
2969 1.513449 1.123411 1.534583 1.773971 1.699152 2.415698 1.554403 False False True
639 -0.060724 -0.018487 -0.180854 -0.462336 -0.597515 -0.334297 -0.230621 True False False
1922 0.838804 0.851530 0.105052 0.603959 0.522623 0.943581 0.392062 False False True
2682 0.928756 0.851530 0.390958 0.768273 0.825086 0.754455 0.682647 False False True
3405 -1.589920 -1.649769 -1.753338 -1.387624 -1.317787 -1.315708 -1.434475 False True False

2859 rows × 10 columns

Training¶

In [101]:
model = LinearRegression()
model.fit(X_train, y_train)
model.coef_
Out[101]:
array([-0.05974412,  0.91302825,  0.74466083,  7.13957947, -5.63901424,
       -2.0749007 ,  0.22610161,  0.24733179, -0.55284362,  0.30551183])
In [102]:
y_pred = model.predict(X_test)
sns.histplot(y_pred, discrete=True)
Out[102]:
<Axes: ylabel='Count'>

Evaluation¶

Let us check the regressions score:

$R^2$¶

In [103]:
print("R2 score: ", 1 - sum((y_test-y_pred) ** 2) / sum((y_test - np.mean(y_test)) ** 2))
R2 score:  0.5222381351571055

We see that prediction model has rather bad performance.

Principle Component Analysis¶

Noticing that the "size" features and the "weight" features are correlated ammong themselves, let us see if we the data can "nicely" be captured by just two features. The data is already scaled and standardized. Let us start by computing the covarince matrix.

In [104]:
m = X_train.shape[0]
Sigma = (1 / m) * np.dot( X_train[num_vars].T, X_train[num_vars])
U = np.linalg.eig(Sigma)[1]
Lambda = np.linalg.eig(Sigma)[0]
np.round(U, 4)
np.round(Lambda, 3)
Out[104]:
array([6.549e+00, 1.810e-01, 1.230e-01, 1.100e-02, 3.000e-03, 5.600e-02,
       7.700e-02])

Notice that the first two eigenvalues are the biggest ones. Looking at the values, we can notice that even only one component explains a lot of variance in the data compared to others, but we will stick with two. Now that we got the eigenvectors, let's test if the first two will suffice for the decomposition.

In [105]:
U_r = U[:, :2]
print(U_r)
[[-0.38042627 -0.17376591]
 [-0.3808758  -0.21901104]
 [-0.36482612 -0.69848051]
 [-0.38668157  0.28072917]
 [-0.3755808   0.49316573]
 [-0.37785864  0.33224088]
 [-0.37914464 -0.03948732]]
In [106]:
Z_train = np.matmul(U_r.T,  X_train[num_vars].T.to_numpy())
plt.scatter(Z_train[0, :], Z_train[1, :])
Out[106]:
<matplotlib.collections.PathCollection at 0x30eab6e10>

Let us check if the reconstruction error in the data is small enough.

In [107]:
X_train_approx = (U_r @ Z_train).T

numerator = (X_train_approx - X_train[num_vars]).to_numpy()
np.square(numerator)
numerator = np.sum(numerator)
numerator

denominator = X_train[num_vars].to_numpy()
denominator = np.square(denominator)
denominator = np.sum(denominator)
denominator
1-numerator/denominator
Out[107]:
1.0

In the cell above, we have first calculated matrix $X_{trainapprox}$ which is the reconstruction of the $x^{(i)}$'s using $z^{(i)}$'s and the reduces eigenvector matrix $U_r$. Recall that $x^{(i)}$ is the $i$th training example and corresponds to the $i$th row in the $X_{train}$ matrix, while $z^{(i)}$ is the projection of the given example onto the subspace spanned by the columns of the matrix $U_r$.

Next, we go on to calculate $\|x^{(i)} - x^{(i)}_{approx}\|^2$ and $\|x^{(i)}\|^2$ to calculate the percentage of retained variance defined as $1-\frac{\sum \|x^{(i)} - x^{(i)}_{approx}\|^2}{\sum \|x^{(i)}\|^2}$. Then we sum over $i$'s and calculate the ratio. Since in practice we put the threshold of variance retained at 99%, we can say that the Principal Component Analysis was successful.

To validate the calculations, we will check the results of the PCA done by the $\texttt{sklearn}$ library.

In [108]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X_train[num_vars])
pca.components_.T
Out[108]:
array([[ 0.38042627,  0.17376591],
       [ 0.3808758 ,  0.21901104],
       [ 0.36482612,  0.69848051],
       [ 0.38668157, -0.28072917],
       [ 0.3755808 , -0.49316573],
       [ 0.37785864, -0.33224088],
       [ 0.37914464,  0.03948732]])

It gives the inverses of vectors we found, so our calculation was correct. We can concluded that it suffices to use only the two extracted features to have the same results, reducing computational requirements.

Further work¶

One peculiarity of the dataset that could be exploited is unrelatedness between sex of the shell and age. Then, stratifying the data by maturity would gives us datapoints that follow Gaussian distribution closer.