This dataset from Kaggle.com (https://www.kaggle.com/ydalat/lifestyle-and-wellbeing-data) contains 12,757 survey responses with 23 attributes describing how we live our lives. How can we reinvent our lifestyles to optimize our individual wellbeing while supporting the UN Sustainable Development Goals?
Your Work-Life Balance survey evaluates how we thrive in both your professional and personal lives: it reflects how well you shape your lifestyle, habits and behaviors to maximize your overall life satisfaction along the following five dimensions:
This is a description of the attributes in the dataset:
%run wellbeing_description.py
We are looking for findings that might help us to get success and improve our quality of life. In particular, we'll try to answer these questions:
From the conclusion we reach to, we can modify some habits to maximize our achievements an reduce our stress levels
This is the general guide line followed for the EDA in the project:
2.1 Data Exploration: Exploration of the values and data types.
2.2 Data cleaning: Clean the data, handling possible wrong values, NaN and outliers, and changing the data types to numeric (integers) when possible to perform numerical data analysis.
2.3 Feature engineering: Create encoding from categorical variables, such as ordinal encoding for age, binary encoding and dummy variables for gender
2.4 Data statistics: See statistic measures with the method describe: mean, standard deviation, max, min and quartiles. Plot histograms and boxplots for all the variables
2.5 Correlations between variables: Look for correlations between variables
2.6 Summary of the EDA: Key Findings and insights
We are going to explore the raw dataset. I'll check the values and data types, looking for possible wrong values, Nan or missings.
From the description, we saw that the variable stress_level is numerical, but it is of type object. There might be some values which are not numbers, and therefore the data type of the variable is "object". We'll see it in the exploration of unique values with the unique method of pd.Series. Next during the data cleaning (section 2.2) I'll handle this wrong values, either dropping the entire rows if there are not too many of these values, or imputing them from the variable mean.
The variables age and gender are objects. I'll see the categories in which age is divided, and next during the feature engineering (section 2.3) I'll perform an ordinal encoding to be able to plot a histogram of the age while performing data statistics (section 2.4). With respect to gender, a binary encoding can be done with view to use it in a future predictive model.
Once I have all the variables cleaned and appropiately treated in the case of categorical features, I am ready to plot histograms and boxplots to search possible outliers that we'll handle. Finally, I'll look for the highest positive and negative correlations, specially between the variable stress level and the others, and between achievement and the others. In this last section and looking the description of the variables, we can expect negative correlations with stress level in general and positive correlations with achievement.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder
import math
data = pd.read_csv('Wellbeing_and_lifestyle_data.csv')
data.head()
# Is there any NaN?
data.isnull().values.any()
#Number of rows and columns
print("Number of data points: {}".format(data.shape[0]), "\nNumber of variables: {}".format(data.shape[1]))
#Column names, data types and counts
data_types = data.dtypes
data_counts = data.count()
df = pd.concat([data_types, data_counts], axis=1).reset_index()
df = df.rename(columns={'index': 'Column names' , 0: 'Types', 1: 'Counts'})
print("\nData types:\n{}".format(df))
It is a dataset with Number of data points: 12756 points (rows) and 23 variables (columns). There are no missings or NaNs
#Let's change the column names to lowercase
data.columns = pd.Index(data.columns.str.lower())
data.columns
Unique values
for col in data.columns:
print(col, data[col].unique())
(data['daily_stress']=='1/1/00').sum()
Daily stress: There is a wrong value for daily stress (1/1/00), but as it is only one, I'll just drop the entire row (next section)
Age: There are 4 categories of age. In the section "feature engineering" we'll perform the ordinal encoding.
As there are no nulls, missing or strange values except the one in daily stress, the data cleaning will cosist only in eliminate the row corresponing to the value 1/1/00 of 'daily stress' and change it to type int64.
NOTE: The visualizations of data (histograms and boxplots) will be done in the section data statistics (2.4), once we perform a initial cleaning an encoding, and a later cleaning would be done for outliers, if required
There is only one strange value of daily_stress, so we can just drop the entire row from the dataset. Then I am going to change the data type from object to integer:
data = data.drop(data[data['daily_stress']=='1/1/00'].index)
data.shape
# Changing data type from object to int64
data['daily_stress'] = data['daily_stress'].astype('int64')
In this section we have to handle 2 categorical variables:
Age: We are going to perform an ordinal encoding, changing the categories to ordered integers
Gender: We are going to perform an binary encoding, changing the categories to ordered integers:
We are going to perform an ordinal encoding, changing the categories to ordered integers:
age_categories = ['Less than 20','21 to 35','36 to 50','51 or more'] #data['age'].unique(), but in the order we want
enc = OrdinalEncoder(categories=[age_categories])
enc.fit(np.array(data['age']).reshape(-1,1)) #Alternatively: Method fit_transform = method fit + method transform
#An array of shape (n_categories, n_features) should be passed to fit --> reshape(-1,1)
enc.categories
age_encoded = enc.transform(np.array(data['age']).reshape(-1,1))
age_encoded
np.unique(age_encoded)
age_original = enc.inverse_transform(age_encoded)
age_original
# Verification
enc.inverse_transform([[code] for code in [0,1,2,3]])
data['age'] = age_encoded
We are going to perform an binary encoding, changing the categories to ordered integers:
data['gender'].unique()
pd.get_dummies(data['gender'], drop_first=True)
gender_translator = {'Female': 0, 'Male': 1}
print(gender_translator)
data['gender'] = pd.get_dummies(data['gender'], drop_first=True)
d = {'Male': 'Gender (Female:0, Male: 1)'}
data.rename(columns=d)
description = data.describe()
description.loc["range"] = description.loc['max'] - description.loc['min']
out_fields = ["mean", "25%", "50%", "75%", "range"]
description = description.loc[out_fields]
description.transpose()
Boxplots
# Boxplots
fig, axes = plt.subplots(figsize=(16,5))
data.boxplot()
for tick in axes.get_xticklabels():
tick.set_rotation(45)
tick.set_ha('right') #Horizontal alignment
plt.show()
n_outliers = data['daily_shouting'][data['daily_shouting']>8].shape[0]
p = n_outliers/data.shape[0]
print("{} values above 1.5·range from Q3 = 8h of daily shouting \n{}% of data".format(n_outliers, round(p*100,2)))
n_outliers = data['flow'][data['flow']>8].shape[0]
p = n_outliers/data.shape[0]
print("{} values above 1.5·range from Q3 = 8h of daily flow \n{}% of data".format(n_outliers, round(p*100,2)))
n_outliers = data['sleep_hours'][data['sleep_hours']<3].shape[0]
p = n_outliers/data.shape[0]
print("{} values below 1.5·range from Q1 = 3h of daily sleep \n{}% of data".format(n_outliers, round(p*100,2)))
We decide to keep outliers as they might be meaningfull, representing the reality of population. Besides, they are not excessive to make us think of them as errors
# Histograms numerical variables
axs = data.hist(figsize=(16, 16))
for ax in axs.flatten():
if ax.is_first_col():
ax.set_ylabel('Frecuency')
plt.suptitle('Histograms', y=0.95, fontsize=20)
plt.show()
Correlation coefficients tell us about how well the variables are related by a straight line. However, it doesn't give any information about how much one influence to the other. For this purpose, we would have to perform a linear regression model, what we'll do later. Now let's see the correlations (Pearson).
data.corr()
# Heatmap correlations
fig, ax = plt.subplots(figsize=(8,6))
sns.heatmap(data.corr(), xticklabels=True, yticklabels=True, ax=ax, center=0, cmap='seismic') #see "diverging colormaps"
Negative correlations:
Poor correlations
Positive correlations:
The rest of the variables are in general positively correlated with the others and these positive correlation are the strongest ones. From numerical analysis I saw achievement and supporting others are the variables positively correlated with the the greatest number of the others. In particular, achievement is the variable that are positively correlated with a greater number of other variables. For a predictive model, we found that flow and personal awards could be good predictive variables, as they are not too correlated between them. We could try only with these two, but we can include core circle and social network in case we want more than two. However, it would probably increase model complexity unnecessarily and therefore overfitting.
Other not surprising strong correlations like:
sorted_corr = np.unique(np.sort(data.corr().values.flatten()))[::-1] #Correlations in descendent order
array2D_corr = np.array(data.corr())
n = 12
nth_highest = sorted_corr[:n]
nth_highest
for col in data.corr().columns:
df_pos_corr = pd.DataFrame(data.corr()[col].nlargest(2)[1:]).transpose()
df_neg_corr = pd.DataFrame(data.corr()[col].nsmallest(1)).transpose()
df = pd.concat([df_pos_corr, df_neg_corr], axis=1)
display(df)
Positive correlations:
The strongest correlations are positive. Let's see what are the pairs of variables with the highest correlation
# 12 Pairs of variables with the highest correlation
sorted_corr = np.unique(np.sort(data.corr().values.flatten()))[::-1] #Correlations in descendent order
array2D_corr = np.array(data.corr())
n = 20
nth_highest = sorted_corr[:n]
fig, axs = plt.subplots(5,4, figsize=(16,16))
fig.tight_layout(pad=5) #padding among subplots
axs = axs.flatten()
rows = []
columns = []
print("Highest correlations:\n\n")
for i in range(n):
position = np.where(array2D_corr == nth_highest[i])
row = data.corr().index[position[0][0]]
column = data.corr().columns[position[1][0]]
rows.append(row)
columns.append(column)
print("{} - {}:".format(row, column))
print(nth_highest[i], "\n")
frecuency = pd.crosstab(data[row],data[column]).iloc[::-1]
ax = fig.sca(axs[i])
ax.set(title='{} VS {}'.format(column,row))
sns.heatmap(frecuency, ax=ax, cmap="YlGnBu")
# axs[i].imshow(frecuency, cmap='Blues')
# axs[i].hist2d(data[row],data[column], cmap='Blues')
# Number of variables that each one is highly correlated with
rows_cols = rows + columns
for variable in list(set(rows_cols)):
print(variable, rows_cols.count(variable))
achievement and supporting others are the variables positively correlated with the the greatest number of the other.
Let's see the correlations of achievement with the others, and what pairs of these are less correlated between them:
# Correlations with achievement
data.corr()['achievement'].sort_values(ascending=False)
# What pair of them are less correlated between them?
achievement_corrs = data.corr()['achievement'].sort_values().index.tolist()
data.corr()[achievement_corrs].loc[achievement_corrs]
The more down and right, the most correlation the variables are with achievement. For a prediction model, we are interested in predictive variables with a low correlation between them, so we should look for small values in the down-right part of this dataframe. We see a 0.217978 corresponding to flow and personal awards, so it might be a good option chosing only these two. core circle and social network would be the next ones to chose in case we want more.
# The lowest values of correlation in absolute value
np.unique(np.sort(np.array(abs(data.corr()[achievement_corrs[11:]].loc[achievement_corrs[11:]])).flatten()))
Negative correlations:
Let's see de most negative correlated variables with the others:
# 12 Pairs of variables with the most negative correlation
sorted_corr = np.unique(np.sort(data.corr().values.flatten()))
array2D_corr = np.array(data.corr())
n = 20
nth_lowest = sorted_corr[:n]
fig, axs = plt.subplots(5,4, figsize=(16,16))
fig.tight_layout(pad=5) #padding between subplots
axs = axs.flatten()
rows = []
columns = []
print("Lowest correlations:\n\n")
for i in range(n):
position = np.where(array2D_corr == nth_lowest[i])
row = data.corr().index[position[0][0]]
column = data.corr().columns[position[1][0]]
rows.append(row)
columns.append(column)
print("{} - {}:".format(row, column))
print(nth_lowest[i], "\n")
frecuency = pd.crosstab(data[row],data[column]).iloc[::-1]
ax = fig.sca(axs[i])
ax.set(title='{} VS {}'.format(column,row))
sns.heatmap(frecuency, ax=ax, cmap="YlGnBu")
# axs[i].imshow(frecuency, cmap='Blues')
# axs[i].hist2d(data[row],data[column], cmap='Blues')
# Number of the other variables that each one is highly negative correlated with
rows_cols = rows + columns
for variable in list(set(rows_cols)):
print(variable, rows_cols.count(variable))
As we showed above in the heatmap of the correlations matrix, daily stress is negatively correlated with a number of other variables. Let's see the most negative correlations of daily stress with the others, and what pairs of these others are less correlated between them:
# Correlations with daily stress
data.corr()['daily_stress'].sort_values()
# What pair of them are less correlated between them?
stress_corrs = abs(data.corr()['daily_stress']).sort_values().index.tolist()
data.corr()[stress_corrs[11:]].loc[stress_corrs[11:]]
The more down and right, the most correlation the variables are with daily stress. For a prediction model, we are interested in predictive variables with a low correlation (in absolute value) between them, so we should look for small values in the down-right part of this dataframe. We see a -0.096076 corresponding to daily shouting and daily meditation, so it might be a good option chosing these two. Other ones we can chose that are poorly correlated with daily shouting and daily meditation are lost vacation and sufficient income.
# The lowest values of correlation in absolute value
np.unique(np.sort(np.array(abs(data.corr()[stress_corrs[11:]].loc[stress_corrs[11:]])).flatten()))
From the EDA we can conclude some key findings:
Hypothsis 1: Population sleep less than 7 hours (World Health Organization recommendation is 7-8 hours). We can perform a t-test and check out if people sleep in average fewer hours than the OMS recommends, which might be indicative for a poor quality of life.The next step would be to see if those who sleep few hours also have negative indicators as a high levels of stress or low marks in achivement, personal awards, etc.
Hypothesis 2: Data points with high values for positive attributes are grouped together. It is said than one good habit positively affects in other areas, producing a kind of "virtuous cycle". Because of this and the similar behaviours of positive attributes (reds in the correlation matrix) we saw with the other variables, we might expect data to be grouped by positive and negative attributes. We could perform clusttering or classification algorithms along with a significance test of the results to check out this hypothesis.
Hypothesis 3: There is a correlation between achievement and the variables flow, daily_meditation and time for passion significantly different from zero. If so, we can try a predictive model like a linear model to predict achievement or just use the findings for interpretation and increase of our achievements.
Hypothsis 4: There is a correlation between daily stress and daily meditation significantly different from zero. The same as the hypothesis 2 can be performed with daily stress.
We assume that the sleep hours follow a normal distribution and we are going to process the variable as continuous, although our data only takes integers between 1 and 10. Our null ($H_{0}$) and alternative ($H_{1}$) hypothesis are:
$H_{0}: \mu < 7h$
$H_{1}: \mu \geq 7h$
The population standard deviation is unknown, so we are going to estimate it from our data and perform a t test:
1. We calculate the t statistic:
$t_0 = \large \frac{\overline{X} - \mu}{\Large \frac{S} {\sqrt n}} \hspace{0.5cm} \small [1]$
where,
$S = \large \sqrt{\frac{\sum_{i=1}^n (x_{i}-\overline x)^2}{n-1}}$
is the standar deviation of our sample, and
$\mu$: population mean
$\overline X$: sample mean
2. The p-value is the probability of $t$ to be greater than what we got from our sample, $t_0$. This is the area down the curve of the T-Student distribution, whose expression is:
$f(t) = \large \frac{\Gamma(\frac{\Large \nu + 1}{2})}{\sqrt{\nu\pi}\hspace{1mm} \Gamma(\nu/2)} (1+\frac{t^2}{\nu})^{-\frac{\nu+1}{2}} \hspace{0.5cm} \small [2]$,
where
$\hspace{2cm}\Gamma(p) = \int_0^\infty x^{p-1}e^{-x}dx \hspace{0.5cm} \small [3]$
The area down the curve is the integral of $f(t)$ between 2 t's. When integrating between $t_0$ and $\infty$ we get the p-value, which is the probability of t > $t_0$, and therefore, the probability of the mean from another sample to be greater than our $\overline X$.
What is the probability of the population mean to be less than 7h, giving our sample?
See in [0] that for the variable t, keeping $\mu$ fixed and varying $\overline X$ is equivalent to keeping a $\overline X$ and varying $\mu$. To calculate the probability of the population mean to be less than 7h giving our sample, we would have our $\overline X$ fixed (from our sample) and we would traverse (recorreríamos) $\mu$ from $7 \rightarrow -\infty$ (to 0 actually), which is the same as keeping $\mu = 7$ and traverse $\overline X$ from $7.03 \rightarrow \infty$. When integrating $f(t)$ we would traverse t in opposite directions, but we can see in [0] that also $dt$ has opposite sign, so the integral result would be the same.
Thus, the probability of population mean to be less than 7h giving our sample, is exactly the same as the probability of any sample mean to be greater than ours (7.03h) assuming the hypothesis that the population mean is 7h, i.e., giving $\mu = 7$
from scipy.stats import t
mu = 7 # Population mean if the hypothesis is true
sample_mean = data['sleep_hours'].mean()
sample_std = data['sleep_hours'].std()
n = data['sleep_hours'].shape[0]
t_stat = (sample_mean - mu) / (sample_std/np.sqrt(n))
t_stat
# Probability of mu < 7h
p_value = 1-t.cdf(t_stat, df=n-1)
print("p-value: ", p_value)
print("1-p-pavlue: ", 1-p_value)
My confidence level was 99%, so p-values below 0.01 would lead us to reject the null hypothesis. I obtained a $p-value \approx 0.0005$. We can reject the null hypothesis with (at least) a 99% of confidence
$H_{1}$ accepted: Population sleep in average 7h or more
Extra: Bayesian interpretation
NOTE: I think expressions below are not valid for continuous variables, so this can be all wrong
Define:
$r = \large \frac{p(H_{0}|x)}{p(H_{1}|x)} = \frac{p(H_{0})·p(x|H_{0})}{p(H_{1})·p(x|H_{1})} = \frac{p(H_{0})}{p(H_{1})}·\frac{p_{value}}{1 - p_{value}}$,
# Giving the same prior probabilities to both hypothesis, p(H0) = p(H1):
r = p_value/(1-p_value)
r
1/r
The alternative hypothesis is 2114 times more likely than the null hypothesis. The sum of both probabilities is 1, so:
p_H1 = (1/r) / ((1/r)+1)
p_H1
Posterior probability: $p(H_{1}) = 99.95 \% $
Which is the value of the population mean?
mu = 7.03
t_stat = (sample_mean - mu) / (sample_std/np.sqrt(n))
t_stat
t.cdf(t_stat, df=n-1)
$\mu = \overline{X} \pm \large t·\frac{S}{\sqrt{n}} \$
t.ppf(0.95, df=n-1) * sample_std/math.sqrt(n)
The mean is $(7.03 \pm 0.02)h$ with a confidence level of 95%
The steps may be:
Correlations Pearson coefficients, r, gave us the idea of how well the linear model model fit to data, but it doesn't say anything about how much the variable influence on stress. Thus, the magnitude of linear regression coefficients and the Pearson correlations have nothing to do and they don't to be related necessarily. What tell us about what features influence the most on stress is the magnitude of linear regression coefficient.
Now we are going to perform some linear regression models, focusing on the interpretation of what are the main factors that cause daily stress, so we'll see what are features with the strongest influence by comparing the corresponding coeficients. These ones can probabily be sources of stress and thus the findings may be of great interest.
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score, mean_squared_error
First we'll remove the variable timestamp as we are not interested in this, and bmi_range as the values are 1 or 2, so they should be wrong ("body mass index" has typical values of 18 - 30)
data_original = data.copy()
data = data.drop(['timestamp', 'bmi_range'], axis=1)
OLS
from statsmodels.api
¶Firsty, we will perform a linear model with the class OLS
(Ordinary Linear Squares) from statsmodel.api
, which provide us a more statistical treatment, as for example the t value from a significance test to accept the no-null value of the coefficients. We will pass to the OLS object the training sets X_train and y_train. We will do the trin-test split with a test size of 0.3 (30%), which result in 8928 data points for the training set and 3827 for the test set. We choose a seed for convinience, for example random_state=2021, to ensure the splits are the same when comparing with other models. Once we have a baseline, we will try adding regularization and polynomial features and prove if they shed light on what factors have more impact on stress
# Baseline
import statsmodels.api as sm
y = data['daily_stress']
X = data.drop('daily_stress', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=2021)
lr = sm.OLS(y_train, sm.add_constant(X_train))
OLSresults = lr.fit()
print("R2 on training set:", r2_score(y_train, OLSresults.predict(sm.add_constant(X_train))))
print("R2 on test set:", r2_score(y_test, OLSresults.predict(sm.add_constant(X_test))))
rmse = np.sqrt(mean_squared_error(y_test, OLSresults.predict(sm.add_constant(X_test))))
print("Root mean squared error: ", rmse)
display(OLSresults.summary())
rmse_cv_ols = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X, y)))
print(rmse_cv_ols)
Features with coefficients significantly different from 0 are shown next. Let's choose a 99% of confidence level as criteria (p-value < 0.01), so we will keep features where the column "P<|t|" is less that 0.01":
features_nonzero = X.columns.drop(['fruits_veggies','donation', 'daily_steps', 'age'])
X_train2 = sm.add_constant(X_train[features_nonzero])
X_test2 = sm.add_constant(X_test[features_nonzero])
lr2 = sm.OLS(y_train, X_train2)
OLSresults2 = lr2.fit() #OLSResults object
print("R2 on training set:", r2_score(y_train, OLSresults2.predict(X_train2)))
print("R2 on test set:", r2_score(y_test, OLSresults2.predict(X_test2)))
rmse = np.sqrt(mean_squared_error(y_test, OLSresults2.predict(X_test2)))
print("Root mean squared error: ", rmse)
coefs2 = OLSresults2.params.sort_values()
print("-"*50)
print("\nCoefficients: \n")
print(coefs2)
rmse_cv_ols2 = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X[features_nonzero], y)))
print("\nRmse CV: ", rmse_cv_ols2)
These coefficients are significantly different from zero*. Variables gender, sufficient income and daily shouting (postive coefficient) seems to be the ones with more influence on stress.
\*Note: Residuals are not normally distributed and there is no way to apply any data transformation (log, root, boxcox) that get a more normally-distributed residuals. As significance tests performed by `statsmodel.api.OLS` assume normally distributed residuals, we should be prudent and critical with the results before reaching a conclusion. However, one fact in favour is that independency $\chi^2$ tests show a dependency between the categories of *stress* and, at least, the variables *gender, sufficient income* and *daily shouting*. These ensure just dependency between unordered categories, not correlation, but giving a look to data, it is noticeable that this dependency follows a definited direction. Evaluations points to significant no-null correlations. See the Statistical Hypothesis Testing for dependencies and correlations for details.
We have a number of features, but some of them are close to zero. We can try now reducing the number of features a see what coefficients we obtain for the remaining ones. We will do this by 2 approaches: Lasso
Regression and RFE
(Recursive Features Elimination):
We see some features that stand out against the others: gender, sufficient income and daily shouting. However, we didn't obtain a great 𝑅2 score, so it might not be significantly different from zero. We can't be sure for now whether there is a real relation between those variables and stress.
We can try simplifying the model by adding regularization and see if we obtain a better fit that make us have more confidence in the real influence of the predictive features. For this purpose, Lasso is usually a better choice than Ridge in terms of interpretation, as some cofficients are more likely to vanish with Lasso, keeping only those with a stronger impact in the output variable and improving interpretation of this simplified model. In addition, we tried with an ElasticNetCV and obtained that the best l1_ratio parameter was 1, so we will set different alpha values for a Lasso Regression. So we will create a Lasso model using LassoCV in order to try with some different alpha values, the parameter with which we control the regularization amount, and compare the resulting models through the 𝑅2 score obtained after doing Cross Validation to ensure the particular selection of the training-validation split doesn't affect to determine which one is better. We keep the parameter cv=None that use the default 5-fold cross validation and we use the X_train subset used previously with LinearRegression, which will be the one that, in turn, will be splitted in 5 folds to perform the Cross Validation. We are keeping the X_test to check the performance to unseen data.
alphas = np.geomspace(1e-5, 1, num=24)
lassoCV = LassoCV(alphas=alphas)
lassoCV.fit(X_train, y_train)
print("Best model:\n")
print("Alpha: ", lassoCV.alpha_)
print("R2 on test set: ", lassoCV.score(X_test, y_test)) #r2_score(y_test, lassoCV.predict(X_test))
rmse = np.sqrt(mean_squared_error(y_test, lassoCV.predict(X_test)))
print("Root mean squared error: ", rmse)
rmse_cv_lasso = np.sqrt(mean_squared_error(y, cross_val_predict(Lasso(alpha=lassoCV.alpha_), X, y)))
print(rmse_cv_lasso)
pd.DataFrame(list(zip(X.columns, lassoCV.coef_)), columns=['feature','coef']).set_index('feature').sort_values(by='coef')
None of the coefficients have gotten null. In fact, results are very similar to the ones from plain linear regression, both the $R^2$ score and the coefficient values. However, trying with a higher regularization (more than alpha=1), $R^2$ score is practically null.
We are going to be more strict, keeping only 13 features, the number of features whose coefficients are significantly different from zero. Let's create a model using only 13 using RFE
, and see if we obtain the same 13 as in features_nonzero
LinearRegression
with the top 13 features selected by Recursive Feature Selection (RFE)
rfe = RFE(LinearRegression(), n_features_to_select=12)
rfe.fit(X, y)
features = X.columns[rfe.support_]
features
X_train_rfe_1 = sm.add_constant(X_train[features])
X_test_rfe_1 = sm.add_constant(X_test[features])
lr_rfe = sm.OLS(y_train, X_train_rfe_1)
OLSresults_rfe = lr_rfe.fit() #OLSResults object
print("R2 on training set:", r2_score(y_train, OLSresults_rfe.predict(X_train_rfe_1)))
print("R2 on test set:", r2_score(y_test, OLSresults_rfe.predict(X_test_rfe_1)))
rmse = np.sqrt(mean_squared_error(y_test, OLSresults_rfe.predict(X_test_rfe_1)))
print("Root mean squared error: ", rmse)
coefs_rfe = OLSresults_rfe.params.sort_values()
print("-"*50)
print("Coefficients:\n")
print(coefs_rfe)
rmse_cv_rfe = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X[features], y)))
print(rmse_cv_rfe)
There might be interactions between variables. Let's try adding Polynomial Features and see if we get a better perform:
We can drop gender to see the interactions between other, as it makes no so much sense to see the interaction terms with gender, and then perform polynomial features. Daily_shouting has negative correlations with the others, so we tried including ratio features like $\frac{meditation}{shouting}$ but no good results were obtained. Adding Polynomial Features might shed light on possible interactions between positively correlated variables.
Let's try polynomial features, keeping only the interaction terms, and creating a Lasso model feeding it with these new features:
pf = PolynomialFeatures(degree=2, interaction_only=True)
X_train_poly = pf.fit_transform(X_train.drop('gender', axis=1))
X_test_poly = pf.transform(X_test.drop('gender', axis=1))
lassoCV_poly = LassoCV(alphas = alphas, max_iter=100000)
lassoCV_poly.fit(X_train_poly, y_train)
print("Best model with polynomial features:\n")
print("Alpha: ", lassoCV_poly.alpha_)
print("R2 on test set: ", lassoCV_poly.score(X_test_poly, y_test))
rmse = np.sqrt(mean_squared_error(y_test, lassoCV_poly.predict(X_test_poly)))
print("Root mean squared error: ", rmse)
rmse_cv_poly = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X, y)))
print(rmse_cv_poly)
selected_features = [(a,b) for a,b in zip(pf.get_feature_names(input_features = X_train.columns), lassoCV_poly.coef_)
if b!=0]
df_selected_features = pd.DataFrame(selected_features, columns=['Polynomial Feature', 'Coef']).sort_values(by='Coef')
print("Polynomial Features coefficients:")
display(df_selected_features.head())
display(df_selected_features.tail())
Coefficients are very close to 0 because, despite the regularization, there is still a number of features. However we can't increase regularization (parameter alpha) because in that case we have seen $R^2$ turns to almost zero and so the linear coefficients lose its veracity. In addition, interaction terms obtained are fairly similar and doesn't make so much sense, not supplying a good interpretation.
After exploring several possibilities, we dicard adding polynomial features. We saw regularization does not improve results reached with the simple linear regression, giving very similar coefficients without eliminating some of them. In addition, it doesn't provide a better fit that could have made it more trustful. In conclusion, we can keep simple linear regression.
print("rmse with cross validation predict:\n")
print("OLS: {}\nOLS with significantly different from zero coefficients: {}\nLasso: {}\nLasso with polynomial features: {}\nOLS with RFE: {}".format(
rmse_cv_ols, rmse_cv_ols2, rmse_cv_lasso, rmse_cv_poly, rmse_cv_rfe))
The model corresponding to the least rmse will be the most reliable model to use and thus the one we will use explain what features have the strongest influence on stress. We discard polynomial features because its features have not make sensea and there was many features with very similar coefficients, which is not useful for the explainability of stress. Plain Linear Regression (OLS) is the one that best fits our objective of interpretation, using only the following features:
print(coefs2)
Daily stress depends mainly of 2 features: Gender and Sufficient income. In average, women suffer more stress than men, and the same happens with people not having enough sufficient income. This findings are not surprising, since other studies also support these hypothesis. Other variables having an impact on stress, although considerably less, are the number of sleep hours, and even less there are daily meditation, lost vacation, etc. These ones hardly have a very little influence, but it is no zero.
Comparision: Here we have a comparition between gender coefficient and the others. We can use the following ratios to compare the influence on stress of each variable, being the higher value, the less influence the variable has.
print("Ratios coefficients\n")
for i,coef in enumerate(coefs2[:2]):
ratios = coef/coefs2[:-1].drop(coefs2.index[i])
ratios.index = ratios.index.map(lambda x: coefs2.index[i] + "/" + x)
print(ratios)
print("\n")
Interpretation: This coefficients ratios say how many units of the corresponding feature (sleep hours, daily meditation, flow and daily shouting) would have the same effect as gender in daily stress. So we see that, for example, being a woman have the same effect on stress as shouting about 2.6 points below in the shouting scale 0-10 (0: not at all, 10: very often), and each sleep hour would be comparable to 2 more times meditating or 3 more hours dedicated to your passion, etc.
Here there are some ideas to improve our model done up to now:
I'll perform some tests for some pairs for variables to prove whether it is a dependency and, going one step ahead, whether there is a correlation between them.
I'll calculate the statistic: $\chi^2 =\large \sum_{i} \frac{(O_{i} - E_{i})^2}{E_{i}}$, where:
$O_{i}$ is each of the observed (real) values
$E_{i}$ is each of the expected values, so the value we would expect according to the female/male proportion if no gender dependency took place.
I'll create a function that shows the crosstab counts of 2 variables, both observed and expected (from the proportion of each category within the variable it belongs to), and calculates the $\chi^2$ statistic:
def independence_test(variable1, variable2):
'''
Return: DataFrame comparing observed vs expected data, chi2, chi2_matrix (addends, contributions to the value of chi2)
Example:
>> df, chi2 = correlation_test('feature1', 'feature2')
'''
# variable1 in Rows
# variable2 in Columns
# Proportions
proportions = (data[variable2].value_counts()/data[variable2].count()).sort_index()
# Expected values
value_counts_var1 = (data[variable1].value_counts()).sort_index()
expected_df = pd.DataFrame()
for index,value in zip(proportions.index, proportions.values):
expected_df[index] = value_counts_var1*value
# Observed values
observed_df = pd.crosstab(data[variable1], data[variable2])
observed_df.index.name = None
observed_df.columns.name = None
# Dataframe Observed vs Expected values
reality = ['Observed','Expected']
df_concat = pd.concat([observed_df, expected_df], axis=1, keys=reality)
df_concat = df_concat.sort_index(axis=1, level=1)
df_concat.columns = df_concat.columns.map(lambda x: x[::-1])
df_concat = df_concat.sort_index(axis=0)
print('Index: Values of {} \nColumns: Values of {}'.format(variable1, variable2))
display(df_concat)
# Chi2
print("-"*100)
print("\nCalculation of chi2:")
print("\nAddends (contributions to the values of chi2):")
chi2_matrix = (((expected_df - observed_df)**2)/expected_df).sort_index(axis=1)
chi2_matrix.index.name = variable1
chi2_matrix.columns.name = variable2
display(chi2_matrix)
chi2 = chi2_matrix.sum(axis=1).sum(axis=0)
print("chi2:", chi2)
return df_concat, chi2, chi2_matrix
Let's perform a signiicance test to prove whether there is a correlations significantly different from 0. We'll use $t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$. Explanations about this expression and the significance test for correlations are commented in "observations" section.
def correlation_test(variable1, variable2):
'''
Return: t statistic
'''
r = data.corr()[variable1][variable2]
n = data.shape[0]
std_r = np.sqrt((1-r**2)/(n-2))
print("Correlation coefficient r: ", r)
print("Standard error of correlation coefficient: ", std_r)
#t statistic
t_statistic = (r*np.sqrt(n-2))/np.sqrt(1-r**2)
print("\nt statistic: ", t_statistic)
return t_statistic
$H_{1}$: Stress levels does depend on gender
1. $\chi^2$ test to check independency:
I'll perform a $\chi^2$ test. There are $n_{1} = 6$ stress levels and $n_{2} = 2$ posible values for gender, so we have $(n_{1} - 1)*(n_{2} - 1) = 5$ degrees of freedom. The $\chi^2$ values for 5 degrees of freedom are:
$\chi^2_{0.01, 5} = 15.087$ for 99% of confidence level
$\chi^2_{0.005, 5} = 16.75$ for 99.5% of confidence level
$\chi^2_{0.001, 5} = 20.515$ for 99.9% of confidence level
This is done by the function created above, independency_test
:
print(gender_translator, "\n")
df_concat_stress_gender, chi2_stress_gender, chi2_matrix_stress_gender = independence_test('daily_stress', 'gender')
$\chi^2$ for 5 grades of freedom and a significance level of 0.001: $\chi^2_{5,0.001} = 20,5147$. We obtained $248.174$, so we can reject the null hypothesis: "stress_level is independent of the gender", with (at least) a 99.9% of confidence level.
We can accept the alternative hypothesis "stress level depends on gender". This doesn't necessarily imply correlation, altough we can appreciate that it does exists. We see the differences between women and men. For low stress levels (0 and 1) there are less women ('Real' values) than expected and more men than expected from the male/female proportion, while for high stress levels (5) there are more women and less men than expected. We see by eye that women suffer more stress than men, generally speaking. But we have to perform a significance test for the correlation, which we'll perform later below, to ensure with numbers it really exists.
If we see the $\chi^2$ for each of the stress levels separatly, we see that answers of 2 and 4 in "stress level" are following the male/female proportion, so no dependence on gender is shown among these middle values. The differences come from the extreme values: 0, 1 and 5:
from matplotlib import colors
norm=colors.TwoSlopeNorm(vmin=0, vcenter=6.63/2, vmax=11)
sns.heatmap(chi2_matrix_stress_gender, norm=norm, cmap='RdBu_r');
The t-student value is $t_{1, 0.01} = 6.63$ for a 0.01 of significance level, so at the heatmap, reds are the ones contributing to the row sum to be 6.63 and thus contributing to that particular level of stress to be significantly dependent of gender.
df_diff = pd.DataFrame()
for col in df_concat_stress_gender.columns.levels[0]:
df_diff[col] = (df_concat_stress_gender[(col,'Observed')] - df_concat_stress_gender[(col,'Expected')])
norm=colors.TwoSlopeNorm(vmin=df_diff.min().min(), vcenter=0, vmax=df_diff.max().max())
sns.heatmap(df_diff, norm=norm, cmap='RdBu_r', annot=True);
We clearly see a correlation, where there are more observed cases for the pairs of categories men (1) - low stress (0,1,2), and for the pairs women (0) - high stress (3,4,5). From the previous heatmap it is appreciable where these differences are significant, and later below, we'll perform a t-test to confirm the significance of the correlation
# Plots
plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
plt.plot(df_concat_stress_gender[0])
plt.legend(df_concat_stress_gender[0].columns);
plt.title('Female');
plt.subplot(1,2,2)
plt.plot(df_concat_stress_gender[1])
plt.legend(df_concat_stress_gender[1].columns);
plt.title('Male');
2. t-test to check correlation
t_statistic = correlation_test('daily_stress', 'gender')
# p-value for n-2 degrees of freedom
from scipy.stats import t
p_value = t.cdf(t_statistic, n-2)*2 #2 tails into account, to obtain P>|t|
print(p_value)
Residuals should be normally distributed to be strict. However, dependency is ensured by the $\chi^2$ test, and giving a look to data, it is noticeable that this dependency follows a definited direction. Although this t-test is not definitive, things point to the hypothesis women suffer more stress than men, and in fact, there are other studies about this reaching our same conclusions.
1. $\chi^2$ test to check independency
variable1, variable2 = 'daily_stress', 'sufficient_income'
df_concat_stress_income, chi2_stress_income, chi2_matrix_stress_income = independence_test(variable1, variable2)
chi2_matrix_stress_income
norm=colors.TwoSlopeNorm(vmin=0, vcenter=6.63/2, vmax=11)
sns.heatmap(chi2_matrix_stress_income, norm=norm, cmap='RdBu_r');
There are special divergences between expected and observed for a level stress of 5, and also for 1. Similar as the previous case of stress vs gender, the t-student value is 𝑡1,0.01=6.63 for a 0.01 of significance level, so at the heatmap, reds are the ones contributing to the row sum to be 6.63 and thus contributing to that particular level of stress to be significantly dependent of income
df_diff = pd.DataFrame()
for col in df_concat_stress_income.columns.levels[0]:
df_diff[col] = (df_concat_stress_income[(col,'Observed')] - df_concat_stress_income[(col,'Expected')])
norm=colors.TwoSlopeNorm(vmin=df_diff.min().min(), vcenter=0, vmax=df_diff.max().max())
sns.heatmap(df_diff, norm=norm, cmap='RdBu_r', annot=True);
Once again, we appreciate a negative correlation between these two variables. Let's confirm this correlation is significant:
2. t-test to check correlation
t_statistic = correlation_test(variable1, variable2)
p_value = t.cdf(t_statistic, n-2)*2
print("\np-value: ", p_value)
Again, residuals should be normally distributed to be strict. However, dependency is ensured by the 𝜒2 test, and giving a look to data, it is noticeable that this dependency follows a definited direction. Although this t-test is not definitive, things point to the alternative hypothesis: stress and sufficient income are negatively correlated: the less sufficient income, the more stress. Other studies also support this hypothesis, talking about the so-called "financial stress".
1. $\chi^2$ test to check independency
variable1, variable2 = 'daily_shouting', 'daily_stress'
output = independence_test(variable1, variable2)
chi2_matrix_stress_shouting = output[2]
df_concat_stress_shouting = output[0]
norm=colors.TwoSlopeNorm(vmin=0, vcenter=15/5, vmax=15)
sns.heatmap(chi2_matrix_stress_shouting, norm=norm, cmap='RdBu_r');
Once again, blues are the ones not contributing to the row sum to be 15, which is the $\chi^2$ value for 5 degrees of freedom and 0.01 of significance level (99% of confidence level), so we can interpret them as the one not being so unexpected. Reds correspond to the pairs of categories with more observed cases than what would be "normal" following the proportions' distribution. Let's see the differences between the observed and expected values
# Differences observed-expected
df_diff = pd.DataFrame()
for col in df_concat_stress_shouting.columns.levels[0]:
df_diff[col] = (df_concat_stress_shouting[(col,'Observed')] - df_concat_stress_shouting[(col,'Expected')])
norm=colors.TwoSlopeNorm(vmin=df_diff.min().min(), vcenter=0, vmax=df_diff.max().max())
print("\nHeatmap differences data observed - data expected")
sns.heatmap(df_diff, norm=norm, cmap="RdBu_r", annot=True);
Here the correlation is easily recognizable, being more cases of low stress for low shouting than expected from the proportions' distribution (positive observed-expected differences), and being more cases of high stress for high score of shouting than expected. One the other hand, there are negative differences (blues) for cross categories (low stress and high shouting, high stress and low shouting), indicating a lower number of people highly stressed among those not shouting so much, and also a lower number of people not so much stressed among those shouting. In general, the more shouting, the more stress
2. t-test to check correlation
t_statistic = correlation_test(variable1, variable2)
p_value = (1 - t.cdf(t_statistic, n-2))*2
print("\n p-value: ", p_value)
Again, residuals should be normally distributed, but all point to the alternative hypothesis: stress and daily shouting are positively correlated: the more shouting, the more stress.
Below it is shown a series of observations and notes for myself from ideas or solved questions arise during the present study. Take a look if you find it useful to a better understanding of approaches carried out in this project.
LinearRegression
:¶import statsmodels.api as sm
lr = sm.OLS(y_train, X_train)
result = lr.fit()
result.summary2()
we see different coefficients respect to the ones from LinearRegression
. Moreover, there are features with a few correlation with daily_strees and, however, a high t meaning a low probability of the correspondent coefficient to be 0 (for example, social_network has t=10,1 while the corr with daily_stress is 0.012720, so they are not correlated). What's happening? This is due to the fact that sm.OLS
doesn't take the intercept by default, so in the adjustment by least squares, the function that minimize is:
$f = \sum (y_{i}^{pred}-y_{i})^2 = \sum (\beta_{1} x_{i} - y_{i})^2$, where $\beta_{1}, x_{i}$ can be vectors. The thing is that the intercept $\beta_{0}$ is not included.
This is $\sum (\beta_{0} + \beta_{1} x_{i} - y_{i})^2$, with $\beta_{0} = 0$. Thus, the resulting regression line passes through the origin (0,0) and the slope is different from 0. Adding X = sm.add_constant(X)
before the lr = sm.OLS(y, X)
, the variable X
will include a column of ones, and the corresponding $\beta$ coefficient will be $\beta_{0}$ as it is multiplied by ones.
X_train_ones = sm.add_constant(X_train)
ols = sm.OLS(y_train, X_train_ones)
ols = ols.fit()
Doing this, we will obtain the intercept, and the result is the same as performing LinearRegression
with our column of X
(without the column of ones, as LinearRegression already takes into account the intercept in its default argument intercept=True
)
x = np.linspace(0,100,1000).reshape(-1,1)
y = np.random.rand(1000)*1000000
coeficientes = np.array([])
lin = LinearRegression()
for i in range(1000):
y = np.random.rand(1000)*100
lin.fit(x,y)
coeficientes = np.hstack([coeficientes, lin.coef_])
plt.hist(coeficientes);
from scipy.stats import normaltest
normaltest(coeficientes)
In random samples, the mean is the best model, the one which minimize the squared errors. Here we can see that the linear regression on random samples gives a coefficient of 0 with the most probability, following the coefficients a normal distribution: $coef \sim N(0,\sigma)$
We performed the analysis by stress levels, so giving a particular stress level, how many women and men answered. Let's perform some hypothesis test by gender. Giving the male answers, are they less stress than population in average? Are women more stressed than population in average?
H0: Men are as stressed as the average of the population
df_concat_stress_gender
real_df
real_df = df_concat_stress_gender[[(0,'Observed'),(1,'Observed')]]
expected_df = df_concat_stress_gender[[(0,'Expected'),(1,'Expected')]]
real_df.columns = real_df.columns.droplevel(1)
expected_df.columns = expected_df.columns.droplevel(1)
chi2 = sum((real_df[1] - expected_df[1])**2/expected_df[1])
chi2
Rejected.
H0: Women are as stressed as the average of the population
sum((real_df[0] - expected_df[0])**2/expected_df[0])
Rejected.
We saw that the categories of daily stress and gender are not independent. Now we would have to demonstrate that, besides this, there is a correlation, being women more stressed than men. The fact that there are dependent doesn't say anything about correlation.
About correlation, we obtained a value of data.corr()['daily_stress']['gender'] = -0.12867
. Now we have to check if this is significantly different from zero to conclude there is not only a dependence but also a (negative) tendency. To obtain the significance, the test we have to perform is a t-test, were we have $H_{0}: \mu=0$, (where $\mu$ is the slope) and we estimate the standard deviation of the population slopes by the standard deviation of the slope we have obtained from our sample.
See that the slope estimator variance come from the variance of the residuals, $\sigma$, considering residuals are normally distributed. The meaning of the slope estimator variance is not that much the uncertainty of what is the value of the slope, the value that minimize the cost function, but it indicates the little differences in the value of the slope we had to use to predict each of the data. So we have a general $\beta_1$ unique, the one we use for all predictions, and the variance says that you might need to use another one slightly (or highly) different to obtain $y_i$ as $y_i = \beta_1·x_i$.
The variance of $\beta_1$ should be interpreted as taking the residuals $\mathbf{\epsilon \sim N(0,\sigma)}$ into $\beta_1$ and leave only $\mathbf{y = \beta_1·x}$, with $\beta_1$ having a variance, instead of $y = \beta_1 · x + \epsilon$
$\small \sigma^2 = MSE$, so:
$Std(\beta_1) = \large \frac{\sqrt{MSE}}{\sqrt{S_{xx}}} \hspace{.5cm}\normalsize [0]$
The calculated value of the statistic t is:
$t =\Large \frac{\beta_1 - 0}{Std(\beta_1)} \hspace{.5cm}\normalsize [1]$, where
$\beta_1 =\frac{\sum(x_i - \overline{x})(y_i - \overline{y})}{\sum(x_i - \overline{x})^2} \hspace{.5cm}\normalsize [2]$
is the expression for $\beta_1$ obtained by minimum squares and $Std(\beta_1)$ is given by [0]
We can also obtain an expression of t from the Pearson correlation coefficient, r. We need the relation between $\beta_1$ and r:
$\beta_1 = r·\frac{S_{yy}}{S_{xx}} \hspace{.5cm}\normalsize [3]$
which is deducted from [2] and the expression for the Pearson correlation coefficient, r: $r=\frac{S_{xy}}{S_{xx}·S_{yy}}$. Taking [3] into account and similarly to the demonstration above of the standard error of the slope in a linear regression, we obtain that $std(r) = std(\beta_1)·\frac{\sqrt{S_{yy}}}{\sqrt{S_{xx}}} = ... = \sqrt{\frac{1-r^2}{N-2}} \hspace{.5cm}\normalsize [4]$ (were we consider $r^2 = 1 - \frac{SSE}{SST}$), and finally:
$\large t = \frac{r - 0}{std(r)} = \frac{r}{\sqrt{\frac{1-r^2}{N-2}}} = ... = \frac{r·\sqrt{N-2}}{\sqrt{1-r^2}} \hspace{.5cm}\normalsize [5]$
sm.OLS(y, X).fit().summary2()
¶Let's see that the values for t obtaind by hand from [1] or from [4], as well as the value for $Std(\beta_1)$ from [0], are those shown authomatically with the method summary2()
of the object sm.OLS
, the linear regressor of OLS of statsmodels.api:
# t with Eq.[5]
r = data.corr(method='pearson')['daily_stress']['gender']
N = data.shape[0]
std_r = np.sqrt((1-r**2)/(N-2)) #Equation [4]
t = (r-0)/std_r #Equation [5]
print(t)
# t with Eq.[1]
t = result.params['gender'] / std_b1 #Equation [1]
print(t)
import statsmodels.api as sm
y = data['daily_stress']
X = sm.add_constant(data['gender'])
lr = sm.OLS(y, X)
result = lr.fit()
rmse = np.sqrt(mean_squared_error(y, result.predict(sm.add_constant(X['gender']))))
Sxx = ((X['gender'] - X['gender'].mean())**2).sum()
std_b1 = rmse/np.sqrt(Sxx) #Equation [0]
print(std_b1)
$t = -14.653189$ and $std\_b1 = 0.02465$ match the t and std.err given by result.summary2():
result.summary2()
The expression for the slope estimator's variance come from assuming normally distributed residuals. However, this is not our case (I tried transforming the data, but I can't obtain normally distributed residuals in any case), so we should be prudent with the conclusions we have reached here. However, there are other studies affirming that women suffer more stress than men, as well as studies about the so-called "financial stress" that relate stress with low income. Besides, in our study we easily appreciate the lower number of observed cases of women with low stress levels, compared to expected cases, and the higher number of women with high stress levels, and since the dependency of the categories is guaranteed by the $\chi^2$ test of independency (not affected by the distribution of residuals, as this test has nothing to do with descriptive tendency models and just deal with unordered categories), it is very likely a real correlation.
Pearson correlation coefficient (r) is an indirect measure of the t-statistic, which determines whether the linear regression cofficient ($\beta_1$) is 0 (t~0) or not (|t|>>0). The relation of these two is $\frac{r·\sqrt{N-2}}{\sqrt{1-r^2}}$, and as we can appreciate, $r \rightarrow 0, t \rightarrow 0$ and $|r| \rightarrow 1, |t| >>> 0$. The magnitude of r tells how well data fit a straight line, and the higher r, the lower $\beta_1$ will vary from its estimated value ($\Delta \beta_1 \sim 0$) and thus the less likely it is to be 0 (strictly, the less likely is that, if it were 0 in reality, we had obtain that so well aligned and low-varianced data).
Coefficients VS Correlations (Pearson)
df_coef = pd.DataFrame(list(zip(X.columns, OLSresults.params[1:])), columns=['Feature', 'Coefficient LR'])
df_coef['Pearson Correlation r'] = data.corr()['daily_stress'].drop('daily_stress').values
df_coef.sort_values(by='Coefficient LR')
Linear Regression coefficients does not follow the same ascending/descending order as correlation Pearson coefficients (r). As we mencioned above, correlation Pearson coefficients tell how well the linear model fit to data, but not how much the variable influence on stress. What tell us about what features influence the most on stress is the magnitude of linear regression coefficient.