Wellbeing and lifestyle data¶

1. Description and objetives¶

This dataset from Kaggle.com (https://www.kaggle.com/ydalat/lifestyle-and-wellbeing-data) contains 12,757 survey responses with 23 attributes describing how we live our lives. How can we reinvent our lifestyles to optimize our individual wellbeing while supporting the UN Sustainable Development Goals?

Your Work-Life Balance survey evaluates how we thrive in both your professional and personal lives: it reflects how well you shape your lifestyle, habits and behaviors to maximize your overall life satisfaction along the following five dimensions:

Healthy body, reflecting your fitness and healthy habits;
Healthy mind, indicating how well you embrace positive emotions;
Expertise, measuring the ability to grow your expertise and achieve something unique;
Connection, assessing the strength of your social network and your inclination to discover the world;
Meaning, evaluating your compassion, generosity and how much 'you are living the life of your dream'.

Description¶

This is a description of the attributes in the dataset:

%run wellbeing_description.py

Objetives¶

We are looking for findings that might help us to get success and improve our quality of life. In particular, we'll try to answer these questions:

Achievements: What indicators have the greatest influence in our achievements and how well can we predict them?

Stress levels: What indicators have the greatest influence in our stress levels and how well can we predict it? For this variable, we will perform some linear models being the interpretation the main goal, so finding the largest coefficients in the predictive model is what we are intereseted in, as well as the statistical significance of these coefficients, so that we get the features having the most influence on our daily stress and take action to hopefully reduce it.

From the conclusion we reach to, we can modify some habits to maximize our achievements an reduce our stress levels

2. Exploratory Data Analysis¶

This is the general guide line followed for the EDA in the project:

2.1 Data Exploration: Exploration of the values and data types.

2.2 Data cleaning: Clean the data, handling possible wrong values, NaN and outliers, and changing the data types to numeric (integers) when possible to perform numerical data analysis.

2.3 Feature engineering: Create encoding from categorical variables, such as ordinal encoding for age, binary encoding and dummy variables for gender

2.4 Data statistics: See statistic measures with the method describe: mean, standard deviation, max, min and quartiles. Plot histograms and boxplots for all the variables

2.5 Correlations between variables: Look for correlations between variables

2.6 Summary of the EDA: Key Findings and insights

2.1 Data Exploration¶

Initial plan:¶

We are going to explore the raw dataset. I'll check the values and data types, looking for possible wrong values, Nan or missings.

From the description, we saw that the variable stress_level is numerical, but it is of type object. There might be some values which are not numbers, and therefore the data type of the variable is "object". We'll see it in the exploration of unique values with the unique method of pd.Series. Next during the data cleaning (section 2.2) I'll handle this wrong values, either dropping the entire rows if there are not too many of these values, or imputing them from the variable mean.

The variables age and gender are objects. I'll see the categories in which age is divided, and next during the feature engineering (section 2.3) I'll perform an ordinal encoding to be able to plot a histogram of the age while performing data statistics (section 2.4). With respect to gender, a binary encoding can be done with view to use it in a future predictive model.

Once I have all the variables cleaned and appropiately treated in the case of categorical features, I am ready to plot histograms and boxplots to search possible outliers that we'll handle. Finally, I'll look for the highest positive and negative correlations, specially between the variable stress level and the others, and between achievement and the others. In this last section and looking the description of the variables, we can expect negative correlations with stress level in general and positive correlations with achievement.

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import OrdinalEncoder

import math

data = pd.read_csv('Wellbeing_and_lifestyle_data.csv')

data.head()

# Is there any NaN?
data.isnull().values.any()

False

#Number of rows and columns
print("Number of data points: {}".format(data.shape[0]), "\nNumber of variables: {}".format(data.shape[1]))

#Column names, data types and counts
data_types = data.dtypes
data_counts = data.count()
df = pd.concat([data_types, data_counts], axis=1).reset_index()
df = df.rename(columns={'index': 'Column names' , 0: 'Types', 1: 'Counts'})
print("\nData types:\n{}".format(df))

Number of data points: 12756 
Number of variables: 23

Data types:
         Column names   Types  Counts
0           Timestamp  object   12756
1      FRUITS_VEGGIES   int64   12756
2        DAILY_STRESS  object   12756
3      PLACES_VISITED   int64   12756
4         CORE_CIRCLE   int64   12756
5   SUPPORTING_OTHERS   int64   12756
6      SOCIAL_NETWORK   int64   12756
7         ACHIEVEMENT   int64   12756
8            DONATION   int64   12756
9           BMI_RANGE   int64   12756
10     TODO_COMPLETED   int64   12756
11               FLOW   int64   12756
12        DAILY_STEPS   int64   12756
13        LIVE_VISION   int64   12756
14        SLEEP_HOURS   int64   12756
15      LOST_VACATION   int64   12756
16     DAILY_SHOUTING   int64   12756
17  SUFFICIENT_INCOME   int64   12756
18    PERSONAL_AWARDS   int64   12756
19   TIME_FOR_PASSION   int64   12756
20   DAILY_MEDITATION   int64   12756
21                AGE  object   12756
22             GENDER  object   12756

It is a dataset with Number of data points: 12756 points (rows) and 23 variables (columns). There are no missings or NaNs

#Let's change the column names to lowercase
data.columns = pd.Index(data.columns.str.lower())
data.columns

Index(['timestamp', 'fruits_veggies', 'daily_stress', 'places_visited',
       'core_circle', 'supporting_others', 'social_network', 'achievement',
       'donation', 'bmi_range', 'todo_completed', 'flow', 'daily_steps',
       'live_vision', 'sleep_hours', 'lost_vacation', 'daily_shouting',
       'sufficient_income', 'personal_awards', 'time_for_passion',
       'daily_meditation', 'age', 'gender'],
      dtype='object')

Unique values

for col in data.columns:
    print(col, data[col].unique())

timestamp ['7/7/15' '7/8/15' '7/9/15' ... '2/24/2020 10:35:02' '2/24/2020 12:54:10'
 '2/24/2020 12:55:21']
fruits_veggies [3 2 5 4 1 0]
daily_stress ['2' '3' '1' '4' '5' '0' '1/1/00']
places_visited [ 2  4  3 10  5  6  7  0  8  1  9]
core_circle [ 5  3  4  9  6  7  8 10  2  1  0]
supporting_others [ 0  8  4 10  5  3  1  2  6  7  9]
social_network [ 5 10  7  4  3  1  2  8  6  9  0]
achievement [ 2  5  3  4  0  1  6 10  8  7  9]
donation [0 2 5 4 3 1]
bmi_range [1 2]
todo_completed [ 6  5  2  3  8 10  7  4  1  0  9]
flow [ 4  2  5  0  1  8  7  6  3 10  9]
daily_steps [ 5  4  7  8  1  3  6  2 10  9]
live_vision [ 0  5 10  4  2  1  6  3  8  9  7]
sleep_hours [ 7  8  5  6 10  9  4  3  2  1]
lost_vacation [ 5  2 10  7  0  3  1  4  8  6  9]
daily_shouting [ 5  2  0  3  1  7  6  4 10  8  9]
sufficient_income [1 2]
personal_awards [ 4  3  5  8 10  1  2  7  6  0  9]
time_for_passion [ 0  2  8  1  3  6  5  4 10  9  7]
daily_meditation [ 5  6  3  0 10  2  1  7  4  8  9]
age ['36 to 50' '51 or more' '21 to 35' 'Less than 20']
gender ['Female' 'Male']

(data['daily_stress']=='1/1/00').sum()

1

Daily stress: There is a wrong value for daily stress (1/1/00), but as it is only one, I'll just drop the entire row (next section)
Age: There are 4 categories of age. In the section "feature engineering" we'll perform the ordinal encoding.

2.2 Data cleaning¶

As there are no nulls, missing or strange values except the one in daily stress, the data cleaning will cosist only in eliminate the row corresponing to the value 1/1/00 of 'daily stress' and change it to type int64.

NOTE: The visualizations of data (histograms and boxplots) will be done in the section data statistics (2.4), once we perform a initial cleaning an encoding, and a later cleaning would be done for outliers, if required

Cleaning variables: daily stress¶

There is only one strange value of daily_stress, so we can just drop the entire row from the dataset. Then I am going to change the data type from object to integer:

data = data.drop(data[data['daily_stress']=='1/1/00'].index)
data.shape

(12755, 23)

# Changing data type from object to int64
data['daily_stress'] = data['daily_stress'].astype('int64')

2.3 Feature engineering¶

In this section we have to handle 2 categorical variables:

Age: We are going to perform an ordinal encoding, changing the categories to ordered integers
Gender: We are going to perform an binary encoding, changing the categories to ordered integers:

Processing categorical variables: age¶

We are going to perform an ordinal encoding, changing the categories to ordered integers:

age_categories = ['Less than 20','21 to 35','36 to 50','51 or more'] #data['age'].unique(), but in the order we want

enc = OrdinalEncoder(categories=[age_categories])
enc.fit(np.array(data['age']).reshape(-1,1)) #Alternatively: Method fit_transform = method fit + method transform
                                #An array of shape (n_categories, n_features) should be passed to fit --> reshape(-1,1)
enc.categories

[['Less than 20', '21 to 35', '36 to 50', '51 or more']]

age_encoded = enc.transform(np.array(data['age']).reshape(-1,1))
age_encoded

array([[2.],
       [2.],
       [2.],
       ...,
       [2.],
       [1.],
       [3.]])

np.unique(age_encoded)

array([0., 1., 2., 3.])

age_original = enc.inverse_transform(age_encoded)
age_original

array([['36 to 50'],
       ['36 to 50'],
       ['36 to 50'],
       ...,
       ['36 to 50'],
       ['21 to 35'],
       ['51 or more']], dtype=object)

# Verification
enc.inverse_transform([[code] for code in [0,1,2,3]])

array([['Less than 20'],
       ['21 to 35'],
       ['36 to 50'],
       ['51 or more']], dtype=object)

data['age'] = age_encoded

Processing categorial variables: gender¶

We are going to perform an binary encoding, changing the categories to ordered integers:

data['gender'].unique()

array(['Female', 'Male'], dtype=object)

pd.get_dummies(data['gender'], drop_first=True)

gender_translator = {'Female': 0, 'Male': 1}
print(gender_translator)

{'Female': 0, 'Male': 1}

data['gender'] = pd.get_dummies(data['gender'], drop_first=True)
d = {'Male': 'Gender (Female:0, Male: 1)'}
data.rename(columns=d)

2.4 Data statistics and visualizations¶

Description¶

description = data.describe()
description.loc["range"] = description.loc['max'] - description.loc['min']
out_fields = ["mean", "25%", "50%", "75%", "range"]
description = description.loc[out_fields]
description.transpose()

Boxplots

# Boxplots

fig, axes = plt.subplots(figsize=(16,5))

data.boxplot()

for tick in axes.get_xticklabels():
    tick.set_rotation(45)
    tick.set_ha('right') #Horizontal alignment

plt.show()

n_outliers = data['daily_shouting'][data['daily_shouting']>8].shape[0]
p = n_outliers/data.shape[0]
print("{} values above 1.5·range from Q3 = 8h of daily shouting \n{}% of data".format(n_outliers, round(p*100,2)))

753 values above 1.5·range from Q3 = 8h of daily shouting 
5.9% of data

n_outliers = data['flow'][data['flow']>8].shape[0]
p = n_outliers/data.shape[0]
print("{} values above 1.5·range from Q3 = 8h of daily flow \n{}% of data".format(n_outliers, round(p*100,2)))

374 values above 1.5·range from Q3 = 8h of daily flow 
2.93% of data

n_outliers = data['sleep_hours'][data['sleep_hours']<3].shape[0]
p = n_outliers/data.shape[0]
print("{} values below 1.5·range from Q1 = 3h of daily sleep \n{}% of data".format(n_outliers, round(p*100,2)))

32 values below 1.5·range from Q1 = 3h of daily sleep 
0.25% of data

We decide to keep outliers as they might be meaningfull, representing the reality of population. Besides, they are not excessive to make us think of them as errors

Histograms¶

# Histograms numerical variables

axs = data.hist(figsize=(16, 16))

for ax in axs.flatten():
    if ax.is_first_col():
        ax.set_ylabel('Frecuency')
        
plt.suptitle('Histograms', y=0.95, fontsize=20)
plt.show()

2.5 Correlations between variables¶

Correlation coefficients tell us about how well the variables are related by a straight line. However, it doesn't give any information about how much one influence to the other. For this purpose, we would have to perform a linear regression model, what we'll do later. Now let's see the correlations (Pearson).

data.corr()

# Heatmap correlations

fig, ax = plt.subplots(figsize=(8,6))

sns.heatmap(data.corr(), xticklabels=True, yticklabels=True, ax=ax, center=0, cmap='seismic') #see "diverging colormaps"

<AxesSubplot:>

Negative correlations:

daily stress is the variable with a greater number of negative correlations with other variables (as the others describe in general desireable characteristics for a good life), especially with daily meditation, as we might expect, but also with to-do completed, time for passion and sleep hours. The only one with a positive correlation with daily stress is daily shouting. There might be a real influence of these variables in the level of stress people experience. For a predictive model, daily shouting and daily meditation could be good predictive variables.

Poor correlations

Variables like bmi_range, sleep hours, lost vacation and daily shouting seems to be poorly correlated with the others, or slightly negative. For example, pairs like bmi range and daily steps, daily shouting and to-do completed, lost vacation and places visited are slightly negative as we might expect.

Positive correlations:

The rest of the variables are in general positively correlated with the others and these positive correlation are the strongest ones. From numerical analysis I saw achievement and supporting others are the variables positively correlated with the the greatest number of the others. In particular, achievement is the variable that are positively correlated with a greater number of other variables. For a predictive model, we found that flow and personal awards could be good predictive variables, as they are not too correlated between them. We could try only with these two, but we can include core circle and social network in case we want more than two. However, it would probably increase model complexity unnecessarily and therefore overfitting.

Other not surprising strong correlations like:
- time for passion with flow
- supporting others with donation
- achievemnt with flow and personal awards

sorted_corr = np.unique(np.sort(data.corr().values.flatten()))[::-1] #Correlations in descendent order
array2D_corr = np.array(data.corr())

n = 12
nth_highest = sorted_corr[:n]

nth_highest

array([1.        , 0.48263733, 0.40032959, 0.39420465, 0.38972744,
       0.37395656, 0.35900949, 0.33862994, 0.33363161, 0.32281997,
       0.3164962 , 0.31128491])

for col in data.corr().columns:
    df_pos_corr = pd.DataFrame(data.corr()[col].nlargest(2)[1:]).transpose()
    df_neg_corr = pd.DataFrame(data.corr()[col].nsmallest(1)).transpose()
    df = pd.concat([df_pos_corr, df_neg_corr], axis=1)
    display(df)

Positive correlations:

The strongest correlations are positive. Let's see what are the pairs of variables with the highest correlation

# 12 Pairs of variables with the highest correlation

sorted_corr = np.unique(np.sort(data.corr().values.flatten()))[::-1] #Correlations in descendent order
array2D_corr = np.array(data.corr())

n = 20
nth_highest = sorted_corr[:n]

fig, axs = plt.subplots(5,4, figsize=(16,16))
fig.tight_layout(pad=5) #padding among subplots

axs = axs.flatten()
rows = []
columns = []
print("Highest correlations:\n\n")
for i in range(n):
    position = np.where(array2D_corr == nth_highest[i]) 
    row = data.corr().index[position[0][0]]
    column = data.corr().columns[position[1][0]]
    rows.append(row)
    columns.append(column)
    print("{} - {}:".format(row, column))
    print(nth_highest[i], "\n")
    frecuency = pd.crosstab(data[row],data[column]).iloc[::-1]
    ax = fig.sca(axs[i])
    ax.set(title='{} VS {}'.format(column,row))
    sns.heatmap(frecuency, ax=ax, cmap="YlGnBu")
    
    # axs[i].imshow(frecuency, cmap='Blues')
    # axs[i].hist2d(data[row],data[column], cmap='Blues')

Highest correlations:


fruits_veggies - fruits_veggies:
1.0 

flow - time_for_passion:
0.4826373253512499 

achievement - personal_awards:
0.40032959107956373 

supporting_others - donation:
0.39420465220667267 

achievement - flow:
0.38972743712470126 

achievement - time_for_passion:
0.37395655829229923 

supporting_others - achievement:
0.35900949207041805 

core_circle - supporting_others:
0.33862994082240555 

supporting_others - personal_awards:
0.33363160903780187 

supporting_others - time_for_passion:
0.32281997227051246 

achievement - live_vision:
0.31649619770451914 

supporting_others - social_network:
0.3112849098938094 

live_vision - time_for_passion:
0.31101500793287484 

achievement - todo_completed:
0.30928570200317784 

daily_stress - daily_shouting:
0.30926429602231414 

core_circle - social_network:
0.3049413472087288 

todo_completed - flow:
0.30214098153931707 

flow - live_vision:
0.30004494164840867 

core_circle - achievement:
0.29235878020232786 

places_visited - personal_awards:
0.2796037154559053

# Number of variables that each one is highly correlated with

rows_cols = rows + columns
for variable in list(set(rows_cols)):
    print(variable, rows_cols.count(variable))

live_vision 3
social_network 2
fruits_veggies 2
achievement 7
personal_awards 3
places_visited 1
time_for_passion 4
supporting_others 6
core_circle 3
daily_stress 1
daily_shouting 1
donation 1
flow 4
todo_completed 2

achievement and supporting others are the variables positively correlated with the the greatest number of the other.

Variable 'Achivement':¶

Let's see the correlations of achievement with the others, and what pairs of these are less correlated between them:

# Correlations with achievement
data.corr()['achievement'].sort_values(ascending=False)

achievement          1.000000
personal_awards      0.400330
flow                 0.389727
time_for_passion     0.373957
supporting_others    0.359009
live_vision          0.316496
todo_completed       0.309286
core_circle          0.292359
places_visited       0.264964
social_network       0.252225
donation             0.236727
daily_steps          0.193328
daily_meditation     0.172801
fruits_veggies       0.166643
sufficient_income    0.117023
sleep_hours          0.054920
age                  0.002239
gender              -0.006039
lost_vacation       -0.010751
bmi_range           -0.034303
daily_shouting      -0.066264
daily_stress        -0.120786
Name: achievement, dtype: float64

# What pair of them are less correlated between them?
achievement_corrs = data.corr()['achievement'].sort_values().index.tolist()
data.corr()[achievement_corrs].loc[achievement_corrs]

The more down and right, the most correlation the variables are with achievement. For a prediction model, we are interested in predictive variables with a low correlation between them, so we should look for small values in the down-right part of this dataframe. We see a 0.217978 corresponding to flow and personal awards, so it might be a good option chosing only these two. core circle and social network would be the next ones to chose in case we want more.

# The lowest values of correlation in absolute value
np.unique(np.sort(np.array(abs(data.corr()[achievement_corrs[11:]].loc[achievement_corrs[11:]])).flatten()))

array([0.14761525, 0.14985247, 0.15491703, 0.15761513, 0.15821833,
       0.16224798, 0.18481764, 0.18876745, 0.18984902, 0.19830437,
       0.20267701, 0.20659311, 0.21225279, 0.21265549, 0.21470312,
       0.21649415, 0.21797815, 0.22017343, 0.22246642, 0.22756044,
       0.23359329, 0.23404912, 0.23672727, 0.23980767, 0.24004297,
       0.24329982, 0.2498565 , 0.25022277, 0.25219994, 0.25222473,
       0.25500816, 0.25998208, 0.26496444, 0.26688059, 0.271881  ,
       0.27439719, 0.27804196, 0.27960372, 0.29235878, 0.30004494,
       0.30214098, 0.30494135, 0.3092857 , 0.31101501, 0.31128491,
       0.3164962 , 0.32281997, 0.33363161, 0.33862994, 0.35900949,
       0.37395656, 0.38972744, 0.39420465, 0.40032959, 0.48263733,
       1.        ])

Negative correlations:

Let's see de most negative correlated variables with the others:

# 12 Pairs of variables with the most negative correlation

sorted_corr = np.unique(np.sort(data.corr().values.flatten()))
array2D_corr = np.array(data.corr())

n = 20
nth_lowest = sorted_corr[:n]

fig, axs = plt.subplots(5,4, figsize=(16,16))
fig.tight_layout(pad=5) #padding between subplots

axs = axs.flatten()
rows = []
columns = []
print("Lowest correlations:\n\n")
for i in range(n):
    position = np.where(array2D_corr == nth_lowest[i]) 
    row = data.corr().index[position[0][0]]
    column = data.corr().columns[position[1][0]]
    rows.append(row)
    columns.append(column)
    print("{} - {}:".format(row, column))
    print(nth_lowest[i], "\n")
    frecuency = pd.crosstab(data[row],data[column]).iloc[::-1]
    ax = fig.sca(axs[i])
    ax.set(title='{} VS {}'.format(column,row))
    sns.heatmap(frecuency, ax=ax, cmap="YlGnBu")
    
    # axs[i].imshow(frecuency, cmap='Blues')
    # axs[i].hist2d(data[row],data[column], cmap='Blues')

Lowest correlations:


daily_stress - daily_meditation:
-0.21367242460175634 

daily_stress - todo_completed:
-0.16697526420745487 

daily_stress - time_for_passion:
-0.1618577321177887 

daily_stress - sleep_hours:
-0.15286158450655735 

todo_completed - daily_shouting:
-0.14558711896131799 

daily_stress - sufficient_income:
-0.14487168163684583 

daily_stress - flow:
-0.142187370687171 

daily_stress - live_vision:
-0.1350155981795351 

daily_stress - places_visited:
-0.13170694011317127 

supporting_others - gender:
-0.13128712907783338 

donation - gender:
-0.12995859803876536 

bmi_range - daily_steps:
-0.1289368441372705 

daily_stress - gender:
-0.128676852685506 

places_visited - lost_vacation:
-0.12475119803990595 

daily_stress - achievement:
-0.12078578179755883 

fruits_veggies - gender:
-0.11886752387580067 

daily_stress - core_circle:
-0.11597261409797491 

lost_vacation - daily_meditation:
-0.11308591354343855 

daily_shouting - time_for_passion:
-0.10497038387085024 

places_visited - bmi_range:
-0.10369634596618633

# Number of the other variables that each one is highly negative correlated with

rows_cols = rows + columns
for variable in list(set(rows_cols)):
    print(variable, rows_cols.count(variable))

places_visited 3
core_circle 1
daily_shouting 2
bmi_range 2
flow 1
sufficient_income 1
live_vision 1
time_for_passion 2
supporting_others 1
daily_stress 11
sleep_hours 1
daily_meditation 2
gender 4
lost_vacation 2
fruits_veggies 1
achievement 1
daily_steps 1
donation 1
todo_completed 2

Variable 'daily stress'¶

As we showed above in the heatmap of the correlations matrix, daily stress is negatively correlated with a number of other variables. Let's see the most negative correlations of daily stress with the others, and what pairs of these others are less correlated between them:

# Correlations with daily stress
data.corr()['daily_stress'].sort_values()

daily_meditation    -0.213672
todo_completed      -0.166975
time_for_passion    -0.161858
sleep_hours         -0.152862
sufficient_income   -0.144872
flow                -0.142187
live_vision         -0.135016
places_visited      -0.131707
gender              -0.128677
achievement         -0.120786
core_circle         -0.115973
fruits_veggies      -0.095132
daily_steps         -0.057925
personal_awards     -0.044362
donation            -0.038291
supporting_others   -0.035373
age                 -0.020649
social_network       0.012720
bmi_range            0.084938
lost_vacation        0.192353
daily_shouting       0.309264
daily_stress         1.000000
Name: daily_stress, dtype: float64

# What pair of them are less correlated between them?
stress_corrs = abs(data.corr()['daily_stress']).sort_values().index.tolist()
data.corr()[stress_corrs[11:]].loc[stress_corrs[11:]]

The more down and right, the most correlation the variables are with daily stress. For a prediction model, we are interested in predictive variables with a low correlation (in absolute value) between them, so we should look for small values in the down-right part of this dataframe. We see a -0.096076 corresponding to daily shouting and daily meditation, so it might be a good option chosing these two. Other ones we can chose that are poorly correlated with daily shouting and daily meditation are lost vacation and sufficient income.

# The lowest values of correlation in absolute value
np.unique(np.sort(np.array(abs(data.corr()[stress_corrs[11:]].loc[stress_corrs[11:]])).flatten()))

array([5.39244520e-04, 2.69626501e-02, 2.85272804e-02, 3.69947788e-02,
       4.72083003e-02, 4.79493603e-02, 6.78111005e-02, 7.15803146e-02,
       7.62676096e-02, 7.82295707e-02, 7.95853327e-02, 8.00993712e-02,
       8.02622588e-02, 8.11793283e-02, 8.30652481e-02, 8.51840221e-02,
       8.66714470e-02, 8.90661821e-02, 9.30579567e-02, 9.60761312e-02,
       1.04970384e-01, 1.13085914e-01, 1.17985233e-01, 1.24751198e-01,
       1.29647517e-01, 1.30777052e-01, 1.31706940e-01, 1.35015598e-01,
       1.42187371e-01, 1.44871682e-01, 1.45587119e-01, 1.45942796e-01,
       1.49852474e-01, 1.52180100e-01, 1.52861585e-01, 1.54917034e-01,
       1.61857732e-01, 1.64169500e-01, 1.66975264e-01, 1.73262396e-01,
       1.81955930e-01, 1.92352962e-01, 1.96000331e-01, 1.98304373e-01,
       2.03315316e-01, 2.03476252e-01, 2.13672425e-01, 2.34049116e-01,
       2.71880996e-01, 2.78041959e-01, 3.00044942e-01, 3.02140982e-01,
       3.09264296e-01, 3.11015008e-01, 4.82637325e-01, 1.00000000e+00])

2.6 Summary of the EDA: Key Findings and insights¶

From the EDA we can conclude some key findings:

About distribution of data, we see that variables are not normally distributed. Most of them are left-skewed, in particular those related to social area: core circle, social network, supporting others or donation. This means that there are a few people with a poor social life compared with the mayority's. On the other hand, other variables are right-skewed, so for example there are a few people standing out in time for passion, flow and achievement from the rest.

There is a clear division between variables positively (red) and negatively (blue) correlated with the others in general. achievement and daily stress are the ones with a greater potencial to be predicted, as the have the strongest positive and negative correlations respectively.

achievement is the variable that are positively correlated with a greater number of other variables. For a predictive model, we found that flow and personal awards could be good predictive variables, as they are not too correlated between them. We could try only with these two, but we can include core circle and social network in case we want more than two. However, it would probably increase model complexity unnecessarily and therefore overfitting.

daily stress is the one with a greater number of negative correlations with other variables, such as daily meditation, to-do completed, time for passion or sleep hours. There might be a real influence of these variables in the level of stress people experience. For a predictive model, daily shouting and daily meditation could be good predictive variables. Other good options to include are lost vacation, although it doesn't seem to have much a real relationship with stress, and sufficient income.

3. Hypothesis¶

Hypothsis 1: Population sleep less than 7 hours (World Health Organization recommendation is 7-8 hours). We can perform a t-test and check out if people sleep in average fewer hours than the OMS recommends, which might be indicative for a poor quality of life.The next step would be to see if those who sleep few hours also have negative indicators as a high levels of stress or low marks in achivement, personal awards, etc.

Hypothesis 2: Data points with high values for positive attributes are grouped together. It is said than one good habit positively affects in other areas, producing a kind of "virtuous cycle". Because of this and the similar behaviours of positive attributes (reds in the correlation matrix) we saw with the other variables, we might expect data to be grouped by positive and negative attributes. We could perform clusttering or classification algorithms along with a significance test of the results to check out this hypothesis.

Hypothesis 3: There is a correlation between achievement and the variables flow, daily_meditation and time for passion significantly different from zero. If so, we can try a predictive model like a linear model to predict achievement or just use the findings for interpretation and increase of our achievements.

Hypothsis 4: There is a correlation between daily stress and daily meditation significantly different from zero. The same as the hypothesis 2 can be performed with daily stress.

Significance test for the hypothsis 1: Population sleep less than 7 hours¶

We assume that the sleep hours follow a normal distribution and we are going to process the variable as continuous, although our data only takes integers between 1 and 10. Our null ($H_{0}$) and alternative ($H_{1}$) hypothesis are:

$H_{0}: \mu < 7h$
$H_{1}: \mu \geq 7h$

The population standard deviation is unknown, so we are going to estimate it from our data and perform a t test:

1. We calculate the t statistic:

$t_0 = \large \frac{\overline{X} - \mu}{\Large \frac{S} {\sqrt n}} \hspace{0.5cm} \small [1]$

where,

$S = \large \sqrt{\frac{\sum_{i=1}^n (x_{i}-\overline x)^2}{n-1}}$

is the standar deviation of our sample, and

$\mu$: population mean
$\overline X$: sample mean

2. The p-value is the probability of $t$ to be greater than what we got from our sample, $t_0$. This is the area down the curve of the T-Student distribution, whose expression is:

$f(t) = \large \frac{\Gamma(\frac{\Large \nu + 1}{2})}{\sqrt{\nu\pi}\hspace{1mm} \Gamma(\nu/2)} (1+\frac{t^2}{\nu})^{-\frac{\nu+1}{2}} \hspace{0.5cm} \small [2]$,

where

$\hspace{2cm}\Gamma(p) = \int_0^\infty x^{p-1}e^{-x}dx \hspace{0.5cm} \small [3]$

The area down the curve is the integral of $f(t)$ between 2 t's. When integrating between $t_0$ and $\infty$ we get the p-value, which is the probability of t > $t_0$, and therefore, the probability of the mean from another sample to be greater than our $\overline X$.

What is the probability of the population mean to be less than 7h, giving our sample?
See in [0] that for the variable t, keeping $\mu$ fixed and varying $\overline X$ is equivalent to keeping a $\overline X$ and varying $\mu$. To calculate the probability of the population mean to be less than 7h giving our sample, we would have our $\overline X$ fixed (from our sample) and we would traverse (recorreríamos) $\mu$ from $7 \rightarrow -\infty$ (to 0 actually), which is the same as keeping $\mu = 7$ and traverse $\overline X$ from $7.03 \rightarrow \infty$. When integrating $f(t)$ we would traverse t in opposite directions, but we can see in [0] that also $dt$ has opposite sign, so the integral result would be the same.

Thus, the probability of population mean to be less than 7h giving our sample, is exactly the same as the probability of any sample mean to be greater than ours (7.03h) assuming the hypothesis that the population mean is 7h, i.e., giving $\mu = 7$

from scipy.stats import t

mu = 7 # Population mean if the hypothesis is true
sample_mean = data['sleep_hours'].mean()
sample_std = data['sleep_hours'].std()
n = data['sleep_hours'].shape[0]

t_stat = (sample_mean - mu) / (sample_std/np.sqrt(n))
t_stat

3.3069605551230157

# Probability of mu < 7h

p_value = 1-t.cdf(t_stat, df=n-1)
print("p-value: ", p_value)
print("1-p-pavlue: ", 1-p_value)

p-value:  0.0004728744186621725
1-p-pavlue:  0.9995271255813378

My confidence level was 99%, so p-values below 0.01 would lead us to reject the null hypothesis. I obtained a $p-value \approx 0.0005$. We can reject the null hypothesis with (at least) a 99% of confidence

$H_{1}$ accepted: Population sleep in average 7h or more

Extra: Bayesian interpretation

NOTE: I think expressions below are not valid for continuous variables, so this can be all wrong

Define:

$r = \large \frac{p(H_{0}|x)}{p(H_{1}|x)} = \frac{p(H_{0})·p(x|H_{0})}{p(H_{1})·p(x|H_{1})} = \frac{p(H_{0})}{p(H_{1})}·\frac{p_{value}}{1 - p_{value}}$,

# Giving the same prior probabilities to both hypothesis, p(H0) = p(H1):

r = p_value/(1-p_value)
r

0.00047309813466757356

1/r

2113.726363987164

The alternative hypothesis is 2114 times more likely than the null hypothesis. The sum of both probabilities is 1, so:

p_H1 = (1/r) / ((1/r)+1)
p_H1

0.9995271255813378

Posterior probability: $p(H_{1}) = 99.95 \% $

Which is the value of the population mean?

mu = 7.03
t_stat = (sample_mean - mu) / (sample_std/np.sqrt(n))
t_stat

0.48867891499420146

t.cdf(t_stat, df=n-1)

0.6874612820197958

$\mu = \overline{X} \pm \large t·\frac{S}{\sqrt{n}} \$

t.ppf(0.95, df=n-1) * sample_std/math.sqrt(n)

0.01751038383904761

The mean is $(7.03 \pm 0.02)h$ with a confidence level of 95%

4. Next steps¶

The steps may be:

Create a linear model for stress level with daily meditation and daily shouting as predictive variables, and see if they really have a correlation significantly different from zero. We can try a log transformation in case we don't get normally distributed residuals, as the variable daily shouting is right-skewed and this can lead to skewed residuals. With daily mediatation a polynomial transformation might be required to fix the left skew.

Create a predictive model for achievement with flow and personal awards as predictive variables, and see if they really have a correlation significantly different from zero. Again, we can try a log transformation with the variable flow in case we don't get normally distributed residuals, as it is right-skewed, and we can also try polynomial transformation with personal awards.

Look for clustters to see the disposition of the data. We could perform clusttering or classification algorithms to prove hypothesis 2. Scaling the variables may be necessary to this issue related with distances.

5. Linear Regression for daily stress¶

Correlations Pearson coefficients, r, gave us the idea of how well the linear model model fit to data, but it doesn't say anything about how much the variable influence on stress. Thus, the magnitude of linear regression coefficients and the Pearson correlations have nothing to do and they don't to be related necessarily. What tell us about what features influence the most on stress is the magnitude of linear regression coefficient.

Now we are going to perform some linear regression models, focusing on the interpretation of what are the main factors that cause daily stress, so we'll see what are features with the strongest influence by comparing the corresponding coeficients. These ones can probabily be sources of stress and thus the findings may be of great interest.

from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score, mean_squared_error

First we'll remove the variable timestamp as we are not interested in this, and bmi_range as the values are 1 or 2, so they should be wrong ("body mass index" has typical values of 18 - 30)

data_original = data.copy()
data = data.drop(['timestamp', 'bmi_range'], axis=1)

5.1 Simple Linear Regression with `OLS` from `statsmodels.api`¶

Firsty, we will perform a linear model with the class OLS (Ordinary Linear Squares) from statsmodel.api, which provide us a more statistical treatment, as for example the t value from a significance test to accept the no-null value of the coefficients. We will pass to the OLS object the training sets X_train and y_train. We will do the trin-test split with a test size of 0.3 (30%), which result in 8928 data points for the training set and 3827 for the test set. We choose a seed for convinience, for example random_state=2021, to ensure the splits are the same when comparing with other models. Once we have a baseline, we will try adding regularization and polynomial features and prove if they shed light on what factors have more impact on stress

# Baseline

import statsmodels.api as sm

y = data['daily_stress']
X = data.drop('daily_stress', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=2021)

lr = sm.OLS(y_train, sm.add_constant(X_train))
OLSresults = lr.fit()

print("R2 on training set:", r2_score(y_train, OLSresults.predict(sm.add_constant(X_train))))
print("R2 on test set:", r2_score(y_test, OLSresults.predict(sm.add_constant(X_test))))
rmse = np.sqrt(mean_squared_error(y_test, OLSresults.predict(sm.add_constant(X_test))))
print("Root mean squared error: ", rmse)

display(OLSresults.summary())

C:\Users\carlo_\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

R2 on training set: 0.20751448027951147
R2 on test set: 0.19640373519899001
Root mean squared error:  1.2239315195417808

rmse_cv_ols = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X, y)))
print(rmse_cv_ols)

1.2263339068499821

Features with coefficients significantly different from 0 are shown next. Let's choose a 99% of confidence level as criteria (p-value < 0.01), so we will keep features where the column "P<|t|" is less that 0.01":

features_nonzero = X.columns.drop(['fruits_veggies','donation', 'daily_steps', 'age'])
X_train2 = sm.add_constant(X_train[features_nonzero])
X_test2 = sm.add_constant(X_test[features_nonzero])

lr2 = sm.OLS(y_train, X_train2)
OLSresults2 = lr2.fit() #OLSResults object

print("R2 on training set:", r2_score(y_train, OLSresults2.predict(X_train2)))
print("R2 on test set:", r2_score(y_test, OLSresults2.predict(X_test2)))
rmse = np.sqrt(mean_squared_error(y_test, OLSresults2.predict(X_test2)))
print("Root mean squared error: ", rmse)

coefs2 = OLSresults2.params.sort_values()
print("-"*50)
print("\nCoefficients: \n")
print(coefs2)

rmse_cv_ols2 = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X[features_nonzero], y)))
print("\nRmse CV: ", rmse_cv_ols2)

R2 on training set: 0.2070379500456585
R2 on test set: 0.19625620857542925
Root mean squared error:  1.224043860904083
--------------------------------------------------

Coefficients: 

gender              -0.327591
sufficient_income   -0.253618
sleep_hours         -0.098262
daily_meditation    -0.048974
time_for_passion    -0.033616
flow                -0.029732
core_circle         -0.023774
todo_completed      -0.023402
achievement         -0.016938
live_vision         -0.015428
places_visited      -0.013766
supporting_others    0.019710
personal_awards      0.022859
social_network       0.023035
lost_vacation        0.048548
daily_shouting       0.125046
const                4.135555
dtype: float64

Rmse CV:  1.2260816886645598

These coefficients are significantly different from zero*. Variables gender, sufficient income and daily shouting (postive coefficient) seems to be the ones with more influence on stress.

\*Note: Residuals are not normally distributed and there is no way to apply any data transformation (log, root, boxcox) that get a more normally-distributed residuals. As significance tests performed by `statsmodel.api.OLS` assume normally distributed residuals, we should be prudent and critical with the results before reaching a conclusion. However, one fact in favour is that independency $\chi^2$ tests show a dependency between the categories of *stress* and, at least, the variables *gender, sufficient income* and *daily shouting*. These ensure just dependency between unordered categories, not correlation, but giving a look to data, it is noticeable that this dependency follows a definited direction. Evaluations points to significant no-null correlations. See the Statistical Hypothesis Testing for dependencies and correlations for details.

We have a number of features, but some of them are close to zero. We can try now reducing the number of features a see what coefficients we obtain for the remaining ones. We will do this by 2 approaches: Lasso Regression and RFE (Recursive Features Elimination):

5.2 Reducing features: Lasso¶

We see some features that stand out against the others: gender, sufficient income and daily shouting. However, we didn't obtain a great 𝑅2 score, so it might not be significantly different from zero. We can't be sure for now whether there is a real relation between those variables and stress.

We can try simplifying the model by adding regularization and see if we obtain a better fit that make us have more confidence in the real influence of the predictive features. For this purpose, Lasso is usually a better choice than Ridge in terms of interpretation, as some cofficients are more likely to vanish with Lasso, keeping only those with a stronger impact in the output variable and improving interpretation of this simplified model. In addition, we tried with an ElasticNetCV and obtained that the best l1_ratio parameter was 1, so we will set different alpha values for a Lasso Regression. So we will create a Lasso model using LassoCV in order to try with some different alpha values, the parameter with which we control the regularization amount, and compare the resulting models through the 𝑅2 score obtained after doing Cross Validation to ensure the particular selection of the training-validation split doesn't affect to determine which one is better. We keep the parameter cv=None that use the default 5-fold cross validation and we use the X_train subset used previously with LinearRegression, which will be the one that, in turn, will be splitted in 5 folds to perform the Cross Validation. We are keeping the X_test to check the performance to unseen data.

alphas = np.geomspace(1e-5, 1, num=24)
lassoCV = LassoCV(alphas=alphas)
lassoCV.fit(X_train, y_train)

print("Best model:\n")
print("Alpha: ", lassoCV.alpha_)
print("R2 on test set: ", lassoCV.score(X_test, y_test))  #r2_score(y_test, lassoCV.predict(X_test))
rmse = np.sqrt(mean_squared_error(y_test, lassoCV.predict(X_test)))
print("Root mean squared error: ", rmse)

Best model:

Alpha:  0.0014924955450518291
R2 on test set:  0.1964983336215841
Root mean squared error:  1.223859477520299

rmse_cv_lasso = np.sqrt(mean_squared_error(y, cross_val_predict(Lasso(alpha=lassoCV.alpha_), X, y)))
print(rmse_cv_lasso)

1.226344366252987

pd.DataFrame(list(zip(X.columns, lassoCV.coef_)), columns=['feature','coef']).set_index('feature').sort_values(by='coef')

None of the coefficients have gotten null. In fact, results are very similar to the ones from plain linear regression, both the $R^2$ score and the coefficient values. However, trying with a higher regularization (more than alpha=1), $R^2$ score is practically null.

5.3 Reducing features: RFE¶

We are going to be more strict, keeping only 13 features, the number of features whose coefficients are significantly different from zero. Let's create a model using only 13 using RFE, and see if we obtain the same 13 as in features_nonzero

LinearRegression with the top 13 features selected by Recursive Feature Selection (RFE)

rfe = RFE(LinearRegression(), n_features_to_select=12)
rfe.fit(X, y)

features = X.columns[rfe.support_]
features

Index(['core_circle', 'social_network', 'todo_completed', 'flow',
       'sleep_hours', 'lost_vacation', 'daily_shouting', 'sufficient_income',
       'time_for_passion', 'daily_meditation', 'age', 'gender'],
      dtype='object')

X_train_rfe_1 = sm.add_constant(X_train[features])
X_test_rfe_1 = sm.add_constant(X_test[features])

lr_rfe = sm.OLS(y_train, X_train_rfe_1)
OLSresults_rfe = lr_rfe.fit() #OLSResults object

print("R2 on training set:", r2_score(y_train, OLSresults_rfe.predict(X_train_rfe_1)))
print("R2 on test set:", r2_score(y_test, OLSresults_rfe.predict(X_test_rfe_1)))
rmse = np.sqrt(mean_squared_error(y_test, OLSresults_rfe.predict(X_test_rfe_1)))
print("Root mean squared error: ", rmse)

coefs_rfe = OLSresults_rfe.params.sort_values()
print("-"*50)
print("Coefficients:\n")
print(coefs_rfe)

R2 on training set: 0.2018642122384604
R2 on test set: 0.19128926477826702
Root mean squared error:  1.2278201843907077
--------------------------------------------------
Coefficients:

gender              -0.336682
sufficient_income   -0.265136
sleep_hours         -0.101721
daily_meditation    -0.050159
time_for_passion    -0.033970
flow                -0.033181
todo_completed      -0.025641
core_circle         -0.022373
social_network       0.026723
age                  0.035751
lost_vacation        0.049624
daily_shouting       0.126775
const                4.161317
dtype: float64

rmse_cv_rfe = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X[features], y)))
print(rmse_cv_rfe)

1.2294077096448424

There might be interactions between variables. Let's try adding Polynomial Features and see if we get a better perform:

5.4 Polynomial Features¶

We can drop gender to see the interactions between other, as it makes no so much sense to see the interaction terms with gender, and then perform polynomial features. Daily_shouting has negative correlations with the others, so we tried including ratio features like $\frac{meditation}{shouting}$ but no good results were obtained. Adding Polynomial Features might shed light on possible interactions between positively correlated variables.

Let's try polynomial features, keeping only the interaction terms, and creating a Lasso model feeding it with these new features:

pf = PolynomialFeatures(degree=2, interaction_only=True)
X_train_poly = pf.fit_transform(X_train.drop('gender', axis=1))
X_test_poly = pf.transform(X_test.drop('gender', axis=1))
lassoCV_poly = LassoCV(alphas = alphas, max_iter=100000)
lassoCV_poly.fit(X_train_poly, y_train)

print("Best model with polynomial features:\n")
print("Alpha: ", lassoCV_poly.alpha_)
print("R2 on test set: ", lassoCV_poly.score(X_test_poly, y_test))
rmse = np.sqrt(mean_squared_error(y_test, lassoCV_poly.predict(X_test_poly)))
print("Root mean squared error: ", rmse)

Best model with polynomial features:

Alpha:  0.13503140378698722
R2 on test set:  0.18848756892228846
Root mean squared error:  1.2299451744264844

rmse_cv_poly = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X, y)))
print(rmse_cv_poly)

1.2263339068499821

selected_features = [(a,b) for a,b in zip(pf.get_feature_names(input_features = X_train.columns), lassoCV_poly.coef_) 
                     if b!=0]
df_selected_features = pd.DataFrame(selected_features, columns=['Polynomial Feature', 'Coef']).sort_values(by='Coef')
print("Polynomial Features coefficients:")
display(df_selected_features.head())
display(df_selected_features.tail())

Polynomial Features coefficients:

Coefficients are very close to 0 because, despite the regularization, there is still a number of features. However we can't increase regularization (parameter alpha) because in that case we have seen $R^2$ turns to almost zero and so the linear coefficients lose its veracity. In addition, interaction terms obtained are fairly similar and doesn't make so much sense, not supplying a good interpretation.

5.5 Best model for interpretation: conclusion¶

After exploring several possibilities, we dicard adding polynomial features. We saw regularization does not improve results reached with the simple linear regression, giving very similar coefficients without eliminating some of them. In addition, it doesn't provide a better fit that could have made it more trustful. In conclusion, we can keep simple linear regression.

print("rmse with cross validation predict:\n")
print("OLS: {}\nOLS with significantly different from zero coefficients: {}\nLasso: {}\nLasso with polynomial features: {}\nOLS with RFE: {}".format(
    rmse_cv_ols, rmse_cv_ols2, rmse_cv_lasso, rmse_cv_poly, rmse_cv_rfe))

rmse with cross validation predict:

OLS: 1.2263339068499821
OLS with significantly different from zero coefficients: 1.2260816886645598
Lasso: 1.226344366252987
Lasso with polynomial features: 1.2263339068499821
OLS with RFE: 1.2294077096448424

The model corresponding to the least rmse will be the most reliable model to use and thus the one we will use explain what features have the strongest influence on stress. We discard polynomial features because its features have not make sensea and there was many features with very similar coefficients, which is not useful for the explainability of stress. Plain Linear Regression (OLS) is the one that best fits our objective of interpretation, using only the following features:

print(coefs2)

gender              -0.327591
sufficient_income   -0.253618
sleep_hours         -0.098262
daily_meditation    -0.048974
time_for_passion    -0.033616
flow                -0.029732
core_circle         -0.023774
todo_completed      -0.023402
achievement         -0.016938
live_vision         -0.015428
places_visited      -0.013766
supporting_others    0.019710
personal_awards      0.022859
social_network       0.023035
lost_vacation        0.048548
daily_shouting       0.125046
const                4.135555
dtype: float64

3.6 Key Findings and Insights¶

Daily stress depends mainly of 2 features: Gender and Sufficient income. In average, women suffer more stress than men, and the same happens with people not having enough sufficient income. This findings are not surprising, since other studies also support these hypothesis. Other variables having an impact on stress, although considerably less, are the number of sleep hours, and even less there are daily meditation, lost vacation, etc. These ones hardly have a very little influence, but it is no zero.

Comparision: Here we have a comparition between gender coefficient and the others. We can use the following ratios to compare the influence on stress of each variable, being the higher value, the less influence the variable has.

print("Ratios coefficients\n")
for i,coef in enumerate(coefs2[:2]):
    ratios = coef/coefs2[:-1].drop(coefs2.index[i])
    ratios.index = ratios.index.map(lambda x: coefs2.index[i] + "/" + x)
    print(ratios)
    print("\n")

Ratios coefficients

gender/sufficient_income     1.291672
gender/sleep_hours           3.333870
gender/daily_meditation      6.689066
gender/time_for_passion      9.744976
gender/flow                 11.018036
gender/core_circle          13.779386
gender/todo_completed       13.998338
gender/achievement          19.340617
gender/live_vision          21.233975
gender/places_visited       23.796995
gender/supporting_others   -16.620255
gender/personal_awards     -14.331076
gender/social_network      -14.221421
gender/lost_vacation        -6.747794
gender/daily_shouting       -2.619754
dtype: float64


sufficient_income/gender                0.774190
sufficient_income/sleep_hours           2.581049
sufficient_income/daily_meditation      5.178610
sufficient_income/time_for_passion      7.544465
sufficient_income/flow                  8.530056
sufficient_income/core_circle          10.667866
sufficient_income/todo_completed       10.837376
sufficient_income/achievement          14.973317
sufficient_income/live_vision          16.439136
sufficient_income/places_visited       18.423401
sufficient_income/supporting_others   -12.867239
sufficient_income/personal_awards     -11.094979
sufficient_income/social_network      -11.010085
sufficient_income/lost_vacation        -5.224076
sufficient_income/daily_shouting       -2.028188
dtype: float64

Interpretation: This coefficients ratios say how many units of the corresponding feature (sleep hours, daily meditation, flow and daily shouting) would have the same effect as gender in daily stress. So we see that, for example, being a woman have the same effect on stress as shouting about 2.6 points below in the shouting scale 0-10 (0: not at all, 10: very often), and each sleep hour would be comparable to 2 more times meditating or 3 more hours dedicated to your passion, etc.

3.7 Next steps¶

Here there are some ideas to improve our model done up to now:

A better perform of that linear model can be achieved by adding some features. Some ideas are the level of noise people are exposed to, how many hours are spent in nature or how many negative people do respondents have in their social circle (according to their criteria, which is what matters). On the other hand, one advanced model can be done from a more technical study, where measures of corporal indicators related to stress from medical analysis could be used along with measures of noise in decibels and other measurable features.

We could perform clusttering or classification algorithms, which might perform better than the linear model as we are dealing with discrete categories within intervals instead of continuous variables. Some other data can be gathered for this model. For example, new features as the living area (big city/town/village) might have impact on stress. It is said that people in cities are in general more stressed, and in fact noise of modern life is a well known source of stress. We could also add information related to the type job (physical/sedentary) or the workday hours, and related to the social network, it would be insightful to get information about how is the people we are surrounded by, and not only the amount of people as this dataset takes into account. We are very influenced by our social circle, and prove up to what point this is true by dividing people in clustters or performing a classification algorithm for prediction, and see for example whether stressed or successful people are usually surounded by other stressed or successful people respectively.

6. Statistical Hypothesis Testing for dependencies and correlations between variables

I'll perform some tests for some pairs for variables to prove whether it is a dependency and, going one step ahead, whether there is a correlation between them.

𝜒2 test to check independency:¶

I'll calculate the statistic: $\chi^2 =\large \sum_{i} \frac{(O_{i} - E_{i})^2}{E_{i}}$, where:
$O_{i}$ is each of the observed (real) values
$E_{i}$ is each of the expected values, so the value we would expect according to the female/male proportion if no gender dependency took place.

I'll create a function that shows the crosstab counts of 2 variables, both observed and expected (from the proportion of each category within the variable it belongs to), and calculates the $\chi^2$ statistic:

def independence_test(variable1, variable2):
    
    '''
    Return: DataFrame comparing observed vs expected data, chi2, chi2_matrix (addends, contributions to the value of chi2)
    
    Example: 
    
    >> df, chi2 = correlation_test('feature1', 'feature2')
    '''

    # variable1 in Rows
    # variable2 in Columns

    # Proportions
    proportions = (data[variable2].value_counts()/data[variable2].count()).sort_index()

    # Expected values
    value_counts_var1 = (data[variable1].value_counts()).sort_index()
    expected_df = pd.DataFrame()
    for index,value in zip(proportions.index, proportions.values):
        expected_df[index] = value_counts_var1*value

    # Observed values
    observed_df = pd.crosstab(data[variable1], data[variable2])
    observed_df.index.name = None
    observed_df.columns.name = None

    # Dataframe Observed vs Expected values
    reality = ['Observed','Expected']
    df_concat = pd.concat([observed_df, expected_df], axis=1, keys=reality)
    df_concat = df_concat.sort_index(axis=1, level=1)
    df_concat.columns = df_concat.columns.map(lambda x: x[::-1])
    df_concat = df_concat.sort_index(axis=0)
    print('Index: Values of {} \nColumns: Values of {}'.format(variable1, variable2))
    display(df_concat)

    # Chi2
    print("-"*100)
    print("\nCalculation of chi2:")
    print("\nAddends (contributions to the values of chi2):")
    chi2_matrix = (((expected_df - observed_df)**2)/expected_df).sort_index(axis=1)
    chi2_matrix.index.name = variable1
    chi2_matrix.columns.name = variable2
    display(chi2_matrix)
    chi2 = chi2_matrix.sum(axis=1).sum(axis=0)
    print("chi2:", chi2)
    
    return df_concat, chi2, chi2_matrix

t-test to check correlation¶

Let's perform a signiicance test to prove whether there is a correlations significantly different from 0. We'll use $t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$. Explanations about this expression and the significance test for correlations are commented in "observations" section.

def correlation_test(variable1, variable2):
    
    '''
    Return: t statistic 
    
    '''
    
    r = data.corr()[variable1][variable2]
    n = data.shape[0]
    std_r = np.sqrt((1-r**2)/(n-2))
    print("Correlation coefficient r: ", r)
    print("Standard error of correlation coefficient: ", std_r)

    #t statistic
    t_statistic = (r*np.sqrt(n-2))/np.sqrt(1-r**2)
    print("\nt statistic: ", t_statistic)
    
    return t_statistic

$H_{0}$: Stress levels are independent of gender: woman are as stressed as men¶

$H_{1}$: Stress levels does depend on gender

1. $\chi^2$ test to check independency:

I'll perform a $\chi^2$ test. There are $n_{1} = 6$ stress levels and $n_{2} = 2$ posible values for gender, so we have $(n_{1} - 1)*(n_{2} - 1) = 5$ degrees of freedom. The $\chi^2$ values for 5 degrees of freedom are:

$\chi^2_{0.01, 5} = 15.087$ for 99% of confidence level
$\chi^2_{0.005, 5} = 16.75$ for 99.5% of confidence level
$\chi^2_{0.001, 5} = 20.515$ for 99.9% of confidence level

This is done by the function created above, independency_test:

print(gender_translator, "\n")
df_concat_stress_gender, chi2_stress_gender, chi2_matrix_stress_gender = independence_test('daily_stress', 'gender')

{'Female': 0, 'Male': 1} 

Index: Values of daily_stress 
Columns: Values of gender

----------------------------------------------------------------------------------------------------

Calculation of chi2:

Addends (contributions to the values of chi2):

chi2: 248.17405935784157

$\chi^2$ for 5 grades of freedom and a significance level of 0.001: $\chi^2_{5,0.001} = 20,5147$. We obtained $248.174$, so we can reject the null hypothesis: "stress_level is independent of the gender", with (at least) a 99.9% of confidence level.

We can accept the alternative hypothesis "stress level depends on gender". This doesn't necessarily imply correlation, altough we can appreciate that it does exists. We see the differences between women and men. For low stress levels (0 and 1) there are less women ('Real' values) than expected and more men than expected from the male/female proportion, while for high stress levels (5) there are more women and less men than expected. We see by eye that women suffer more stress than men, generally speaking. But we have to perform a significance test for the correlation, which we'll perform later below, to ensure with numbers it really exists.

If we see the $\chi^2$ for each of the stress levels separatly, we see that answers of 2 and 4 in "stress level" are following the male/female proportion, so no dependence on gender is shown among these middle values. The differences come from the extreme values: 0, 1 and 5:

from matplotlib import colors
norm=colors.TwoSlopeNorm(vmin=0, vcenter=6.63/2, vmax=11)
sns.heatmap(chi2_matrix_stress_gender, norm=norm, cmap='RdBu_r');

The t-student value is $t_{1, 0.01} = 6.63$ for a 0.01 of significance level, so at the heatmap, reds are the ones contributing to the row sum to be 6.63 and thus contributing to that particular level of stress to be significantly dependent of gender.

df_diff = pd.DataFrame()
for col in df_concat_stress_gender.columns.levels[0]:
    df_diff[col] = (df_concat_stress_gender[(col,'Observed')] - df_concat_stress_gender[(col,'Expected')])

norm=colors.TwoSlopeNorm(vmin=df_diff.min().min(), vcenter=0, vmax=df_diff.max().max())
sns.heatmap(df_diff, norm=norm, cmap='RdBu_r', annot=True);

We clearly see a correlation, where there are more observed cases for the pairs of categories men (1) - low stress (0,1,2), and for the pairs women (0) - high stress (3,4,5). From the previous heatmap it is appreciable where these differences are significant, and later below, we'll perform a t-test to confirm the significance of the correlation

# Plots

plt.figure(figsize=(8,4))

plt.subplot(1,2,1)
plt.plot(df_concat_stress_gender[0])
plt.legend(df_concat_stress_gender[0].columns);
plt.title('Female');

plt.subplot(1,2,2)
plt.plot(df_concat_stress_gender[1])
plt.legend(df_concat_stress_gender[1].columns);
plt.title('Male');

2. t-test to check correlation

t_statistic = correlation_test('daily_stress', 'gender')

# p-value for n-2 degrees of freedom
from scipy.stats import t
p_value = t.cdf(t_statistic, n-2)*2 #2 tails into account, to obtain P>|t|
print(p_value)

Correlation coefficient r:  -0.128676852685506
Standar error of correlation coefficient:  0.00878149089059826

t statistic:  -14.653189793007869
3.1680727107627955e-48

Residuals should be normally distributed to be strict. However, dependency is ensured by the $\chi^2$ test, and giving a look to data, it is noticeable that this dependency follows a definited direction. Although this t-test is not definitive, things point to the hypothesis women suffer more stress than men, and in fact, there are other studies about this reaching our same conclusions.

$𝐻_0$ : Stress levels are independent of sufficient income¶

1. $\chi^2$ test to check independency

variable1, variable2 = 'daily_stress', 'sufficient_income'
df_concat_stress_income, chi2_stress_income, chi2_matrix_stress_income = independence_test(variable1, variable2)

Index: Values of daily_stress 
Columns: Values of sufficient_income

----------------------------------------------------------------------------------------------------

Calculation of chi2:

Addends (contributions to the values of chi2):

chi2: 295.1364376469751

chi2_matrix_stress_income

norm=colors.TwoSlopeNorm(vmin=0, vcenter=6.63/2, vmax=11)
sns.heatmap(chi2_matrix_stress_income, norm=norm, cmap='RdBu_r');

There are special divergences between expected and observed for a level stress of 5, and also for 1. Similar as the previous case of stress vs gender, the t-student value is 𝑡1,0.01=6.63 for a 0.01 of significance level, so at the heatmap, reds are the ones contributing to the row sum to be 6.63 and thus contributing to that particular level of stress to be significantly dependent of income

df_diff = pd.DataFrame()
for col in df_concat_stress_income.columns.levels[0]:
    df_diff[col] = (df_concat_stress_income[(col,'Observed')] - df_concat_stress_income[(col,'Expected')])

norm=colors.TwoSlopeNorm(vmin=df_diff.min().min(), vcenter=0, vmax=df_diff.max().max())
sns.heatmap(df_diff, norm=norm, cmap='RdBu_r', annot=True);

Once again, we appreciate a negative correlation between these two variables. Let's confirm this correlation is significant:

2. t-test to check correlation

t_statistic = correlation_test(variable1, variable2)
p_value = t.cdf(t_statistic, n-2)*2
print("\np-value: ", p_value)

Correlation coefficient r:  -0.14487168163684583
Standar error of correlation coefficient:  0.008761689756615417

t statistic:  -16.534673751425867

p-value:  8.849120787406811e-61

Again, residuals should be normally distributed to be strict. However, dependency is ensured by the 𝜒2 test, and giving a look to data, it is noticeable that this dependency follows a definited direction. Although this t-test is not definitive, things point to the alternative hypothesis: stress and sufficient income are negatively correlated: the less sufficient income, the more stress. Other studies also support this hypothesis, talking about the so-called "financial stress".

$𝐻_0$ : Stress levels are independent of daily shouting¶

1. $\chi^2$ test to check independency

variable1, variable2 = 'daily_shouting', 'daily_stress'
output = independence_test(variable1, variable2)
chi2_matrix_stress_shouting = output[2]
df_concat_stress_shouting = output[0]

Index: Values of daily_shouting 
Columns: Values of daily_stress

----------------------------------------------------------------------------------------------------

Calculation of chi2:

Addends (contributions to the values of chi2):

chi2: 1712.8777027227065

norm=colors.TwoSlopeNorm(vmin=0, vcenter=15/5, vmax=15)
sns.heatmap(chi2_matrix_stress_shouting, norm=norm, cmap='RdBu_r');

Once again, blues are the ones not contributing to the row sum to be 15, which is the $\chi^2$ value for 5 degrees of freedom and 0.01 of significance level (99% of confidence level), so we can interpret them as the one not being so unexpected. Reds correspond to the pairs of categories with more observed cases than what would be "normal" following the proportions' distribution. Let's see the differences between the observed and expected values

# Differences observed-expected

df_diff = pd.DataFrame()
for col in df_concat_stress_shouting.columns.levels[0]:
    df_diff[col] = (df_concat_stress_shouting[(col,'Observed')] - df_concat_stress_shouting[(col,'Expected')])

norm=colors.TwoSlopeNorm(vmin=df_diff.min().min(), vcenter=0, vmax=df_diff.max().max())
print("\nHeatmap differences data observed - data expected")
sns.heatmap(df_diff, norm=norm, cmap="RdBu_r", annot=True);

Heatmap differences data observed - data expected

Here the correlation is easily recognizable, being more cases of low stress for low shouting than expected from the proportions' distribution (positive observed-expected differences), and being more cases of high stress for high score of shouting than expected. One the other hand, there are negative differences (blues) for cross categories (low stress and high shouting, high stress and low shouting), indicating a lower number of people highly stressed among those not shouting so much, and also a lower number of people not so much stressed among those shouting. In general, the more shouting, the more stress

2. t-test to check correlation

t_statistic = correlation_test(variable1, variable2)
p_value = (1 - t.cdf(t_statistic, n-2))*2 
print("\n p-value: ", p_value)

Correlation coefficient r:  0.30926429602231414
Standar error of correlation coefficient:  0.008420995496599783

t statistic:  36.72538432625792

 p-value:  0.0

Again, residuals should be normally distributed, but all point to the alternative hypothesis: stress and daily shouting are positively correlated: the more shouting, the more stress.

Attached:¶

Below it is shown a series of observations and notes for myself from ideas or solved questions arise during the present study. Take a look if you find it useful to a better understanding of approaches carried out in this project.

1. Performing sm.OLS we obtain different coefficients of that from `LinearRegression`:¶

import statsmodels.api as sm lr = sm.OLS(y_train, X_train) result = lr.fit() result.summary2()

we see different coefficients respect to the ones from LinearRegression. Moreover, there are features with a few correlation with daily_strees and, however, a high t meaning a low probability of the correspondent coefficient to be 0 (for example, social_network has t=10,1 while the corr with daily_stress is 0.012720, so they are not correlated). What's happening? This is due to the fact that sm.OLS doesn't take the intercept by default, so in the adjustment by least squares, the function that minimize is:

$f = \sum (y_{i}^{pred}-y_{i})^2 = \sum (\beta_{1} x_{i} - y_{i})^2$, where $\beta_{1}, x_{i}$ can be vectors. The thing is that the intercept $\beta_{0}$ is not included.

This is $\sum (\beta_{0} + \beta_{1} x_{i} - y_{i})^2$, with $\beta_{0} = 0$. Thus, the resulting regression line passes through the origin (0,0) and the slope is different from 0. Adding X = sm.add_constant(X) before the lr = sm.OLS(y, X), the variable X will include a column of ones, and the corresponding $\beta$ coefficient will be $\beta_{0}$ as it is multiplied by ones.

X_train_ones = sm.add_constant(X_train) ols = sm.OLS(y_train, X_train_ones) ols = ols.fit()

Doing this, we will obtain the intercept, and the result is the same as performing LinearRegression with our column of X (without the column of ones, as LinearRegression already takes into account the intercept in its default argument intercept=True)

2. Distribution of coefficients from random samples: $N(0,\sigma)$¶

x = np.linspace(0,100,1000).reshape(-1,1)
y = np.random.rand(1000)*1000000

coeficientes = np.array([])
lin = LinearRegression()
for i in range(1000):
    y = np.random.rand(1000)*100
    lin.fit(x,y)
    coeficientes = np.hstack([coeficientes, lin.coef_])

plt.hist(coeficientes);

from scipy.stats import normaltest
normaltest(coeficientes)

NormaltestResult(statistic=0.012647543546749278, pvalue=0.9936961811898626)

In random samples, the mean is the best model, the one which minimize the squared errors. Here we can see that the linear regression on random samples gives a coefficient of 0 with the most probability, following the coefficients a normal distribution: $coef \sim N(0,\sigma)$

3. Hypothesis tests about stress level and gender¶

We performed the analysis by stress levels, so giving a particular stress level, how many women and men answered. Let's perform some hypothesis test by gender. Giving the male answers, are they less stress than population in average? Are women more stressed than population in average?

H0: Men are as stressed as the average of the population

df_concat_stress_gender

real_df

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-86-0dd91c80cc44> in <module>
----> 1 real_df

NameError: name 'real_df' is not defined

real_df = df_concat_stress_gender[[(0,'Observed'),(1,'Observed')]]
expected_df = df_concat_stress_gender[[(0,'Expected'),(1,'Expected')]]
real_df.columns = real_df.columns.droplevel(1)
expected_df.columns = expected_df.columns.droplevel(1)

chi2 = sum((real_df[1] - expected_df[1])**2/expected_df[1])
chi2

Rejected.

H0: Women are as stressed as the average of the population

sum((real_df[0] - expected_df[0])**2/expected_df[0])

Rejected.

4. Testing the significance of correlation coefficient¶

We saw that the categories of daily stress and gender are not independent. Now we would have to demonstrate that, besides this, there is a correlation, being women more stressed than men. The fact that there are dependent doesn't say anything about correlation.

About correlation, we obtained a value of data.corr()['daily_stress']['gender'] = -0.12867. Now we have to check if this is significantly different from zero to conclude there is not only a dependence but also a (negative) tendency. To obtain the significance, the test we have to perform is a t-test, were we have $H_{0}: \mu=0$, (where $\mu$ is the slope) and we estimate the standard deviation of the population slopes by the standard deviation of the slope we have obtained from our sample.

Sandard error of the slope in a linear regression:¶

See that the slope estimator variance come from the variance of the residuals, $\sigma$, considering residuals are normally distributed. The meaning of the slope estimator variance is not that much the uncertainty of what is the value of the slope, the value that minimize the cost function, but it indicates the little differences in the value of the slope we had to use to predict each of the data. So we have a general $\beta_1$ unique, the one we use for all predictions, and the variance says that you might need to use another one slightly (or highly) different to obtain $y_i$ as $y_i = \beta_1·x_i$.

The variance of $\beta_1$ should be interpreted as taking the residuals $\mathbf{\epsilon \sim N(0,\sigma)}$ into $\beta_1$ and leave only $\mathbf{y = \beta_1·x}$, with $\beta_1$ having a variance, instead of $y = \beta_1 · x + \epsilon$

Calculation of t statistic¶

$\small \sigma^2 = MSE$, so:

$Std(\beta_1) = \large \frac{\sqrt{MSE}}{\sqrt{S_{xx}}} \hspace{.5cm}\normalsize [0]$

The calculated value of the statistic t is:

$t =\Large \frac{\beta_1 - 0}{Std(\beta_1)} \hspace{.5cm}\normalsize [1]$, where

$\beta_1 =\frac{\sum(x_i - \overline{x})(y_i - \overline{y})}{\sum(x_i - \overline{x})^2} \hspace{.5cm}\normalsize [2]$

is the expression for $\beta_1$ obtained by minimum squares and $Std(\beta_1)$ is given by [0]

We can also obtain an expression of t from the Pearson correlation coefficient, r. We need the relation between $\beta_1$ and r:

$\beta_1 = r·\frac{S_{yy}}{S_{xx}} \hspace{.5cm}\normalsize [3]$

which is deducted from [2] and the expression for the Pearson correlation coefficient, r: $r=\frac{S_{xy}}{S_{xx}·S_{yy}}$. Taking [3] into account and similarly to the demonstration above of the standard error of the slope in a linear regression, we obtain that $std(r) = std(\beta_1)·\frac{\sqrt{S_{yy}}}{\sqrt{S_{xx}}} = ... = \sqrt{\frac{1-r^2}{N-2}} \hspace{.5cm}\normalsize [4]$ (were we consider $r^2 = 1 - \frac{SSE}{SST}$), and finally:

$\large t = \frac{r - 0}{std(r)} = \frac{r}{\sqrt{\frac{1-r^2}{N-2}}} = ... = \frac{r·\sqrt{N-2}}{\sqrt{1-r^2}} \hspace{.5cm}\normalsize [5]$

Comparision with the values returned by `sm.OLS(y, X).fit().summary2()`¶

Let's see that the values for t obtaind by hand from [1] or from [4], as well as the value for $Std(\beta_1)$ from [0], are those shown authomatically with the method summary2() of the object sm.OLS, the linear regressor of OLS of statsmodels.api:

# t with Eq.[5]

r = data.corr(method='pearson')['daily_stress']['gender']
N = data.shape[0]
std_r = np.sqrt((1-r**2)/(N-2)) #Equation [4]
t = (r-0)/std_r #Equation [5]
print(t)

# t with Eq.[1]

t = result.params['gender'] / std_b1 #Equation [1]
print(t)

import statsmodels.api as sm
y = data['daily_stress']
X = sm.add_constant(data['gender'])
lr = sm.OLS(y, X)
result = lr.fit()

rmse = np.sqrt(mean_squared_error(y, result.predict(sm.add_constant(X['gender']))))
Sxx = ((X['gender'] - X['gender'].mean())**2).sum()

std_b1 = rmse/np.sqrt(Sxx) #Equation [0]
print(std_b1)

$t = -14.653189$ and $std\_b1 = 0.02465$ match the t and std.err given by result.summary2():

result.summary2()

Note and discussion: Residuals are not normally distributed¶

The expression for the slope estimator's variance come from assuming normally distributed residuals. However, this is not our case (I tried transforming the data, but I can't obtain normally distributed residuals in any case), so we should be prudent with the conclusions we have reached here. However, there are other studies affirming that women suffer more stress than men, as well as studies about the so-called "financial stress" that relate stress with low income. Besides, in our study we easily appreciate the lower number of observed cases of women with low stress levels, compared to expected cases, and the higher number of women with high stress levels, and since the dependency of the categories is guaranteed by the $\chi^2$ test of independency (not affected by the distribution of residuals, as this test has nothing to do with descriptive tendency models and just deal with unordered categories), it is very likely a real correlation.

5. Pearson correlation coefficient VS Linear regression coeficient¶

Pearson correlation coefficient (r) is an indirect measure of the t-statistic, which determines whether the linear regression cofficient ($\beta_1$) is 0 (t~0) or not (|t|>>0). The relation of these two is $\frac{r·\sqrt{N-2}}{\sqrt{1-r^2}}$, and as we can appreciate, $r \rightarrow 0, t \rightarrow 0$ and $|r| \rightarrow 1, |t| >>> 0$. The magnitude of r tells how well data fit a straight line, and the higher r, the lower $\beta_1$ will vary from its estimated value ($\Delta \beta_1 \sim 0$) and thus the less likely it is to be 0 (strictly, the less likely is that, if it were 0 in reality, we had obtain that so well aligned and low-varianced data).

Coefficients VS Correlations (Pearson)

df_coef = pd.DataFrame(list(zip(X.columns, OLSresults.params[1:])), columns=['Feature', 'Coefficient LR'])
df_coef['Pearson Correlation r'] = data.corr()['daily_stress'].drop('daily_stress').values
df_coef.sort_values(by='Coefficient LR')

Linear Regression coefficients does not follow the same ascending/descending order as correlation Pearson coefficients (r). As we mencioned above, correlation Pearson coefficients tell how well the linear model fit to data, but not how much the variable influence on stress. What tell us about what features influence the most on stress is the magnitude of linear regression coefficient.

	Variable	Description	Values	Type of attribute
0	Timestamp	Date of the survey	7/7/15 - 24/2/2020	Ordinal, object
1	Fruits_Veggies	How many fruits or veggies do you eat everyday?	0 - 5	Numerical, int 64
2	Daily_stress	How much stress so you typically experience everyday? In average, over 12 months	0 - 5	Numerical, object
3	Places_visited	How many places do you visit? Over a period of 12 months. Include new states, new cities, etc.	0 - 10	Numerical, int 64
4	Core_circle	How many people are very close to you? i.e. close family and friends	0 - 10	Numerical, int 64
5	Supporting others	How many people do you help achieve a better life? A reflection of your altruism. Over a period of 12 months.	0 - 10	Numerical, int 64
6	Social_network	How many people do you interact with during a typical day? At home, at work... Average of workdays and weekends	0 - 10	Numerical, int 64
7	Achievement	How many remarkable achievements are you proud of? Over the last 12 months, such as running a marathon, success at work, opening a new business, etc.	0 - 10	Numerical, int 64
8	Donation	How many times do you donate your time or money to good causes? Over a period of 12 months	0 - 5	Numerical, int 64
9	BMI_Range	What is your body mass index range? Your body mass in kg divided by the square of your height in meters.	1 - 2	Numerical, int 64
10	Todo_completed	How well do you complete your weekly to-do list? Include work- and personal-related tasks. On a scale of 0 = not at all to 10 = very well	0 - 10	Numerical, int 64
11	Flow	How many hours do you experience 'flow'? 'Flow' is defined as the mental state, in which you are fully immersed in performing an activity	0 - 10	Numerical, int 64
12	Daily_steps	How many steps (in thousands) do you typically walk everyday? Thousand steps, daily average over multiple days including work days and week-end	1 - 10	Numerical, int 64
13	Live_vision	For how many years ahead is your life vision very clear for? For instance, illustrated in a vision board or discussed with a friend	0 - 10	Numerical, int 64
14	Sleep_hours	About how long do you typically sleep? Over the course of a typical working week, including week-end.	1 - 10	Numerical, int 64
15	Lost_vacation	How many days of vacation do you typically lose every year? Unused vacation days, lost or carried forward into the following year, etc.	0 - 10	Numerical, int 64
16	Daily_shouting	How often do you shout or sulk at somebody? In a typical week. Expressing your negative emotions in an active or passive manner.	0 - 10	Numerical, int 64
17	Sufficient_income	How sufficient is your income to cover basic life expenses? Such as the costs of housing, food, health care, car and education.	1 - 2	Numerical, int 64
18	Personal_awards	How many recognitions have you received in your life? E.g. diploma, degree, certificate, accreditation, award, prize, etc.	0 - 10	Numerical, int 64
19	Time_for_passion	How many hours do you spend everyday doing what you are passionate about? Daily hours spent doing what you are passionate and dreaming about	0 - 10	Numerical, int 64
20	Daily_meditation	How many times during the day do you think about yourself in a typical week? Include meditation, praying and relaxation activities such as fitness, walking in a park or lunch breaks.	0 - 10	Numerical, int 64
21	Age	Age groups	21 to 35 - Less than 20	Ratio Scaled, object
22	Gender	Male or female	Female - Male	Nominal, object

	fruits_veggies	daily_stress	places_visited	core_circle	supporting_others	social_network	achievement	donation	bmi_range	todo_completed	...	live_vision	sleep_hours	lost_vacation	daily_shouting	sufficient_income	personal_awards	time_for_passion	daily_meditation	age	gender
fruits_veggies	1.000000	-0.095132	0.248520	0.153606	0.207907	0.105804	0.166643	0.200787	-0.091937	0.230350	...	0.113970	0.105838	-0.069845	-0.071099	0.151236	0.165587	0.171531	0.195067	0.183554	-0.118868
daily_stress	-0.095132	1.000000	-0.131707	-0.115973	-0.035373	0.012720	-0.120786	-0.038291	0.084938	-0.166975	...	-0.135016	-0.152862	0.192353	0.309264	-0.144872	-0.044362	-0.161858	-0.213672	-0.020649	-0.128677
places_visited	0.248520	-0.131707	1.000000	0.259982	0.255008	0.157615	0.264964	0.212655	-0.103696	0.234049	...	0.149852	0.129648	-0.124751	-0.089066	0.173262	0.279604	0.198304	0.203315	0.003419	-0.054253
core_circle	0.153606	-0.115973	0.259982	1.000000	0.338630	0.304941	0.292359	0.220173	-0.025774	0.222466	...	0.216494	0.060685	-0.083369	-0.076232	0.122687	0.250223	0.227560	0.103687	0.002869	-0.096980
supporting_others	0.207907	-0.035373	0.255008	0.338630	1.000000	0.311285	0.359009	0.394205	0.027601	0.252200	...	0.233593	0.013629	-0.026155	-0.064187	0.115812	0.333632	0.322820	0.149154	0.190202	-0.131287
social_network	0.105804	0.012720	0.157615	0.304941	0.311285	1.000000	0.252225	0.147615	0.017123	0.212253	...	0.184818	-0.029174	0.011538	-0.010305	0.132226	0.214703	0.202677	-0.006766	-0.058707	-0.050410
achievement	0.166643	-0.120786	0.264964	0.292359	0.359009	0.252225	1.000000	0.236727	-0.034303	0.309286	...	0.316496	0.054920	-0.010751	-0.066264	0.117023	0.400330	0.373957	0.172801	0.002239	-0.006039
donation	0.200787	-0.038291	0.212655	0.220173	0.394205	0.147615	0.236727	1.000000	0.059614	0.189849	...	0.158218	0.005471	-0.025449	-0.054656	0.123999	0.274397	0.188767	0.150520	0.233210	-0.129959
bmi_range	-0.091937	0.084938	-0.103696	-0.025774	0.027601	0.017123	-0.034303	0.059614	1.000000	-0.067473	...	-0.009084	-0.095543	0.032417	0.054456	-0.014296	0.016309	-0.023930	-0.070900	0.200395	-0.009803
todo_completed	0.230350	-0.166975	0.234049	0.222466	0.252200	0.212253	0.309286	0.189849	-0.067473	1.000000	...	0.271881	0.117985	-0.080262	-0.145587	0.203476	0.249857	0.278042	0.181956	0.096548	-0.089833
flow	0.136223	-0.142187	0.154917	0.243300	0.266881	0.240043	0.389727	0.162248	0.011586	0.302141	...	0.300045	0.036995	0.000539	-0.080099	0.086671	0.217978	0.482637	0.145943	-0.002399	-0.016699
daily_steps	0.243607	-0.057925	0.196342	0.151113	0.145220	0.213596	0.193328	0.107843	-0.128937	0.204796	...	0.119280	0.010364	-0.045349	-0.031981	0.101667	0.150016	0.146962	0.141684	-0.020511	0.032735
live_vision	0.113970	-0.135016	0.149852	0.216494	0.233593	0.184818	0.316496	0.158218	-0.009084	0.271881	...	1.000000	0.047949	-0.028527	-0.081179	0.130777	0.206593	0.311015	0.152180	0.041089	0.022373
sleep_hours	0.105838	-0.152862	0.129648	0.060685	0.013629	-0.029174	0.054920	0.005471	-0.095543	0.117985	...	0.047949	1.000000	-0.093058	-0.078230	0.047208	0.018537	0.067811	0.164169	-0.056300	-0.048616
lost_vacation	-0.069845	0.192353	-0.124751	-0.083369	-0.026155	0.011538	-0.010751	-0.025449	0.032417	-0.080262	...	-0.028527	-0.093058	1.000000	0.085184	-0.083065	-0.038252	-0.026963	-0.113086	-0.052863	0.025472
daily_shouting	-0.071099	0.309264	-0.089066	-0.076232	-0.064187	-0.010305	-0.066264	-0.054656	0.054456	-0.145587	...	-0.081179	-0.078230	0.085184	1.000000	-0.076268	-0.041729	-0.104970	-0.096076	-0.084847	-0.064035
sufficient_income	0.151236	-0.144872	0.173262	0.122687	0.115812	0.132226	0.117023	0.123999	-0.014296	0.203476	...	0.130777	0.047208	-0.083065	-0.076268	1.000000	0.153977	0.079585	0.071580	0.161993	0.006567
personal_awards	0.165587	-0.044362	0.279604	0.250223	0.333632	0.214703	0.400330	0.274397	0.016309	0.249857	...	0.206593	0.018537	-0.038252	-0.041729	0.153977	1.000000	0.239808	0.154547	0.130737	-0.020684
time_for_passion	0.171531	-0.161858	0.198304	0.227560	0.322820	0.202677	0.373957	0.188767	-0.023930	0.278042	...	0.311015	0.067811	-0.026963	-0.104970	0.079585	0.239808	1.000000	0.196000	0.000118	0.015833
daily_meditation	0.195067	-0.213672	0.203315	0.103687	0.149154	-0.006766	0.172801	0.150520	-0.070900	0.181956	...	0.152180	0.164169	-0.113086	-0.096076	0.071580	0.154547	0.196000	1.000000	0.099682	0.094050
age	0.183554	-0.020649	0.003419	0.002869	0.190202	-0.058707	0.002239	0.233210	0.200395	0.096548	...	0.041089	-0.056300	-0.052863	-0.084847	0.161993	0.130737	0.000118	0.099682	1.000000	-0.089236
gender	-0.118868	-0.128677	-0.054253	-0.096980	-0.131287	-0.050410	-0.006039	-0.129959	-0.009803	-0.089833	...	0.022373	-0.048616	0.025472	-0.064035	0.006567	-0.020684	0.015833	0.094050	-0.089236	1.000000

	daily_stress	daily_shouting	bmi_range	lost_vacation	gender	age	sleep_hours	sufficient_income	fruits_veggies	daily_meditation	...	social_network	places_visited	core_circle	todo_completed	live_vision	supporting_others	time_for_passion	flow	personal_awards	achievement
daily_stress	1.000000	0.309264	0.084938	0.192353	-0.128677	-0.020649	-0.152862	-0.144872	-0.095132	-0.213672	...	0.012720	-0.131707	-0.115973	-0.166975	-0.135016	-0.035373	-0.161858	-0.142187	-0.044362	-0.120786
daily_shouting	0.309264	1.000000	0.054456	0.085184	-0.064035	-0.084847	-0.078230	-0.076268	-0.071099	-0.096076	...	-0.010305	-0.089066	-0.076232	-0.145587	-0.081179	-0.064187	-0.104970	-0.080099	-0.041729	-0.066264
bmi_range	0.084938	0.054456	1.000000	0.032417	-0.009803	0.200395	-0.095543	-0.014296	-0.091937	-0.070900	...	0.017123	-0.103696	-0.025774	-0.067473	-0.009084	0.027601	-0.023930	0.011586	0.016309	-0.034303
lost_vacation	0.192353	0.085184	0.032417	1.000000	0.025472	-0.052863	-0.093058	-0.083065	-0.069845	-0.113086	...	0.011538	-0.124751	-0.083369	-0.080262	-0.028527	-0.026155	-0.026963	0.000539	-0.038252	-0.010751
gender	-0.128677	-0.064035	-0.009803	0.025472	1.000000	-0.089236	-0.048616	0.006567	-0.118868	0.094050	...	-0.050410	-0.054253	-0.096980	-0.089833	0.022373	-0.131287	0.015833	-0.016699	-0.020684	-0.006039
age	-0.020649	-0.084847	0.200395	-0.052863	-0.089236	1.000000	-0.056300	0.161993	0.183554	0.099682	...	-0.058707	0.003419	0.002869	0.096548	0.041089	0.190202	0.000118	-0.002399	0.130737	0.002239
sleep_hours	-0.152862	-0.078230	-0.095543	-0.093058	-0.048616	-0.056300	1.000000	0.047208	0.105838	0.164169	...	-0.029174	0.129648	0.060685	0.117985	0.047949	0.013629	0.067811	0.036995	0.018537	0.054920
sufficient_income	-0.144872	-0.076268	-0.014296	-0.083065	0.006567	0.161993	0.047208	1.000000	0.151236	0.071580	...	0.132226	0.173262	0.122687	0.203476	0.130777	0.115812	0.079585	0.086671	0.153977	0.117023
fruits_veggies	-0.095132	-0.071099	-0.091937	-0.069845	-0.118868	0.183554	0.105838	0.151236	1.000000	0.195067	...	0.105804	0.248520	0.153606	0.230350	0.113970	0.207907	0.171531	0.136223	0.165587	0.166643
daily_meditation	-0.213672	-0.096076	-0.070900	-0.113086	0.094050	0.099682	0.164169	0.071580	0.195067	1.000000	...	-0.006766	0.203315	0.103687	0.181956	0.152180	0.149154	0.196000	0.145943	0.154547	0.172801
daily_steps	-0.057925	-0.031981	-0.128937	-0.045349	0.032735	-0.020511	0.010364	0.101667	0.243607	0.141684	...	0.213596	0.196342	0.151113	0.204796	0.119280	0.145220	0.146962	0.143280	0.150016	0.193328
donation	-0.038291	-0.054656	0.059614	-0.025449	-0.129959	0.233210	0.005471	0.123999	0.200787	0.150520	...	0.147615	0.212655	0.220173	0.189849	0.158218	0.394205	0.188767	0.162248	0.274397	0.236727
social_network	0.012720	-0.010305	0.017123	0.011538	-0.050410	-0.058707	-0.029174	0.132226	0.105804	-0.006766	...	1.000000	0.157615	0.304941	0.212253	0.184818	0.311285	0.202677	0.240043	0.214703	0.252225
places_visited	-0.131707	-0.089066	-0.103696	-0.124751	-0.054253	0.003419	0.129648	0.173262	0.248520	0.203315	...	0.157615	1.000000	0.259982	0.234049	0.149852	0.255008	0.198304	0.154917	0.279604	0.264964
core_circle	-0.115973	-0.076232	-0.025774	-0.083369	-0.096980	0.002869	0.060685	0.122687	0.153606	0.103687	...	0.304941	0.259982	1.000000	0.222466	0.216494	0.338630	0.227560	0.243300	0.250223	0.292359
todo_completed	-0.166975	-0.145587	-0.067473	-0.080262	-0.089833	0.096548	0.117985	0.203476	0.230350	0.181956	...	0.212253	0.234049	0.222466	1.000000	0.271881	0.252200	0.278042	0.302141	0.249857	0.309286
live_vision	-0.135016	-0.081179	-0.009084	-0.028527	0.022373	0.041089	0.047949	0.130777	0.113970	0.152180	...	0.184818	0.149852	0.216494	0.271881	1.000000	0.233593	0.311015	0.300045	0.206593	0.316496
supporting_others	-0.035373	-0.064187	0.027601	-0.026155	-0.131287	0.190202	0.013629	0.115812	0.207907	0.149154	...	0.311285	0.255008	0.338630	0.252200	0.233593	1.000000	0.322820	0.266881	0.333632	0.359009
time_for_passion	-0.161858	-0.104970	-0.023930	-0.026963	0.015833	0.000118	0.067811	0.079585	0.171531	0.196000	...	0.202677	0.198304	0.227560	0.278042	0.311015	0.322820	1.000000	0.482637	0.239808	0.373957
flow	-0.142187	-0.080099	0.011586	0.000539	-0.016699	-0.002399	0.036995	0.086671	0.136223	0.145943	...	0.240043	0.154917	0.243300	0.302141	0.300045	0.266881	0.482637	1.000000	0.217978	0.389727
personal_awards	-0.044362	-0.041729	0.016309	-0.038252	-0.020684	0.130737	0.018537	0.153977	0.165587	0.154547	...	0.214703	0.279604	0.250223	0.249857	0.206593	0.333632	0.239808	0.217978	1.000000	0.400330
achievement	-0.120786	-0.066264	-0.034303	-0.010751	-0.006039	0.002239	0.054920	0.117023	0.166643	0.172801	...	0.252225	0.264964	0.292359	0.309286	0.316496	0.359009	0.373957	0.389727	0.400330	1.000000

Dep. Variable:	daily_stress	R-squared:	0.208
Model:	OLS	Adj. R-squared:	0.206
Method:	Least Squares	F-statistic:	116.6
Date:	Sun, 18 Apr 2021	Prob (F-statistic):	0.00
Time:	12:05:44	Log-Likelihood:	-14475.
No. Observations:	8928	AIC:	2.899e+04
Df Residuals:	8907	BIC:	2.914e+04
Df Model:	20
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	4.1133	0.105	39.351	0.000	3.908	4.318
fruits_veggies	-0.0194	0.010	-1.955	0.051	-0.039	5.48e-05
places_visited	-0.0124	0.004	-2.777	0.006	-0.021	-0.004
core_circle	-0.0234	0.005	-4.515	0.000	-0.034	-0.013
supporting_others	0.0192	0.005	3.879	0.000	0.009	0.029
social_network	0.0234	0.005	4.933	0.000	0.014	0.033
achievement	-0.0167	0.006	-2.852	0.004	-0.028	-0.005
donation	-0.0009	0.008	-0.112	0.910	-0.016	0.015
todo_completed	-0.0233	0.006	-4.060	0.000	-0.034	-0.012
flow	-0.0296	0.007	-4.383	0.000	-0.043	-0.016
daily_steps	0.0042	0.005	0.862	0.388	-0.005	0.014
live_vision	-0.0155	0.004	-3.498	0.000	-0.024	-0.007
sleep_hours	-0.0959	0.011	-8.602	0.000	-0.118	-0.074
lost_vacation	0.0487	0.004	13.530	0.000	0.042	0.056
daily_shouting	0.1253	0.005	25.168	0.000	0.116	0.135
sufficient_income	-0.2563	0.031	-8.346	0.000	-0.317	-0.196
personal_awards	0.0224	0.005	4.641	0.000	0.013	0.032
time_for_passion	-0.0328	0.006	-5.658	0.000	-0.044	-0.021
daily_meditation	-0.0488	0.005	-10.535	0.000	-0.058	-0.040
age	0.0210	0.015	1.420	0.156	-0.008	0.050
gender	-0.3315	0.028	-11.950	0.000	-0.386	-0.277

	Timestamp	FRUITS_VEGGIES	DAILY_STRESS	PLACES_VISITED	CORE_CIRCLE	SUPPORTING_OTHERS	SOCIAL_NETWORK	ACHIEVEMENT	DONATION	BMI_RANGE	...	LIVE_VISION	SLEEP_HOURS	LOST_VACATION	DAILY_SHOUTING	SUFFICIENT_INCOME	PERSONAL_AWARDS	TIME_FOR_PASSION	DAILY_MEDITATION	AGE	GENDER
0	7/7/15	3	2	2	5	0	5	2	0	1	...	0	7	5	5	1	4	0	5	36 to 50	Female
1	7/7/15	2	3	4	3	8	10	5	2	2	...	5	8	2	2	2	3	2	6	36 to 50	Female
2	7/7/15	2	3	3	4	4	10	3	2	2	...	5	8	10	2	2	4	8	3	36 to 50	Female
3	7/7/15	3	3	10	3	10	7	2	5	2	...	0	5	7	5	1	5	2	0	51 or more	Female
4	7/7/15	5	1	3	3	10	4	2	4	2	...	0	7	0	0	2	8	1	5	51 or more	Female

	mean	25%	50%	75%	range
fruits_veggies	2.930223	2.0	3.0	4.0	5.0
daily_stress	2.782673	2.0	3.0	4.0	5.0
places_visited	5.339240	3.0	5.0	8.0	10.0
core_circle	5.484908	3.0	5.0	8.0	10.0
supporting_others	5.577421	3.0	5.0	10.0	10.0
social_network	6.551156	4.0	6.0	10.0	10.0
achievement	3.963230	2.0	3.0	6.0	10.0
donation	2.700745	1.0	3.0	5.0	5.0
bmi_range	1.400706	1.0	1.0	2.0	1.0
todo_completed	5.706233	4.0	6.0	8.0	10.0
flow	3.126382	1.0	3.0	4.0	10.0
daily_steps	5.704822	3.0	5.0	8.0	9.0
live_vision	3.712034	1.0	3.0	5.0	10.0
sleep_hours	7.035202	6.0	7.0	8.0	9.0
lost_vacation	2.832536	0.0	0.0	5.0	10.0
daily_shouting	2.920659	1.0	2.0	4.0	10.0
sufficient_income	1.728185	1.0	2.0	2.0	1.0
personal_awards	5.702626	3.0	5.0	9.0	10.0
time_for_passion	3.266484	1.0	2.0	5.0	10.0
daily_meditation	6.253156	4.0	7.0	10.0	10.0
age	1.581968	1.0	1.0	2.0	3.0
gender	0.395218	0.0	0.0	1.0	1.0

Omnibus:	103.262	Durbin-Watson:	1.990
Prob(Omnibus):	0.000	Jarque-Bera (JB):	64.643
Skew:	-0.030	Prob(JB):	9.18e-15
Kurtosis:	2.588	Cond. No.	169.

	coef
feature
gender	-0.324783
sufficient_income	-0.248169
sleep_hours	-0.095088
daily_meditation	-0.048859
time_for_passion	-0.032768
flow	-0.029366
core_circle	-0.023177
todo_completed	-0.023167
fruits_veggies	-0.018384
achievement	-0.016536
live_vision	-0.015557
places_visited	-0.012463
donation	-0.000072
daily_steps	0.003708
age	0.018553
supporting_others	0.019050
personal_awards	0.022077
social_network	0.023043
lost_vacation	0.048645
daily_shouting	0.125290

	Polynomial Feature	Coef
46	sleep_hours sufficient_income	-0.027884
48	sleep_hours daily_meditation	-0.003442
20	supporting_others daily_meditation	-0.003251
42	live_vision sleep_hours	-0.002931
55	time_for_passion daily_meditation	-0.002821

	Polynomial Feature	Coef
13	supporting_others social_network	0.002591
38	daily_steps lost_vacation	0.002792
24	social_network age	0.003054
53	daily_shouting daily_meditation	0.003471
45	sleep_hours daily_shouting	0.012051

	0		1
	Expected	Observed	Expected	Observed
0	339.887730	243	222.112270	319
1	1207.750529	1006	789.249471	991
2	1631.703018	1592	1066.296982	1106
3	2137.905919	2270	1397.094081	1265
4	1403.095257	1448	916.904743	872
5	993.657546	1155	649.342454	488

gender	0	1
daily_stress
0	27.618627	42.263457
1	33.701725	51.572130
2	0.966064	1.478321
3	8.161653	12.489385
4	1.437134	2.199177
5	26.197544	40.088843

sufficient_income	1	2
daily_stress
0	11.415938	4.261311
1	48.237351	18.005910
2	19.425805	7.251213
3	0.015554	0.005806
4	13.830329	5.162548
5	121.988958	45.535715

daily_stress	0	1	2	3	4	5
daily_shouting
0	281.671867	72.579703	0.704783	32.113801	25.665259	35.421394
1	0.656403	86.195555	6.054673	0.972482	23.592119	44.147251
2	1.506745	6.166260	16.884605	0.049399	2.913760	31.169673
3	19.574291	6.194255	2.179678	4.717678	3.772114	4.220201
4	23.125455	35.899439	0.001037	44.873106	0.000002	0.135553
5	13.419884	25.004496	2.111299	0.102970	7.733697	44.668886
6	9.652591	21.041107	4.458279	2.221423	20.686528	3.956383
7	16.679625	44.078881	12.711888	2.636402	33.783656	24.885339
8	9.391829	22.715161	20.426943	6.253321	8.942218	31.551156
9	1.708652	7.450317	4.677104	1.079692	8.128045	21.915776
10	6.241371	59.778281	33.849310	13.454534	13.857730	338.464389

	Male
0	0
1	0
2	0
3	0
4	0
...	...
12751	0
12752	0
12753	0
12754	0
12755	0

	1		2
	Expected	Observed	Expected	Observed
0	152.760016	111	409.239984	451
1	542.814504	381	1454.185496	1616
2	733.356801	614	1964.643199	2084
3	960.865935	957	2574.134065	2578
4	630.610741	724	1689.389259	1596
5	446.592003	680	1196.407997	963

	Feature	Coefficient LR	Pearson Correlation r
19	gender	-0.331477	-0.128677
14	sufficient_income	-0.256305	-0.144872
11	sleep_hours	-0.095864	-0.152862
17	daily_meditation	-0.048766	-0.213672
16	time_for_passion	-0.032759	-0.161858
8	flow	-0.029573	-0.142187
2	core_circle	-0.023396	-0.115973
7	todo_completed	-0.023260	-0.166975
0	fruits_veggies	-0.019387	-0.095132
5	achievement	-0.016670	-0.120786
10	live_vision	-0.015525	-0.135016
1	places_visited	-0.012374	-0.131707
6	donation	-0.000887	-0.038291
9	daily_steps	0.004181	-0.057925
3	supporting_others	0.019178	-0.035373
18	age	0.020990	-0.020649
15	personal_awards	0.022434	-0.044362
4	social_network	0.023383	0.012720
12	lost_vacation	0.048679	0.192353
13	daily_shouting	0.125297	0.309264

Wellbeing and lifestyle data¶

1. Description and objetives¶

Description¶

Objetives¶

2. Exploratory Data Analysis¶

2.1 Data Exploration¶

Initial plan:¶

2.2 Data cleaning¶

Cleaning variables: daily stress¶

2.3 Feature engineering¶

Processing categorical variables: age¶

Processing categorial variables: gender¶

2.4 Data statistics and visualizations¶

Description¶

Histograms¶

2.5 Correlations between variables¶

Variable 'Achivement':¶

Variable 'daily stress'¶

2.6 Summary of the EDA: Key Findings and insights¶

3. Hypothesis¶

Significance test for the hypothsis 1: Population sleep less than 7 hours¶

4. Next steps¶

5. Linear Regression for daily stress¶

5.1 Simple Linear Regression with OLS from statsmodels.api¶

5.2 Reducing features: Lasso¶

5.3 Reducing features: RFE¶

5.4 Polynomial Features¶

5.5 Best model for interpretation: conclusion¶

3.6 Key Findings and Insights¶

3.7 Next steps¶

6. Statistical Hypothesis Testing for dependencies and correlations between variables

𝜒2 test to check independency:¶

t-test to check correlation¶

$H_{0}$: Stress levels are independent of gender: woman are as stressed as men¶

$𝐻_0$ : Stress levels are independent of sufficient income¶

$𝐻_0$ : Stress levels are independent of daily shouting¶

Attached:¶

1. Performing sm.OLS we obtain different coefficients of that from LinearRegression:¶

2. Distribution of coefficients from random samples: $N(0,\sigma)$¶

3. Hypothesis tests about stress level and gender¶

4. Testing the significance of correlation coefficient¶

Sandard error of the slope in a linear regression:¶

Calculation of t statistic¶

Comparision with the values returned by sm.OLS(y, X).fit().summary2()¶

Note and discussion: Residuals are not normally distributed¶

5. Pearson correlation coefficient VS Linear regression coeficient¶

5.1 Simple Linear Regression with `OLS` from `statsmodels.api`¶

1. Performing sm.OLS we obtain different coefficients of that from `LinearRegression`:¶

Comparision with the values returned by `sm.OLS(y, X).fit().summary2()`¶