Wellbeing and lifestyle data

1. Description and objetives

This dataset from Kaggle.com (https://www.kaggle.com/ydalat/lifestyle-and-wellbeing-data) contains 12,757 survey responses with 23 attributes describing how we live our lives. How can we reinvent our lifestyles to optimize our individual wellbeing while supporting the UN Sustainable Development Goals?

Your Work-Life Balance survey evaluates how we thrive in both your professional and personal lives: it reflects how well you shape your lifestyle, habits and behaviors to maximize your overall life satisfaction along the following five dimensions:

  1. Healthy body, reflecting your fitness and healthy habits;
  2. Healthy mind, indicating how well you embrace positive emotions;
  3. Expertise, measuring the ability to grow your expertise and achieve something unique;
  4. Connection, assessing the strength of your social network and your inclination to discover the world;
  5. Meaning, evaluating your compassion, generosity and how much 'you are living the life of your dream'.

Description

This is a description of the attributes in the dataset:

In [1]:
%run wellbeing_description.py
Variable Description Values Type of attribute
0 Timestamp Date of the survey 7/7/15 - 24/2/2020 Ordinal, object
1 Fruits_Veggies How many fruits or veggies do you eat everyday? 0 - 5 Numerical, int 64
2 Daily_stress How much stress so you typically experience everyday? In average, over 12 months 0 - 5 Numerical, object
3 Places_visited How many places do you visit? Over a period of 12 months. Include new states, new cities, etc. 0 - 10 Numerical, int 64
4 Core_circle How many people are very close to you? i.e. close family and friends 0 - 10 Numerical, int 64
5 Supporting others How many people do you help achieve a better life? A reflection of your altruism. Over a period of 12 months. 0 - 10 Numerical, int 64
6 Social_network How many people do you interact with during a typical day? At home, at work... Average of workdays and weekends 0 - 10 Numerical, int 64
7 Achievement How many remarkable achievements are you proud of? Over the last 12 months, such as running a marathon, success at work, opening a new business, etc. 0 - 10 Numerical, int 64
8 Donation How many times do you donate your time or money to good causes? Over a period of 12 months 0 - 5 Numerical, int 64
9 BMI_Range What is your body mass index range? Your body mass in kg divided by the square of your height in meters. 1 - 2 Numerical, int 64
10 Todo_completed How well do you complete your weekly to-do list? Include work- and personal-related tasks. On a scale of 0 = not at all to 10 = very well 0 - 10 Numerical, int 64
11 Flow How many hours do you experience 'flow'? 'Flow' is defined as the mental state, in which you are fully immersed in performing an activity 0 - 10 Numerical, int 64
12 Daily_steps How many steps (in thousands) do you typically walk everyday? Thousand steps, daily average over multiple days including work days and week-end 1 - 10 Numerical, int 64
13 Live_vision For how many years ahead is your life vision very clear for? For instance, illustrated in a vision board or discussed with a friend 0 - 10 Numerical, int 64
14 Sleep_hours About how long do you typically sleep? Over the course of a typical working week, including week-end. 1 - 10 Numerical, int 64
15 Lost_vacation How many days of vacation do you typically lose every year? Unused vacation days, lost or carried forward into the following year, etc. 0 - 10 Numerical, int 64
16 Daily_shouting How often do you shout or sulk at somebody? In a typical week. Expressing your negative emotions in an active or passive manner. 0 - 10 Numerical, int 64
17 Sufficient_income How sufficient is your income to cover basic life expenses? Such as the costs of housing, food, health care, car and education. 1 - 2 Numerical, int 64
18 Personal_awards How many recognitions have you received in your life? E.g. diploma, degree, certificate, accreditation, award, prize, etc. 0 - 10 Numerical, int 64
19 Time_for_passion How many hours do you spend everyday doing what you are passionate about? Daily hours spent doing what you are passionate and dreaming about 0 - 10 Numerical, int 64
20 Daily_meditation How many times during the day do you think about yourself in a typical week? Include meditation, praying and relaxation activities such as fitness, walking in a park or lunch breaks. 0 - 10 Numerical, int 64
21 Age Age groups 21 to 35 - Less than 20 Ratio Scaled, object
22 Gender Male or female Female - Male Nominal, object

Objetives

We are looking for findings that might help us to get success and improve our quality of life. In particular, we'll try to answer these questions:

  1. Achievements: What indicators have the greatest influence in our achievements and how well can we predict them?
  1. Stress levels: What indicators have the greatest influence in our stress levels and how well can we predict it? For this variable, we will perform some linear models being the interpretation the main goal, so finding the largest coefficients in the predictive model is what we are intereseted in, as well as the statistical significance of these coefficients, so that we get the features having the most influence on our daily stress and take action to hopefully reduce it.

From the conclusion we reach to, we can modify some habits to maximize our achievements an reduce our stress levels

2. Exploratory Data Analysis

This is the general guide line followed for the EDA in the project:

2.1 Data Exploration: Exploration of the values and data types.

2.2 Data cleaning: Clean the data, handling possible wrong values, NaN and outliers, and changing the data types to numeric (integers) when possible to perform numerical data analysis.

2.3 Feature engineering: Create encoding from categorical variables, such as ordinal encoding for age, binary encoding and dummy variables for gender

2.4 Data statistics: See statistic measures with the method describe: mean, standard deviation, max, min and quartiles. Plot histograms and boxplots for all the variables

2.5 Correlations between variables: Look for correlations between variables

2.6 Summary of the EDA: Key Findings and insights

2.1 Data Exploration

Initial plan:

We are going to explore the raw dataset. I'll check the values and data types, looking for possible wrong values, Nan or missings.

From the description, we saw that the variable stress_level is numerical, but it is of type object. There might be some values which are not numbers, and therefore the data type of the variable is "object". We'll see it in the exploration of unique values with the unique method of pd.Series. Next during the data cleaning (section 2.2) I'll handle this wrong values, either dropping the entire rows if there are not too many of these values, or imputing them from the variable mean.

The variables age and gender are objects. I'll see the categories in which age is divided, and next during the feature engineering (section 2.3) I'll perform an ordinal encoding to be able to plot a histogram of the age while performing data statistics (section 2.4). With respect to gender, a binary encoding can be done with view to use it in a future predictive model.

Once I have all the variables cleaned and appropiately treated in the case of categorical features, I am ready to plot histograms and boxplots to search possible outliers that we'll handle. Finally, I'll look for the highest positive and negative correlations, specially between the variable stress level and the others, and between achievement and the others. In this last section and looking the description of the variables, we can expect negative correlations with stress level in general and positive correlations with achievement.

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import OrdinalEncoder

import math
In [3]:
data = pd.read_csv('Wellbeing_and_lifestyle_data.csv')
In [4]:
data.head()
Out[4]:
Timestamp FRUITS_VEGGIES DAILY_STRESS PLACES_VISITED CORE_CIRCLE SUPPORTING_OTHERS SOCIAL_NETWORK ACHIEVEMENT DONATION BMI_RANGE ... LIVE_VISION SLEEP_HOURS LOST_VACATION DAILY_SHOUTING SUFFICIENT_INCOME PERSONAL_AWARDS TIME_FOR_PASSION DAILY_MEDITATION AGE GENDER
0 7/7/15 3 2 2 5 0 5 2 0 1 ... 0 7 5 5 1 4 0 5 36 to 50 Female
1 7/7/15 2 3 4 3 8 10 5 2 2 ... 5 8 2 2 2 3 2 6 36 to 50 Female
2 7/7/15 2 3 3 4 4 10 3 2 2 ... 5 8 10 2 2 4 8 3 36 to 50 Female
3 7/7/15 3 3 10 3 10 7 2 5 2 ... 0 5 7 5 1 5 2 0 51 or more Female
4 7/7/15 5 1 3 3 10 4 2 4 2 ... 0 7 0 0 2 8 1 5 51 or more Female

5 rows × 23 columns

In [5]:
# Is there any NaN?
data.isnull().values.any()
Out[5]:
False
In [6]:
#Number of rows and columns
print("Number of data points: {}".format(data.shape[0]), "\nNumber of variables: {}".format(data.shape[1]))

#Column names, data types and counts
data_types = data.dtypes
data_counts = data.count()
df = pd.concat([data_types, data_counts], axis=1).reset_index()
df = df.rename(columns={'index': 'Column names' , 0: 'Types', 1: 'Counts'})
print("\nData types:\n{}".format(df))
Number of data points: 12756 
Number of variables: 23

Data types:
         Column names   Types  Counts
0           Timestamp  object   12756
1      FRUITS_VEGGIES   int64   12756
2        DAILY_STRESS  object   12756
3      PLACES_VISITED   int64   12756
4         CORE_CIRCLE   int64   12756
5   SUPPORTING_OTHERS   int64   12756
6      SOCIAL_NETWORK   int64   12756
7         ACHIEVEMENT   int64   12756
8            DONATION   int64   12756
9           BMI_RANGE   int64   12756
10     TODO_COMPLETED   int64   12756
11               FLOW   int64   12756
12        DAILY_STEPS   int64   12756
13        LIVE_VISION   int64   12756
14        SLEEP_HOURS   int64   12756
15      LOST_VACATION   int64   12756
16     DAILY_SHOUTING   int64   12756
17  SUFFICIENT_INCOME   int64   12756
18    PERSONAL_AWARDS   int64   12756
19   TIME_FOR_PASSION   int64   12756
20   DAILY_MEDITATION   int64   12756
21                AGE  object   12756
22             GENDER  object   12756

It is a dataset with Number of data points: 12756 points (rows) and 23 variables (columns). There are no missings or NaNs

In [7]:
#Let's change the column names to lowercase
data.columns = pd.Index(data.columns.str.lower())
data.columns
Out[7]:
Index(['timestamp', 'fruits_veggies', 'daily_stress', 'places_visited',
       'core_circle', 'supporting_others', 'social_network', 'achievement',
       'donation', 'bmi_range', 'todo_completed', 'flow', 'daily_steps',
       'live_vision', 'sleep_hours', 'lost_vacation', 'daily_shouting',
       'sufficient_income', 'personal_awards', 'time_for_passion',
       'daily_meditation', 'age', 'gender'],
      dtype='object')

Unique values

In [8]:
for col in data.columns:
    print(col, data[col].unique())
timestamp ['7/7/15' '7/8/15' '7/9/15' ... '2/24/2020 10:35:02' '2/24/2020 12:54:10'
 '2/24/2020 12:55:21']
fruits_veggies [3 2 5 4 1 0]
daily_stress ['2' '3' '1' '4' '5' '0' '1/1/00']
places_visited [ 2  4  3 10  5  6  7  0  8  1  9]
core_circle [ 5  3  4  9  6  7  8 10  2  1  0]
supporting_others [ 0  8  4 10  5  3  1  2  6  7  9]
social_network [ 5 10  7  4  3  1  2  8  6  9  0]
achievement [ 2  5  3  4  0  1  6 10  8  7  9]
donation [0 2 5 4 3 1]
bmi_range [1 2]
todo_completed [ 6  5  2  3  8 10  7  4  1  0  9]
flow [ 4  2  5  0  1  8  7  6  3 10  9]
daily_steps [ 5  4  7  8  1  3  6  2 10  9]
live_vision [ 0  5 10  4  2  1  6  3  8  9  7]
sleep_hours [ 7  8  5  6 10  9  4  3  2  1]
lost_vacation [ 5  2 10  7  0  3  1  4  8  6  9]
daily_shouting [ 5  2  0  3  1  7  6  4 10  8  9]
sufficient_income [1 2]
personal_awards [ 4  3  5  8 10  1  2  7  6  0  9]
time_for_passion [ 0  2  8  1  3  6  5  4 10  9  7]
daily_meditation [ 5  6  3  0 10  2  1  7  4  8  9]
age ['36 to 50' '51 or more' '21 to 35' 'Less than 20']
gender ['Female' 'Male']
In [9]:
(data['daily_stress']=='1/1/00').sum()
Out[9]:
1

Daily stress: There is a wrong value for daily stress (1/1/00), but as it is only one, I'll just drop the entire row (next section)
Age: There are 4 categories of age. In the section "feature engineering" we'll perform the ordinal encoding.

2.2 Data cleaning

As there are no nulls, missing or strange values except the one in daily stress, the data cleaning will cosist only in eliminate the row corresponing to the value 1/1/00 of 'daily stress' and change it to type int64.

NOTE: The visualizations of data (histograms and boxplots) will be done in the section data statistics (2.4), once we perform a initial cleaning an encoding, and a later cleaning would be done for outliers, if required

Cleaning variables: daily stress

There is only one strange value of daily_stress, so we can just drop the entire row from the dataset. Then I am going to change the data type from object to integer:

In [10]:
data = data.drop(data[data['daily_stress']=='1/1/00'].index)
data.shape
Out[10]:
(12755, 23)
In [11]:
# Changing data type from object to int64
data['daily_stress'] = data['daily_stress'].astype('int64')

2.3 Feature engineering

In this section we have to handle 2 categorical variables:

  • Age: We are going to perform an ordinal encoding, changing the categories to ordered integers

  • Gender: We are going to perform an binary encoding, changing the categories to ordered integers:

Processing categorical variables: age

We are going to perform an ordinal encoding, changing the categories to ordered integers:

In [12]:
age_categories = ['Less than 20','21 to 35','36 to 50','51 or more'] #data['age'].unique(), but in the order we want

enc = OrdinalEncoder(categories=[age_categories])
enc.fit(np.array(data['age']).reshape(-1,1)) #Alternatively: Method fit_transform = method fit + method transform
                                #An array of shape (n_categories, n_features) should be passed to fit --> reshape(-1,1)
enc.categories
Out[12]:
[['Less than 20', '21 to 35', '36 to 50', '51 or more']]
In [13]:
age_encoded = enc.transform(np.array(data['age']).reshape(-1,1))
age_encoded
Out[13]:
array([[2.],
       [2.],
       [2.],
       ...,
       [2.],
       [1.],
       [3.]])
In [14]:
np.unique(age_encoded)
Out[14]:
array([0., 1., 2., 3.])
In [15]:
age_original = enc.inverse_transform(age_encoded)
age_original
Out[15]:
array([['36 to 50'],
       ['36 to 50'],
       ['36 to 50'],
       ...,
       ['36 to 50'],
       ['21 to 35'],
       ['51 or more']], dtype=object)
In [16]:
# Verification
enc.inverse_transform([[code] for code in [0,1,2,3]])
Out[16]:
array([['Less than 20'],
       ['21 to 35'],
       ['36 to 50'],
       ['51 or more']], dtype=object)
In [17]:
data['age'] = age_encoded

Processing categorial variables: gender

We are going to perform an binary encoding, changing the categories to ordered integers:

In [18]:
data['gender'].unique()
Out[18]:
array(['Female', 'Male'], dtype=object)
In [19]:
pd.get_dummies(data['gender'], drop_first=True)
Out[19]:
Male
0 0
1 0
2 0
3 0
4 0
... ...
12751 0
12752 0
12753 0
12754 0
12755 0

12755 rows × 1 columns

In [20]:
gender_translator = {'Female': 0, 'Male': 1}
print(gender_translator)
{'Female': 0, 'Male': 1}
In [21]:
data['gender'] = pd.get_dummies(data['gender'], drop_first=True)
d = {'Male': 'Gender (Female:0, Male: 1)'}
data.rename(columns=d)
Out[21]:
timestamp fruits_veggies daily_stress places_visited core_circle supporting_others social_network achievement donation bmi_range ... live_vision sleep_hours lost_vacation daily_shouting sufficient_income personal_awards time_for_passion daily_meditation age gender
0 7/7/15 3 2 2 5 0 5 2 0 1 ... 0 7 5 5 1 4 0 5 2.0 0
1 7/7/15 2 3 4 3 8 10 5 2 2 ... 5 8 2 2 2 3 2 6 2.0 0
2 7/7/15 2 3 3 4 4 10 3 2 2 ... 5 8 10 2 2 4 8 3 2.0 0
3 7/7/15 3 3 10 3 10 7 2 5 2 ... 0 5 7 5 1 5 2 0 3.0 0
4 7/7/15 5 1 3 3 10 4 2 4 2 ... 0 7 0 0 2 8 1 5 3.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12751 2/23/2020 22:03:56 3 4 10 8 10 8 6 5 1 ... 3 7 0 1 1 10 6 7 1.0 0
12752 2/24/2020 2:44:30 3 3 6 5 2 5 1 0 2 ... 0 7 0 0 2 3 0 2 2.0 0
12753 2/24/2020 10:35:02 4 4 7 5 3 3 4 2 1 ... 5 6 0 2 2 6 3 5 2.0 0
12754 2/24/2020 12:54:10 3 3 10 4 8 10 3 3 1 ... 1 6 0 1 1 10 1 10 1.0 0
12755 2/24/2020 12:55:21 2 2 6 1 6 1 5 2 1 ... 0 7 0 1 2 3 2 10 3.0 0

12755 rows × 23 columns

2.4 Data statistics and visualizations

Description

In [22]:
description = data.describe()
description.loc["range"] = description.loc['max'] - description.loc['min']
out_fields = ["mean", "25%", "50%", "75%", "range"]
description = description.loc[out_fields]
description.transpose()
Out[22]:
mean 25% 50% 75% range
fruits_veggies 2.930223 2.0 3.0 4.0 5.0
daily_stress 2.782673 2.0 3.0 4.0 5.0
places_visited 5.339240 3.0 5.0 8.0 10.0
core_circle 5.484908 3.0 5.0 8.0 10.0
supporting_others 5.577421 3.0 5.0 10.0 10.0
social_network 6.551156 4.0 6.0 10.0 10.0
achievement 3.963230 2.0 3.0 6.0 10.0
donation 2.700745 1.0 3.0 5.0 5.0
bmi_range 1.400706 1.0 1.0 2.0 1.0
todo_completed 5.706233 4.0 6.0 8.0 10.0
flow 3.126382 1.0 3.0 4.0 10.0
daily_steps 5.704822 3.0 5.0 8.0 9.0
live_vision 3.712034 1.0 3.0 5.0 10.0
sleep_hours 7.035202 6.0 7.0 8.0 9.0
lost_vacation 2.832536 0.0 0.0 5.0 10.0
daily_shouting 2.920659 1.0 2.0 4.0 10.0
sufficient_income 1.728185 1.0 2.0 2.0 1.0
personal_awards 5.702626 3.0 5.0 9.0 10.0
time_for_passion 3.266484 1.0 2.0 5.0 10.0
daily_meditation 6.253156 4.0 7.0 10.0 10.0
age 1.581968 1.0 1.0 2.0 3.0
gender 0.395218 0.0 0.0 1.0 1.0

Boxplots

In [23]:
# Boxplots

fig, axes = plt.subplots(figsize=(16,5))

data.boxplot()

for tick in axes.get_xticklabels():
    tick.set_rotation(45)
    tick.set_ha('right') #Horizontal alignment

plt.show()
In [24]:
n_outliers = data['daily_shouting'][data['daily_shouting']>8].shape[0]
p = n_outliers/data.shape[0]
print("{} values above 1.5·range from Q3 = 8h of daily shouting \n{}% of data".format(n_outliers, round(p*100,2)))
753 values above 1.5·range from Q3 = 8h of daily shouting 
5.9% of data
In [25]:
n_outliers = data['flow'][data['flow']>8].shape[0]
p = n_outliers/data.shape[0]
print("{} values above 1.5·range from Q3 = 8h of daily flow \n{}% of data".format(n_outliers, round(p*100,2)))
374 values above 1.5·range from Q3 = 8h of daily flow 
2.93% of data
In [26]:
n_outliers = data['sleep_hours'][data['sleep_hours']<3].shape[0]
p = n_outliers/data.shape[0]
print("{} values below 1.5·range from Q1 = 3h of daily sleep \n{}% of data".format(n_outliers, round(p*100,2)))
32 values below 1.5·range from Q1 = 3h of daily sleep 
0.25% of data

We decide to keep outliers as they might be meaningfull, representing the reality of population. Besides, they are not excessive to make us think of them as errors

Histograms

In [27]:
# Histograms numerical variables

axs = data.hist(figsize=(16, 16))

for ax in axs.flatten():
    if ax.is_first_col():
        ax.set_ylabel('Frecuency')
        
plt.suptitle('Histograms', y=0.95, fontsize=20)
plt.show()

2.5 Correlations between variables

Correlation coefficients tell us about how well the variables are related by a straight line. However, it doesn't give any information about how much one influence to the other. For this purpose, we would have to perform a linear regression model, what we'll do later. Now let's see the correlations (Pearson).

In [28]:
data.corr()
Out[28]:
fruits_veggies daily_stress places_visited core_circle supporting_others social_network achievement donation bmi_range todo_completed ... live_vision sleep_hours lost_vacation daily_shouting sufficient_income personal_awards time_for_passion daily_meditation age gender
fruits_veggies 1.000000 -0.095132 0.248520 0.153606 0.207907 0.105804 0.166643 0.200787 -0.091937 0.230350 ... 0.113970 0.105838 -0.069845 -0.071099 0.151236 0.165587 0.171531 0.195067 0.183554 -0.118868
daily_stress -0.095132 1.000000 -0.131707 -0.115973 -0.035373 0.012720 -0.120786 -0.038291 0.084938 -0.166975 ... -0.135016 -0.152862 0.192353 0.309264 -0.144872 -0.044362 -0.161858 -0.213672 -0.020649 -0.128677
places_visited 0.248520 -0.131707 1.000000 0.259982 0.255008 0.157615 0.264964 0.212655 -0.103696 0.234049 ... 0.149852 0.129648 -0.124751 -0.089066 0.173262 0.279604 0.198304 0.203315 0.003419 -0.054253
core_circle 0.153606 -0.115973 0.259982 1.000000 0.338630 0.304941 0.292359 0.220173 -0.025774 0.222466 ... 0.216494 0.060685 -0.083369 -0.076232 0.122687 0.250223 0.227560 0.103687 0.002869 -0.096980
supporting_others 0.207907 -0.035373 0.255008 0.338630 1.000000 0.311285 0.359009 0.394205 0.027601 0.252200 ... 0.233593 0.013629 -0.026155 -0.064187 0.115812 0.333632 0.322820 0.149154 0.190202 -0.131287
social_network 0.105804 0.012720 0.157615 0.304941 0.311285 1.000000 0.252225 0.147615 0.017123 0.212253 ... 0.184818 -0.029174 0.011538 -0.010305 0.132226 0.214703 0.202677 -0.006766 -0.058707 -0.050410
achievement 0.166643 -0.120786 0.264964 0.292359 0.359009 0.252225 1.000000 0.236727 -0.034303 0.309286 ... 0.316496 0.054920 -0.010751 -0.066264 0.117023 0.400330 0.373957 0.172801 0.002239 -0.006039
donation 0.200787 -0.038291 0.212655 0.220173 0.394205 0.147615 0.236727 1.000000 0.059614 0.189849 ... 0.158218 0.005471 -0.025449 -0.054656 0.123999 0.274397 0.188767 0.150520 0.233210 -0.129959
bmi_range -0.091937 0.084938 -0.103696 -0.025774 0.027601 0.017123 -0.034303 0.059614 1.000000 -0.067473 ... -0.009084 -0.095543 0.032417 0.054456 -0.014296 0.016309 -0.023930 -0.070900 0.200395 -0.009803
todo_completed 0.230350 -0.166975 0.234049 0.222466 0.252200 0.212253 0.309286 0.189849 -0.067473 1.000000 ... 0.271881 0.117985 -0.080262 -0.145587 0.203476 0.249857 0.278042 0.181956 0.096548 -0.089833
flow 0.136223 -0.142187 0.154917 0.243300 0.266881 0.240043 0.389727 0.162248 0.011586 0.302141 ... 0.300045 0.036995 0.000539 -0.080099 0.086671 0.217978 0.482637 0.145943 -0.002399 -0.016699
daily_steps 0.243607 -0.057925 0.196342 0.151113 0.145220 0.213596 0.193328 0.107843 -0.128937 0.204796 ... 0.119280 0.010364 -0.045349 -0.031981 0.101667 0.150016 0.146962 0.141684 -0.020511 0.032735
live_vision 0.113970 -0.135016 0.149852 0.216494 0.233593 0.184818 0.316496 0.158218 -0.009084 0.271881 ... 1.000000 0.047949 -0.028527 -0.081179 0.130777 0.206593 0.311015 0.152180 0.041089 0.022373
sleep_hours 0.105838 -0.152862 0.129648 0.060685 0.013629 -0.029174 0.054920 0.005471 -0.095543 0.117985 ... 0.047949 1.000000 -0.093058 -0.078230 0.047208 0.018537 0.067811 0.164169 -0.056300 -0.048616
lost_vacation -0.069845 0.192353 -0.124751 -0.083369 -0.026155 0.011538 -0.010751 -0.025449 0.032417 -0.080262 ... -0.028527 -0.093058 1.000000 0.085184 -0.083065 -0.038252 -0.026963 -0.113086 -0.052863 0.025472
daily_shouting -0.071099 0.309264 -0.089066 -0.076232 -0.064187 -0.010305 -0.066264 -0.054656 0.054456 -0.145587 ... -0.081179 -0.078230 0.085184 1.000000 -0.076268 -0.041729 -0.104970 -0.096076 -0.084847 -0.064035
sufficient_income 0.151236 -0.144872 0.173262 0.122687 0.115812 0.132226 0.117023 0.123999 -0.014296 0.203476 ... 0.130777 0.047208 -0.083065 -0.076268 1.000000 0.153977 0.079585 0.071580 0.161993 0.006567
personal_awards 0.165587 -0.044362 0.279604 0.250223 0.333632 0.214703 0.400330 0.274397 0.016309 0.249857 ... 0.206593 0.018537 -0.038252 -0.041729 0.153977 1.000000 0.239808 0.154547 0.130737 -0.020684
time_for_passion 0.171531 -0.161858 0.198304 0.227560 0.322820 0.202677 0.373957 0.188767 -0.023930 0.278042 ... 0.311015 0.067811 -0.026963 -0.104970 0.079585 0.239808 1.000000 0.196000 0.000118 0.015833
daily_meditation 0.195067 -0.213672 0.203315 0.103687 0.149154 -0.006766 0.172801 0.150520 -0.070900 0.181956 ... 0.152180 0.164169 -0.113086 -0.096076 0.071580 0.154547 0.196000 1.000000 0.099682 0.094050
age 0.183554 -0.020649 0.003419 0.002869 0.190202 -0.058707 0.002239 0.233210 0.200395 0.096548 ... 0.041089 -0.056300 -0.052863 -0.084847 0.161993 0.130737 0.000118 0.099682 1.000000 -0.089236
gender -0.118868 -0.128677 -0.054253 -0.096980 -0.131287 -0.050410 -0.006039 -0.129959 -0.009803 -0.089833 ... 0.022373 -0.048616 0.025472 -0.064035 0.006567 -0.020684 0.015833 0.094050 -0.089236 1.000000

22 rows × 22 columns

In [29]:
# Heatmap correlations

fig, ax = plt.subplots(figsize=(8,6))

sns.heatmap(data.corr(), xticklabels=True, yticklabels=True, ax=ax, center=0, cmap='seismic') #see "diverging colormaps"
Out[29]:
<AxesSubplot:>

Negative correlations:

  • daily stress is the variable with a greater number of negative correlations with other variables (as the others describe in general desireable characteristics for a good life), especially with daily meditation, as we might expect, but also with to-do completed, time for passion and sleep hours. The only one with a positive correlation with daily stress is daily shouting. There might be a real influence of these variables in the level of stress people experience. For a predictive model, daily shouting and daily meditation could be good predictive variables.

Poor correlations

  • Variables like bmi_range, sleep hours, lost vacation and daily shouting seems to be poorly correlated with the others, or slightly negative. For example, pairs like bmi range and daily steps, daily shouting and to-do completed, lost vacation and places visited are slightly negative as we might expect.

Positive correlations:

  • The rest of the variables are in general positively correlated with the others and these positive correlation are the strongest ones. From numerical analysis I saw achievement and supporting others are the variables positively correlated with the the greatest number of the others. In particular, achievement is the variable that are positively correlated with a greater number of other variables. For a predictive model, we found that flow and personal awards could be good predictive variables, as they are not too correlated between them. We could try only with these two, but we can include core circle and social network in case we want more than two. However, it would probably increase model complexity unnecessarily and therefore overfitting.

    Other not surprising strong correlations like:

    • time for passion with flow
    • supporting others with donation
    • achievemnt with flow and personal awards
In [30]:
sorted_corr = np.unique(np.sort(data.corr().values.flatten()))[::-1] #Correlations in descendent order
array2D_corr = np.array(data.corr())

n = 12
nth_highest = sorted_corr[:n]
In [31]:
nth_highest
Out[31]:
array([1.        , 0.48263733, 0.40032959, 0.39420465, 0.38972744,
       0.37395656, 0.35900949, 0.33862994, 0.33363161, 0.32281997,
       0.3164962 , 0.31128491])
In [32]:
for col in data.corr().columns:
    df_pos_corr = pd.DataFrame(data.corr()[col].nlargest(2)[1:]).transpose()
    df_neg_corr = pd.DataFrame(data.corr()[col].nsmallest(1)).transpose()
    df = pd.concat([df_pos_corr, df_neg_corr], axis=1)
    display(df)
places_visited gender
fruits_veggies 0.24852 -0.118868
daily_shouting daily_meditation
daily_stress 0.309264 -0.213672
personal_awards daily_stress
places_visited 0.279604 -0.131707
supporting_others daily_stress
core_circle 0.33863 -0.115973
donation gender
supporting_others 0.394205 -0.131287
supporting_others age
social_network 0.311285 -0.058707
personal_awards daily_stress
achievement 0.40033 -0.120786
supporting_others gender
donation 0.394205 -0.129959
age daily_steps
bmi_range 0.200395 -0.128937
achievement daily_stress
todo_completed 0.309286 -0.166975
time_for_passion daily_stress
flow 0.482637 -0.142187
fruits_veggies bmi_range
daily_steps 0.243607 -0.128937
achievement daily_stress
live_vision 0.316496 -0.135016
daily_meditation daily_stress
sleep_hours 0.164169 -0.152862
daily_stress places_visited
lost_vacation 0.192353 -0.124751
daily_stress todo_completed
daily_shouting 0.309264 -0.145587
todo_completed daily_stress
sufficient_income 0.203476 -0.144872
achievement daily_stress
personal_awards 0.40033 -0.044362
flow daily_stress
time_for_passion 0.482637 -0.161858
places_visited daily_stress
daily_meditation 0.203315 -0.213672
donation gender
age 0.23321 -0.089236
daily_meditation supporting_others
gender 0.09405 -0.131287

Positive correlations:

The strongest correlations are positive. Let's see what are the pairs of variables with the highest correlation

In [33]:
# 12 Pairs of variables with the highest correlation

sorted_corr = np.unique(np.sort(data.corr().values.flatten()))[::-1] #Correlations in descendent order
array2D_corr = np.array(data.corr())

n = 20
nth_highest = sorted_corr[:n]

fig, axs = plt.subplots(5,4, figsize=(16,16))
fig.tight_layout(pad=5) #padding among subplots

axs = axs.flatten()
rows = []
columns = []
print("Highest correlations:\n\n")
for i in range(n):
    position = np.where(array2D_corr == nth_highest[i]) 
    row = data.corr().index[position[0][0]]
    column = data.corr().columns[position[1][0]]
    rows.append(row)
    columns.append(column)
    print("{} - {}:".format(row, column))
    print(nth_highest[i], "\n")
    frecuency = pd.crosstab(data[row],data[column]).iloc[::-1]
    ax = fig.sca(axs[i])
    ax.set(title='{} VS {}'.format(column,row))
    sns.heatmap(frecuency, ax=ax, cmap="YlGnBu")
    
    # axs[i].imshow(frecuency, cmap='Blues')
    # axs[i].hist2d(data[row],data[column], cmap='Blues')
Highest correlations:


fruits_veggies - fruits_veggies:
1.0 

flow - time_for_passion:
0.4826373253512499 

achievement - personal_awards:
0.40032959107956373 

supporting_others - donation:
0.39420465220667267 

achievement - flow:
0.38972743712470126 

achievement - time_for_passion:
0.37395655829229923 

supporting_others - achievement:
0.35900949207041805 

core_circle - supporting_others:
0.33862994082240555 

supporting_others - personal_awards:
0.33363160903780187 

supporting_others - time_for_passion:
0.32281997227051246 

achievement - live_vision:
0.31649619770451914 

supporting_others - social_network:
0.3112849098938094 

live_vision - time_for_passion:
0.31101500793287484 

achievement - todo_completed:
0.30928570200317784 

daily_stress - daily_shouting:
0.30926429602231414 

core_circle - social_network:
0.3049413472087288 

todo_completed - flow:
0.30214098153931707 

flow - live_vision:
0.30004494164840867 

core_circle - achievement:
0.29235878020232786 

places_visited - personal_awards:
0.2796037154559053 

In [34]:
# Number of variables that each one is highly correlated with

rows_cols = rows + columns
for variable in list(set(rows_cols)):
    print(variable, rows_cols.count(variable))
live_vision 3
social_network 2
fruits_veggies 2
achievement 7
personal_awards 3
places_visited 1
time_for_passion 4
supporting_others 6
core_circle 3
daily_stress 1
daily_shouting 1
donation 1
flow 4
todo_completed 2

achievement and supporting others are the variables positively correlated with the the greatest number of the other.

Variable 'Achivement':

Let's see the correlations of achievement with the others, and what pairs of these are less correlated between them:

In [35]:
# Correlations with achievement
data.corr()['achievement'].sort_values(ascending=False)
Out[35]:
achievement          1.000000
personal_awards      0.400330
flow                 0.389727
time_for_passion     0.373957
supporting_others    0.359009
live_vision          0.316496
todo_completed       0.309286
core_circle          0.292359
places_visited       0.264964
social_network       0.252225
donation             0.236727
daily_steps          0.193328
daily_meditation     0.172801
fruits_veggies       0.166643
sufficient_income    0.117023
sleep_hours          0.054920
age                  0.002239
gender              -0.006039
lost_vacation       -0.010751
bmi_range           -0.034303
daily_shouting      -0.066264
daily_stress        -0.120786
Name: achievement, dtype: float64
In [36]:
# What pair of them are less correlated between them?
achievement_corrs = data.corr()['achievement'].sort_values().index.tolist()
data.corr()[achievement_corrs].loc[achievement_corrs]
Out[36]:
daily_stress daily_shouting bmi_range lost_vacation gender age sleep_hours sufficient_income fruits_veggies daily_meditation ... social_network places_visited core_circle todo_completed live_vision supporting_others time_for_passion flow personal_awards achievement
daily_stress 1.000000 0.309264 0.084938 0.192353 -0.128677 -0.020649 -0.152862 -0.144872 -0.095132 -0.213672 ... 0.012720 -0.131707 -0.115973 -0.166975 -0.135016 -0.035373 -0.161858 -0.142187 -0.044362 -0.120786
daily_shouting 0.309264 1.000000 0.054456 0.085184 -0.064035 -0.084847 -0.078230 -0.076268 -0.071099 -0.096076 ... -0.010305 -0.089066 -0.076232 -0.145587 -0.081179 -0.064187 -0.104970 -0.080099 -0.041729 -0.066264
bmi_range 0.084938 0.054456 1.000000 0.032417 -0.009803 0.200395 -0.095543 -0.014296 -0.091937 -0.070900 ... 0.017123 -0.103696 -0.025774 -0.067473 -0.009084 0.027601 -0.023930 0.011586 0.016309 -0.034303
lost_vacation 0.192353 0.085184 0.032417 1.000000 0.025472 -0.052863 -0.093058 -0.083065 -0.069845 -0.113086 ... 0.011538 -0.124751 -0.083369 -0.080262 -0.028527 -0.026155 -0.026963 0.000539 -0.038252 -0.010751
gender -0.128677 -0.064035 -0.009803 0.025472 1.000000 -0.089236 -0.048616 0.006567 -0.118868 0.094050 ... -0.050410 -0.054253 -0.096980 -0.089833 0.022373 -0.131287 0.015833 -0.016699 -0.020684 -0.006039
age -0.020649 -0.084847 0.200395 -0.052863 -0.089236 1.000000 -0.056300 0.161993 0.183554 0.099682 ... -0.058707 0.003419 0.002869 0.096548 0.041089 0.190202 0.000118 -0.002399 0.130737 0.002239
sleep_hours -0.152862 -0.078230 -0.095543 -0.093058 -0.048616 -0.056300 1.000000 0.047208 0.105838 0.164169 ... -0.029174 0.129648 0.060685 0.117985 0.047949 0.013629 0.067811 0.036995 0.018537 0.054920
sufficient_income -0.144872 -0.076268 -0.014296 -0.083065 0.006567 0.161993 0.047208 1.000000 0.151236 0.071580 ... 0.132226 0.173262 0.122687 0.203476 0.130777 0.115812 0.079585 0.086671 0.153977 0.117023
fruits_veggies -0.095132 -0.071099 -0.091937 -0.069845 -0.118868 0.183554 0.105838 0.151236 1.000000 0.195067 ... 0.105804 0.248520 0.153606 0.230350 0.113970 0.207907 0.171531 0.136223 0.165587 0.166643
daily_meditation -0.213672 -0.096076 -0.070900 -0.113086 0.094050 0.099682 0.164169 0.071580 0.195067 1.000000 ... -0.006766 0.203315 0.103687 0.181956 0.152180 0.149154 0.196000 0.145943 0.154547 0.172801
daily_steps -0.057925 -0.031981 -0.128937 -0.045349 0.032735 -0.020511 0.010364 0.101667 0.243607 0.141684 ... 0.213596 0.196342 0.151113 0.204796 0.119280 0.145220 0.146962 0.143280 0.150016 0.193328
donation -0.038291 -0.054656 0.059614 -0.025449 -0.129959 0.233210 0.005471 0.123999 0.200787 0.150520 ... 0.147615 0.212655 0.220173 0.189849 0.158218 0.394205 0.188767 0.162248 0.274397 0.236727
social_network 0.012720 -0.010305 0.017123 0.011538 -0.050410 -0.058707 -0.029174 0.132226 0.105804 -0.006766 ... 1.000000 0.157615 0.304941 0.212253 0.184818 0.311285 0.202677 0.240043 0.214703 0.252225
places_visited -0.131707 -0.089066 -0.103696 -0.124751 -0.054253 0.003419 0.129648 0.173262 0.248520 0.203315 ... 0.157615 1.000000 0.259982 0.234049 0.149852 0.255008 0.198304 0.154917 0.279604 0.264964
core_circle -0.115973 -0.076232 -0.025774 -0.083369 -0.096980 0.002869 0.060685 0.122687 0.153606 0.103687 ... 0.304941 0.259982 1.000000 0.222466 0.216494 0.338630 0.227560 0.243300 0.250223 0.292359
todo_completed -0.166975 -0.145587 -0.067473 -0.080262 -0.089833 0.096548 0.117985 0.203476 0.230350 0.181956 ... 0.212253 0.234049 0.222466 1.000000 0.271881 0.252200 0.278042 0.302141 0.249857 0.309286
live_vision -0.135016 -0.081179 -0.009084 -0.028527 0.022373 0.041089 0.047949 0.130777 0.113970 0.152180 ... 0.184818 0.149852 0.216494 0.271881 1.000000 0.233593 0.311015 0.300045 0.206593 0.316496
supporting_others -0.035373 -0.064187 0.027601 -0.026155 -0.131287 0.190202 0.013629 0.115812 0.207907 0.149154 ... 0.311285 0.255008 0.338630 0.252200 0.233593 1.000000 0.322820 0.266881 0.333632 0.359009
time_for_passion -0.161858 -0.104970 -0.023930 -0.026963 0.015833 0.000118 0.067811 0.079585 0.171531 0.196000 ... 0.202677 0.198304 0.227560 0.278042 0.311015 0.322820 1.000000 0.482637 0.239808 0.373957
flow -0.142187 -0.080099 0.011586 0.000539 -0.016699 -0.002399 0.036995 0.086671 0.136223 0.145943 ... 0.240043 0.154917 0.243300 0.302141 0.300045 0.266881 0.482637 1.000000 0.217978 0.389727
personal_awards -0.044362 -0.041729 0.016309 -0.038252 -0.020684 0.130737 0.018537 0.153977 0.165587 0.154547 ... 0.214703 0.279604 0.250223 0.249857 0.206593 0.333632 0.239808 0.217978 1.000000 0.400330
achievement -0.120786 -0.066264 -0.034303 -0.010751 -0.006039 0.002239 0.054920 0.117023 0.166643 0.172801 ... 0.252225 0.264964 0.292359 0.309286 0.316496 0.359009 0.373957 0.389727 0.400330 1.000000

22 rows × 22 columns

The more down and right, the most correlation the variables are with achievement. For a prediction model, we are interested in predictive variables with a low correlation between them, so we should look for small values in the down-right part of this dataframe. We see a 0.217978 corresponding to flow and personal awards, so it might be a good option chosing only these two. core circle and social network would be the next ones to chose in case we want more.

In [37]:
# The lowest values of correlation in absolute value
np.unique(np.sort(np.array(abs(data.corr()[achievement_corrs[11:]].loc[achievement_corrs[11:]])).flatten()))
Out[37]:
array([0.14761525, 0.14985247, 0.15491703, 0.15761513, 0.15821833,
       0.16224798, 0.18481764, 0.18876745, 0.18984902, 0.19830437,
       0.20267701, 0.20659311, 0.21225279, 0.21265549, 0.21470312,
       0.21649415, 0.21797815, 0.22017343, 0.22246642, 0.22756044,
       0.23359329, 0.23404912, 0.23672727, 0.23980767, 0.24004297,
       0.24329982, 0.2498565 , 0.25022277, 0.25219994, 0.25222473,
       0.25500816, 0.25998208, 0.26496444, 0.26688059, 0.271881  ,
       0.27439719, 0.27804196, 0.27960372, 0.29235878, 0.30004494,
       0.30214098, 0.30494135, 0.3092857 , 0.31101501, 0.31128491,
       0.3164962 , 0.32281997, 0.33363161, 0.33862994, 0.35900949,
       0.37395656, 0.38972744, 0.39420465, 0.40032959, 0.48263733,
       1.        ])

Negative correlations:

Let's see de most negative correlated variables with the others:

In [38]:
# 12 Pairs of variables with the most negative correlation

sorted_corr = np.unique(np.sort(data.corr().values.flatten()))
array2D_corr = np.array(data.corr())

n = 20
nth_lowest = sorted_corr[:n]

fig, axs = plt.subplots(5,4, figsize=(16,16))
fig.tight_layout(pad=5) #padding between subplots

axs = axs.flatten()
rows = []
columns = []
print("Lowest correlations:\n\n")
for i in range(n):
    position = np.where(array2D_corr == nth_lowest[i]) 
    row = data.corr().index[position[0][0]]
    column = data.corr().columns[position[1][0]]
    rows.append(row)
    columns.append(column)
    print("{} - {}:".format(row, column))
    print(nth_lowest[i], "\n")
    frecuency = pd.crosstab(data[row],data[column]).iloc[::-1]
    ax = fig.sca(axs[i])
    ax.set(title='{} VS {}'.format(column,row))
    sns.heatmap(frecuency, ax=ax, cmap="YlGnBu")
    
    # axs[i].imshow(frecuency, cmap='Blues')
    # axs[i].hist2d(data[row],data[column], cmap='Blues')
Lowest correlations:


daily_stress - daily_meditation:
-0.21367242460175634 

daily_stress - todo_completed:
-0.16697526420745487 

daily_stress - time_for_passion:
-0.1618577321177887 

daily_stress - sleep_hours:
-0.15286158450655735 

todo_completed - daily_shouting:
-0.14558711896131799 

daily_stress - sufficient_income:
-0.14487168163684583 

daily_stress - flow:
-0.142187370687171 

daily_stress - live_vision:
-0.1350155981795351 

daily_stress - places_visited:
-0.13170694011317127 

supporting_others - gender:
-0.13128712907783338 

donation - gender:
-0.12995859803876536 

bmi_range - daily_steps:
-0.1289368441372705 

daily_stress - gender:
-0.128676852685506 

places_visited - lost_vacation:
-0.12475119803990595 

daily_stress - achievement:
-0.12078578179755883 

fruits_veggies - gender:
-0.11886752387580067 

daily_stress - core_circle:
-0.11597261409797491 

lost_vacation - daily_meditation:
-0.11308591354343855 

daily_shouting - time_for_passion:
-0.10497038387085024 

places_visited - bmi_range:
-0.10369634596618633 

In [39]:
# Number of the other variables that each one is highly negative correlated with

rows_cols = rows + columns
for variable in list(set(rows_cols)):
    print(variable, rows_cols.count(variable))
places_visited 3
core_circle 1
daily_shouting 2
bmi_range 2
flow 1
sufficient_income 1
live_vision 1
time_for_passion 2
supporting_others 1
daily_stress 11
sleep_hours 1
daily_meditation 2
gender 4
lost_vacation 2
fruits_veggies 1
achievement 1
daily_steps 1
donation 1
todo_completed 2

Variable 'daily stress'

As we showed above in the heatmap of the correlations matrix, daily stress is negatively correlated with a number of other variables. Let's see the most negative correlations of daily stress with the others, and what pairs of these others are less correlated between them:

In [40]:
# Correlations with daily stress
data.corr()['daily_stress'].sort_values()
Out[40]:
daily_meditation    -0.213672
todo_completed      -0.166975
time_for_passion    -0.161858
sleep_hours         -0.152862
sufficient_income   -0.144872
flow                -0.142187
live_vision         -0.135016
places_visited      -0.131707
gender              -0.128677
achievement         -0.120786
core_circle         -0.115973
fruits_veggies      -0.095132
daily_steps         -0.057925
personal_awards     -0.044362
donation            -0.038291
supporting_others   -0.035373
age                 -0.020649
social_network       0.012720
bmi_range            0.084938
lost_vacation        0.192353
daily_shouting       0.309264
daily_stress         1.000000
Name: daily_stress, dtype: float64
In [41]:
# What pair of them are less correlated between them?
stress_corrs = abs(data.corr()['daily_stress']).sort_values().index.tolist()
data.corr()[stress_corrs[11:]].loc[stress_corrs[11:]]
Out[41]:
places_visited live_vision flow sufficient_income sleep_hours time_for_passion todo_completed lost_vacation daily_meditation daily_shouting daily_stress
places_visited 1.000000 0.149852 0.154917 0.173262 0.129648 0.198304 0.234049 -0.124751 0.203315 -0.089066 -0.131707
live_vision 0.149852 1.000000 0.300045 0.130777 0.047949 0.311015 0.271881 -0.028527 0.152180 -0.081179 -0.135016
flow 0.154917 0.300045 1.000000 0.086671 0.036995 0.482637 0.302141 0.000539 0.145943 -0.080099 -0.142187
sufficient_income 0.173262 0.130777 0.086671 1.000000 0.047208 0.079585 0.203476 -0.083065 0.071580 -0.076268 -0.144872
sleep_hours 0.129648 0.047949 0.036995 0.047208 1.000000 0.067811 0.117985 -0.093058 0.164169 -0.078230 -0.152862
time_for_passion 0.198304 0.311015 0.482637 0.079585 0.067811 1.000000 0.278042 -0.026963 0.196000 -0.104970 -0.161858
todo_completed 0.234049 0.271881 0.302141 0.203476 0.117985 0.278042 1.000000 -0.080262 0.181956 -0.145587 -0.166975
lost_vacation -0.124751 -0.028527 0.000539 -0.083065 -0.093058 -0.026963 -0.080262 1.000000 -0.113086 0.085184 0.192353
daily_meditation 0.203315 0.152180 0.145943 0.071580 0.164169 0.196000 0.181956 -0.113086 1.000000 -0.096076 -0.213672
daily_shouting -0.089066 -0.081179 -0.080099 -0.076268 -0.078230 -0.104970 -0.145587 0.085184 -0.096076 1.000000 0.309264
daily_stress -0.131707 -0.135016 -0.142187 -0.144872 -0.152862 -0.161858 -0.166975 0.192353 -0.213672 0.309264 1.000000

The more down and right, the most correlation the variables are with daily stress. For a prediction model, we are interested in predictive variables with a low correlation (in absolute value) between them, so we should look for small values in the down-right part of this dataframe. We see a -0.096076 corresponding to daily shouting and daily meditation, so it might be a good option chosing these two. Other ones we can chose that are poorly correlated with daily shouting and daily meditation are lost vacation and sufficient income.

In [42]:
# The lowest values of correlation in absolute value
np.unique(np.sort(np.array(abs(data.corr()[stress_corrs[11:]].loc[stress_corrs[11:]])).flatten()))
Out[42]:
array([5.39244520e-04, 2.69626501e-02, 2.85272804e-02, 3.69947788e-02,
       4.72083003e-02, 4.79493603e-02, 6.78111005e-02, 7.15803146e-02,
       7.62676096e-02, 7.82295707e-02, 7.95853327e-02, 8.00993712e-02,
       8.02622588e-02, 8.11793283e-02, 8.30652481e-02, 8.51840221e-02,
       8.66714470e-02, 8.90661821e-02, 9.30579567e-02, 9.60761312e-02,
       1.04970384e-01, 1.13085914e-01, 1.17985233e-01, 1.24751198e-01,
       1.29647517e-01, 1.30777052e-01, 1.31706940e-01, 1.35015598e-01,
       1.42187371e-01, 1.44871682e-01, 1.45587119e-01, 1.45942796e-01,
       1.49852474e-01, 1.52180100e-01, 1.52861585e-01, 1.54917034e-01,
       1.61857732e-01, 1.64169500e-01, 1.66975264e-01, 1.73262396e-01,
       1.81955930e-01, 1.92352962e-01, 1.96000331e-01, 1.98304373e-01,
       2.03315316e-01, 2.03476252e-01, 2.13672425e-01, 2.34049116e-01,
       2.71880996e-01, 2.78041959e-01, 3.00044942e-01, 3.02140982e-01,
       3.09264296e-01, 3.11015008e-01, 4.82637325e-01, 1.00000000e+00])

2.6 Summary of the EDA: Key Findings and insights

From the EDA we can conclude some key findings:

  • About distribution of data, we see that variables are not normally distributed. Most of them are left-skewed, in particular those related to social area: core circle, social network, supporting others or donation. This means that there are a few people with a poor social life compared with the mayority's. On the other hand, other variables are right-skewed, so for example there are a few people standing out in time for passion, flow and achievement from the rest.
  • There is a clear division between variables positively (red) and negatively (blue) correlated with the others in general. achievement and daily stress are the ones with a greater potencial to be predicted, as the have the strongest positive and negative correlations respectively.
  • achievement is the variable that are positively correlated with a greater number of other variables. For a predictive model, we found that flow and personal awards could be good predictive variables, as they are not too correlated between them. We could try only with these two, but we can include core circle and social network in case we want more than two. However, it would probably increase model complexity unnecessarily and therefore overfitting.
  • daily stress is the one with a greater number of negative correlations with other variables, such as daily meditation, to-do completed, time for passion or sleep hours. There might be a real influence of these variables in the level of stress people experience. For a predictive model, daily shouting and daily meditation could be good predictive variables. Other good options to include are lost vacation, although it doesn't seem to have much a real relationship with stress, and sufficient income.

3. Hypothesis

Hypothsis 1: Population sleep less than 7 hours (World Health Organization recommendation is 7-8 hours). We can perform a t-test and check out if people sleep in average fewer hours than the OMS recommends, which might be indicative for a poor quality of life.The next step would be to see if those who sleep few hours also have negative indicators as a high levels of stress or low marks in achivement, personal awards, etc.

Hypothesis 2: Data points with high values for positive attributes are grouped together. It is said than one good habit positively affects in other areas, producing a kind of "virtuous cycle". Because of this and the similar behaviours of positive attributes (reds in the correlation matrix) we saw with the other variables, we might expect data to be grouped by positive and negative attributes. We could perform clusttering or classification algorithms along with a significance test of the results to check out this hypothesis.

Hypothesis 3: There is a correlation between achievement and the variables flow, daily_meditation and time for passion significantly different from zero. If so, we can try a predictive model like a linear model to predict achievement or just use the findings for interpretation and increase of our achievements.

Hypothsis 4: There is a correlation between daily stress and daily meditation significantly different from zero. The same as the hypothesis 2 can be performed with daily stress.

Significance test for the hypothsis 1: Population sleep less than 7 hours

We assume that the sleep hours follow a normal distribution and we are going to process the variable as continuous, although our data only takes integers between 1 and 10. Our null ($H_{0}$) and alternative ($H_{1}$) hypothesis are:

$H_{0}: \mu < 7h$
$H_{1}: \mu \geq 7h$

The population standard deviation is unknown, so we are going to estimate it from our data and perform a t test:

1. We calculate the t statistic:

$t_0 = \large \frac{\overline{X} - \mu}{\Large \frac{S} {\sqrt n}} \hspace{0.5cm} \small [1]$

where,

$S = \large \sqrt{\frac{\sum_{i=1}^n (x_{i}-\overline x)^2}{n-1}}$

is the standar deviation of our sample, and

$\mu$: population mean
$\overline X$: sample mean


2. The p-value is the probability of $t$ to be greater than what we got from our sample, $t_0$. This is the area down the curve of the T-Student distribution, whose expression is:

$f(t) = \large \frac{\Gamma(\frac{\Large \nu + 1}{2})}{\sqrt{\nu\pi}\hspace{1mm} \Gamma(\nu/2)} (1+\frac{t^2}{\nu})^{-\frac{\nu+1}{2}} \hspace{0.5cm} \small [2]$,

where

$\hspace{2cm}\Gamma(p) = \int_0^\infty x^{p-1}e^{-x}dx \hspace{0.5cm} \small [3]$

image.png

The area down the curve is the integral of $f(t)$ between 2 t's. When integrating between $t_0$ and $\infty$ we get the p-value, which is the probability of t > $t_0$, and therefore, the probability of the mean from another sample to be greater than our $\overline X$.

What is the probability of the population mean to be less than 7h, giving our sample?
See in [0] that for the variable t, keeping $\mu$ fixed and varying $\overline X$ is equivalent to keeping a $\overline X$ and varying $\mu$. To calculate the probability of the population mean to be less than 7h giving our sample, we would have our $\overline X$ fixed (from our sample) and we would traverse (recorreríamos) $\mu$ from $7 \rightarrow -\infty$ (to 0 actually), which is the same as keeping $\mu = 7$ and traverse $\overline X$ from $7.03 \rightarrow \infty$. When integrating $f(t)$ we would traverse t in opposite directions, but we can see in [0] that also $dt$ has opposite sign, so the integral result would be the same.

Thus, the probability of population mean to be less than 7h giving our sample, is exactly the same as the probability of any sample mean to be greater than ours (7.03h) assuming the hypothesis that the population mean is 7h, i.e., giving $\mu = 7$

In [43]:
from scipy.stats import t

mu = 7 # Population mean if the hypothesis is true
sample_mean = data['sleep_hours'].mean()
sample_std = data['sleep_hours'].std()
n = data['sleep_hours'].shape[0]

t_stat = (sample_mean - mu) / (sample_std/np.sqrt(n))
t_stat
Out[43]:
3.3069605551230157
In [44]:
# Probability of mu < 7h

p_value = 1-t.cdf(t_stat, df=n-1)
print("p-value: ", p_value)
print("1-p-pavlue: ", 1-p_value)
p-value:  0.0004728744186621725
1-p-pavlue:  0.9995271255813378

My confidence level was 99%, so p-values below 0.01 would lead us to reject the null hypothesis. I obtained a $p-value \approx 0.0005$. We can reject the null hypothesis with (at least) a 99% of confidence

$H_{1}$ accepted: Population sleep in average 7h or more


Extra: Bayesian interpretation

NOTE: I think expressions below are not valid for continuous variables, so this can be all wrong

Define:

$r = \large \frac{p(H_{0}|x)}{p(H_{1}|x)} = \frac{p(H_{0})·p(x|H_{0})}{p(H_{1})·p(x|H_{1})} = \frac{p(H_{0})}{p(H_{1})}·\frac{p_{value}}{1 - p_{value}}$,

In [45]:
# Giving the same prior probabilities to both hypothesis, p(H0) = p(H1):

r = p_value/(1-p_value)
r
Out[45]:
0.00047309813466757356
In [46]:
1/r
Out[46]:
2113.726363987164

The alternative hypothesis is 2114 times more likely than the null hypothesis. The sum of both probabilities is 1, so:

In [47]:
p_H1 = (1/r) / ((1/r)+1)
p_H1
Out[47]:
0.9995271255813378

Posterior probability: $p(H_{1}) = 99.95 \% $

Which is the value of the population mean?

In [48]:
mu = 7.03
t_stat = (sample_mean - mu) / (sample_std/np.sqrt(n))
t_stat
Out[48]:
0.48867891499420146
In [49]:
t.cdf(t_stat, df=n-1)
Out[49]:
0.6874612820197958

$\mu = \overline{X} \pm \large t·\frac{S}{\sqrt{n}} \$

In [50]:
t.ppf(0.95, df=n-1) * sample_std/math.sqrt(n)
Out[50]:
0.01751038383904761

The mean is $(7.03 \pm 0.02)h$ with a confidence level of 95%

4. Next steps

The steps may be:

  • Create a linear model for stress level with daily meditation and daily shouting as predictive variables, and see if they really have a correlation significantly different from zero. We can try a log transformation in case we don't get normally distributed residuals, as the variable daily shouting is right-skewed and this can lead to skewed residuals. With daily mediatation a polynomial transformation might be required to fix the left skew.
  • Create a predictive model for achievement with flow and personal awards as predictive variables, and see if they really have a correlation significantly different from zero. Again, we can try a log transformation with the variable flow in case we don't get normally distributed residuals, as it is right-skewed, and we can also try polynomial transformation with personal awards.
  • Look for clustters to see the disposition of the data. We could perform clusttering or classification algorithms to prove hypothesis 2. Scaling the variables may be necessary to this issue related with distances.

5. Linear Regression for daily stress

Correlations Pearson coefficients, r, gave us the idea of how well the linear model model fit to data, but it doesn't say anything about how much the variable influence on stress. Thus, the magnitude of linear regression coefficients and the Pearson correlations have nothing to do and they don't to be related necessarily. What tell us about what features influence the most on stress is the magnitude of linear regression coefficient.

Now we are going to perform some linear regression models, focusing on the interpretation of what are the main factors that cause daily stress, so we'll see what are features with the strongest influence by comparing the corresponding coeficients. These ones can probabily be sources of stress and thus the findings may be of great interest.

In [51]:
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score, mean_squared_error

First we'll remove the variable timestamp as we are not interested in this, and bmi_range as the values are 1 or 2, so they should be wrong ("body mass index" has typical values of 18 - 30)

In [52]:
data_original = data.copy()
data = data.drop(['timestamp', 'bmi_range'], axis=1)

5.1 Simple Linear Regression with OLS from statsmodels.api

Firsty, we will perform a linear model with the class OLS (Ordinary Linear Squares) from statsmodel.api, which provide us a more statistical treatment, as for example the t value from a significance test to accept the no-null value of the coefficients. We will pass to the OLS object the training sets X_train and y_train. We will do the trin-test split with a test size of 0.3 (30%), which result in 8928 data points for the training set and 3827 for the test set. We choose a seed for convinience, for example random_state=2021, to ensure the splits are the same when comparing with other models. Once we have a baseline, we will try adding regularization and polynomial features and prove if they shed light on what factors have more impact on stress

In [53]:
# Baseline

import statsmodels.api as sm

y = data['daily_stress']
X = data.drop('daily_stress', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=2021)

lr = sm.OLS(y_train, sm.add_constant(X_train))
OLSresults = lr.fit()

print("R2 on training set:", r2_score(y_train, OLSresults.predict(sm.add_constant(X_train))))
print("R2 on test set:", r2_score(y_test, OLSresults.predict(sm.add_constant(X_test))))
rmse = np.sqrt(mean_squared_error(y_test, OLSresults.predict(sm.add_constant(X_test))))
print("Root mean squared error: ", rmse)

display(OLSresults.summary())
C:\Users\carlo_\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
R2 on training set: 0.20751448027951147
R2 on test set: 0.19640373519899001
Root mean squared error:  1.2239315195417808
OLS Regression Results
Dep. Variable: daily_stress R-squared: 0.208
Model: OLS Adj. R-squared: 0.206
Method: Least Squares F-statistic: 116.6
Date: Sun, 18 Apr 2021 Prob (F-statistic): 0.00
Time: 12:05:44 Log-Likelihood: -14475.
No. Observations: 8928 AIC: 2.899e+04
Df Residuals: 8907 BIC: 2.914e+04
Df Model: 20
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 4.1133 0.105 39.351 0.000 3.908 4.318
fruits_veggies -0.0194 0.010 -1.955 0.051 -0.039 5.48e-05
places_visited -0.0124 0.004 -2.777 0.006 -0.021 -0.004
core_circle -0.0234 0.005 -4.515 0.000 -0.034 -0.013
supporting_others 0.0192 0.005 3.879 0.000 0.009 0.029
social_network 0.0234 0.005 4.933 0.000 0.014 0.033
achievement -0.0167 0.006 -2.852 0.004 -0.028 -0.005
donation -0.0009 0.008 -0.112 0.910 -0.016 0.015
todo_completed -0.0233 0.006 -4.060 0.000 -0.034 -0.012
flow -0.0296 0.007 -4.383 0.000 -0.043 -0.016
daily_steps 0.0042 0.005 0.862 0.388 -0.005 0.014
live_vision -0.0155 0.004 -3.498 0.000 -0.024 -0.007
sleep_hours -0.0959 0.011 -8.602 0.000 -0.118 -0.074
lost_vacation 0.0487 0.004 13.530 0.000 0.042 0.056
daily_shouting 0.1253 0.005 25.168 0.000 0.116 0.135
sufficient_income -0.2563 0.031 -8.346 0.000 -0.317 -0.196
personal_awards 0.0224 0.005 4.641 0.000 0.013 0.032
time_for_passion -0.0328 0.006 -5.658 0.000 -0.044 -0.021
daily_meditation -0.0488 0.005 -10.535 0.000 -0.058 -0.040
age 0.0210 0.015 1.420 0.156 -0.008 0.050
gender -0.3315 0.028 -11.950 0.000 -0.386 -0.277
Omnibus: 103.262 Durbin-Watson: 1.990
Prob(Omnibus): 0.000 Jarque-Bera (JB): 64.643
Skew: -0.030 Prob(JB): 9.18e-15
Kurtosis: 2.588 Cond. No. 169.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [54]:
rmse_cv_ols = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X, y)))
print(rmse_cv_ols)
1.2263339068499821

Features with coefficients significantly different from 0 are shown next. Let's choose a 99% of confidence level as criteria (p-value < 0.01), so we will keep features where the column "P<|t|" is less that 0.01":

In [57]:
features_nonzero = X.columns.drop(['fruits_veggies','donation', 'daily_steps', 'age'])
X_train2 = sm.add_constant(X_train[features_nonzero])
X_test2 = sm.add_constant(X_test[features_nonzero])

lr2 = sm.OLS(y_train, X_train2)
OLSresults2 = lr2.fit() #OLSResults object

print("R2 on training set:", r2_score(y_train, OLSresults2.predict(X_train2)))
print("R2 on test set:", r2_score(y_test, OLSresults2.predict(X_test2)))
rmse = np.sqrt(mean_squared_error(y_test, OLSresults2.predict(X_test2)))
print("Root mean squared error: ", rmse)

coefs2 = OLSresults2.params.sort_values()
print("-"*50)
print("\nCoefficients: \n")
print(coefs2)

rmse_cv_ols2 = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X[features_nonzero], y)))
print("\nRmse CV: ", rmse_cv_ols2)
R2 on training set: 0.2070379500456585
R2 on test set: 0.19625620857542925
Root mean squared error:  1.224043860904083
--------------------------------------------------

Coefficients: 

gender              -0.327591
sufficient_income   -0.253618
sleep_hours         -0.098262
daily_meditation    -0.048974
time_for_passion    -0.033616
flow                -0.029732
core_circle         -0.023774
todo_completed      -0.023402
achievement         -0.016938
live_vision         -0.015428
places_visited      -0.013766
supporting_others    0.019710
personal_awards      0.022859
social_network       0.023035
lost_vacation        0.048548
daily_shouting       0.125046
const                4.135555
dtype: float64

Rmse CV:  1.2260816886645598

These coefficients are significantly different from zero*. Variables gender, sufficient income and daily shouting (postive coefficient) seems to be the ones with more influence on stress.

\*Note: Residuals are not normally distributed and there is no way to apply any data transformation (log, root, boxcox) that get a more normally-distributed residuals. As significance tests performed by `statsmodel.api.OLS` assume normally distributed residuals, we should be prudent and critical with the results before reaching a conclusion. However, one fact in favour is that independency $\chi^2$ tests show a dependency between the categories of *stress* and, at least, the variables *gender, sufficient income* and *daily shouting*. These ensure just dependency between unordered categories, not correlation, but giving a look to data, it is noticeable that this dependency follows a definited direction. Evaluations points to significant no-null correlations. See the Statistical Hypothesis Testing for dependencies and correlations for details.

We have a number of features, but some of them are close to zero. We can try now reducing the number of features a see what coefficients we obtain for the remaining ones. We will do this by 2 approaches: Lasso Regression and RFE (Recursive Features Elimination):

5.2 Reducing features: Lasso

We see some features that stand out against the others: gender, sufficient income and daily shouting. However, we didn't obtain a great 𝑅2 score, so it might not be significantly different from zero. We can't be sure for now whether there is a real relation between those variables and stress.

We can try simplifying the model by adding regularization and see if we obtain a better fit that make us have more confidence in the real influence of the predictive features. For this purpose, Lasso is usually a better choice than Ridge in terms of interpretation, as some cofficients are more likely to vanish with Lasso, keeping only those with a stronger impact in the output variable and improving interpretation of this simplified model. In addition, we tried with an ElasticNetCV and obtained that the best l1_ratio parameter was 1, so we will set different alpha values for a Lasso Regression. So we will create a Lasso model using LassoCV in order to try with some different alpha values, the parameter with which we control the regularization amount, and compare the resulting models through the 𝑅2 score obtained after doing Cross Validation to ensure the particular selection of the training-validation split doesn't affect to determine which one is better. We keep the parameter cv=None that use the default 5-fold cross validation and we use the X_train subset used previously with LinearRegression, which will be the one that, in turn, will be splitted in 5 folds to perform the Cross Validation. We are keeping the X_test to check the performance to unseen data.

In [58]:
alphas = np.geomspace(1e-5, 1, num=24)
lassoCV = LassoCV(alphas=alphas)
lassoCV.fit(X_train, y_train)

print("Best model:\n")
print("Alpha: ", lassoCV.alpha_)
print("R2 on test set: ", lassoCV.score(X_test, y_test))  #r2_score(y_test, lassoCV.predict(X_test))
rmse = np.sqrt(mean_squared_error(y_test, lassoCV.predict(X_test)))
print("Root mean squared error: ", rmse)
Best model:

Alpha:  0.0014924955450518291
R2 on test set:  0.1964983336215841
Root mean squared error:  1.223859477520299
In [59]:
rmse_cv_lasso = np.sqrt(mean_squared_error(y, cross_val_predict(Lasso(alpha=lassoCV.alpha_), X, y)))
print(rmse_cv_lasso)
1.226344366252987
In [60]:
pd.DataFrame(list(zip(X.columns, lassoCV.coef_)), columns=['feature','coef']).set_index('feature').sort_values(by='coef')
Out[60]:
coef
feature
gender -0.324783
sufficient_income -0.248169
sleep_hours -0.095088
daily_meditation -0.048859
time_for_passion -0.032768
flow -0.029366
core_circle -0.023177
todo_completed -0.023167
fruits_veggies -0.018384
achievement -0.016536
live_vision -0.015557
places_visited -0.012463
donation -0.000072
daily_steps 0.003708
age 0.018553
supporting_others 0.019050
personal_awards 0.022077
social_network 0.023043
lost_vacation 0.048645
daily_shouting 0.125290

None of the coefficients have gotten null. In fact, results are very similar to the ones from plain linear regression, both the $R^2$ score and the coefficient values. However, trying with a higher regularization (more than alpha=1), $R^2$ score is practically null.

5.3 Reducing features: RFE

We are going to be more strict, keeping only 13 features, the number of features whose coefficients are significantly different from zero. Let's create a model using only 13 using RFE, and see if we obtain the same 13 as in features_nonzero

LinearRegression with the top 13 features selected by Recursive Feature Selection (RFE)

In [61]:
rfe = RFE(LinearRegression(), n_features_to_select=12)
rfe.fit(X, y)

features = X.columns[rfe.support_]
features
Out[61]:
Index(['core_circle', 'social_network', 'todo_completed', 'flow',
       'sleep_hours', 'lost_vacation', 'daily_shouting', 'sufficient_income',
       'time_for_passion', 'daily_meditation', 'age', 'gender'],
      dtype='object')
In [62]:
X_train_rfe_1 = sm.add_constant(X_train[features])
X_test_rfe_1 = sm.add_constant(X_test[features])

lr_rfe = sm.OLS(y_train, X_train_rfe_1)
OLSresults_rfe = lr_rfe.fit() #OLSResults object

print("R2 on training set:", r2_score(y_train, OLSresults_rfe.predict(X_train_rfe_1)))
print("R2 on test set:", r2_score(y_test, OLSresults_rfe.predict(X_test_rfe_1)))
rmse = np.sqrt(mean_squared_error(y_test, OLSresults_rfe.predict(X_test_rfe_1)))
print("Root mean squared error: ", rmse)

coefs_rfe = OLSresults_rfe.params.sort_values()
print("-"*50)
print("Coefficients:\n")
print(coefs_rfe)
R2 on training set: 0.2018642122384604
R2 on test set: 0.19128926477826702
Root mean squared error:  1.2278201843907077
--------------------------------------------------
Coefficients:

gender              -0.336682
sufficient_income   -0.265136
sleep_hours         -0.101721
daily_meditation    -0.050159
time_for_passion    -0.033970
flow                -0.033181
todo_completed      -0.025641
core_circle         -0.022373
social_network       0.026723
age                  0.035751
lost_vacation        0.049624
daily_shouting       0.126775
const                4.161317
dtype: float64
In [63]:
rmse_cv_rfe = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X[features], y)))
print(rmse_cv_rfe)
1.2294077096448424

There might be interactions between variables. Let's try adding Polynomial Features and see if we get a better perform:

5.4 Polynomial Features

We can drop gender to see the interactions between other, as it makes no so much sense to see the interaction terms with gender, and then perform polynomial features. Daily_shouting has negative correlations with the others, so we tried including ratio features like $\frac{meditation}{shouting}$ but no good results were obtained. Adding Polynomial Features might shed light on possible interactions between positively correlated variables.

Let's try polynomial features, keeping only the interaction terms, and creating a Lasso model feeding it with these new features:

In [64]:
pf = PolynomialFeatures(degree=2, interaction_only=True)
X_train_poly = pf.fit_transform(X_train.drop('gender', axis=1))
X_test_poly = pf.transform(X_test.drop('gender', axis=1))
lassoCV_poly = LassoCV(alphas = alphas, max_iter=100000)
lassoCV_poly.fit(X_train_poly, y_train)

print("Best model with polynomial features:\n")
print("Alpha: ", lassoCV_poly.alpha_)
print("R2 on test set: ", lassoCV_poly.score(X_test_poly, y_test))
rmse = np.sqrt(mean_squared_error(y_test, lassoCV_poly.predict(X_test_poly)))
print("Root mean squared error: ", rmse)
Best model with polynomial features:

Alpha:  0.13503140378698722
R2 on test set:  0.18848756892228846
Root mean squared error:  1.2299451744264844
In [65]:
rmse_cv_poly = np.sqrt(mean_squared_error(y, cross_val_predict(LinearRegression(), X, y)))
print(rmse_cv_poly)
1.2263339068499821
In [66]:
selected_features = [(a,b) for a,b in zip(pf.get_feature_names(input_features = X_train.columns), lassoCV_poly.coef_) 
                     if b!=0]
df_selected_features = pd.DataFrame(selected_features, columns=['Polynomial Feature', 'Coef']).sort_values(by='Coef')
print("Polynomial Features coefficients:")
display(df_selected_features.head())
display(df_selected_features.tail())
Polynomial Features coefficients:
Polynomial Feature Coef
46 sleep_hours sufficient_income -0.027884
48 sleep_hours daily_meditation -0.003442
20 supporting_others daily_meditation -0.003251
42 live_vision sleep_hours -0.002931
55 time_for_passion daily_meditation -0.002821
Polynomial Feature Coef
13 supporting_others social_network 0.002591
38 daily_steps lost_vacation 0.002792
24 social_network age 0.003054
53 daily_shouting daily_meditation 0.003471
45 sleep_hours daily_shouting 0.012051

Coefficients are very close to 0 because, despite the regularization, there is still a number of features. However we can't increase regularization (parameter alpha) because in that case we have seen $R^2$ turns to almost zero and so the linear coefficients lose its veracity. In addition, interaction terms obtained are fairly similar and doesn't make so much sense, not supplying a good interpretation.

5.5 Best model for interpretation: conclusion

After exploring several possibilities, we dicard adding polynomial features. We saw regularization does not improve results reached with the simple linear regression, giving very similar coefficients without eliminating some of them. In addition, it doesn't provide a better fit that could have made it more trustful. In conclusion, we can keep simple linear regression.

In [67]:
print("rmse with cross validation predict:\n")
print("OLS: {}\nOLS with significantly different from zero coefficients: {}\nLasso: {}\nLasso with polynomial features: {}\nOLS with RFE: {}".format(
    rmse_cv_ols, rmse_cv_ols2, rmse_cv_lasso, rmse_cv_poly, rmse_cv_rfe))
rmse with cross validation predict:

OLS: 1.2263339068499821
OLS with significantly different from zero coefficients: 1.2260816886645598
Lasso: 1.226344366252987
Lasso with polynomial features: 1.2263339068499821
OLS with RFE: 1.2294077096448424

The model corresponding to the least rmse will be the most reliable model to use and thus the one we will use explain what features have the strongest influence on stress. We discard polynomial features because its features have not make sensea and there was many features with very similar coefficients, which is not useful for the explainability of stress. Plain Linear Regression (OLS) is the one that best fits our objective of interpretation, using only the following features:

In [68]:
print(coefs2)
gender              -0.327591
sufficient_income   -0.253618
sleep_hours         -0.098262
daily_meditation    -0.048974
time_for_passion    -0.033616
flow                -0.029732
core_circle         -0.023774
todo_completed      -0.023402
achievement         -0.016938
live_vision         -0.015428
places_visited      -0.013766
supporting_others    0.019710
personal_awards      0.022859
social_network       0.023035
lost_vacation        0.048548
daily_shouting       0.125046
const                4.135555
dtype: float64

3.6 Key Findings and Insights

Daily stress depends mainly of 2 features: Gender and Sufficient income. In average, women suffer more stress than men, and the same happens with people not having enough sufficient income. This findings are not surprising, since other studies also support these hypothesis. Other variables having an impact on stress, although considerably less, are the number of sleep hours, and even less there are daily meditation, lost vacation, etc. These ones hardly have a very little influence, but it is no zero.

Comparision: Here we have a comparition between gender coefficient and the others. We can use the following ratios to compare the influence on stress of each variable, being the higher value, the less influence the variable has.

In [72]:
print("Ratios coefficients\n")
for i,coef in enumerate(coefs2[:2]):
    ratios = coef/coefs2[:-1].drop(coefs2.index[i])
    ratios.index = ratios.index.map(lambda x: coefs2.index[i] + "/" + x)
    print(ratios)
    print("\n")
Ratios coefficients

gender/sufficient_income     1.291672
gender/sleep_hours           3.333870
gender/daily_meditation      6.689066
gender/time_for_passion      9.744976
gender/flow                 11.018036
gender/core_circle          13.779386
gender/todo_completed       13.998338
gender/achievement          19.340617
gender/live_vision          21.233975
gender/places_visited       23.796995
gender/supporting_others   -16.620255
gender/personal_awards     -14.331076
gender/social_network      -14.221421
gender/lost_vacation        -6.747794
gender/daily_shouting       -2.619754
dtype: float64


sufficient_income/gender                0.774190
sufficient_income/sleep_hours           2.581049
sufficient_income/daily_meditation      5.178610
sufficient_income/time_for_passion      7.544465
sufficient_income/flow                  8.530056
sufficient_income/core_circle          10.667866
sufficient_income/todo_completed       10.837376
sufficient_income/achievement          14.973317
sufficient_income/live_vision          16.439136
sufficient_income/places_visited       18.423401
sufficient_income/supporting_others   -12.867239
sufficient_income/personal_awards     -11.094979
sufficient_income/social_network      -11.010085
sufficient_income/lost_vacation        -5.224076
sufficient_income/daily_shouting       -2.028188
dtype: float64


Interpretation: This coefficients ratios say how many units of the corresponding feature (sleep hours, daily meditation, flow and daily shouting) would have the same effect as gender in daily stress. So we see that, for example, being a woman have the same effect on stress as shouting about 2.6 points below in the shouting scale 0-10 (0: not at all, 10: very often), and each sleep hour would be comparable to 2 more times meditating or 3 more hours dedicated to your passion, etc.

3.7 Next steps

Here there are some ideas to improve our model done up to now:

  • A better perform of that linear model can be achieved by adding some features. Some ideas are the level of noise people are exposed to, how many hours are spent in nature or how many negative people do respondents have in their social circle (according to their criteria, which is what matters). On the other hand, one advanced model can be done from a more technical study, where measures of corporal indicators related to stress from medical analysis could be used along with measures of noise in decibels and other measurable features.
  • We could perform clusttering or classification algorithms, which might perform better than the linear model as we are dealing with discrete categories within intervals instead of continuous variables. Some other data can be gathered for this model. For example, new features as the living area (big city/town/village) might have impact on stress. It is said that people in cities are in general more stressed, and in fact noise of modern life is a well known source of stress. We could also add information related to the type job (physical/sedentary) or the workday hours, and related to the social network, it would be insightful to get information about how is the people we are surrounded by, and not only the amount of people as this dataset takes into account. We are very influenced by our social circle, and prove up to what point this is true by dividing people in clustters or performing a classification algorithm for prediction, and see for example whether stressed or successful people are usually surounded by other stressed or successful people respectively.

6. Statistical Hypothesis Testing for dependencies and correlations between variables

I'll perform some tests for some pairs for variables to prove whether it is a dependency and, going one step ahead, whether there is a correlation between them.

𝜒2 test to check independency:

I'll calculate the statistic: $\chi^2 =\large \sum_{i} \frac{(O_{i} - E_{i})^2}{E_{i}}$, where:
$O_{i}$ is each of the observed (real) values
$E_{i}$ is each of the expected values, so the value we would expect according to the female/male proportion if no gender dependency took place.

I'll create a function that shows the crosstab counts of 2 variables, both observed and expected (from the proportion of each category within the variable it belongs to), and calculates the $\chi^2$ statistic:

In [66]:
def independence_test(variable1, variable2):
    
    '''
    Return: DataFrame comparing observed vs expected data, chi2, chi2_matrix (addends, contributions to the value of chi2)
    
    Example: 
    
    >> df, chi2 = correlation_test('feature1', 'feature2')
    '''

    # variable1 in Rows
    # variable2 in Columns

    # Proportions
    proportions = (data[variable2].value_counts()/data[variable2].count()).sort_index()

    # Expected values
    value_counts_var1 = (data[variable1].value_counts()).sort_index()
    expected_df = pd.DataFrame()
    for index,value in zip(proportions.index, proportions.values):
        expected_df[index] = value_counts_var1*value

    # Observed values
    observed_df = pd.crosstab(data[variable1], data[variable2])
    observed_df.index.name = None
    observed_df.columns.name = None

    # Dataframe Observed vs Expected values
    reality = ['Observed','Expected']
    df_concat = pd.concat([observed_df, expected_df], axis=1, keys=reality)
    df_concat = df_concat.sort_index(axis=1, level=1)
    df_concat.columns = df_concat.columns.map(lambda x: x[::-1])
    df_concat = df_concat.sort_index(axis=0)
    print('Index: Values of {} \nColumns: Values of {}'.format(variable1, variable2))
    display(df_concat)

    # Chi2
    print("-"*100)
    print("\nCalculation of chi2:")
    print("\nAddends (contributions to the values of chi2):")
    chi2_matrix = (((expected_df - observed_df)**2)/expected_df).sort_index(axis=1)
    chi2_matrix.index.name = variable1
    chi2_matrix.columns.name = variable2
    display(chi2_matrix)
    chi2 = chi2_matrix.sum(axis=1).sum(axis=0)
    print("chi2:", chi2)
    
    return df_concat, chi2, chi2_matrix

t-test to check correlation

Let's perform a signiicance test to prove whether there is a correlations significantly different from 0. We'll use $t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$. Explanations about this expression and the significance test for correlations are commented in "observations" section.

In [67]:
def correlation_test(variable1, variable2):
    
    '''
    Return: t statistic 
    
    '''
    
    r = data.corr()[variable1][variable2]
    n = data.shape[0]
    std_r = np.sqrt((1-r**2)/(n-2))
    print("Correlation coefficient r: ", r)
    print("Standard error of correlation coefficient: ", std_r)

    #t statistic
    t_statistic = (r*np.sqrt(n-2))/np.sqrt(1-r**2)
    print("\nt statistic: ", t_statistic)
    
    return t_statistic

$H_{0}$: Stress levels are independent of gender: woman are as stressed as men

$H_{1}$: Stress levels does depend on gender

1. $\chi^2$ test to check independency:

I'll perform a $\chi^2$ test. There are $n_{1} = 6$ stress levels and $n_{2} = 2$ posible values for gender, so we have $(n_{1} - 1)*(n_{2} - 1) = 5$ degrees of freedom. The $\chi^2$ values for 5 degrees of freedom are:

$\chi^2_{0.01, 5} = 15.087$ for 99% of confidence level
$\chi^2_{0.005, 5} = 16.75$ for 99.5% of confidence level
$\chi^2_{0.001, 5} = 20.515$ for 99.9% of confidence level

This is done by the function created above, independency_test:

In [68]:
print(gender_translator, "\n")
df_concat_stress_gender, chi2_stress_gender, chi2_matrix_stress_gender = independence_test('daily_stress', 'gender')
{'Female': 0, 'Male': 1} 

Index: Values of daily_stress 
Columns: Values of gender
0 1
Expected Observed Expected Observed
0 339.887730 243 222.112270 319
1 1207.750529 1006 789.249471 991
2 1631.703018 1592 1066.296982 1106
3 2137.905919 2270 1397.094081 1265
4 1403.095257 1448 916.904743 872
5 993.657546 1155 649.342454 488
----------------------------------------------------------------------------------------------------

Calculation of chi2:

Addends (contributions to the values of chi2):
gender 0 1
daily_stress
0 27.618627 42.263457
1 33.701725 51.572130
2 0.966064 1.478321
3 8.161653 12.489385
4 1.437134 2.199177
5 26.197544 40.088843
chi2: 248.17405935784157

$\chi^2$ for 5 grades of freedom and a significance level of 0.001: $\chi^2_{5,0.001} = 20,5147$. We obtained $248.174$, so we can reject the null hypothesis: "stress_level is independent of the gender", with (at least) a 99.9% of confidence level.

We can accept the alternative hypothesis "stress level depends on gender". This doesn't necessarily imply correlation, altough we can appreciate that it does exists. We see the differences between women and men. For low stress levels (0 and 1) there are less women ('Real' values) than expected and more men than expected from the male/female proportion, while for high stress levels (5) there are more women and less men than expected. We see by eye that women suffer more stress than men, generally speaking. But we have to perform a significance test for the correlation, which we'll perform later below, to ensure with numbers it really exists.

If we see the $\chi^2$ for each of the stress levels separatly, we see that answers of 2 and 4 in "stress level" are following the male/female proportion, so no dependence on gender is shown among these middle values. The differences come from the extreme values: 0, 1 and 5:

In [69]:
from matplotlib import colors
norm=colors.TwoSlopeNorm(vmin=0, vcenter=6.63/2, vmax=11)
sns.heatmap(chi2_matrix_stress_gender, norm=norm, cmap='RdBu_r');

The t-student value is $t_{1, 0.01} = 6.63$ for a 0.01 of significance level, so at the heatmap, reds are the ones contributing to the row sum to be 6.63 and thus contributing to that particular level of stress to be significantly dependent of gender.

In [70]:
df_diff = pd.DataFrame()
for col in df_concat_stress_gender.columns.levels[0]:
    df_diff[col] = (df_concat_stress_gender[(col,'Observed')] - df_concat_stress_gender[(col,'Expected')])

norm=colors.TwoSlopeNorm(vmin=df_diff.min().min(), vcenter=0, vmax=df_diff.max().max())
sns.heatmap(df_diff, norm=norm, cmap='RdBu_r', annot=True);

We clearly see a correlation, where there are more observed cases for the pairs of categories men (1) - low stress (0,1,2), and for the pairs women (0) - high stress (3,4,5). From the previous heatmap it is appreciable where these differences are significant, and later below, we'll perform a t-test to confirm the significance of the correlation

In [71]:
# Plots

plt.figure(figsize=(8,4))

plt.subplot(1,2,1)
plt.plot(df_concat_stress_gender[0])
plt.legend(df_concat_stress_gender[0].columns);
plt.title('Female');

plt.subplot(1,2,2)
plt.plot(df_concat_stress_gender[1])
plt.legend(df_concat_stress_gender[1].columns);
plt.title('Male');

2. t-test to check correlation

In [72]:
t_statistic = correlation_test('daily_stress', 'gender')

# p-value for n-2 degrees of freedom
from scipy.stats import t
p_value = t.cdf(t_statistic, n-2)*2 #2 tails into account, to obtain P>|t|
print(p_value)
Correlation coefficient r:  -0.128676852685506
Standar error of correlation coefficient:  0.00878149089059826

t statistic:  -14.653189793007869
3.1680727107627955e-48

Residuals should be normally distributed to be strict. However, dependency is ensured by the $\chi^2$ test, and giving a look to data, it is noticeable that this dependency follows a definited direction. Although this t-test is not definitive, things point to the hypothesis women suffer more stress than men, and in fact, there are other studies about this reaching our same conclusions.

$𝐻_0$ : Stress levels are independent of sufficient income

1. $\chi^2$ test to check independency

In [73]:
variable1, variable2 = 'daily_stress', 'sufficient_income'
df_concat_stress_income, chi2_stress_income, chi2_matrix_stress_income = independence_test(variable1, variable2)
Index: Values of daily_stress 
Columns: Values of sufficient_income
1 2
Expected Observed Expected Observed
0 152.760016 111 409.239984 451
1 542.814504 381 1454.185496 1616
2 733.356801 614 1964.643199 2084
3 960.865935 957 2574.134065 2578
4 630.610741 724 1689.389259 1596
5 446.592003 680 1196.407997 963
----------------------------------------------------------------------------------------------------

Calculation of chi2:

Addends (contributions to the values of chi2):
sufficient_income 1 2
daily_stress
0 11.415938 4.261311
1 48.237351 18.005910
2 19.425805 7.251213
3 0.015554 0.005806
4 13.830329 5.162548
5 121.988958 45.535715
chi2: 295.1364376469751
In [74]:
chi2_matrix_stress_income
Out[74]:
sufficient_income 1 2
daily_stress
0 11.415938 4.261311
1 48.237351 18.005910
2 19.425805 7.251213
3 0.015554 0.005806
4 13.830329 5.162548
5 121.988958 45.535715
In [75]:
norm=colors.TwoSlopeNorm(vmin=0, vcenter=6.63/2, vmax=11)
sns.heatmap(chi2_matrix_stress_income, norm=norm, cmap='RdBu_r');

There are special divergences between expected and observed for a level stress of 5, and also for 1. Similar as the previous case of stress vs gender, the t-student value is 𝑡1,0.01=6.63 for a 0.01 of significance level, so at the heatmap, reds are the ones contributing to the row sum to be 6.63 and thus contributing to that particular level of stress to be significantly dependent of income

In [76]:
df_diff = pd.DataFrame()
for col in df_concat_stress_income.columns.levels[0]:
    df_diff[col] = (df_concat_stress_income[(col,'Observed')] - df_concat_stress_income[(col,'Expected')])

norm=colors.TwoSlopeNorm(vmin=df_diff.min().min(), vcenter=0, vmax=df_diff.max().max())
sns.heatmap(df_diff, norm=norm, cmap='RdBu_r', annot=True);

Once again, we appreciate a negative correlation between these two variables. Let's confirm this correlation is significant:

2. t-test to check correlation

In [77]:
t_statistic = correlation_test(variable1, variable2)
p_value = t.cdf(t_statistic, n-2)*2
print("\np-value: ", p_value)
Correlation coefficient r:  -0.14487168163684583
Standar error of correlation coefficient:  0.008761689756615417

t statistic:  -16.534673751425867

p-value:  8.849120787406811e-61

Again, residuals should be normally distributed to be strict. However, dependency is ensured by the 𝜒2 test, and giving a look to data, it is noticeable that this dependency follows a definited direction. Although this t-test is not definitive, things point to the alternative hypothesis: stress and sufficient income are negatively correlated: the less sufficient income, the more stress. Other studies also support this hypothesis, talking about the so-called "financial stress".

$𝐻_0$ : Stress levels are independent of daily shouting

1. $\chi^2$ test to check independency

In [78]:
variable1, variable2 = 'daily_shouting', 'daily_stress'
output = independence_test(variable1, variable2)
chi2_matrix_stress_shouting = output[2]
df_concat_stress_shouting = output[0]
Index: Values of daily_shouting 
Columns: Values of daily_stress
0 1 2 3 4 5
Expected Observed Expected Observed Expected Observed Expected Observed Expected Observed Expected Observed
0 87.241082 244 310.000784 460 418.819287 436 548.749510 416 360.141121 264 255.048216 160
1 132.668130 142 471.420384 673 636.901450 699 834.487260 806 547.669149 434 387.853626 257
2 93.894316 82 333.642258 379 450.759545 538 590.598589 596 387.606429 354 274.498863 182
3 71.379067 34 253.637005 214 342.670325 370 448.976872 495 294.660917 328 208.675813 179
4 43.840847 12 155.783222 81 210.467268 210 275.760486 387 180.980008 181 128.168169 124
5 44.413642 20 157.818581 95 213.217091 192 279.363387 274 183.344571 221 129.842728 206
6 18.285378 5 64.974912 28 87.782830 68 115.015680 131 75.484124 115 53.457076 68
7 24.013328 4 85.328499 24 115.281066 77 151.044688 171 99.129753 157 70.202666 112
8 13.086162 2 46.500118 14 62.822893 27 82.312426 105 54.021168 76 38.257232 73
9 4.890788 2 17.378832 6 23.479263 13 30.763230 25 20.189730 33 14.298158 32
10 28.287260 15 100.515406 23 135.798981 68 177.927871 129 116.773030 157 82.697452 250
----------------------------------------------------------------------------------------------------

Calculation of chi2:

Addends (contributions to the values of chi2):
daily_stress 0 1 2 3 4 5
daily_shouting
0 281.671867 72.579703 0.704783 32.113801 25.665259 35.421394
1 0.656403 86.195555 6.054673 0.972482 23.592119 44.147251
2 1.506745 6.166260 16.884605 0.049399 2.913760 31.169673
3 19.574291 6.194255 2.179678 4.717678 3.772114 4.220201
4 23.125455 35.899439 0.001037 44.873106 0.000002 0.135553
5 13.419884 25.004496 2.111299 0.102970 7.733697 44.668886
6 9.652591 21.041107 4.458279 2.221423 20.686528 3.956383
7 16.679625 44.078881 12.711888 2.636402 33.783656 24.885339
8 9.391829 22.715161 20.426943 6.253321 8.942218 31.551156
9 1.708652 7.450317 4.677104 1.079692 8.128045 21.915776
10 6.241371 59.778281 33.849310 13.454534 13.857730 338.464389
chi2: 1712.8777027227065
In [79]:
norm=colors.TwoSlopeNorm(vmin=0, vcenter=15/5, vmax=15)
sns.heatmap(chi2_matrix_stress_shouting, norm=norm, cmap='RdBu_r');

Once again, blues are the ones not contributing to the row sum to be 15, which is the $\chi^2$ value for 5 degrees of freedom and 0.01 of significance level (99% of confidence level), so we can interpret them as the one not being so unexpected. Reds correspond to the pairs of categories with more observed cases than what would be "normal" following the proportions' distribution. Let's see the differences between the observed and expected values

In [80]:
# Differences observed-expected

df_diff = pd.DataFrame()
for col in df_concat_stress_shouting.columns.levels[0]:
    df_diff[col] = (df_concat_stress_shouting[(col,'Observed')] - df_concat_stress_shouting[(col,'Expected')])

norm=colors.TwoSlopeNorm(vmin=df_diff.min().min(), vcenter=0, vmax=df_diff.max().max())
print("\nHeatmap differences data observed - data expected")
sns.heatmap(df_diff, norm=norm, cmap="RdBu_r", annot=True);
Heatmap differences data observed - data expected

Here the correlation is easily recognizable, being more cases of low stress for low shouting than expected from the proportions' distribution (positive observed-expected differences), and being more cases of high stress for high score of shouting than expected. One the other hand, there are negative differences (blues) for cross categories (low stress and high shouting, high stress and low shouting), indicating a lower number of people highly stressed among those not shouting so much, and also a lower number of people not so much stressed among those shouting. In general, the more shouting, the more stress

2. t-test to check correlation

In [81]:
t_statistic = correlation_test(variable1, variable2)
p_value = (1 - t.cdf(t_statistic, n-2))*2 
print("\n p-value: ", p_value)
Correlation coefficient r:  0.30926429602231414
Standar error of correlation coefficient:  0.008420995496599783

t statistic:  36.72538432625792

 p-value:  0.0

Again, residuals should be normally distributed, but all point to the alternative hypothesis: stress and daily shouting are positively correlated: the more shouting, the more stress.

Attached:

Below it is shown a series of observations and notes for myself from ideas or solved questions arise during the present study. Take a look if you find it useful to a better understanding of approaches carried out in this project.

1. Performing sm.OLS we obtain different coefficients of that from LinearRegression:

import statsmodels.api as sm lr = sm.OLS(y_train, X_train) result = lr.fit() result.summary2()

we see different coefficients respect to the ones from LinearRegression. Moreover, there are features with a few correlation with daily_strees and, however, a high t meaning a low probability of the correspondent coefficient to be 0 (for example, social_network has t=10,1 while the corr with daily_stress is 0.012720, so they are not correlated). What's happening? This is due to the fact that sm.OLS doesn't take the intercept by default, so in the adjustment by least squares, the function that minimize is:

$f = \sum (y_{i}^{pred}-y_{i})^2 = \sum (\beta_{1} x_{i} - y_{i})^2$, where $\beta_{1}, x_{i}$ can be vectors. The thing is that the intercept $\beta_{0}$ is not included.

This is $\sum (\beta_{0} + \beta_{1} x_{i} - y_{i})^2$, with $\beta_{0} = 0$. Thus, the resulting regression line passes through the origin (0,0) and the slope is different from 0. Adding X = sm.add_constant(X) before the lr = sm.OLS(y, X), the variable X will include a column of ones, and the corresponding $\beta$ coefficient will be $\beta_{0}$ as it is multiplied by ones.

X_train_ones = sm.add_constant(X_train) ols = sm.OLS(y_train, X_train_ones) ols = ols.fit()

Doing this, we will obtain the intercept, and the result is the same as performing LinearRegression with our column of X (without the column of ones, as LinearRegression already takes into account the intercept in its default argument intercept=True)

 

2. Distribution of coefficients from random samples: $N(0,\sigma)$

In [82]:
x = np.linspace(0,100,1000).reshape(-1,1)
y = np.random.rand(1000)*1000000

coeficientes = np.array([])
lin = LinearRegression()
for i in range(1000):
    y = np.random.rand(1000)*100
    lin.fit(x,y)
    coeficientes = np.hstack([coeficientes, lin.coef_])
In [83]:
plt.hist(coeficientes);
In [84]:
from scipy.stats import normaltest
normaltest(coeficientes)
Out[84]:
NormaltestResult(statistic=0.012647543546749278, pvalue=0.9936961811898626)

In random samples, the mean is the best model, the one which minimize the squared errors. Here we can see that the linear regression on random samples gives a coefficient of 0 with the most probability, following the coefficients a normal distribution: $coef \sim N(0,\sigma)$

3. Hypothesis tests about stress level and gender

We performed the analysis by stress levels, so giving a particular stress level, how many women and men answered. Let's perform some hypothesis test by gender. Giving the male answers, are they less stress than population in average? Are women more stressed than population in average?

H0: Men are as stressed as the average of the population

In [85]:
df_concat_stress_gender
Out[85]:
0 1
Expected Observed Expected Observed
0 339.887730 243 222.112270 319
1 1207.750529 1006 789.249471 991
2 1631.703018 1592 1066.296982 1106
3 2137.905919 2270 1397.094081 1265
4 1403.095257 1448 916.904743 872
5 993.657546 1155 649.342454 488
In [86]:
real_df
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-86-0dd91c80cc44> in <module>
----> 1 real_df

NameError: name 'real_df' is not defined
In [ ]:
real_df = df_concat_stress_gender[[(0,'Observed'),(1,'Observed')]]
expected_df = df_concat_stress_gender[[(0,'Expected'),(1,'Expected')]]
real_df.columns = real_df.columns.droplevel(1)
expected_df.columns = expected_df.columns.droplevel(1)

chi2 = sum((real_df[1] - expected_df[1])**2/expected_df[1])
chi2

Rejected.

H0: Women are as stressed as the average of the population

In [ ]:
sum((real_df[0] - expected_df[0])**2/expected_df[0])

Rejected.

4. Testing the significance of correlation coefficient

We saw that the categories of daily stress and gender are not independent. Now we would have to demonstrate that, besides this, there is a correlation, being women more stressed than men. The fact that there are dependent doesn't say anything about correlation.

About correlation, we obtained a value of data.corr()['daily_stress']['gender'] = -0.12867. Now we have to check if this is significantly different from zero to conclude there is not only a dependence but also a (negative) tendency. To obtain the significance, the test we have to perform is a t-test, were we have $H_{0}: \mu=0$, (where $\mu$ is the slope) and we estimate the standard deviation of the population slopes by the standard deviation of the slope we have obtained from our sample.

Sandard error of the slope in a linear regression:

image.png

See that the slope estimator variance come from the variance of the residuals, $\sigma$, considering residuals are normally distributed. The meaning of the slope estimator variance is not that much the uncertainty of what is the value of the slope, the value that minimize the cost function, but it indicates the little differences in the value of the slope we had to use to predict each of the data. So we have a general $\beta_1$ unique, the one we use for all predictions, and the variance says that you might need to use another one slightly (or highly) different to obtain $y_i$ as $y_i = \beta_1·x_i$.

The variance of $\beta_1$ should be interpreted as taking the residuals $\mathbf{\epsilon \sim N(0,\sigma)}$ into $\beta_1$ and leave only $\mathbf{y = \beta_1·x}$, with $\beta_1$ having a variance, instead of $y = \beta_1 · x + \epsilon$

Calculation of t statistic

$\small \sigma^2 = MSE$, so:

$Std(\beta_1) = \large \frac{\sqrt{MSE}}{\sqrt{S_{xx}}} \hspace{.5cm}\normalsize [0]$

 

The calculated value of the statistic t is:

$t =\Large \frac{\beta_1 - 0}{Std(\beta_1)} \hspace{.5cm}\normalsize [1]$, where

 

$\beta_1 =\frac{\sum(x_i - \overline{x})(y_i - \overline{y})}{\sum(x_i - \overline{x})^2} \hspace{.5cm}\normalsize [2]$

is the expression for $\beta_1$ obtained by minimum squares and $Std(\beta_1)$ is given by [0]

We can also obtain an expression of t from the Pearson correlation coefficient, r. We need the relation between $\beta_1$ and r:

$\beta_1 = r·\frac{S_{yy}}{S_{xx}} \hspace{.5cm}\normalsize [3]$

which is deducted from [2] and the expression for the Pearson correlation coefficient, r: $r=\frac{S_{xy}}{S_{xx}·S_{yy}}$. Taking [3] into account and similarly to the demonstration above of the standard error of the slope in a linear regression, we obtain that $std(r) = std(\beta_1)·\frac{\sqrt{S_{yy}}}{\sqrt{S_{xx}}} = ... = \sqrt{\frac{1-r^2}{N-2}} \hspace{.5cm}\normalsize [4]$ (were we consider $r^2 = 1 - \frac{SSE}{SST}$), and finally:

$\large t = \frac{r - 0}{std(r)} = \frac{r}{\sqrt{\frac{1-r^2}{N-2}}} = ... = \frac{r·\sqrt{N-2}}{\sqrt{1-r^2}} \hspace{.5cm}\normalsize [5]$

Comparision with the values returned by sm.OLS(y, X).fit().summary2()

Let's see that the values for t obtaind by hand from [1] or from [4], as well as the value for $Std(\beta_1)$ from [0], are those shown authomatically with the method summary2() of the object sm.OLS, the linear regressor of OLS of statsmodels.api:

In [ ]:
# t with Eq.[5]

r = data.corr(method='pearson')['daily_stress']['gender']
N = data.shape[0]
std_r = np.sqrt((1-r**2)/(N-2)) #Equation [4]
t = (r-0)/std_r #Equation [5]
print(t)
In [ ]:
# t with Eq.[1]

t = result.params['gender'] / std_b1 #Equation [1]
print(t)
In [ ]:
import statsmodels.api as sm
y = data['daily_stress']
X = sm.add_constant(data['gender'])
lr = sm.OLS(y, X)
result = lr.fit()
In [ ]:
rmse = np.sqrt(mean_squared_error(y, result.predict(sm.add_constant(X['gender']))))
Sxx = ((X['gender'] - X['gender'].mean())**2).sum()

std_b1 = rmse/np.sqrt(Sxx) #Equation [0]
print(std_b1)

$t = -14.653189$ and $std\_b1 = 0.02465$ match the t and std.err given by result.summary2():

In [ ]:
result.summary2()

Note and discussion: Residuals are not normally distributed

The expression for the slope estimator's variance come from assuming normally distributed residuals. However, this is not our case (I tried transforming the data, but I can't obtain normally distributed residuals in any case), so we should be prudent with the conclusions we have reached here. However, there are other studies affirming that women suffer more stress than men, as well as studies about the so-called "financial stress" that relate stress with low income. Besides, in our study we easily appreciate the lower number of observed cases of women with low stress levels, compared to expected cases, and the higher number of women with high stress levels, and since the dependency of the categories is guaranteed by the $\chi^2$ test of independency (not affected by the distribution of residuals, as this test has nothing to do with descriptive tendency models and just deal with unordered categories), it is very likely a real correlation.

5. Pearson correlation coefficient VS Linear regression coeficient

Pearson correlation coefficient (r) is an indirect measure of the t-statistic, which determines whether the linear regression cofficient ($\beta_1$) is 0 (t~0) or not (|t|>>0). The relation of these two is $\frac{r·\sqrt{N-2}}{\sqrt{1-r^2}}$, and as we can appreciate, $r \rightarrow 0, t \rightarrow 0$ and $|r| \rightarrow 1, |t| >>> 0$. The magnitude of r tells how well data fit a straight line, and the higher r, the lower $\beta_1$ will vary from its estimated value ($\Delta \beta_1 \sim 0$) and thus the less likely it is to be 0 (strictly, the less likely is that, if it were 0 in reality, we had obtain that so well aligned and low-varianced data).

Coefficients VS Correlations (Pearson)

In [55]:
df_coef = pd.DataFrame(list(zip(X.columns, OLSresults.params[1:])), columns=['Feature', 'Coefficient LR'])
df_coef['Pearson Correlation r'] = data.corr()['daily_stress'].drop('daily_stress').values
df_coef.sort_values(by='Coefficient LR')
Out[55]:
Feature Coefficient LR Pearson Correlation r
19 gender -0.331477 -0.128677
14 sufficient_income -0.256305 -0.144872
11 sleep_hours -0.095864 -0.152862
17 daily_meditation -0.048766 -0.213672
16 time_for_passion -0.032759 -0.161858
8 flow -0.029573 -0.142187
2 core_circle -0.023396 -0.115973
7 todo_completed -0.023260 -0.166975
0 fruits_veggies -0.019387 -0.095132
5 achievement -0.016670 -0.120786
10 live_vision -0.015525 -0.135016
1 places_visited -0.012374 -0.131707
6 donation -0.000887 -0.038291
9 daily_steps 0.004181 -0.057925
3 supporting_others 0.019178 -0.035373
18 age 0.020990 -0.020649
15 personal_awards 0.022434 -0.044362
4 social_network 0.023383 0.012720
12 lost_vacation 0.048679 0.192353
13 daily_shouting 0.125297 0.309264

Linear Regression coefficients does not follow the same ascending/descending order as correlation Pearson coefficients (r). As we mencioned above, correlation Pearson coefficients tell how well the linear model fit to data, but not how much the variable influence on stress. What tell us about what features influence the most on stress is the magnitude of linear regression coefficient.

Portafolio