Write a python function that:
users.csv
line per line<name> has <age> years old.
for each lineHere is a demo about how to read a file line per line
with open("path to the file", "r") as finput: # "r" means read mode and finput is a variable (its name is free)
for line in f :
print(l)
To split a string s
according to a separator sep
you should use the split
function (s.split(sep)
). This function returns a list.
Write a python function that:
users.csv
into a dictionary dUsers
dUsers
dUsers
must follow the following format:
dUsers = {
id: {
"name" : name,
"age" : age,
"sex" : sex,
"interests" : ["interest1", "interest2"]
}
}
Using the data structure you just have created, write two python functions that:
We now consider the links between users and thus manipulate the links.csv
file. Create a python function that returns an adjacency list dRel
with the following format.
dRel = {
id : [list of connected users]
}
Write two Python functions that:
The purpose of this last exercise it to write a basic recommender system that implements the following principle: for each user, the system should recommend the majority interest of his/her friend. Obviously, if this majority interest is shared by the user, the second most majority is recommended.
There are 5 major steps in any data science / machine learning project :
A brief introduction about how these steps can be handled in Python (>= 3.6) is given below.
Pandas
Pandas is a library written for the Python programming language that allows data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical vectors and time series.
DataFrame
object to manipulate data easily and efficiently with indexes that can be strings;Documentation link: https://pandas.pydata.org/pandas-docs/stable/
Numpy
NumPy is an extension of the Python programming language, designed to manipulate multidimensional matrices or tables as well as mathematical functions operating on these tables. It offers much more efficient types and operations than the standard lib, and has shortcuts for mass processing.
Documentation link: https://docs.scipy.org/doc/
Matplotlib
Matplotlib is a library of the Python programming language designed to plot and visualize data in graphical form. It can be combined with the NumPy and SciPy python scientific computation libraries.
Documentation link: https://matplotlib.org/contents.html
Scikit-learn
Scikit-learn is a free Python library dedicated to automatic learning. It is developed by many contributors, particularly in the academic world, by French institutes of higher education and research such as Inria and Télécom ParisTech. It includes functions for estimating random forests, logistic regressions, classification algorithms, and support vector machines. It is designed to harmonize with other free Python libraries, including NumPy and SciPy.
Documentation link: http://scikit-learn.org/stable/
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # Seaborn is a Python data visualization library based on matplotlib
import numpy as np
%matplotlib inline
In machine learning competitions, two files are usually given. A training file that is used to learn the machine learning algorithm and a test file that is used to measure the performance of the algorithm.
Instructions: read the pandas documentation and find how to read the two csv files. Then, print the first ten lines of the train data frame using the head
function.
Let's talk about the context, we have to predict house prices.
As you should know, it is a problem of SUPERVISED machine learning, because a target variable (SalePrice
) has to be predicted.
As we have to predict a value it is a regression problem so you will use regression algorithms.
Instructions:
training
data frame using the columns
primitive;training
and test
data frames using the shape
primitiveInstructions:
describe
function on the SalePrice
columndistplot
function on the SalePrice
columnThe piece of code below shows how to plot a scatter plot of the two numerical variables GrLivArea
and SalePrice
(the target variable).
Instructions. Modify this piece of code to display the relationship between every numerical features and the target variable (you should use a loop).
Hint. To determine wheter a variable (column of the data frame) is numerical, you can have a look to the following stack overflow post.
# scatter plot grlivarea/saleprice
var = 'GrLivArea'
# A new data frame is created with only the desired columns (the two we would like to display)
price_surface = pd.concat([train['SalePrice'], train[var]], axis=1)
price_surface.plot.scatter(x=var, y='SalePrice', ylim=(0,800000))
plt.ylabel("Prix")
plt.xlabel("Surface habitable")
plt.show()
The piece of code below shows how to plot a boxplot of the categorical variables SaleCondition
w.r.t. the target variable.
Instructions. Modify this piece of code to display the relationship between every categorical features and the target variable (you should use a loop).
Hint. To determine wheter a variable (column of the data frame) is categorical, you can have a look to the following stack overflow post.
var = 'SaleCondition'
pair = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=pair)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);
The best way to get a complete view of your dataset fairly quickly is to make a heatmap representing the correlations between variables. The code below shows how to do that very quickly. Have a look to the documentation to determine which method has been used as default to calculate the correlations.
#correlation matrix
corrmat = train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=1, vmin=-1, square=True);
We now focus on the 10 features that are the most correlated with the target feature.
k = 10 #Number of features to consider
# We keep only the k most (negatively or positively) correlated features
cols = abs(corrmat).nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
Most machine learning algorithms do not deal with missing data (NaN). One of the first challenges to adresse is to manage these missing values by replacing them with estimates.
We first check the ratio of missing values per feature.
#missing data
# the isnull method outputs a matrix of the same format as the train and for each element of this matrix
# sends a booleen: True if the value is a missing value (NaN), False if not
# Then we add the number of null values
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
We can see that the first 5 variables contain too many missing values, it is better not to use them.
The train and the test are merged in order to do the same formatting for the training and test game. This process is very classic.
data = pd.concat([train, test],axis = 'rows', sort=False) # merge the two datasets
data.reset_index(drop= True)
data.head()
Instructions. Remove features:
Pandas method to succeed in the task: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
Now you have to replace missing values in order to make sense of them.
Pandas method to succeed in the task: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
For example, you can replace missing values with the most frequent value, or the mean, median...
Instructions. Replace the NaN values of the other variables.
# Replace NaN values in LotFrontage with the mean
data['LotFrontage'] = data['LotFrontage'].fillna(data['LotFrontage'].mean())
# Replace NaN values in Alley with the mean
data['Alley'] = data['Alley'].fillna('NOACCESS')
Very few machine learning algorithms do not take the categorical variables as inputs, they need numerical values. It is thus necessary to convert them into numerical features.
To remedy this there exists several methods:
Resources :
Instructions. Apply the one hot encoding on all categorical features.
Now the DataFrame is ready, it is a good habit to normalize the data if you use algorithms of machine learning such as SVM or KNN.
Instructions. Apply the MinMaxScaler to normalize the data.
Resources. This Stack Overflow entry should be of interest: https://stackoverflow.com/questions/26414913/normalize-columns-of-pandas-data-frame
is_test = data['SalePrice'].isnull() # Masque afin de séparer la base d'entrainement et de test
# car dans le test nous ne connaissons pas la valeur de la variable cible donc ils ont comme valeur NaN
train = data[~is_test] # la tilde est la négation
test = data[is_test].drop('SalePrice', axis = 'columns')
Always check your code before training
assert
returns an error if the condition is wrong
assert len(train) == 1460 # Check the size of the training set
assert len(test) == 1459 # Check the size of the test set
assert train.isnull().sum().sum() == 0 # Check if there still exists NaN values
assert test.isnull().sum().sum() == 0 # Check if there still exists NaN values
# X are the training data and Y the prices to predict
X_train = train.drop(['SalePrice','Id'], axis = 'columns')
Y_train = train['SalePrice']
The problem is measured using the RMSE, which is the average square deviation between the predicted value and the true value. $$\sqrt{\frac{1}{n} \sum^n_{i=1}(\overline{y_i} - y_i)^2}$$ The goal is to minimize this evaluation metric.
After the data formatting, the evaluation of the model is the most important. It is necessary to evaluate your model.
Validation of the model provides us with information on its performance, if new additions or modifications to the data have enable the model to better predict. Also it informs us if there is overfit (the worst enemy in machine learning)
def rmse(predictions,targets):
"""Implementation of RMSE
Arguments:
predictions {np array} -- Predicted value
targets {np array} -- True value
Returns:
float -- RMSE score
"""
return np.sqrt(np.mean((predictions-targets)**2))
The validation method we will use is cross-validation.
Cross-validation is, in machine learning, a method of estimating the reliability of a model based on a sampling technique.
Suppose you have a statistical model with one or more unknown parameters, and a set of learning data on which you can train the model. The learning process optimizes the model parameters to match the data as closely as possible. If an independent validation sample is then taken from the same training population, it will generally turn out that the model does not respond as well to validation as it did during training: sometimes it is called overlearning. Cross-validation is a way to predict the effectiveness of a model on a hypothetical validation set when an independent and explicit validation set is not available.
k-fold cross-validation: the original sample is divided into k samples, then one of k samples is selected as the validation set and the other k-1 samples will constitute the learning set. The performance score is calculated as in the first method, then the operation is repeated by selecting another validation sample from among the k-1 samples that have not yet been used for model validation. The operation is repeated k times so that in the end each sub-sample was used exactly once as a validation set. The mean of the k root mean square errors is finally calculated to estimate the prediction error.
Déclaration of the model
from sklearn.linear_model import LinearRegression
model = LinearRegression() # Try to use some others!!
Cross-validation
from sklearn.model_selection import KFold
# Split the dataset into 5 folds using a predefined seed (for reproducibility purpose)
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
CV = KFold(n_splits = 5,random_state = 42)
# List to save the model values for each fold
fit_score = []
val_score = []
verbose = False # Set it to True if you want aditionnal infos to be displayed
# enumerate is a predifined keyword in Python: https://docs.python.org/3/library/functions.html#enumerate
for i, (fit_index,val_index) in enumerate(CV.split(X_train,Y_train)):
X_fit = X_train.iloc[fit_index]
Y_fit = Y_train.iloc[fit_index]
X_val = X_train.iloc[val_index]
Y_val = Y_train.iloc[val_index]
model.fit(X_fit,Y_fit)
pred_fit = model.predict(X_fit)
pred_val = model.predict(X_val)
if verbose :
print(f'Rmse fit for fold {i+1} : {rmse(pred_fit,Y_fit):.3f}')
print(f'Rmse val for fold {i+1} : {rmse(pred_val,Y_val):.3f}')
fit_score.append(rmse(pred_fit,Y_fit))
val_score.append(rmse(pred_val,Y_val))
fit_score = np.array(fit_score)
val_score = np.array(val_score)
print(f'RMSE score for fit :{np.mean(fit_score):.3f} ± {np.std(fit_score):.3f}')
print(f'RMSE score for val :{np.mean(val_score):.3f} ± {np.std(val_score):.3f}')
Instructions. Some areas for improvment:
At this point, you trained k models and have an idea on how effective is your solution (the features used, the algorithms and its parameters). We are now training our model on all train data because previously, we only used $\frac{4}{5}$ of our data in cross-validation.
model.fit(X_train,Y_train)
pred = model.predict(test.drop(['Id'],axis = 'columns'))
If you want to participate to a machine learning competition (e.g., Kaggle), you need to submit to prediction and thus to first write it in a file. You will find below some piece of code to achieve this goal.
submission = pd.DataFrame()
submission['Id'] = np.array(test['Id'])
submission['SalePrice'] = pred
submission.head()
filename = f'submission_{np.mean(val_score):.3f}_{np.std(val_score):.3f}'
submission.to_csv(f'submission/{filename}',index =False)