In this Series, you will learn how to construct an end-to-end machine learning pipeline using some of the most popular algorithms that are widely used in industry and professional competitions, such as Kaggle.
However, in this introductory post, we will go through the following topics:
- A brief introduction to machine learning
- What is scikit-learn?
- Installing scikit-learn
- Algorithms that you will learn to implement scikit-learn in series.
- Example to build your first regression model
Now, let's begin this fun journey into the world of machine learning with scikit-learn!
A brief introduction to machine learning!
Machine learning has generated quite the buzz – from Elon Musk fearing the role of unregulated artificial intelligence in society, to Mark Zuckerberg having a view that contradicts Musk. 💥So, what exactly is machine learning? Simply put, machine learning is a set of methods that can detect patterns in data and use those patterns to make future predictions. Machine learning has found immense value in a wide range of industries, ranging from finance to healthcare. This translates to a higher requirement of talent with the skill capital in the field of machine learning.
Here is a quick Overview of google trend for machine learning.😧
Here is a quick Overview of google trend for machine learning.😧
Broadly speaking, machine learning can be categorized into three main types:
- Supervised learning
- Unsupervised learning
- Reinforcement learning
Supervised learning
Supervised learning is a form of machine learning in which our data comes with a set of labels or a target variable that is numeric. These labels/categories usually belong to one feature/attribute, which is commonly known as the target variable. For instance, each row of your data could either belong to the category of Healthy or Not Healthy.
Given a set of features such as weight, blood sugar levels, and age, we can use the supervised machine learning algorithm to predict whether the person is healthy or not.
In the following simple mathematical expression, S is the supervised learning algorithm, X is the set of input features, such as weight and age, and Y is the target variable with the labels Healthy or Not Healthy:
Although supervised machine learning is the most common type of machine learning that is implemented with scikit-learn and in the industry, most datasets typically do not come with predefined labels. Unsupervised learning algorithms are first used to cluster data without labels into distinct groups to which we can then assign labels. This is discussed in detail in the following section.
Supervised learning algorithms
Supervised learning algorithms can be used to solve both classification and regression problems. you will learn how to implement some of the most popular supervised machine learning algorithms. Popular supervised machine learning algorithms are the ones that are widely used in industry and research, and have helped us solve a wide range of problems across a wide range of domains. These are some of supervised learning algorithms are as follows:
Supervised learning algorithms can be used to solve both classification and regression problems. you will learn how to implement some of the most popular supervised machine learning algorithms. Popular supervised machine learning algorithms are the ones that are widely used in industry and research, and have helped us solve a wide range of problems across a wide range of domains. These are some of supervised learning algorithms are as follows:
- Linear regression: This supervised learning algorithm is used to predict continuous numeric outcomes such as house prices, stock prices, and temperature, to name a few
- Logistic regression: The logistic learning algorithm is a popular classification algorithm that is especially used in the credit industry in order to predict loan defaults
- k-Nearest Neighbors: The k-NN algorithm is a classification algorithm that is used to classify data into two or more categories, and is widely used to classify houses into expensive and affordable categories based on price, area, bedrooms, and a whole range of other features
- Support vector machines: The SVM algorithm is a popular classification algorithm that is used in image and face detection, along with applications such as handwriting recognition
- Tree-Based algorithms: Tree-based algorithms such as decision trees, Random Forests, and Boosted trees are used to solve both classification and regression problems
- Naive Bayes: The Naive Bayes classifier is a machine learning algorithm that uses the mathematical model of probability to solve classification problems
Unsupervised learning
Unsupervised learning is a form of machine learning in which the algorithm tries to detect/find patterns in data that do not have an outcome/target variable. In other words, we do not have data that comes with pre-existing labels. Thus, the algorithm will typically use a metric such as distance to group data together depending on how close they are to each other.
As discussed in the previous section, most of the data that you will encounter in the real world will not come with a set of predefined labels and, as such, will only have a set of input features without a target attribute.
In the following simple mathematical expression, U is the unsupervised learning algorithm, while X is a set of input features, such as weight and age:
Given this data, our objective is to create groups that could potentially be labeled as Healthy or Not Healthy. The unsupervised learning algorithm will use a metric such as distance in order to identify how close a set of points are to each other and how far apart two such groups are.
Unsupervised learning algorithms
Unsupervised machine learning algorithms are typically used to cluster points of data based on distance. The unsupervised learning algorithm that you will learn is as follows:
- k-means: The k-means algorithm is a popular algorithm that is typically used to segment customers into unique categories based on a variety of features, such as their spending habits. This algorithm is also used to segment houses into categories based on their features, such as price and area.
Reinforcement learning
Reinforcement learning is an area of Machine Learning. Reinforcement. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from the supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of training dataset, it is bound to learn from its experience.
Pre-requisite for the Machine learning (Must Read🙏):
Lets Deep Dive In!!
How we are going to do it?
-- Scikit-learn
Scikit-learn is a free and open source software that helps you tackle supervised and unsupervised machine learning problems. The software is built entirely in Python and utilizes some of the most popular libraries that Python has to offer, namely NumPy and SciPy.
The main reason why scikit-learn is very popular stems from the fact that most of the world's most popular machine learning algorithms can be implemented quite quickly in a plug and play format once you know what the core pipeline is like. Another reason is that popular algorithms for classification such as logistic regression and support vector machines are written in Cython. Cython is used to give these algorithms C-like performance and thus makes the use of scikit-learn quite efficient in the process.
Scikit-learn is designed to tackle problems pertaining to supervised and unsupervised learning only and does not support reinforcement learning at present.
Scikit-learn is designed to tackle problems pertaining to supervised and unsupervised learning only and does not support reinforcement learning at present.
Installing the Scikit- learn package
There are two ways in which you can install scikit-learn on your personal device:
- By using the pip method
- By using the Anaconda method
The pip method can be implemented on the macOS/Linux Terminal or the Windows PowerShell, while the Anaconda method will work with the Anaconda prompt.
Choosing between these two methods of installation is pretty straightforward:
The pip method
pip3 install NumPy
pip3 install SciPy
pip3 install scikit-learn
pip3 install -U scikit-learn
The Anaconda method
conda install NumPy
conda install SciPy
conda install scikit-learn
conda install -U scikit-learn
So far, this lesson has focused on the brief introduction into what machine learning is for those
of you who are just beginning your journey into the world of machine learning.
You have learned about how scikit-learn fits into the context of machine learning
and how you can go about installing the necessary software.
Now, we'll put this into practice💪 and do some data exploration and analysis.
The dataset we'll look at in this section is the so-called Boston housing dataset.
Loading the Data into Jupyter Using a Pandas DataFrame
Often times, data is stored in tables, which means it can be saved as a comma-separated
variable (CSV) file. This format, and many others can be read into Python as a DataFrame
object, using the Pandas library. Other common formats include tab-separated variable
(TSV), SQL tables, and JSON data structures. Indeed, Pandas has support for all of these.
In this example, however, we are not going to load the data this way because of the dataset
is available directly through scikit-learn.
from sklearn import datasets
boston = datasets.load_boston()
type(boston)
print(boston['DESCR'])
import pandas as pd
## Loading the data as Dataframe in pandas
df = pd.DataFrame(data=boston['data'], columns = boston['feature_names'])
#Checking our top 5 rows of the dataframe
df.head()
In machine learning, the variable that is being modeled is called the target variable; it's what you are trying to predict given the features. For this dataset, the suggested target is MEDV, the median house value in 1,000s of dollars.
## Adding Target temp Column to our dataframe
df['MEDV'] = boston['target']
## Creating copy of the target Value
y = df['MEDV'].copy()
##Deleting the Newly created column
del df['MEDV']
## Concat the target columns to our existing dataframe
df = pd.concat((y, df), axis=1)
MEDV | CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24.0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.9 | 4.98 |
1 | 21.6 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.9 | 9.14 |
Here, we introduce a dummy variable y to hold a copy of the target column before removing it from the DataFrame. We then use the Pandas concatenation function to combine it with the remaining DataFrame along the 1st axis (as opposed to the 0th axis, which combines rows).
For this dataset, we see there are no NaNs, which means we have no immediate work to do in cleaning the data and can move on.
To simplify the analysis, the final thing we'll do before exploration is remove some of the columns. We won't bother looking at these, and instead focus on the remainder in more detail.
Remove some columns by running the cell that contains the following code:
Data Exploration
Since this is an entirely new dataset that we've never seen before, the first goal here is to understand the data. We've already seen the textual description of the data, which is important for qualitative understanding. We'll now compute a quantitative description.
This computes various properties including the mean, standard deviation, minimum, and maximum for each column. This table gives a high-level idea of how everything is distributed. Note that we have taken the transform of the result by adding a .T to the output; this swaps the rows and columns.
We call sns.heatmap and pass the pairwise correlation matrix as input. We use a custom color palette here to override the Seaborn default.
This resulting table shows the correlation score between each set of values. Large positive scores indicate a strong positive (that is, in the same direction) correlation. As expected, we see maximum values of 1 on the diagonal.
Pearson coefficient is defined as the covariance between two variables, divided by the product of their standard deviations:
The covariance, in turn, is defined as follows:
Here, n is the number of samples, xi and yi are the individual samples being summed over, and Xbar and Ybar are the means of each set.
Linear Model with Scikit-Learn
We can see that it presents us with a regression problem where we predict a continuous target variable given a set of features. In particular, we'll be predicting the median house value (MEDV). We'll train models that take only one feature as input to make this prediction. This way, the models will be conceptually simple to understand and we can focus more on the technical details of the scikit-learn API.
We'll import the
Use scikit-learn to fit a polynomial regression model to predict the median house value (MEDV), given the LSTAT values. We are hoping to build a model that has a lower mean-squared error (MSE)
print(df.shape)
df.isnull().sum()
---------------------
(506, 14)
MEDV 0
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
dtype: int64
For this dataset, we see there are no NaNs, which means we have no immediate work to do in cleaning the data and can move on.
To simplify the analysis, the final thing we'll do before exploration is remove some of the columns. We won't bother looking at these, and instead focus on the remainder in more detail.
Remove some columns by running the cell that contains the following code:
for col in ['ZN', 'NOX', 'RAD', 'PTRATIO', 'B']:
del df[col]
Data Exploration
Since this is an entirely new dataset that we've never seen before, the first goal here is to understand the data. We've already seen the textual description of the data, which is important for qualitative understanding. We'll now compute a quantitative description.
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
MEDV | 506.0 | 22.532806 | 9.197104 | 5.00000 | 17.025000 | 21.20000 | 25.000000 | 50.0000 |
CRIM | 506.0 | 3.613524 | 8.601545 | 0.00632 | 0.082045 | 0.25651 | 3.677083 | 88.9762 |
INDUS | 506.0 | 11.136779 | 6.860353 | 0.46000 | 5.190000 | 9.69000 | 18.100000 | 27.7400 |
CHAS | 506.0 | 0.069170 | 0.253994 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 1.0000 |
RM | 506.0 | 6.284634 | 0.702617 | 3.56100 | 5.885500 | 6.20850 | 6.623500 | 8.7800 |
AGE | 506.0 | 68.574901 | 28.148861 | 2.90000 | 45.025000 | 77.50000 | 94.075000 | 100.0000 |
DIS | 506.0 | 3.795043 | 2.105710 | 1.12960 | 2.100175 | 3.20745 | 5.188425 | 12.1265 |
TAX | 506.0 | 408.237154 | 168.537116 | 187.00000 | 279.000000 | 330.00000 | 666.000000 | 711.0000 |
LSTAT | 506.0 | 12.653063 | 7.141062 | 1.73000 | 6.950000 | 11.36000 | 16.955000 | 37.9700 |
This computes various properties including the mean, standard deviation, minimum, and maximum for each column. This table gives a high-level idea of how everything is distributed. Note that we have taken the transform of the result by adding a .T to the output; this swaps the rows and columns.
cols = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV']
df[cols].corr()
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
ax = sns.heatmap(df[cols].corr(),
cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
ax.xaxis.tick_top() # move labels to the top
We call sns.heatmap and pass the pairwise correlation matrix as input. We use a custom color palette here to override the Seaborn default.
This resulting table shows the correlation score between each set of values. Large positive scores indicate a strong positive (that is, in the same direction) correlation. As expected, we see maximum values of 1 on the diagonal.
Pearson coefficient is defined as the covariance between two variables, divided by the product of their standard deviations:
The covariance, in turn, is defined as follows:
Here, n is the number of samples, xi and yi are the individual samples being summed over, and Xbar and Ybar are the means of each set.
Linear Model with Scikit-Learn
We can see that it presents us with a regression problem where we predict a continuous target variable given a set of features. In particular, we'll be predicting the median house value (MEDV). We'll train models that take only one feature as input to make this prediction. This way, the models will be conceptually simple to understand and we can focus more on the technical details of the scikit-learn API.
We'll import the
LinearRegression
class and build our linear classification model the same way as before, when we calculated the MSE. Run the following:Use scikit-learn to fit a polynomial regression model to predict the median house value (MEDV), given the LSTAT values. We are hoping to build a model that has a lower mean-squared error (MSE)
y = df['MEDV'].values
x = df['LSTAT'].values.reshape(-1,1)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
x_poly = poly.fit_transform(x)
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(x_poly, y)
y_pred = clf.predict(x_poly)
resid_MEDV = y - y_pred
from sklearn.metrics import mean_squared_error
error = mean_squared_error(y, y_pred)
print('mse = {:.2f}'.format(error))
fig, ax = plt.subplots()
# Plot the samples
ax.scatter(x.flatten(), y, alpha=0.6)
# Plot the polynomial model
x_ = np.linspace(2, 38, 50).reshape(-1, 1)
x_poly = poly.fit_transform(x_)
y_ = clf.predict(x_poly)
ax.plot(x_, y_, color='red', alpha=0.8)
ax.set_xlabel('LSTAT'); ax.set_ylabel('MEDV');
--------------------------------------
mse = 28.88
This error metric can be converted to an average error in dollars by taking the square root. Doing this for the polynomial model, we find the average error for the median house value is only $5,300.This completes the extensive guide to understanding how to write your first Machine learning model.
Here, we used visual assists such as scatter plots, to deepen our understanding of the data. We also performed simple predictive modeling. In the next part, we will look into what is MSE/RMSE and work with more other models to enhance accuracy.
Hope you liked it! Please Share this guide to other fellow readers.
Till then Happy Machine learning! 😅😅
5 comments
This is indeed a introductory post ! Thanks
Can you post how to read different plot and identify?
Good Post ! very lengthy though!
I am newbie to machine learning and looking forward to learn~!
@Samuel R : You are welcome
EmoticonEmoticon