Welcome to the world of machine learning with scikit-learn. Machine learning can be overwhelming at times and this is partly due to a large number of tools that are available on the market. This post will simplify this process of tool selection down to one – scikit-learn.

In this Series, you will learn how to construct an end-to-end machine learning pipeline using some of the most popular algorithms that are widely used in industry and professional competitions, such as Kaggle.

However, in this introductory post, we will go through the following topics:

A brief introduction to machine learning
What is scikit-learn?
Installing scikit-learn
Algorithms that you will learn to implement scikit-learn in series.
Example to build your first regression model

Now, let's begin this fun journey into the world of machine learning with scikit-learn!

A brief introduction to machine learning!

Machine learning has generated quite the buzz – from Elon Musk fearing the role of unregulated artificial intelligence in society, to Mark Zuckerberg having a view that contradicts Musk. 💥

So, what exactly is machine learning? Simply put, machine learning is a set of methods that can detect patterns in data and use those patterns to make future predictions. Machine learning has found immense value in a wide range of industries, ranging from finance to healthcare. This translates to a higher requirement of talent with the skill capital in the field of machine learning.

Here is a quick Overview of google trend for machine learning.😧

Broadly speaking, machine learning can be categorized into three main types:

Supervised learning
Unsupervised learning
Reinforcement learning

Supervised learning

Supervised learning is a form of machine learning in which our data comes with a set of labels or a target variable that is numeric. These labels/categories usually belong to one feature/attribute, which is commonly known as the target variable. For instance, each row of your data could either belong to the category of Healthy or Not Healthy.

Given a set of features such as weight, blood sugar levels, and age, we can use the supervised machine learning algorithm to predict whether the person is healthy or not.

In the following simple mathematical expression, S is the supervised learning algorithm, X is the set of input features, such as weight and age, and Y is the target variable with the labels Healthy or Not Healthy:

Although supervised machine learning is the most common type of machine learning that is implemented with scikit-learn and in the industry, most datasets typically do not come with predefined labels. Unsupervised learning algorithms are first used to cluster data without labels into distinct groups to which we can then assign labels. This is discussed in detail in the following section.

Supervised learning algorithms

Supervised learning algorithms can be used to solve both classification and regression problems. you will learn how to implement some of the most popular supervised machine learning algorithms. Popular supervised machine learning algorithms are the ones that are widely used in industry and research, and have helped us solve a wide range of problems across a wide range of domains. These are some of supervised learning algorithms are as follows:

Linear regression: This supervised learning algorithm is used to predict continuous numeric outcomes such as house prices, stock prices, and temperature, to name a few
Logistic regression: The logistic learning algorithm is a popular classification algorithm that is especially used in the credit industry in order to predict loan defaults
k-Nearest Neighbors: The k-NN algorithm is a classification algorithm that is used to classify data into two or more categories, and is widely used to classify houses into expensive and affordable categories based on price, area, bedrooms, and a whole range of other features
Support vector machines: The SVM algorithm is a popular classification algorithm that is used in image and face detection, along with applications such as handwriting recognition
Tree-Based algorithms: Tree-based algorithms such as decision trees, Random Forests, and Boosted trees are used to solve both classification and regression problems
Naive Bayes: The Naive Bayes classifier is a machine learning algorithm that uses the mathematical model of probability to solve classification problems

Unsupervised learning

Unsupervised learning is a form of machine learning in which the algorithm tries to detect/find patterns in data that do not have an outcome/target variable. In other words, we do not have data that comes with pre-existing labels. Thus, the algorithm will typically use a metric such as distance to group data together depending on how close they are to each other.

As discussed in the previous section, most of the data that you will encounter in the real world will not come with a set of predefined labels and, as such, will only have a set of input features without a target attribute.

In the following simple mathematical expression, U is the unsupervised learning algorithm, while X is a set of input features, such as weight and age:

Given this data, our objective is to create groups that could potentially be labeled as Healthy or Not Healthy. The unsupervised learning algorithm will use a metric such as distance in order to identify how close a set of points are to each other and how far apart two such groups are.

Unsupervised learning algorithms

Unsupervised machine learning algorithms are typically used to cluster points of data based on distance. The unsupervised learning algorithm that you will learn is as follows:

k-means: The k-means algorithm is a popular algorithm that is typically used to segment customers into unique categories based on a variety of features, such as their spending habits. This algorithm is also used to segment houses into categories based on their features, such as price and area.

Reinforcement learning

Reinforcement learning is an area of Machine Learning. Reinforcement. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from the supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of training dataset, it is bound to learn from its experience.

Pre-requisite for the Machine learning (Must Read🙏):

Lets Deep Dive In!!

How we are going to do it?

-- Scikit-learn

Scikit-learn is a free and open source software that helps you tackle supervised and unsupervised machine learning problems. The software is built entirely in Python and utilizes some of the most popular libraries that Python has to offer, namely NumPy and SciPy.

The main reason why scikit-learn is very popular stems from the fact that most of the world's most popular machine learning algorithms can be implemented quite quickly in a plug and play format once you know what the core pipeline is like. Another reason is that popular algorithms for classification such as logistic regression and support vector machines are written in Cython. Cython is used to give these algorithms C-like performance and thus makes the use of scikit-learn quite efficient in the process.

Scikit-learn is designed to tackle problems pertaining to supervised and unsupervised learning only and does not support reinforcement learning at present.

Installing the Scikit- learn package

There are two ways in which you can install scikit-learn on your personal device:

By using the pip method
By using the Anaconda method

The pip method can be implemented on the macOS/Linux Terminal or the Windows PowerShell, while the Anaconda method will work with the Anaconda prompt.

Choosing between these two methods of installation is pretty straightforward:

The pip method


pip3 install NumPy
pip3 install SciPy
pip3 install scikit-learn
pip3 install -U scikit-learn

The Anaconda method


conda install NumPy
conda install SciPy
conda install scikit-learn
conda install -U scikit-learn

So far, this lesson has focused on the brief introduction into what machine learning is for those

of you who are just beginning your journey into the world of machine learning.

You have learned about how scikit-learn fits into the context of machine learning

and how you can go about installing the necessary software.

Now, we'll put this into practice💪 and do some data exploration and analysis.

The dataset we'll look at in this section is the so-called Boston housing dataset.

Loading the Data into Jupyter Using a Pandas DataFrame

Often times, data is stored in tables, which means it can be saved as a comma-separated

variable (CSV) file. This format, and many others can be read into Python as a DataFrame

object, using the Pandas library. Other common formats include tab-separated variable

(TSV), SQL tables, and JSON data structures. Indeed, Pandas has support for all of these.

In this example, however, we are not going to load the data this way because of the dataset

is available directly through scikit-learn.

The Boston housing dataset can be accessed from the module sklearn.datasets using the method.load_boston


from sklearn import datasets
boston = datasets.load_boston()
type(boston)

print(boston['DESCR'])


import pandas as pd

## Loading the data as Dataframe in pandas

df = pd.DataFrame(data=boston['data'], columns = boston['feature_names'])

#Checking our top 5 rows of the dataframe

df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

In machine learning, the variable that is being modeled is called the target variable; it's what you are trying to predict given the features. For this dataset, the suggested target is MEDV, the median house value in 1,000s of dollars.


## Adding Target temp Column to our dataframe
df['MEDV'] = boston['target']


## Creating copy of the target Value
y = df['MEDV'].copy()

##Deleting the Newly created column
del df['MEDV']


## Concat the target columns to our existing dataframe
df = pd.concat((y, df), axis=1)

	MEDV	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	24.0	0.00632	18.0	2.31	0.0	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.9	4.98
1	21.6	0.02731	0.0	7.07	0.0	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.9	9.14

Here, we introduce a dummy variable y to hold a copy of the target column before removing it from the DataFrame. We then use the Pandas concatenation function to combine it with the remaining DataFrame along the 1st axis (as opposed to the 0th axis, which combines rows).


print(df.shape)

df.isnull().sum()
---------------------
(506, 14)

MEDV       0
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

For this dataset, we see there are no NaNs, which means we have no immediate work to do in cleaning the data and can move on.

To simplify the analysis, the final thing we'll do before exploration is remove some of the columns. We won't bother looking at these, and instead focus on the remainder in more detail.
Remove some columns by running the cell that contains the following code:


for col in ['ZN', 'NOX', 'RAD', 'PTRATIO', 'B']:
    del df[col]

Data Exploration

Since this is an entirely new dataset that we've never seen before, the first goal here is to understand the data. We've already seen the textual description of the data, which is important for qualitative understanding. We'll now compute a quantitative description.

 df.describe().T

count	mean	std	min	25%	50%	75%	max
MEDV	506.0	22.532806	9.197104	5.00000	17.025000	21.20000	25.000000	50.0000
CRIM	506.0	3.613524	8.601545	0.00632	0.082045	0.25651	3.677083	88.9762
INDUS	506.0	11.136779	6.860353	0.46000	5.190000	9.69000	18.100000	27.7400
CHAS	506.0	0.069170	0.253994	0.00000	0.000000	0.00000	0.000000	1.0000
RM	506.0	6.284634	0.702617	3.56100	5.885500	6.20850	6.623500	8.7800
AGE	506.0	68.574901	28.148861	2.90000	45.025000	77.50000	94.075000	100.0000
DIS	506.0	3.795043	2.105710	1.12960	2.100175	3.20745	5.188425	12.1265
TAX	506.0	408.237154	168.537116	187.00000	279.000000	330.00000	666.000000	711.0000
LSTAT	506.0	12.653063	7.141062	1.73000	6.950000	11.36000	16.955000	37.9700

This computes various properties including the mean, standard deviation, minimum, and maximum for each column. This table gives a high-level idea of how everything is distributed. Note that we have taken the transform of the result by adding a .T to the output; this swaps the rows and columns.


cols = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV'] 

df[cols].corr()

import matplotlib.pyplot as plt 

import seaborn as sns 
%matplotlib inline 
ax = sns.heatmap(df[cols].corr(), 
                 cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
ax.xaxis.tick_top() # move labels to the top

We call sns.heatmap and pass the pairwise correlation matrix as input. We use a custom color palette here to override the Seaborn default.

This resulting table shows the correlation score between each set of values. Large positive scores indicate a strong positive (that is, in the same direction) correlation. As expected, we see maximum values of 1 on the diagonal.

Pearson coefficient is defined as the covariance between two variables, divided by the product of their standard deviations:

The covariance, in turn, is defined as follows:

Here, n is the number of samples, xi and yi are the individual samples being summed over, and Xbar and Ybar are the means of each set.

Linear Model with Scikit-Learn

We can see that it presents us with a regression problem where we predict a continuous target variable given a set of features. In particular, we'll be predicting the median house value (MEDV). We'll train models that take only one feature as input to make this prediction. This way, the models will be conceptually simple to understand and we can focus more on the technical details of the scikit-learn API.

We'll import the LinearRegression class and build our linear classification model the same way as before, when we calculated the MSE. Run the following:

Use scikit-learn to fit a polynomial regression model to predict the median house value (MEDV), given the LSTAT values. We are hoping to build a model that has a lower mean-squared error (MSE)


y = df['MEDV'].values

x = df['LSTAT'].values.reshape(-1,1)


from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=3)

x_poly = poly.fit_transform(x)



from sklearn.linear_model import LinearRegression

clf = LinearRegression()

clf.fit(x_poly, y)


y_pred = clf.predict(x_poly)

resid_MEDV = y - y_pred


from sklearn.metrics import mean_squared_error

error = mean_squared_error(y, y_pred)

print('mse = {:.2f}'.format(error))

fig, ax = plt.subplots()
# Plot the samples
ax.scatter(x.flatten(), y, alpha=0.6)
# Plot the polynomial model
x_ = np.linspace(2, 38, 50).reshape(-1, 1)
x_poly = poly.fit_transform(x_)
y_ = clf.predict(x_poly)
ax.plot(x_, y_, color='red', alpha=0.8)
ax.set_xlabel('LSTAT'); ax.set_ylabel('MEDV');



--------------------------------------
mse = 28.88

This error metric can be converted to an average error in dollars by taking the square root. Doing this for the polynomial model, we find the average error for the median house value is only $5,300.

This completes the extensive guide to understanding how to write your first Machine learning model.

Here, we used visual assists such as scatter plots, to deepen our understanding of the data. We also performed simple predictive modeling. In the next part, we will look into what is MSE/RMSE and work with more other models to enhance accuracy.

Hope you liked it! Please Share this guide to other fellow readers.

Till then Happy Machine learning! 😅😅