In this tutorial, we are going to find reasonable salaries for a company which is hiring new employees. To complete this task we’ll be using Linear Regression algorithm it’s happened to one of the basic machine learning algorithms. To predict the best salary package for an employee, we’ll be using his/her experience as a key point.

For Linear Regression we use an old school math formula y = mX + b. where y is our prediction, X is our data, m is the slope of line or gradient and b is y-intercept. You can learn more about here.

Let’s say the company gave you the dataset of their old employees and the dataset contains their years of experience and their salaries. Now you as a data scientist has to predict the perfect salary structure for their new employees. Let’s get started.

For this task, we’ll be using three Python libraries scikit-learn, pandas and matplotlib. let’s just import these libraries and go deeply in details!

Pandas:

Pandas is an open-source python library. In Data Science with Python programming language, it’s one of the most powerful libraries for data manipulation. And for this tutorial, we’ll use pandas to handle our dataset.

Sklearn:

Here comes the most important library for this tutorial! Sklearn is one of the most important libraries when it comes to machine learning or data science. It helps us implement very complex mathematics formulas/algorithms in just a few lines of code. In fact, we’ll be implementing Linear Regression with help of Sklearn.

Matplotlob:

Matplotlob is a very powerful data visualization library in Python. This library plots our data into 2D plots. While working with a large amount of data Matplotlib help us better understand the structure of data.

As we have already imported important libraries in our code, Now let’s import our data! we’ll be using a free dataset provided by superdatascience.com. Download your dataset here.

Now as we imported our dataset into our code, we can see the entries in the dataset by just printing the ‘df‘. let’s do that and understand our data!

After printing the ‘df‘ we’ll get the following output.

The data has three columns first is the index of each entry and second is YearsExperience and third is Salary. For our task, we’ll just focus on YearsExperience and Salary.

In the next step, we’ll split our data into two separate datasets X and y. The X will contain ‘YearsExperience‘ and y will contain ‘Salary’. let’s split our data!

Just splitting out data into X and y isn’t gonna help us, so we’ll split X and y into four different parts X_train, X_test, y_train  and y_test. We’ll use 70% of our date in X_train and y_train and the rest we’ll use in X_test and y_test. after splitting you’ll know why you had to split in four Different parts! let’s add the code and split!

So why did we have to split our data into X_train, X_test, y_train and y_test? well, Machine Learning works in two phrases first is training/learning and second is predicting the output. we’ll use X_train and y_train to train our model and X_test and y_test we’ll be used to predict the outcome of our model.

The ‘test_size=0.3‘ tells train_test_split that we’ll be using only 30% of the data for testing and ‘random_state=0‘ tells that we don’t wanna shuffle our data.

Now here comes the fun part! now we’ll fit the data into the Linear Regression model and predict the y using x_test. it’ll take just three lines of code! let’s just do that!

We are almost done! let’s just use matplotlib to visualize our outputs! we’ll use two separate visualizations to see outcomes of training and test sets.

Code for training set: Code for test set: 