How to split test_train_split data using Scikit-learn library?

Aparna Mishra
3 min readJan 6, 2022

--

In this story we will first see how to perform train_test_split using sklearn library and then we will see why is it important to split our data set into training data and testing data.

Importing train_test_split from sklearn.model_selection

from sklearn.model_selection import train_test_split

The sklearn library provides a function called train_test_split()

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

arrays — → An input array/series : x,y

test_size — → The value should be between 0 to 1.0 which can be used to represent the proportion of data we want to include in testing data. For example- ideally we consider 30% data under test_size , we can mention test_size = 0.3.

train_size — → By default it is None as it adjusts according to test_size. Value is between 0 and 1.0

random_state — → Controls the shuffling applied to the data before applying the split.

shuffle By default it is True , this is used to specify if we want to shuffle the data before splitting. If shuffle is False , then stratify must be None.

stratify — → By default it is None , stratify means array like.

x_train, x_test , y_train, y_test = train_test_split(x,y, test_size = 0.3 , random_state = 0 , shuffle = True)

Step by step procedure:

# splitting dataset into testing and training-------------------from sklearn.model_selection import train_test_split

x_train, x_test , y_train, y_test = train_test_split(data['message'] , data['label'] , test_size = 0.3 , random_state = 0 , shuffle = True)

{ Note : Here x and y should be an array. }

Checking the shape of testing and training dataset.

Now Let’s discuss why we train_test_split data ?

We split the data by putting 70- 80 % data under “train set” and 30–20 % data under “test set”. If we use the same data for testing the model that was used to for training then the model will perform well but this is not good as it leads to an “overfitting problem”, i.e. the model has now memorized the data and it will not provide accurate results for unseen data.

In simple terms, testing the model on the same data which is used for training can cause overfitting, For example — If you were to go for an exam and you got the same question paper which you solved few weeks back then chances are that you will score 100% marks but if you get an overview of the questions that will be asked from each module and then you prepare yourself accordingly then you will perform good as well as the marks scored will be accurate according to your learning of the subject.

Therefore, we split the data into test and train.

Resources:

  1. sklearn.model_selection.train_test_split — scikit-learn 1.0.2 documentation
  2. sklearn.datasets.make_classification — scikit-learn 1.0.2 documentation
  3. sklearn.datasets.make_blobs — scikit-learn 1.0.2 documentation
  4. scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation

--

--

No responses yet