Tokenization in NLP
Tokenization is a very significant step when we are working with text data.
As the word suggests it is the process of breaking text data into small chunks of sentences or words. In other words it means breaking text data into “tokens” so that our machine can interpret language. This step is performed as a part of data preprocessing or text preprocessing in case of Natural Language Processing.
It is the building block of any NLP project.
So, to understand what is tokenization I have broken the operation into steps , just to get a basic understanding of how to perform tokenization.
Consider a sentence: “ Yoga develops inner awareness.”
If we perform word tokenization on this sentence then it will look like this:
['Yoga',
'develops',
'inner',
'awareness',
'.']
# a list of tokens
link of jupyter notebook : click
Import Natural Language Preprocessing Toolkit Library called nltk.
import nltkparagraph = """Yoga is not just about bending or twisting the body and holding the breath. It is a technique to bring you into a state where you see and experience reality simply the way it is. If you enable your energies to become exuberant and ecstatic, your sensory body expands. This enables you to experience the whole universe as a part of yourself, making everything one, this is the union that yoga creates."""
Tokenizing sentences — called sentence tokenization
#tokenizing sentences
#sent_tokenize() function is used for sentence tokenization which is present in nltk library.sentences = nltk.sent_tokenize(paragraph)
sentences
Output : The paragraph is broken into tokens of sentences in the form of list.
['Yoga is not just about bending or twisting the body and holding the breath.',
'It is a technique to bring you into a state where you see and experience reality simply the way it is.',
'If you enable your energies to become exuberant and ecstatic, your sensory body expands.',
'This enables you to experience the whole universe as a part of yourself, making everything one, this is the union that yoga creates.']
Accessing the first sentence :
Tokenizing Words — called word tokenization
#tokenizing words --- word_tokenize() function is used for sentence tokenization which is present in nltk library.words = nltk.word_tokenize(paragraph)
words
Output : List of words
['Yoga',
'is',
'not',
'just',
'about',
'bending',
'or',
'twisting',
'the',
'body',
'and',
'holding',
'the',
'breath',
'.',
'It',
'is',
'a',
'technique',
'to',
'bring',
'you',
'into',
'a',
'state',
'where',
'you',
'see',
'and',
'experience',
'reality',
'simply',
'the',
'way',
'it',
'is',
'.',
'If',
'you',
'enable',
'your',
'energies',
'to',
'become',
'exuberant',
'and',
'ecstatic',
',',
'your',
'sensory',
'body',
'expands',
'.',
'This',
'enables',
'you',
'to',
'experience',
'the',
'whole',
'universe',
'as',
'a',
'part',
'of',
'yourself',
',',
'making',
'everything',
'one',
',',
'this',
'is',
'the',
'union',
'that',
'yoga',
'creates',
'.']
We can see that even the “fullstops” are considered as a token.
There are various operation involved in NLP text preprocessing which I will be discuss in other articles.