How to add custom stopwords and remove them from text in NLP
Our aim is to remove the words which are repeating and thus do not contribute importance in a sentence/document/text/ paragraph. These are called stopwords . Example — “it’s”, ”the”, “a”, “an” ,”on” etc.
The NLTK library already contains stopwords , but if we want to add few words which we want our machine to ignore then we can add some custom stopwords.
In this article we will see how to perform this operation stepwise.
Step 1 — Importing and downloading stopwords from nltk.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
Step 2— Checking the stopwords list present in nltk library for english language.
stopwords = stopwords.words('english')
print(stopwords)
Output :
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Step 3 — Let’s create a text / paragraph.
text = '''Yoga develops inner awareness. It focuses your attention on your body's abilities at the present moment.
It helps develop breath and strength of mind and body.
It's not about physical appearance.Yoga studios typically don't have mirrors.
This is so people can focus their awareness inward rather than how a pose — or the people around them — looks.'''#converting text into lower case
text = text.lower()
print(text)
Output :
"yoga develops inner awareness. it focuses your attention on your body's abilities at the present moment.\nit helps develop breath and strength of mind and body. \nit's not about physical appearance.yoga studios typically don't have mirrors. \nthis is so people can focus their awareness inward rather than how a pose — or the people around them — looks."
Step 4 — Creating custom list of stopwords.
#creating a custom list of stopwords
stop_list = ["n't","'s",".","—"]
Step 5 — Adding custom list of stopwords into nltk list of stopwords.
stpwrd = nltk.corpus.stopwords.words('english')
# entend()function is used to add custom stopwords stpwrd.extend(stop_list)
print(stpwrd)
Output: we can see that the list of custom stopwords are added into the nltk list of stopwords.
Step 6 — Downloading and importing tokenizer from nltk.
# importing tokenizer from nltk
from nltk.tokenize import word_tokenize
Step 7 — Performing word tokenization on text.
word_tokens = word_tokenize(text)
print(word_tokens)
Step 8 — Removing custom stop words.
custom_words = [words for words in word_tokens if not words in stpwrd]
print(custom_words)
Output : We can see that all the added custom stopwords are not present in the list.
['yoga', 'develops', 'inner', 'awareness', 'focuses', 'attention', 'body', 'abilities', 'present', 'moment', 'helps', 'develop', 'breath', 'strength', 'mind', 'body', 'physical', 'appearance.yoga', 'studios', 'typically', 'mirrors', 'people', 'focus', 'awareness', 'inward', 'rather', 'pose', 'people', 'around', 'looks']