Stop Word and Tokenization with NLTK

Stop Words and Tokenization with NLTK: Natural Language Processing (NLP) is a sub-area of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (native) languages. This is nothing but how to program computers to process and analyze large amounts of natural language data.

NLP = Computer Science + AI + Computational Linguistics

In another way, Natural language processing is the capability of computer software to understand human language as it is spoken. NLP is one of the component of artificial intelligence (AI).

About Natural Language Processing Tool Kit

The Natural Language Toolkit, or more commonly is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.
It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.
A software package for manipulating linguistic data and performing NLP tasks.
Natural Language Processing Tool Kit is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning
It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.
It includes more than 50 corpora and lexical sources such as the Penn Treebank Corpus, Open Multilingual Wordnet, Problem Report Corpus, and Lin’s Dependency Thesaurus.

Following table shows the important modules in Natural Language Processing Tool Kit :

Language Processing Task	NLTK-modules	Functionality
Accessing corpora	nltk.corpus	Standardized interfaces to corpora and lexicons
String processing	nltk.tokenize, nltk.stem	Tokenizers, sentence tokenizers, stemmers
Collocation discovery	nltk.collocations	t-test, chi-squared, point-wise mutual information
Classification	nltk.classify, nltk.cluster	Decision tree, maximum entropy, naive Bayes, EM, k-means
Chunking	nltk.chunk	Regular expression, n-gram, named entity
Part-of-speech tagging	nltk.tag	n-gram, backoff, Brill, HMM, TnT
Parsing	nltk.parse	Chart, feature-based, unification, probabilistic, dependency
Semantic interpretation	nltk.sem, nltk.inference	Lambda calculus, first-order logic, model checking
Evaluation metrics	nltk.metrics	Precision, recall, agreement coefficients
Probability and estimation	nltk.probability	Frequency distributions, smoothed probability distributions
Applications	nltk.app, nltk.chat	Graphical concordancer, parsers, WordNet browser, chatbots

The goal of designing the Natural Language Processing Tool Kit are as follows,

Simplicity:

It provides an in-built framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data

Consistency:

To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names

Text Data Analytics with NLTK and Python CourseExtensibility:

To provide a structure into which new software modules can be easily accommodated,including alternative implementations and competing approaches to the same task

Modularity:

To provide components that can be used independently without needing to understand

the rest of the toolkit

NLTK-Installation and Requirements

NLTK requires Python versions 2.7, 3.4, 3.5, or 3.6

NLTK-Library included in ANACONDA

On Windows or Linux or Mac, you can install NLTK using pip command. It can be directly installed using pip in terminal using the following command

>>> pip install nltk

You can directly Install from http://pypi.python.org/pypi/nltk

Step 1:

Open Command Prompt and type python

Step 2:

After a successful installation, we will go for importing-NLTK.

The NLTK-package is then included using the following command

>>>import nltk

Step 3:

Next, we need to install some of the packages or components for Natural Language Processing Tool Kit. For that Open command prompt or anaconda prompt then type,

>>>nltk.download( )

After typing nltk.download() we will get NLTK-downloader window through which we can download all the package needed for further processing in Natural Language Processing Tool Kit.

Natural Language Processing ApplicationsStop Words

The process of converting data to something a computer can understand is referred to as pre-processing. One of the primary forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.

There is also a corpus of stop words, that is, high-frequency of words like “the, to and also” that we sometimes want to filter out of a document before further processing.

A stop word is a commonly used word such as “the”, “a”, “an”, “in” that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

Stop words usually have little lexical content, and their presence in a text fails to distinguish it from other texts.

>>> import nltk

>>> from nltk.corpus import stopwords

>>> stopwords.words(‘english’)

Program for Removing the stop words with NLTK:

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

example_sent = “This is a sample sentence, showing off the stop words filtration.”

stop_words = set(stopwords.words(‘english’))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:

if w not in stop_words:

filtered_sentence.append(w)

print(word_tokens)

print(filtered_sentence)

Output:

[‘This’, ‘is’, ‘a’, ‘sample’, ‘sentence’, ‘,’, ‘showing’,

‘off’, ‘the’, ‘stop’, ‘words’, ‘filtration’, ‘.’]

[‘This’, ‘sample’, ‘sentence’, ‘,’, ‘showing’, ‘stop’,

‘words’, ‘filtration’, ‘.’]

Learn Natural Language Processing with PythonTokenization

Tokenization is the process by dividing the quantity of text into smaller parts called tokens.

Alternatively, Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries.

These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization:

Sent1: – Hello Friends, Welcome to the world of Natural Language Processing

Word Token in Sent1 are as follows

‘Hello’ ‘Friends’ ‘,’ ‘Welcome’ ‘to’ ‘the’ ‘world’ ‘of’ ‘Natural’ ‘Language’ ‘Processing’

Total Number of Tokens: – 11

There are two ways of tokenization in Natural Language Processing Tool Kit, one is sentence tokenization, and another is word tokenization

Code for Sentence Tokenization:

Following is the code for Sentence Tokenization

>>import nltk

>>from nltk.tokenize import sent_tokenize

>> EXAMPLE_TEXT = “Hello Mr. Nitin, what are you doing today? The weather is dull, and NLTK is awesome. The sky is pinkish-blue. You shouldn’t eat cardboard.”

>>print(sent_tokenize(EXAMPLE_TEXT))

>>print(len(sent_tokenize(EXAMPLE_TEXT)))

Output

[‘Hello Mr. Nitin, what are you doing today?’, ‘The weather is dull, and NLTK is awesome.’, ‘The sky is pinkish-blue.’, “You shouldn’t eat cardboard.”]

Code for Word Tokenization:

Following is the code for word tokenization,

>>import nltk

>>from nltk.tokenize import word_tokenize

>> EXAMPLE_TEXT = “Hello Mr. Nitin, what are you doing today? The weather is dull, and NLTK is awesome. The sky is pinkish-blue. You shouldn’t eat cardboard.”

>>print(word_tokenize(EXAMPLE_TEXT))

>>print(len(word_tokenize(EXAMPLE_TEXT)))

OUTPUT

[‘Hello’, ‘Mr.’, ‘Smith’, ‘,’, ‘how’, ‘are’, ‘you’, ‘doing’, ‘today’, ‘?’, ‘The’, ‘weather’, ‘is’, ‘great’, ‘,’, ‘and’, ‘Python’, ‘is’, ‘awesome’, ‘.’, ‘The’, ‘sky’, ‘is’, ‘pinkish-blue’, ‘.’, ‘You’, ‘should’, “n’t”, ‘eat’, ‘cardboard’, ‘.’]

Stop Words and Tokenization with NLTK

learntek

Infographics: Digital Marketing Certifications

SDLC (Software Development Life Cycle)

Related Posts:

How AI is affecting Digital Marketing in 2021

Artificial Intelligence – A Great Way to Improve Data Handling and Processing

Machine Learning and Artificial Intelligence

Leave a reply: Cancel Reply