using simple Python libraries
There’s so much going on in natural language processing these days (GRUs, LSTMs, XLNet, BERT and so on!). It can be confusing figuring out where to begin. This article talks about the basics of natural language processing including data cleaning, normalization, encoding, sentiment analysis and a simple text classifier using basic yet powerful Python libraries. This is often the first step before diving into complicated deep learning models.
Data Cleaning and normalization
Depending on the nature of the problem, this step may or may not be required. If our model is trying to learn the language to the largest extent, it may be best to use the data in its raw format, in fact, modern deep learning techniques recommend not to remove stop words, emojis or lowercase letters, because they provide additional context. However, if you’re trying to do a trend analysis or classification based on certain word occurrences (like in a bag-of-words model), it helps to perform this step. There are a few common preprocessing steps I’d like to highlight here:
- Removing punctuation: When trying to train a machine learning model, it helps to reduce overfitting by removing punctuation (like !,* etc.). However, be careful to not remove something important, for example, question marks (?) help to recognize questions.
- Removing emojis: Sometimes people attach emojis to words without spaces (for example: you❤ ) making it difficult to interpret such words. Removing emojis can help with such cases. Again, be careful while removing these as emojis might actually be really useful for tasks like sentiment analysis and topic classification.
- Removing stop words: For tasks like data exploration and trend analysis, it might not be very useful to see common words like ‘the’, ‘and’ , ‘of’ etc. The
sklearnpackage actually has a collection of commonly used English stop words that we can use to remove these.
- Making all text lowercase: This is the simplest way to normalize text. (after all,
betterdo have the same semantic implication)
- Stemming words: Another way of normalizing is by replacing derived words with their root form (eg: ‘posting’, ‘posted’, ‘posts’ are all replaced by ‘post’). To stem words we use the
PorterStemmerutil provided by
- Extracting/Removing hashtags and mentions: Hashtags and mentions can be very useful in identifying trends in your data. It helps to extract them out of your text and analyze them separately.
Here’s a simple function to perform the above-mentioned tasks:
Word vectors — What are they?
Machine learning algorithms are only capable of processing fixed-length numerical inputs i.e, they cannot take string inputs to process textual data! This is where word vectors come in, where we represent each word using vectors of fixed length. Individual word vectors are then used to encode sentences.
This is the simplest way of encoding words. It assumes a bag-of-words representation where each word is considered as an independent entity and word relationships are ignored (for example,
occupation are considered as completely independent words even though they practically have the same meaning). This method involves creating a vocabulary of distinct words from the entire corpus, the length of this vocabulary being the length of each word vector. Each vector has a designated index for itself in the word vector and that index is marked
1 while others are marked
0 to represent the particular word.
The vocabulary here consists of 9 distinct words and these words can be one hot encoded into vectors of length 9. The word vector representations are
going : [1,0,0,0,0,0,0,0,0]
good : [0,1,0,0,0,0,0,0,0] and so on..
Using this representation the text
Tomorrow will be a good day can be encoded into: [0,1,1,0,1,1,0,1,0]. Notice how the word
will is ignored because it doesn’t exist in the vocabulary at all. Having a good and extensive vocabulary is necessary for making this model work well. Also note how the word relationships (the order of occurrence, semantic relationships) are completely ignored in this representation.
Word2Vec word embeddings:
This method of word encoding (commonly known as word embeddings) takes context into consideration. For example, we can expect the words
royal to have a smaller spatial distance than
honey . Word2vec uses a shallow two-layer neural network to perform a specific task (based on the method used) and learn weights for the hidden layer for every word. These learned hidden layer weights are used as our final word vectors. You can read the original paper to get an in-depth understanding of how these word vectors are obtained. But at a high level, these are the two common methods of obtaining context-based word vectors using Word2Vec:
CBOW (Continuous Bag of words):
The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words).
Skip — Gram model:
The Skip-gram model architecture tries to achieve the reverse of what the CBOW model does. It tries to predict the source context words (surrounding words) given a target word (the center word).
The window size (number of surrounding words considered on each side) is a hyperparameter in both cases.
Glove is quite similar to Word2vec but unlike Word2vec, Glove takes advantage of the global co-occurrences of words rather than just the local context which makes it more powerful in some ways. Again, you can get a better understanding by going through the original paper.
Word Embeddings — How do I use them?
Now that we have a rough idea about what word embeddings are and why they are useful, let’s talk about how we can use them to our advantage.
Using pre-trained word vectors:
There are many publicly available pre-trained word vectors of different vector lengths like Glove, fasttext, etc. These have been trained on massive corpora ( Wikipedia, twitter and common crawl datasets) and can be downloaded and used to encode words in our corpus.
Example: Finding the most similar document to a given document using word vector similarity
Given a set of documents belonging to different topics (training set), when given a new document can we find the most similar document to it from the original set?
- Load the pre-trained word vector file into a dictionary with the word as key and its vector representation as the value.
- Find the centroid vector for each document in the training set by averaging the word vectors of words that exist in the particular document (ignore words that are not part of the vocabulary)
- Find the centroid of the new document, pick the document from the training set whose centroid is closest to the new document’s centroid (using a suitable measure of similarity, like Euclidean distance, cosine similarity, etc.)
Here’s some helper functions load the glove dictionary, find the centroid and find the distance between centroids:
Training to generate word vectors from scratch:
If you want to find word vectors for your particular corpus, you can use the
gensim package for training.
In the example above, I’ve just used two lines from this Wikipedia page. The training is very fast and simple, all you need to input is the list of words, size of the word vectors you need, the window size (number of surrounding words to be considered), and the minimum number of occurrences of the word for it to be considered as part of the vocabulary. It’s easy to inspect the vocabulary, obtain the vector and also look at the most common words from the corpus. Of course, training from scratch might not always yield as good results as the pre-trained ones but it is good for problems that involves data that looks very different from the datasets used in pretraining.
EDA with text data is not as straightforward as tabular or numerical data. However, there are some libraries that can make these tasks easier. For the rest of this article I’ve used the following dataset from Kaggle:
Exploration using spacy:
Spacy is a very powerful NLP library that has a variety of uses. It can be used for named entity recognition, identifying the part of speech a word belongs to and even give the word vector and sentiment of the word.
nlp function from
spacy converts each word into a token having various attributes like the ones mentioned in the above example.
Wordclouds are a simple yet interesting way to visualize how frequently various words appear in our corpus. Let’s take the most frequently occurring nouns in our comments’ data for example:
A very common task in NLP is identifying how positive or negative a particular comment or piece of text is. The
vaderSentiment package provides a quick and easy way to do this:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment_analyzer = SentimentIntensityAnalyzer()
For classification, one of the easiest libraries I’ve used is
fasttext . It was released by Facebook in 2016 and uses a linear technique both for combining the word vectors into the vector representing the text and for computing the classification criterion. It takes very little time to train and gives decent results for most common text-classification problems. It can be used to come up with a baseline model. You can read the original paper to get a better understanding of the mechanics behind the
This is how I used
fasttext to classify toxic vs non-toxic comments:
We’ve touched upon most of the basics but of course, there’s a lot more to NLP. However, this article is a good starting point and hopefully helpful for beginners because these were the first things I learned when I started off!