Skip to main content

Stemming vs Lemmatization

 

Text is one of the messiest forms of data you would ever work with.

And there is always some amount of redundancy in the data because a word in the text could take many forms due to:

1. Different spellings, e.g., color and colour.

2. Contractions, e.g., "can't" is a contraction for "cannot."

3. Inflection forms, i.e., changed forms of the words to indicate distinctions such as tense, number, person, etc. e.g., "walked" is the past tense form of "walk".

Fixing the redundancies due to different spellings and contractions is a bit complicated, as it might require some manual intervention. 

However, inflection forms are commonly handled through two methods: stemming and lemmatization.

Stemming is the fastest way to deal with inflected words. It simply chops off the endings of the words in hopes of reaching the word's root form. It works well when converting words like "walking" or "walked" to "walk". 

But, ask it to change the word "found" to "find", and you'll notice that stemming can sometimes be like fitting a square peg into a round hole.

Lemmatization is slow, but a better alternative to stemming when you care more about perfection rather than speed of execution. Lemmatization is essentially the tortoise to stemming's hare. 

Lemmatization attempts to utilize the context and part of speech of a word to find what is known as a "lemma" - the base form of a word. All lemmatizers are usually linked to a lexical database that contains conditional mappings between words and their lemmas. 

Wordnetlemmatizer is one of the popular implementations of a lemmatizer in Python. This lemmatizer is provided by the Natural Language toolkit package (NLTK) and uses a vast lexical database of the English language called WordNet.

One of the most common mistakes made when using this lemmatizer is forgetting to provide the POS tags along with the tokens. WordNetLemmatizer assumes that all tokens are nouns by default unless otherwise specified. This omission leads to improper lemmatization of verbs, making the results of stemming appear superior to the current outcomes.


But, adding a POS tagging step before the lemmatizer can result in increased time and memory consumption. Therefore, it is important to consider time and memory constraints when deciding between stemming and lemmatization processes.










Comments

Popular posts from this blog

Solving Customer Churn with a hammer!

Learning when data should take a back seat and give way to domain knowledge is a valuable skill. Suppose you built a machine learning model on the data of your customers to predict churn risk. Now that you have a risk score for each customer, what do you do next? Do you filter the top n% based on the risk and send them a coupon with a discount in the hopes that it will prevent churn? But what if price is not the factor driving churn in many of these customers? Customers might have been treated poorly by customer service, which drove them away from your company's product.  Or there might have been an indirect competitor's product or service that removes the need for your company's product altogether (this happened to companies like Blockbuster and Kodak in the past!) There could be a myriad of factors, but you get the point! Dashboards and models cannot guide any company's strategic actions directly. If companies try to use them without additional context, more often tha...

What is SUTVA for A/B testing?

Imagine if person B’s blood pressure reading depends on whether person A receives the blood pressure medicine in a randomized controlled trial. This will be violating Stable Unit Treatment Value Assumption (SUTVA) SUTVA states that the treatment received by an individual should not influence the outcome we see for another individual during the experiment. I know the initial example sounded absurd, so let me try again. Consider LinkedIn A/B testing a new ‘dislike’ reaction for its users, and the gods of fate chose you to be part of the initial treatment group that received this update. Excited after seeing this new update, you use this dislike reaction on my post and send a screenshot to a few of your connections to do the same, who are coincidentally in the control group that did not receive the update. Your connections log in and engage with my posts to use this dislike reaction, but later get disappointed as this new update is not yet available to them. The offices of LinkedIn are tr...

A practical advice about building models

One of the most practical pieces of advice I recently learned about building models is counterintuitive. It suggests that we should not immediately jump into training models on the data. Instead, we should first try to create heuristic rules for the prediction problem at hand. For example, if we are trying to predict whether a customer will buy the latest edition of the iPhone or not, a simple heuristic rule would be that customers with an annual income greater than $80,000 USD and a history of purchasing Apple products would have a higher probability of buying the new iPhone. You could write a simple SQL query to test out such heuristic rules on your training and holdout sets and evaluate their effectiveness. This approach could sometimes help you create better features, identify inherent target leakage issues, and provide a baseline that you could aim to beat with the models.