Text is one of the messiest forms of data you would ever work with.
And there is always some amount of redundancy in the data because a word in the text could take many forms due to:
1. Different spellings, e.g., color and colour.
2. Contractions, e.g., "can't" is a contraction for "cannot."
3. Inflection forms, i.e., changed forms of the words to indicate distinctions such as tense, number, person, etc. e.g., "walked" is the past tense form of "walk".
Fixing the redundancies due to different spellings and contractions is a bit complicated, as it might require some manual intervention.
However, inflection forms are commonly handled through two methods: stemming and lemmatization.
Stemming is the fastest way to deal with inflected words. It simply chops off the endings of the words in hopes of reaching the word's root form. It works well when converting words like "walking" or "walked" to "walk".
But, ask it to change the word "found" to "find", and you'll notice that stemming can sometimes be like fitting a square peg into a round hole.
Lemmatization is slow, but a better alternative to stemming when you care more about perfection rather than speed of execution. Lemmatization is essentially the tortoise to stemming's hare.
Lemmatization attempts to utilize the context and part of speech of a word to find what is known as a "lemma" - the base form of a word. All lemmatizers are usually linked to a lexical database that contains conditional mappings between words and their lemmas.
Wordnetlemmatizer is one of the popular implementations of a lemmatizer in Python. This lemmatizer is provided by the Natural Language toolkit package (NLTK) and uses a vast lexical database of the English language called WordNet.
One of the most common mistakes made when using this lemmatizer is forgetting to provide the POS tags along with the tokens. WordNetLemmatizer assumes that all tokens are nouns by default unless otherwise specified. This omission leads to improper lemmatization of verbs, making the results of stemming appear superior to the current outcomes.
But, adding a POS tagging step before the lemmatizer can result in increased time and memory consumption. Therefore, it is important to consider time and memory constraints when deciding between stemming and lemmatization processes.
Comments
Post a Comment