Stemming vs Lemmatization in NLP

Niraj Bhoi
8 min readDec 14, 2022

--

Words usually have multiple meanings based on their usage in the text. Similarly, different word forms convey related meanings, such as “toy” and “toy” denote the same meaning.

Searching for “toys” and then searching for “toys” will probably not find another target. However, this kind of contrast between different word forms, called “inflection”, causes various problems in understanding queries.

Given the different words ``kam’’ and ``camel’’, the search intent yields different meanings rather than having the same root. Similarly, searching for the word “Love” in Google’s search option will find words such as “Loves,” “Loved,” and “Loving” in stems.

Stemming and lemmatization are strategies used to simplify various search queries.

Stemming and lemmatization were developed in the 1960s. These are text normalization and text mining techniques in natural language processing that are applied to adapt texts, words, and documents for further processing. These are widely used systems for tagging, SEO, web search results, and information retrieval.

When implementing NLP, we are always faced with similar root forms but different representations. For example, the word “caring” can be decomposed into “car” and “care” using stemming or lemmatization methods.

Stemming

Stemming is a rule-based approach to generating variants of root/base words. Simply put, it reduces base words to base words. This heuristic process is the simpler of the two because the process indiscriminately trims the ends of words. Now let’s standardize words by base stem regardless of inflection which helps many applications like clustering or text classification. Search engines use these methods extensively to provide better results regardless of word form. Prior to Google’s implementation of stemming in 2003, searches for the word “fish” did not include websites about fish or fishing.

Challenges of stemming

Stemming helps to shorten the look-up and normalize the sentences for a better understanding. The process has two main challenges:

1.Over stemming

2.Under stemming

Over stemming

The modified word is truncated, making the resulting stem meaningless. Redundant stems can also cause multiple words with different meanings to have the same stem. For example, “universal”, “university”, and “universe” are reduced to “universe”. The three words here are etymologically related, but have vastly different modern meanings. When search engines treat them as synonyms, search results get worse.

Under stemming

Different inflections here have the same stem despite different meanings. Problems arise when you have multiple words that are actually forms of each other. Examples of underestimation in Porter’s stemmer are “graduate” → “graduate”, “graduate” → “graduate”, “graduate” / “graduate” → “graduate”. English words have Latin forms, so these few synonyms do not match.

Stemming algorithms

Porter’s Stemmer

The Stemmer Porter algorithm is one of the most popular morphological analysis methods proposed in 1980. It is based on the idea that suffixes in English are made up of combinations of smaller and simpler suffixes.

Although known to be an efficient and straightforward process, it also has a number of drawbacks. It can only be used for English words because it is based on many hard-coded rules derived from English. There may also be cases where the output of Stemmer Porter is an artificial stem of a word rather than an English word.

Lancaster stemming algorithm

Lancaster is one of the most aggressive stemmers as it tends to over stem many words. It was developed at Lancaster University and it is another very common stemming algorithms. Like the Porter stemmer, the Lancaster stemmer consists of a set of rules where each rule specifies either deletion or replacement of an ending. Also, some rules are restricted to intact words, and some rules are applied iteratively as the word goes through them.

There are two additional conditions to prevent the stemming of various short-rooted words:

  • If the word starts with a vowel, then at least two letters must remain after stemming (owing -> ow, but not ear -> e)
  • If the word starts with a consonant, then at least three letters must remain after stemming, and at least one of these must be a vowel or “y” (saying -> say, but not string -> str)

Regexp Stemmer — RegexpStemmer()

NLTK has a RegexpStemmer class that makes it easy to implement the regular expression Stemmer algorithm. This basically takes a single regular expression and removes all prefixes or suffixes matching the expression. The Regex stemmer uses regular expressions to identify morphological affixes. Substrings that match the regular expression are discarded. Now, for example, create an instance of the RegexpStemmer class and give it a suffix or prefix to remove from words.

Snowball stemming algorithm

NLTK has a SnowballStemmer class that makes it easy to implement the Snowball Stemmer algorithm. Supports 15 non-English languages. To use this hover class, you need to create an instance with the name of the language you are using and then call the Stem() method

Python Implementation of Stemming

Natural Language Toolkit (NLTK) is a Python library to make programs that work with natural language. It provides a user-friendly interface to datasets that are over 50 corpora and lexical resources such as WordNet Word repository. The library can perform different operations such as tokenizing, stemming, classification, parsing, tagging, and semantic reasoning.

There are different algorithms for stemming as mentioned previously we are doing a stemming of words in PorterStammer or LancasterStammer, nltk.stem is a package that performs stemming using different classes.

Stemming Words

Output:

Stemming Sentences

We can stem sentences and documents using nltk stemmers. To stem each word in the sentences firstly we have to separate sentences into words, for that tokenizer is used. The nltk tokenizer separates the sentence into words

Output:

python are veri intellig and work veri pythonli and now they are python their way to success .

Stemming a document

Stemming documents is also possible in NLP. for stemming a documents there are various process as mentioned below:

  1. Take a document as the input
  2. Read the content in the document line by line
  3. Tokenize each line
  4. Stem the words
  5. Output the Stemmed words
  6. Repeat the steps 2–5 for whole document

Output:

Stemmed sentence

natur languag process ( nlp ) refer to the branch of comput science†” and more specif , the branch of artifici intellig or ai†” concern with give comput the abil to understand text and spoken word in much the same way human be can

Typical errors of stemming are the following:

  • Over stemming: Happens when too much is removed. For example, ‘wander’ becomes ‘wand’; ‘news’ becomes ‘new’; or ‘universal’, ‘universe’, ‘universities’, and ‘university’ are all reduced to ‘univers’. A better result would be ‘univers’ for the first two and ‘universi’ for the last two.
  • Understemming: Happens when words are from the same root but are not seen that way. For example, ‘knavish’ remains as ‘knavish’; ‘data’ and ‘datum’ become ‘dat’ and ‘datu’ although they’re from the same root
  • Mis-stemming: Usually not a problem unless it leads for false conflations. For example, ‘relativity’ becomes ‘relative’

Lemmatization

Lemmatization is an evolution of stemming and describes the process of grouping the various inflectional forms of a word so that they can be analyzed as a single element. Restoration is similar to stemming, but adds context to words. Thus, words with similar meanings are combined into one word. Primitive restoration algorithms usually also take positional arguments as input, such as whether a word is an adjective, noun, or verb.

Whenever you do text preprocessing for NLP, you need both stemming and lemma extraction. Sometimes you will come across articles or discussions where two words are used interchangeably when they are not. Roundness restoration is generally preferred over stemming because it is contextual analysis of words instead of using hard-coded rules to remove suffixes. However, if the text document is very long, original restoration will take much longer, which is a serious drawback.

This entails reducing words to a standard or dictionary form. The root of the word is called “lemma.” This method combines the inflectional parts of a word in a way that can be recognized as a single element. This process is similar to root extraction, but the root is important.

Lemmatization Implementation in Python

Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.

Lemmatization of words

Output:

Lemmatization of Sentence

We can also do the lemmatization of sentences in python. To make the lemmatization better and context dependent, we have to find the POS tag and pass it on to the lemmatizer. Firstly we are finding the POS tag for each token using NLTK, and using that to find the corresponding tag in WordNet and then we lemmatize the token based on tag

Output:

Lemmatization vs Stemming

Both procedures involve the same methodology. That is, the inflectional form of each word is reduced to a common stem or root. However, the main difference is how they work and hence the results each returns.

  • Stemming is a faster process than lemmatization as stemming chops off the word irrespective of the context, whereas the latter is context-dependent.
  • Stemming is a rule-based approach, whereas lemmatization is a canonical dictionary-based approach.
  • Lemmatization has higher accuracy than stemming.
  • Lemmatization is preferred for context analysis, whereas stemming is recommended when the context is not important.

Applications Of Lemmatization

  1. Biomedicine: Using lemmatization to parse biomedicine literature may increase the efficiency of data retrieval tasks.
  2. Search engines
  3. Compact indexing: Lemmatization is an efficient method for storing data in the form of index values.

Applications Of Stemming

  1. Stemming is used in information retrieval systems like search engines.
  2. It is used to determine domain vocabularies in domain analysis.
  3. To display search results by indexing while documents are evolving into numbers and to map documents to common subjects by stemming.
  4. Sentiment Analysis, which examines reviews and comments made by different users about anything, is frequently used for product analysis, such as for online retail stores. Before it is interpreted, stemming is accepted in the form of the text-preparation mean.

Conclusion

  • Stemming and Lemmatization are methods that help us in text preprocessing for Natural Language Processing.
  • Both of them help to map multiple words to a common root word.
  • That way, these words are treated similarly and the model learns that they can be used in similar contexts.

--

--

No responses yet