The 5 phases in Natural Language Processing

Shashank Prasad
7 min readJun 16, 2021

Natural Language processing is the understanding and manipulation of a natural language like text or speech. Natural language is something we humans communicate with i.e. speech or text. Given our dependency on technology, the number of calls we make, the messages we send, and the data that is being used on the internet, a lot of data is generated as text or speech(audio files). We must have ways to understand and process this data, hence natural language processing is of utmost importance to understand and use this data for our benefit.

NLP can be used for numerous benefits like creating search engines, chatbots, analyzing reviews, understanding the sentiment of a tweet/product review, voice assistants, spelling correction, text prediction, and many more.

But to implement these, we need to understand Natural Language processing and design a system accordingly. There are 5 phases of NLP that one needs to understand first-

1. Lexical analysis

2. Parsing

3. Semantic analysis

4. Discourse Integration

5. Pragmatic Analysis

The 5 phases of NLP (Source: https://www.javatpoint.com/nlp)

Let’s discuss these phases one by one

  1. Lexical Analysis:

Lexical Analysis is the first phase of NLP, it deals with the study of trying to understand the meaning of words, their relation with other words, and the context. Mostly lexical analysis is the starting point of an NLP pipeline.

Lexical analysis converts the complex input program into Tokens in a particular order.

With this statement, another question arises — What are tokens?

A token in NLP refers to a sequence of characters that can be considered as one unit, in the grammar of the respective language.

Lexical Analysis can be used in many forms-

a. A part of a compiler, Lexical analysis can be used to create a compiler for a programming language. Which takes the code as input and breaks the code into tokens, removing the white spaces and comments which aren’t supposed to be analyzed by the programming language. After tokenization, the analyzer will extract its meaning, by finding keywords, operations, and variables with tokens. This can be implemented with deterministic finite automata, elaborated in the figure below

The flow of how a compiler is created using lexical analysis (Source:https://www.geeksforgeeks.org/introduction-of-lexical-analysis/#:~:text=Lexical%20Analysis%20is%20the%20first,the%20parser%20for%20syntax%20analysis)

b. In situations where it’s used for a chatbot, the tokens will be looked up in the database to understand the intent of the words and their relation to the sentence they are used in. In these forms, the lexical analysis may use multiple words together or also called n-grams for analysis of the sentence.

2. Parsing

Grammar is very essential and important to describe the syntactic structure of well-formed programs. In the literary sense, they denote syntactical rules for conversation in natural languages. Linguists have attempted to define grammars since the inception of natural languages like English, Hindi, etc.

The theory of formal languages is also applicable in the fields of Computer Science mainly in programming languages and data structure. For example, in ‘C’ language, the precise grammar rules state how functions are made from lists and statements.

The origin of the word ‘parsing’ is from the Latin word ‘pars’ which means ‘part’.

It means to break down a given sentence into its ‘grammatical constituents’.

The purpose of this phrase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax analysis checks the text for meaningfulness compared to the rules of formal grammar. For example, a sentence like “hot ice cream” would be rejected by a semantic analyzer.

Correct Syntax: Sun rises in the east.
Incorrect Syntax: Rise in sun the east.

Parser:

It is used to implement the task of parsing. It may be defined as the software component designed for taking input data (text) and giving a structural representation of the input after checking for correct syntax as per formal grammar. It also builds a data structure generally in the form of a parse tree or abstract syntax tree or other hierarchical structure.

The main roles of the parser include −

To report any syntax error.

  • To recover from commonly occurring errors so that the processing of the remainder of the program can be continued.
  • To create a parse tree.
  • To create a symbol table.
  • To produce intermediate representations (IR).

3. Semantic Analysis:

Semantic analysis is the process of understanding natural language similar to how a human communicates. Semantic analysis means getting the meaning of a text. Semantic analysis tries to understand the meaning and context of the text. Semantics focuses only on the literal meaning of each word, phrase, and sentence. This only finds out the dictionary meaning or the actual meaning from the given text.
The semantic analysis starts by reading every word in the content to capture the actual meaning of any text. It identifies the text elements and assigns them to their logical and grammatical role. It analyzes context or corpus in the surrounding text and finally, it analyses the text to disambiguate the words with more than one meaning.

Now let’s understand Different techniques that are used in achieving this.

A. Coreference resolution

Example of Co-reference Resolution

Source: https://towardsai.net/p/nlp/c-r

Through co-reference, we try to find out which phrase refers to which entity. In this method, we need to find all the references to an entity within the text document. But in this case, we have to consider more than pronouns to understand this. Both word phrases and pronouns are used in reference. Words like this, that, it can also be referred to as an entity.

B. Semantic role labeling

Every sentence has a predicate that conveys the main logic of that sentence. The main verb helps in identifying that predicate. The identification of the predicate and the arguments for that particular predicate is known as semantic role labeling.

Example of Coreference Resolution (Source: https://www.slideshare.net/marinasantini1/semantic-role-labeling)

C. Word sense disambiguation

There are different types of ambiguity that a language processor has to resolve. A word can have a different meaning which makes it ambiguous. For example, The word “Will” can have 2 meanings in different sentences. “Will” has a car. Here, “Will” is a noun. He “will” go to the market. Here will is a verb. To resolve this ambiguity, context can be handy. Thus, to understand the sense of the word, neighboring words can be used. Hence, with the help of WSD, we can select the correct word sense of a particular word.

D. Named entity recognition

This method, it focuses on extracting and identifying named entities such as person, location, organizations that are considered as a pronoun.

Example of Named entity recognition (Source: https://lionbridge.ai/business-resources/named-entity-recognition-dataset/)

In the above example, we identify many name entities like “Manchester United”, “Newcastle United ‘’ are categorized as Organizations. “Old Trafford” is categorized as a location whereas “Anthony Martial”, “Juan Mata”, “Jose Mourinho” and “Alex Sachez” are categorized as Person.
Named entity recognition is used for multiple applications like text classification, topic modeling, content recommendations, trend detection.

4. Discourse Integration

Discourse relations specify the relations between sentences or clauses. Due to these relations, two adjacent sentences can look coherent. The meaning of an individual sentence may depend on the sentences that precede it and may influence the meanings of the sentences that follow it. For example, the word “that” in the sentence “He needs that” depends upon the prior discourse context.

Discourse Integration is very important for information retrieval, text summarization, and information extraction.

What kind of structure the discourse has depends upon the segmentation we applied to the discourse.

5. Pragmatic Analysis

Pragmatic Analysis is part of the process of extracting information from text. Specifically, it’s the portion that focuses on figuring out the actual meaning of a structured set of text. It actually comes from the field of linguistics, where the context is considered from the text.

It is this important because a lot of the text’s meaning does have to do with the context in which it was said/written. Ambiguity, and limiting ambiguity, are at the core of natural language processing, so needless to say, pragmatic analysis is actually quite crucial with respect to extracting meaning or information.

Conclusion

Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering, etc. The paper distinguishes four phases by discussing different levels of NLP and components of Natural Language Generation (NLG) followed by presenting the history and evolution of NLP, state of the art presenting the various applications of NLP, and current trends and challenges.

--

--