Thirumala Reddy
December 19, 2022
Natural Language Processing
Stemming:-
Stemming is the process wherein we are reducing infected words to their word steam. Stemming is faster in execution when compared to lemmatization, But Stemming will remove the meaning of the word.
Ex:- 1)If we have words history, historical then these words get converted to history
2)If we have words finally, final, and finalized then these words get converted to final
Applications of stemming:-
Stemming is used in sentiment classifiers, Gmail spam classifiers,s, etc.
Lemmatization: -
Lemmatization also performs the same process as stemming but the output word we get from lemmatization is a meaningful word. The execution time in Lemmatization is more when compared to stemming.
Ex:- The words history, and historical get converted into history & finally, final, finalized words get converted into final.
Applications of Lemmatization:-
Lemmatization is used in the chat box, Text Summarization, Language Translation, etc.
-->stopwords library is used to remove unwanted words. Examples of stop words are is, this, the, are …
-->sent_tokenize is a function inside nltk that takes the paragraph & it applies a lot of regular expressions inside the function. This regular expression will be responsible for converting the paragraph into different sentences.
-->We will take up the above sentences and convert them into words by using word_tokenize.
Click here to see the implementation of Stemming
Click here to see the implementation of Lemmatization
1)Based on the count or frequency of words
a)Bag Of Words
b)TF-IDF
c)One Hot Embedding
2)Deep Learning Trained model
a)Word2Vec
i)Continue the Bag of words
ii)SKIP GRAMS
BAG OF WORDS:-
Let's understand the concept of a bag of words by taking an example. Let's assume that I have 3 statements
sent 1:- He is a good boy.
sent 2:- She is a good girl.
sent 3:- Boy and girl are good.
-->The first step is we need to lower the sentences i.e. we remove the stop wards. The resulting sentence is
sent 1:- good boy
sent 2:- good girl
sent 3:- boy girl good
Let's see construct a bag of words
WORDS | FREQUENCY | good | boy | girl | ||
good | 3 | sent 1 | 1 | 1 | 0 | |
boy | 2 | BOW---> | sent 2 | 1 | 0 | 1 |
girl | 2 | sent 3 | 1 | 1 | 1 |
TF | IDF | ||||||||
WORDS | FREQUENCY | sent 1 | sent 2 | sent 3 | Words | IDF | |||
good | 3 | good | 1/2 | 1/2 | 1/3 | good | log(3/3)=0 | ||
boy | 2 | TF-IDF--> | boy | 1/2 | 0 | 1/3 | * | boy | log(3/2) |
girl | 2 | girl | 0 | 1/2 | 1/3 | girl | log(3/2) | ||
good | boy | girl | |||||||
sent 1 | 0 | (1/2)*log(3/2) | 0 | ||||||
"=" | sent 2 | 0 | 0 | (1/2)*log(3/2) | |||||
sent 3 | 0 | (1/3)*log(3/2) | (1/3)*log(3/2) |
Refer to this to know how to apply TF-IDF
Dataisgood is on a mission to ensure that everyone has the opportunity to thrive in an inclusive environment that fosters equal opportunities for advancement and progress. At Dataisgood, we empower individuals with live, hands-on training led by industry experts. Our goal is to facilitate successful transitions for those from non-tech backgrounds, equipping them with the skills and knowledge needed to excel in the tech industry. Additionally, we offer upskilling and reskilling opportunities through our industry-approved training programs, ensuring that professionals stay ahead in their careers
Dataisgood LLC.
447 Broadway,
NY 10013, USA
Ph: +1 718-682-7717
Addictive Learning Technology Pvt Ltd
B-75, Sector 63 Noida, 201301
Uttar Pradesh, India
Ph:+91-8700627800
Addictive Learning Technology Pvt Ltd
Corporate Office: 576, Block C,Sushant Lok Phase I, Sector 43, Gurugram, Haryana 122002
Ph:+91-8700627800
Skill Arbitrage Technology, Inc.
8 The Green,
Dover, DE 19901
Ph:+91-8700627800