Selen Arslan

DATA PREPROCESSING FOR BERTOPIC

Step-by-Step Guide to Cleaning and Preprocessing Data for BERT-based Topic Modeling

Data cleansing contains pre-steps before analysing process of the projects. These steps and order of steps are as important as the analysis process. In some cases, we can say that it is even more important. Because a dataset that has not been properly cleaned and still contains noise is not safe enough to use in the analysis process. Result will not be reliable enough. Data cleansing process basically contains couple of steps which may seems like same for all kind of projects. But, it is not true. You should be very carefull about order of steps to avoid loosing useful data.

These project represents the preprocessing steps for BERTopic. Check for more information about BERTopic.

DATASET

Let’s import the dataset. You can download the dataset:
Dataset

import pandas as pd
df=pd.read_csv('YOUR_PATH/reddit_data.csv')

Dataset contains Reddit posts and comments. You can see how to scrap the data from Reddit in my blog post. We will analyze the posts.

df_post=df[df['type']=='post']

TEXT TO SUBSENTENCES

If you are working with long Reddit texts and your goal is to analyze the emotions expressed in the text, dividing the texts into subsentences can be a very efficient approach. This is because emotions can change throughout a text, and by analyzing subsentences separately, you can more accurately capture the nuances of the emotions expressed.

Dividing the text into subsentences can also help you to identify patterns or themes in the emotional content of the text. This can allow you to gain a more detailed understanding of the emotions expressed in the text, which can be beneficial for tasks such as sentiment analysis or emotion recognition.

Additionally, by dividing the text into subsentences, you can also reduce the computational resources required to analyze the text. As the subsentences are shorter than the whole text, the model can process them more efficiently, which can be beneficial if you are working with a large dataset.

In summary, dividing long Reddit texts into subsentences can be an efficient approach for analyzing the emotions expressed in the text. It can help to capture the nuances of the emotions expressed, identify patterns or themes in the emotional content, and reduce the computational resources required to analyze the text.

We will use the Natural Language Toolkit (NLTK) library to tokenize the text in the “selftext” column of the DataFrame (df_post) into sentences. The sent_tokenize function from the NLTK library is used to split the text into a list of sentences, where each sentence is represented as a separate string.

For each text in the “selftext” column, we will tokenize it into sentences using the sent_tokenize function and append it to the list “sent_list”. This will give us a list of lists, where each inner list contains the sentences for one text.

from nltk.tokenize import sent_tokenize
sent_list=[]
for i in range(len(df_post)):
    text=df_post.selftext[i]
    sentence = sent_tokenize(text)
    sent_list.append(sentence)

We will create a new DataFrame (df_post_seperated) to store the tokenized sentences and their corresponding main text. The DataFrame will have three columns: “main_text_index”, “main_text” and “sentences”.

The “main_text_index” column will contain the index of the main text in the original DataFrame, “main_text” column will contain the original text, and “sentences” column will contain the tokenized sentences.

df_post_seperated=pd.DataFrame(columns = ["main_text_index",'main_text', "sentences"])
df_post_seperated.main_text_index=[i for i in range(len(sent_list))]
df_post_seperated.sentences=[sentences for sentences in sent_list]
df_post_seperated.main_text=df_post.selftext.copy()

In this block of code, we are creating an empty DataFrame called df_p_s with columns “main_text_index”, “sub_sentence_index” and “sub_sentence”. We are then using a nested for loop to iterate through each element in the sentences column of the df_post_seperated DataFrame. For each sentence, we are appending the sentence, its index, and the index of its parent text to the sub_sentence, sub_index, and main_index lists respectively. We then assign these lists as values to the corresponding columns in the df_p_s DataFrame. This DataFrame now contains the sub-sentences of the original texts with their respective indices for the main text and sub-sentence.

sub_sentence=[]
sub_index=[]
main_index=[]
for i in range(len(df_post_seperated)):
    row=df_post_seperated.sentences[i]
    for e in range(len(row)):
        sub_sentence.append(row[e])
        sub_index.append(e)
        main_index.append(i)
df_p_s=pd.DataFrame(columns = ["main_text_index", "sub_sentence_index","sub_sentence"])
df_p_s.main_text_index=[element for element in main_index]
df_p_s.sub_sentence_index=[element for element in sub_index]
df_p_s.sub_sentence=[element for element in sub_sentence]

HASHTAGS AND MENTIONS

Hashtag extraction needs to be done before data cleaning process, we will keep them for next analysis process.

Before cleaning the punctuation, we should remove the mentions which start with ‘@’. We don’t want to keep the mentioned usernames as the word, they will be noise in the dataset.

from bs4 import BeautifulSoup 
import re 
!pip install autocorrect
from autocorrect import Speller 
df.rename(columns = {'sub_sentence':'text'}, inplace = True)
df['hashtags']=''
df['wh_mention']=''
for i in range(len(df)):
  tweet=df.text[i]
  hashtag=re.findall(r"#(\w+)", tweet)
  df.hashtags[i]=hashtag
  df.wh_mention[i]=re.sub("@[A-Za-z0-9_]+","", tweet)

DATA CLEANSING

def strip_html_tags(text):
    """ 
    This function will remove all the occurrences of html tags from the text.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of html tags.
        
    Example:
    Input : This is a nice place to live. <IMG>
    Output : This is a nice place to live.  
    """
    # Initiating BeautifulSoup object soup.
    soup = BeautifulSoup(text, "html.parser")
    # Get all the text other than html tags.
    stripped_text = soup.get_text(separator=" ")
    return stripped_text

def remove_newlines_tabs(text):
    """
    This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of newlines, tabs, \\n, \\ characters.
        
    Example:
    Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
    Output : This is her first day at this place. Please, Be nice to her. 
    
    """
    
    # Replacing all the occurrences of \n,\\n,\t,\\ with a space.
    Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
    return Formatted_text

def remove_links(text):
    """
    This function will remove all the occurrences of links.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of all types of links.
        
    Example:
    Input : To know more about this website: kajalyadav.com  visit: https://kajalyadav.com//Blogs
    Output : To know more about this website: visit:      
    """  
    # Removing all the occurrences of links that starts with https
    remove_https = re.sub(r'http\S+', '', text)
    # Remove all the occurrences of text that ends with .com
    remove_com = re.sub(r"\ [A-Za-z]*\.com", " ", remove_https)
    return remove_com

def remove_whitespace(text):
    """ This function will remove 
        extra whitespaces from the text
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after extra whitespaces removed .
        
    Example:
    Input : How   are   you   doing   ?
    Output : How are you doing ?     
        
    """
    pattern = re.compile(r'\s+') 
    Without_whitespace = re.sub(pattern, ' ', text)
    # There are some instances where there is no space after '?' & ')', 
    # So I am replacing these with one space so that It will not consider two words as one token.
    text = Without_whitespace.replace('?', ' ? ').replace(')', ') ')
    return text

# Code for accented characters removal
def accented_characters_removal(text):
    # this is a docstring
    """
    The function will remove accented characters from the 
    text contained within the Dataset.
       
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" with removed accented characters.
        
    Example:
    Input : Málaga, àéêöhello
    Output : Malaga, aeeohello    
        
    """
    # Remove accented characters from text using unidecode.
    # Unidecode() - It takes unicode data & tries to represent it to ASCII characters. 
    text = unidecode(text)
    return text

# Code for text lowercasing
def lower_casing_text(text):
    
    """
    The function will convert text into lower case.
    
    arguments:
         input_text: "text" of type "String".
         
    return:
         value: text in lowercase
         
    Example:
    Input : The World is Full of Surprises!
    Output : the world is full of surprises!
    
    """
    # Convert text to lower case
    # lower() - It converts all upperase letter of given string to lowercase.
    text = text.lower()
    return text

# Code for removing repeated characters and punctuations

def reducing_incorrect_character_repeatation(text):
    """
    This Function will reduce repeatition to two characters 
    for alphabets and to one character for punctuations.
    
    arguments:
         input_text: "text" of type "String".
         
    return:
        value: Finally formatted text with alphabets repeating to 
        two characters & punctuations limited to one repeatition 
        
    Example:
    Input : Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)
    Output : Reallyy, Greeaatt !?.;:)
    
    """
    # Pattern matching for all case alphabets
    Pattern_alpha = re.compile(r"([A-Za-z])\1{1,}", re.DOTALL)
    
    # Limiting all the  repeatation to two characters.
    Formatted_text = Pattern_alpha.sub(r"\1\1", text) 
    
    # Pattern matching for all the punctuations that can occur
    Pattern_Punct = re.compile(r'([.,/#!$%^&*?;:{}=_`~()+-])\1{1,}')
    
    # Limiting punctuations in previously formatted string to only one.
    Combined_Formatted = Pattern_Punct.sub(r'\1', Formatted_text)
    
    # The below statement is replacing repeatation of spaces that occur more than two times with that of one occurrence.
    Final_Formatted = re.sub(' {2,}',' ', Combined_Formatted)
    return Final_Formatted

CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have",
}
# The code for expanding contraction words
def expand_contractions(text, contraction_mapping =  CONTRACTION_MAP):
    """expand shortened words to the actual form.
       e.g. don't to do not
    
       arguments:
            input_text: "text" of type "String".
         
       return:
            value: Text with expanded form of shorthened words.
        
       Example: 
       Input : ain't, aren't, can't, cause, can't've
       Output :  is not, are not, cannot, because, cannot have 
    
     """
    # Tokenizing text into tokens.
    list_Of_tokens = text.split(' ')

    # Checking for whether the given token matches with the Key & replacing word with key's value.
    
    # Check whether Word is in lidt_Of_tokens or not.
    for Word in list_Of_tokens: 
        # Check whether found word is in dictionary "Contraction Map" or not as a key. 
         if Word in CONTRACTION_MAP: 
                # If Word is present in both dictionary & list_Of_tokens, replace that word with the key value.
                list_Of_tokens = [item.replace(Word, CONTRACTION_MAP[Word]) for item in list_Of_tokens]
                
    # Converting list of tokens to String.
    String_Of_tokens = ' '.join(str(e) for e in list_Of_tokens) 
    return String_Of_tokens

# The code for removing special characters
def removing_special_characters(text):
    """Removing all the special characters except the one that is passed within 
       the regex to match, as they have imp meaning in the text provided.
   
    
    arguments:
         input_text: "text" of type "String".
         
    return:
        value: Text with removed special characters that don't require.
        
    Example: 
    Input : Hello, K-a-j-a-l. Thi*s is $100.05 : the payment that you will recieve! (Is this okay?) 
    Output :  Hello, Kajal. This is $100.05 : the payment that you will recieve! Is this okay?
    
   """
    # The formatted text after removing not necessary punctuations.
    Formatted_Text = re.sub(r"[^a-zA-Z0-9:$-,%.?!]+", ' ', text) 
    # In the above regex expression,I am providing necessary set of punctuations that are frequent in this particular dataset.
    return Formatted_Text

# The code for spelling corrections
def spelling_correction(text):
    ''' 
    This function will correct spellings.
    
    arguments:
         input_text: "text" of type "String".
         
    return:
        value: Text after corrected spellings.
        
    Example: 
    Input : This is Oberois from Dlhi who came heree to studdy.
    Output : This is Oberoi from Delhi who came here to study.
      
    
    '''
    # Check for spellings in English language
    spell = Speller(lang='en')
    Corrected_text = spell(text)
    return Corrected_text

Let’s create a dataset to store the cleaned version of the text column in dataset.

clean_df=pd.DataFrame(df.wh_mention) #original version
clean_df['clean_text']=clean_df.wh_mention.copy() #we will store cleaned version in here
!pip install unidecode
from unidecode import unidecode

Let’s apply the functions.

for i in range(len(clean_df)):
  #remove all the occurrences of html tags
    clean_df.iloc[i,1]=strip_html_tags(clean_df.iloc[i,1])
  #remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
    clean_df.iloc[i,1]=remove_newlines_tabs(clean_df.iloc[i,1])
  #remove all the occurrences of links(https)
    clean_df.iloc[i,1]=remove_links(clean_df.iloc[i,1])
  #remove extra whitespaces (blanks)
    clean_df.iloc[i,1]=remove_whitespace(clean_df.iloc[i,1])
  #remove accented characters
    clean_df.iloc[i,1]=accented_characters_removal(clean_df.iloc[i,1])
  #convert text into lower case
    clean_df.iloc[i,1]=lower_casing_text(clean_df.iloc[i,1])
  #reduce repeatition to two characters for alphabets and to one character for punctuations (Ex: !!! to !)
    clean_df.iloc[i,1]=reducing_incorrect_character_repeatation(clean_df.iloc[i,1])
  #Ex: 'isn't' to 'is not'
    clean_df.iloc[i,1]=expand_contractions(clean_df.iloc[i,1])
  #Removing all the special characters
    clean_df.iloc[i,1]=removing_special_characters(clean_df.iloc[i,1])
  #removing numbers and punctuations
    clean_df.iloc[i,1]=re.sub(r"[^a-zA-Z:$-,%.?!]+", ' ', clean_df.iloc[i,1])
    clean_df.iloc[i,1] = re.sub(r'[^\w\s]','',clean_df.iloc[i,1])

This will take some time.

#spelling correction 
for i in range(len(clean_df)):
    clean_df.iloc[i,1]=spelling_correction(clean_df.iloc[i,1])
    clean_df.iloc[i,1] = re.sub(r'[^\w\s]','',clean_df.iloc[i,1])

So, we removed all the noises from our dataset. But still we have things to do. The next step is tokenization. Here is the clean text up to now:

TOKENIZATION

Tokenization is the process of splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens. There are different kind of tokenization techniques and different python libraries. word_tokenize from nltk library is used for this project.

Punctuation is already cleaned and all letters are lowercased , we don’t need to know where the sentence starts or ends for topic analysis. We will focus on whole dataset as the big one piece of information to see what it is about.

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

clean_df['token']=clean_df.clean_text.copy()
for i in range(len(clean_df)):
  clean_df['token'].iloc[i]=word_tokenize(clean_df.clean_text.iloc[i])

PART OF SPEECH TAGGING

Universal POS tags are part-of-speech tags used in Universal Dependencies (UD) project which has the plain and significant categories.

Here are the POStags:

ADJ: adjective
ADP: adposition
ADV: adverb
AUX: auxiliary
CCONJ: coordinating conjunction
DET: determiner
INTJ: interjection
NOUN: noun
NUM: numeral
PART: particle
PRON: pronoun
PROPN: proper noun
PUNCT: punctuation
SCONJ: subordinating conjunction
SYM: symbol
VERB: verb
X: other

Check for more information here

import ast
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

clean_df['pos_tag']=''
for i in range(len(clean_df)):
    #clean_df.token[i]=ast.literal_eval(clean_df.token[i]) #use this line if you get error 
    clean_df['pos_tag'].iloc[i]=nltk.pos_tag(clean_df.token.iloc[i], tagset='universal')

STOP WORDS

In the Natural Language Toolkit (NLTK), a stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. NLTK provides a list of commonly used stop words in various languages, which can be useful for natural language processing tasks such as text classification and text mining. To use NLTK’s list of stop words in your Python code, you can use the nltk.corpus.stopwords module, which contains lists of stop words for different languages.

df=clean_df
import nltk
!pip install stopwords 
nltk.download('stopwords') 
from nltk.corpus import stopwords 
import ast

We are going to remove the stopwords from the text in our dataframe df. To do this, we will first initialize some empty columns in the dataframe to store the stopword-removed tokens, stopword-removed text, extracted stopwords, and POS tags for the stopword-removed words.

Next, we will use a for loop to iterate through each row of the dataframe. For each row, we will use a list comprehension to remove the stopwords from the token column and store the result in the token_stopword_removed column.

We will then use another list comprehension to extract the stopwords that were removed in the previous step and store the result in the extracted_stopword column.

We will also use the join() method to create a text string from the stopword-removed tokens and store it in the text_stopword_removed column.

Finally, we will use a list comprehension to match the original POS tags with the stopword-removed words and store the result in the pos_stopword_removed column.

We will be using stopwords.words(‘english’) from the nltk library to get the list of english stopwords. With the help of list comprehension we will remove the stopwords from the token column and also match the original pos_tag column with the stopword removed token column.
We will also extract the stopwords that were removed from the token column and store it in extracted_stopword column and also join the stopword removed token column to form a text and store it in text_stopword_removed column.

#stop words removing and matching removed version of the clean data with original postags
df['token_stopword_removed']=''
df['text_stopword_removed']=''
df['extracted_stopword']=''
df['pos_stopword_removed']=''

for i in range(len(df)):
  #df.token[i]=ast.literal_eval(df.token[i])
  #df.pos_tag[i]=ast.literal_eval(df.pos_tag[i])

  #cleaning the stopwords
  df.token_stopword_removed[i] = [word for word in df.token.iloc[i] if not word in stopwords.words('english')]
  #check the extracted stopwords 
  df.extracted_stopword[i] = [word for word in df.token.iloc[i] if  word in stopwords.words('english')]
  #creating text from stopword tokens
  df.text_stopword_removed[i] = (" ").join(df.token_stopword_removed[i])
  #matching the original pos tag with the stopword removed word 
  df.pos_stopword_removed[i]= [(word,tag) for word,tag in df.pos_tag[i] if word not in stopwords.words('english')]

TAG ANALYSIS

The cleanest version of text is preprocessed, hashtag-mention removed and stop word removed version of the text which is stored in column ‘text_stopword_removed’. pos tag information of the cleanest version of the text is stored in ‘pos_stopword_removed’ column. we will count the tags, verbs and order the tag counts. Also we will store the verbs with related postags for future analysis. You can modify this part as your wish.

from collections import Counter

This block of code counts the POS tags in the ‘pos_stopword_removed’ column and stores the count in ‘tag_count’ column. It also orders the tags by count in descending order and stores the result in ‘tag_order’ column. Additionally, it extracts all verbs from the ‘pos_stopword_removed’ column and stores them in ‘verb_tag’ column along with their count, and verb’s POS tag information in ‘verb_pos_tag’ column for future analysis. The goal is to provide a summary of the POS tags and verbs in the cleaned text for further analysis.

#count the tags
df['tag_count']=''
the_count_=pd.DataFrame(df.tag_count.copy())
the_count_=[]
for i in range(len(df)): 
  the_count = Counter(tag for _, tag in df.pos_stopword_removed.iloc[i])
  the_count_.append(the_count)

for i in range(len(the_count_)):
  df.tag_count[i]=the_count_[i]

#order of tag_count
df['tag_order']=''
for i in range(len(df)):
  df.tag_order[i]=sorted(df.tag_count[i].items(),key=lambda item: item[1], reverse=True)

#verb tags
df['verb_tag']=''
for i in range(len(df)):
  a=[]
  for k,c in df.tag_order[i]:
    if k.count('VERB')==1:
      x=k,c
      a.append(x)
    df['verb_tag'][i]=a

df['verb_pos_tag']=''
for i in range(len(df)):
  a=[]
  for k,c in df.pos_stopword_removed[i]:
    if c.count('VERB')==1:
      x=k,c
      a.append(x)
    df['verb_pos_tag'][i]=a

LEMMATIZATION

Lemmatization is the process of reducing a word to its base or root form. It is similar to stemming, but it takes into account the word’s part-of-speech (POS) tag. This means that it uses the context of the word to determine its base form, rather than just removing common morphological affixes. For example, the lemma of the verb “running” would be “run”, and the lemma of the noun “dogs” would be “dog”.

The lemmatization process is done by comparing the word and its POS tag to a dictionary of base forms, such as WordNet. The lemmatizer then replaces the word with its corresponding base form, if it’s found in the dictionary. This process can be useful for text processing tasks such as text classification, information retrieval, and machine learning, because it reduces the dimensionality of the data and makes it easier to compare different words.

It is common practice to perform POS tagging before lemmatization because the lemmatizer needs the POS tag information to determine the base form of a word. The POS tag provides a clue to the lemmatizer about whether the word is a noun, verb, adjective or adverb. This information is used to select the right lemma from the dictionary, which can be different depending on the POS tag. For example, “running” as verb and “running” as noun have different base form, “run” and “running” respectively.

clean_df=pd.DataFrame(df.text_stopword_removed.copy())
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
def pos_tagger(nltk_tag):
	if nltk_tag.startswith('J'):
		return wordnet.ADJ
	elif nltk_tag.startswith('V'):
		return wordnet.VERB
	elif nltk_tag.startswith('N'):
		return wordnet.NOUN
	elif nltk_tag.startswith('R'):
		return wordnet.ADV
	else:		
		return None

clean_df['tag_original']=df.pos_stopword_removed.copy()
clean_df['tag_v2']=''

for i in range(len(clean_df)):
    #clean_df.tag_original[i]=ast.literal_eval(clean_df.tag_original[i])
    clean_df['tag_v2'].iloc[i]=list(map(lambda x: (x[0], pos_tagger(x[1])),clean_df.tag_original.iloc[i] ))

clean_df['lemmatized_sentence']=''
clean_df['lemmatized_pair_v1']=''
for i in range(len(clean_df)):
  sentence=[]
  pair=[]
  pairs_raw=[]


  for word,tag in clean_df.tag_v2[i]:
    if tag is None :
      sentence.append(word)
      x=word,tag
      pair.append(x)
    else:
      y=lemmatizer.lemmatize(word, tag),tag #lemmatized word,tag
      pair.append(y)
      sentence.append(lemmatizer.lemmatize(word, tag)) #lemmatized word
  sentence = " ".join(sentence)


  clean_df['lemmatized_sentence'][i]=sentence
  clean_df['lemmatized_pair_v1'][i]=pair

def reverse_pos_tagger(nltk_tag):
	if nltk_tag==wordnet.ADJ:
		return 'ADJ'
	elif nltk_tag==wordnet.VERB:
		return 'VERB'
	elif nltk_tag==wordnet.NOUN:
		return 'NOUN'
	elif nltk_tag==wordnet.ADV:
		return 'ADV'
	else:		
		return None
clean_df['lemmatized_pair_v2']=''
for i in range(len(clean_df)):
    clean_df['lemmatized_pair_v2'].iloc[i]=list(map(lambda x: (x[0], reverse_pos_tagger(x[1])),clean_df.lemmatized_pair_v1.iloc[i] ))


#matching 

clean_df['lemmatized_pair_v3']=clean_df['lemmatized_pair_v2'].copy()


for i in range(len(clean_df)): #i 1
  pair_list=[]
  for e in range(len(clean_df.lemmatized_pair_v2[i])): #i1 
      for pair in clean_df.lemmatized_pair_v2[i][e]:
          if clean_df.lemmatized_pair_v2[i][e][1]==None:
              word=clean_df.lemmatized_pair_v2[i][e][0]
              tag=clean_df.tag_original[i][e][1]
              x=word,tag
              clean_df['lemmatized_pair_v3'][i][e]=x

          else:
              w=clean_df.lemmatized_pair_v2[i][e][0]
              t=clean_df.lemmatized_pair_v2[i][e][1]
              y=w,t
              clean_df['lemmatized_pair_v3'][i][e]=y

df['text_lemmatized']=clean_df.lemmatized_sentence.copy()
df['lemmatized_postag']=clean_df.lemmatized_pair_v3.copy()
df['lemmatized_verb_pos_tag']=''
for i in range(len(df)):
  a=[]
  #df.lemmatized_postag[i]=ast.literal_eval(df.lemmatized_postag[i])
  for k,c in df.lemmatized_postag[i]:
    if c.count('VERB')==1:
      x=k,c
      a.append(x)
    df['lemmatized_verb_pos_tag'][i]=a

Here is the final dataset:

	wh_mention	clean_text	token	pos_tag	token_stopword_removed	text_stopword_removed	extracted_stopword	pos_stopword_removed	tag_count	tag_order	verb_tag	verb_pos_tag	text_lemmatized	lemmatized_postag	lemmatized_verb_pos_tag
0	All my homies who have experienced hair change…	all my homes who have experienced hair changes…	[all, my, homes, who, have, experienced, hair,…	[(all, DET), (my, PRON), (homes, NOUN), (who, …	[homes, experienced, hair, changes, postpartum…	homes experienced hair changes postpartum plea…	[all, my, who, have, your]	[(homes, NOUN), (experienced, VERB), (hair, NO…	{‘NOUN’: 5, ‘VERB’: 2, ‘ADJ’: 1}	[(NOUN, 5), (VERB, 2), (ADJ, 1)]	[(VERB, 2)]	[(experienced, VERB), (postpartum, VERB)]	home experience hair change postpartum please …	[(home, NOUN), (experience, VERB), (hair, NOUN…	[(experience, VERB), (postpartum, VERB)]
1	I’m getting to the point that I’m worried that…	i am getting to the point that i am worried th…	[i, am, getting, to, the, point, that, i, am, …	[(i, NOUN), (am, VERB), (getting, VERB), (to, …	[getting, point, worried, limp, sad, hair, due…	getting point worried limp sad hair due hormon…	[i, am, to, the, that, i, am, that, my, is, to…	[(getting, VERB), (point, NOUN), (worried, ADJ…	{‘VERB’: 1, ‘NOUN’: 4, ‘ADJ’: 5, ‘ADV’: 1}	[(ADJ, 5), (NOUN, 4), (VERB, 1), (ADV, 1)]	[(VERB, 1)]	[(getting, VERB)]	get point worried limp sad hair due hormonal c…	[(get, VERB), (point, NOUN), (worried, ADJ), (…	[(get, VERB)]
2	My hair is still thick but so soft and very li…	my hair is still thick but so soft and very li…	[my, hair, is, still, thick, but, so, soft, an…	[(my, PRON), (hair, NOUN), (is, VERB), (still,…	[hair, still, thick, soft, limp, nearly, curly…	hair still thick soft limp nearly curly handfu…	[my, is, but, so, and, very, not, as, as, it, …	[(hair, NOUN), (still, ADV), (thick, ADJ), (so…	{‘NOUN’: 3, ‘ADV’: 5, ‘ADJ’: 2}	[(ADV, 5), (NOUN, 3), (ADJ, 2)]	[]	[]	hair still thick soft limp nearly curly handfu…	[(hair, NOUN), (still, ADV), (thick, ADJ), (so…	[]
3	Ten months postpartum, and I just want to know…	ten months postpartum and i just want to know …	[ten, months, postpartum, and, i, just, want, …	[(ten, ADJ), (months, NOUN), (postpartum, NOUN…	[ten, months, postpartum, want, know, alone]	ten months postpartum want know alone	[and, i, just, to, that, i, am, not]	[(ten, ADJ), (months, NOUN), (postpartum, NOUN…	{‘ADJ’: 1, ‘NOUN’: 2, ‘VERB’: 2, ‘ADV’: 1}	[(NOUN, 2), (VERB, 2), (ADJ, 1), (ADV, 1)]	[(VERB, 2)]	[(want, VERB), (know, VERB)]	ten month postpartum want know alone	[(ten, ADJ), (month, NOUN), (postpartum, NOUN)…	[(want, VERB), (know, VERB)]
4	Hi fellow fit pregnant friends!	hi fellow fit pregnant friends	[hi, fellow, fit, pregnant, friends]	[(hi, ADV), (fellow, ADJ), (fit, NOUN), (pregn…	[hi, fellow, fit, pregnant, friends]	hi fellow fit pregnant friends	[]	[(hi, ADV), (fellow, ADJ), (fit, NOUN), (pregn…	{‘ADV’: 1, ‘ADJ’: 2, ‘NOUN’: 2}	[(ADJ, 2), (NOUN, 2), (ADV, 1)]	[]	[]	hi fellow fit pregnant friend	[(hi, ADV), (fellow, ADJ), (fit, NOUN), (pregn…	[]
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…

Cleaned dataset

REFERENCES

https://towardsdatascience.com/cleaning-preprocessing-text-data-by-building-nlp-pipeline-853148add68a