Selen Arslan

SCRAPING REDDIT DATA WITH PMAW LIBRARY FOR SPECIFIC KEYWORD

Collecting and Cleaning Reddit Data for Postpartum-related Submissions and Comments in 2021

In this script, we will be using the pmaw library to scrape Reddit data for a specific keyword. The data will be collectingfor the keyword “postpartum” within the year 2021. PMAW (Pushshift API Wrapper) is a Python wrapper for the Pushshift.io API. It allows for easy access to the data provided by the API, and provides a variety of useful functions for searching and filtering the data.

Before we begin, it’s important to understand the structure of Reddit. Reddit is an online platform that allows users to submit, discuss and share content. The content can be organized in to different communities called “subreddits”. Each subreddit is focused on a specific topic, and users can submit posts or comments related to that topic. A submission on Reddit refers to a post that is submitted to a subreddit by a user. A comment, on the other hand, refers to a reply or discussion on a submission. In this script, we will be scraping the submission data (post) and comment data from Reddit.

Let’s begin by importing the necessary libraries including Pandas, Numpy, datetime, and the PushshiftAPI library. We also install pmaw library.

import datetime as dt
import pandas as pd
import numpy as np
!pip install pmaw pandas
from pmaw import PushshiftAPI
import matplotlib.pyplot as plt
import seaborn as sns
import ast

SCRAPPING POST

We then use the PushshiftAPI to search for submissions that contain the keyword “postpartum” within the year 2021. The search_submissions() method is used to search for the submissions, and the ‘q’ parameter is used to specify the keyword to search for. The ‘after’ and ‘before’ parameters are used to specify the date range, and the ‘limit’ parameter is used to specify the maximum number of submissions to return. The data is stored in a dataframe called “df_post”.

api = PushshiftAPI()
end_epoch = int(dt.datetime(2021,12,31,0,0).timestamp())
start_epoch = int(dt.datetime(2021,1,1,0,0).timestamp())
submissions = api.search_submissions(q='postpartum', after=start_epoch, before=end_epoch,limit=30000)
df_post=pd.DataFrame(submissions)

Now we will start cleaning the data. We will first replace empty selftext rows with NaN. We then proceed to drop the selftext rows that contain NaN.

df2 = df_post.selftext.replace('',np.nan,regex = True)
df_post = df_post[df_post['selftext'].notna()].reset_index()

We also drop any duplicated selftext rows and keep the first one. This step is done because some of the posts are shared in different subreddits, which results in duplicate entries in our dataframe. By keeping only one instance of the duplicated post, we can ensure that our data is clean and accurate.

df_post = df_post.drop_duplicates(subset=["selftext"], keep='first')

We also drop any rows that contain the selftext “[deleted]” or “[removed]” as they do not contain any relevant information.

df_post.drop(df_post[df_post['selftext']=='[deleted]'].index, inplace = True)
df_post.drop(df_post[df_post['selftext']=='[removed]'].index, inplace = True)

Finally, we save the cleaned dataframe to a CSV file.

df_post.to_csv('YOUR PATH/df_post_original.csv')

With this script, we have successfully scraped Reddit data for the keyword “postpartum” within the year 2021 and cleaned the data to obtain a useful dataset for further analysis.

SCRAPPING COMMENT

This code block is used to scrape comment data from Reddit that contain the keyword “postpartum” within the year 2021. We are using the same PushshiftAPI library that we used in the previous code block to scrape the submission data. The api.search_comments() method is used to search for comments that contain the keyword “postpartum”. The ‘q’ parameter is used to specify the keyword to search for, the ‘after’ and ‘before’ parameters are used to specify the date range and the ‘limit’ parameter is used to specify the maximum number of comments to return.

Once we have the comments, we are creating a dataframe from the comments and storing the data in the variable “comment_df”. Then we save the comment dataframe to a csv file. It’s similar to our first code block where we are scraping post data but this time we are scraping comments instead.

comment = api.search_comments(q='postpartum', after=start_epoch, before=end_epoch,limit=70000)
comment_df=pd.DataFrame(comment)
comment_df.to_csv('YOUR PATH/df_comment_original.csv')

MATCHING

This code block is used to match the comments with the related posts. In order to match the comments with the related posts, we first need to create a common identifier that we can use to join the two dataframes. The ‘parent_id column’ in the comment dataframe contains the id of the post that the comment is related to, but it has a prefix of “t3_” which we need to remove. We use the .str[] method to remove the prefix and store the cleaned parent id in a new column called parent_id_clean.

comment_df['parent_id_clean'] = comment_df['parent_id'].str[3:]

We make a copy of post dataframe and comment dataframe and store in “post_df1” and “comment_df1” respectively. These copies are used to avoid modifying the original dataframes. We create a dataframe from the parent_id_clean column of the comment dataframe and rename it to “id” using the .rename() method. We also create a dataframe from the id column of the post dataframe. Then we use the pd.merge() function to join these two dataframes on the ‘id’ column, so we have all the comments that match with their parent posts.We then drop the duplicate rows and reset the index of the resulting dataframe, and store it in the variable “merged_id”.

post_df1=post_df.copy()
comment_df1=comment_df.copy()
parent_id=pd.DataFrame(comment_df1['parent_id_clean'])
parent_id.rename(columns = {'parent_id_clean':'id'}, inplace = True)
post_id=pd.DataFrame(post_df1['id'])
merged_id=pd.merge(parent_id,post_id,how='inner',on=['id'])
merged_id=merged_id.id.drop_duplicates().reset_index(drop=True)
merged_id=pd.DataFrame(merged_id)

Then we will create the post and comment datasets

merged_post=pd.merge(post_df1,merged_id,how='inner',on=['id'])
merged_id['parent_id_clean']=merged_id.id.copy()
merged_comment=pd.merge(comment_df1,merged_id,how='inner',on=['parent_id_clean'])
merged_comment['type']='comment'
merged_post['type']='post'
merged_comment.to_csv('YOUR PATH/matched_comment.csv')
merged_post.to_csv('YOUR PATH/matched_post.csv')

In conclusion, this script uses the pmaw library to scrape submission data from Reddit for the keyword “postpartum” within a specific date range. The data is stored in a dataframe and cleaned to remove any duplicates or irrelevant data. Additionally, the script also scraps the comment data and matches it with the related post data to create two final datasets, one for posts and one for comments. These datasets can then be used for further analysis or modeling. Overall, this script provides a useful tool for collecting and cleaning data from Reddit for specific keywords and date ranges.