TWITTER API V2 SEARCH WITH TWEEPY

How to use Tweepy to get data for a specific hashtag with Twitter API V2?

Twitter API V2 for Academic Research account is used for this project. Please check for more information here.

WHAT IS TWEEPY?

WHY TWEEPY?

Tweepy is an open source Python package that gives you a very convenient way to access the Twitter API with Python. It is defined as ‘An easy-to-use Python library for accessing the Twitter API’ in the original website. You can chech the documentation for more detail from here.

Tweepy is not a third party API application to get data from Twitter. It is a Python library used to access real Twitter API which requires to have developer account and keys.You can easily find the all the methods you need in Tweepy documentation. Zero experience is not a problem for Tweepy user. Easy to learn and apply.

SET UP

Set up process for Tweepy is pretty easy. You only need one line code to install tweepy.

!pip install tweepy

Be sure you download the latest version to avoid errors in next steps.

!pip install --upgrade tweepy

Tweepy is set up and ready to use. Now we will connect to the Twitter API by Tweepy. Bearer Token from developer account will be used in next step.

CONNECTING TWITTER API

The Twitter API enables programmatic access to Twitter in unique and advanced ways. It is quite easy to use the twitter API to collect data. Only thing you need is bearer token from your developer account. A Bearer Token is a byte array of unspecified format that you generate using a script like a curl command. You can also obtain a Bearer Token from the developer portal inside the keys and tokens section of your App’s settings. We will be ready to collect data from Twitter after following codes:

Put your Bearer Token instead of 'BEARER TOKEN' in the following code.

bearer_token='BEARER TOKEN'

tweepy.Client is Tweepy’s interface for Twitter API v2. Details can be found in here. We will initialize the client with Bearer Token.

client = tweepy.Client(bearer_token, wait_on_rate_limit=True)

COLLECT DATA FROM TWITTER

Topic


I want to collect data about postpartum. We will search for tweets includes word ‘postpartum‘ or hashtag ‘#postpartum‘.

Time Period


I choose 2021. I want to have a homogeneous dataset which means same amount of data for each month.

Language


We will look for English tweets.

Type of Tweet


I want to have original tweets, so we will not search for retweets.

We will decide later which information will be kept.

Let’s work on the time period. Unfortunately, Twitter API does not support clear distribution on given time period. We will try our best to have most homogeneous dataset for whole year 2021. We will collect 50 tweets from each day during 2021. So, 50 tweets * 365 days = 18250. We are expecting to see 18250 rows in our dataset. The time format of request process is %Y-%m-%dTH:M:SZ. First we should create a list of days in 2021 in this format.

date=[]

#start date: 2021-january 1 st
start_date = datetime.date(2021, 1, 1)
 
#end date: 2022-january 1 st
#end date (01.01.2022) will not be included in data collecting part.
#last day will be 31.12.2021
end_date = datetime.date(2022, 1, 1)
 
# delta time
delta = datetime.timedelta(days=1)
# iterate over range of dates
while (start_date <= end_date):
    d1 = datetime.datetime.strptime(str(start_date),"%Y-%m-%d")
    new_format = "%Y-%m-%dT00:00:00Z"
    date.append(d1.strftime(new_format))
    start_date += delta

Basically, code structures is:

for response in tweepy.Paginator(client.search_all_tweets,
query =
tweet_fields = 8
expansions =
start_time =
end_time =
max_results=
limit=1)

  • query is used for keywords.
  • tweet_fields and expansions are used to select required columns.
  • start_time and end_time represent time period.
  • max_results is maximum capacity of collecting data in each request.
  • limit is number of request in each loop.

Please check the Twitter documentation page for more details.

Note: This process will take some time.

df_tweet = []
#iterates the days
for i in range(len(date)-1): #first day of 2022 is not included 
#request process
    for response in tweepy.Paginator(client.search_all_tweets, 
                                 query = '(postpartum OR #postpartum) lang:en -is:retweet', # tweets inclued 'postpartum' or #postpartum 
                                 tweet_fields = ['entities,id,text,author_id,lang,created_at,public_metrics,referenced_tweets'],
                                 expansions = ['author_id,in_reply_to_user_id,referenced_tweets.id'],
                                 start_time = date[i],
                                 end_time = date[i+1], # (end_date)-(start_date) = 1 day
                                 max_results=50, limit=1): # 50 tweets in each request. we will have totaly 1 request for each day.
                                
      time.sleep(1)
      df_tweet.append(response)

So, we are searching for english original tweets which includes ‘postpartum’ or ‘#postpartum’.

Finally we have the data. We should save it with column names which make sense in csv format. Following code maps the information we collect and store everything in a csv file.

result = []
user_dict = {}
#all information is in df_tweet. This is a loop to reach them.
for response in df_tweet:    
    for tweet in response.data:
        result.append({'author_id': tweet['author_id'], 
                       'tweet_id': tweet['id'],
                       'text': tweet.text,
                       'created_at': tweet.created_at,
                        'lang' : tweet['lang'],
                        'retweets': tweet.public_metrics['retweet_count'],
                       'replies': tweet.public_metrics['reply_count'],
                       'likes': tweet.public_metrics['like_count'],
                       'quote_count': tweet.public_metrics['quote_count']
                      })

df = pd.DataFrame(result)
#change 'YOUR_PATH' with the directory.
df.to_csv('YOUR_PATH/twitter_data.csv')

Let’s check the distribution of the tweets with following code.

df_month=[]
#dates of collected tweets are stored in a different dataset and the information of month is extracted.
for i in range(len(df)):
  x=str(df.created_at[i]).split('-')
  df_month.append(x[1])

df_month=pd.DataFrame(df_month)
df_month.columns = ['month']


#frequency of each month
df_month['month'].value_counts()

#bar chart
df_month['month'].value_counts().plot(kind='bar');