GetOldTweets3

How to bypass limitations of the Twitter Official API when scraping twitter data.

3 min readApr 13, 2020

GetOldTweets3 is a free Python 3 library that allows you to scrape data from twitter without requiring any API keys. It also allows you to scrape historical tweets > 1 week old, which you cannot do with the Twitter API.

Using GetOldTweets3, you can scrape tweets using a variety of search parameters such as start/ end dates, username(s), text query search, and reference location area. Additionally, you can specify which tweet attributes you would like to include. Some attributes include: username, tweet text, date, retweets and hashtags.

Let’s run through some examples to illustrate the different ways we can use the GetOldTweets3 library to extract tweets:

Before you start, you will need to install GetOldTweets3.

!pip install GetOldTweets3

Example 1

Top 100 Recent Tweets from Specified News Sources

# Importing GetOldTweets3
import GetOldTweets3 as got# Importing pandas
import pandas as pd
def get_tweets(username, top_only, start_date, end_date, max_tweets):
   
    # specifying tweet search criteria 
    tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
                          .setTopTweets(top_only)\
                          .setSince(start_date)\
                          .setUntil(end_date)\
                          .setMaxTweets(max_tweets)
    
    # scraping tweets based on criteria
    tweet = got.manager.TweetManager.getTweets(tweetCriteria)
    
    # creating list of tweets with the tweet attributes 
    # specified in the list comprehension
    text_tweets = [[tw.username,
                tw.text,
                tw.date,
                tw.retweets,
                tw.favorites,
                tw.mentions,
                tw.hashtags] for tw in tweet]
    
    # creating dataframe, assigning column names to list of
    # tweets corresponding to tweet attributes
    news_df = pd.DataFrame(text_tweets, 
                            columns = ['User', 'Text','Date',      'Favorites', 'Retweets', 'Mentions', 'HashTags'])
    
    return news_df

Now we can run the function based on our desired criteria.

# Defining news sources I want to include
news_sources = ['nytimes', 'bbcbreaking', 'bbcnews', 'bbcworld', 'theeconomist', 'reuters','wsj', 'financialtimes', 'guardian']# getting tweets from the defined new sources, 
# only including top tweets, 
# looking at the past week with the end_date not inclusive,
# and specifying that we want a max number of tweets = 100.
# also sorting the tweets by date, descending.news_df = get_tweets(news_sources, 
                     top_only = True,
                     start_date = "2020-04-07", 
                     end_date = "2020-04-14",
                    max_tweets = 100).sort_values('Date',         ascending=False)news_df.head()

Here is a screenshot of this output.

Let’s try something. Let’s export this as an html file so we can view the tweet text in its entirety in our browser…

# exporting as an html file
news_df.to_html('news_twitter_summary.html')

Here is a screenshot what it looks like when opened in my browser:

Now we have a personal news feed. Pretty cool, right?!

Example 2

First 1000 tweets that include the word “flood” in Wisconsin and its surrounding area, from 7/18/19 through 7/20/19.

(I used the following code in a recent project I completed, using social media to map floods in the US. Using data from FEMA, I identified periods with and without definite flooding in a variety of states, and then scraped twitter using GetOldTweets3. After cleaning the data and performing NLP, I converted the text data to numeric, and trained a classifier to identify which “flood” tweets corresponded to an actual US flood.)

# Importing GetOldTweets3
import GetOldTweets3 as got# Importing pandas
import pandas as pddef get_tweets(state, startdate, enddate, maxtweet):
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch("Flood")\
                                            .setSince(startdate)\
                                            .setUntil(enddate)\
                                            .setNear(state)\
                                            .setWithin("500mi")\
                                            .setMaxTweets(maxtweet)
    tweet = got.manager.TweetManager.getTweets(tweetCriteria)
    
    text_tweets = [[tw.username,
                tw.text,
                tw.date,
                tw.retweets,
                tw.favorites,
                tw.mentions,
                tw.hashtags,
                tw.geo] for tw in tweet]    df_state= pd.DataFrame(text_tweets, columns = ['User', 'Text', 'Date', 'Favorites', 'Retweets', 'Mentions','Hashtags', 'Geolocation'])
    
    return df_state

Now we can run the function based on our desired criteria.

df_1 = get_tweets('Wisconsin', "2019-07-18", "2019-07-20", 1000)
df_1.head()

Here is a screenshot of the output:

Last update: April 13, 2020

Resources

GetOldTweets3 Python3 library documentation at https://pypi.org/project/GetOldTweets3/.

GetOldTweets3

How to bypass limitations of the Twitter Official API when scraping twitter data.

Example 1

Example 2

Resources

Written by Andrea Yoss