GetOldTweets3

How to bypass limitations of the Twitter Official API when scraping twitter data.

Andrea Yoss
3 min readApr 13, 2020

GetOldTweets3 is a free Python 3 library that allows you to scrape data from twitter without requiring any API keys. It also allows you to scrape historical tweets > 1 week old, which you cannot do with the Twitter API.

Using GetOldTweets3, you can scrape tweets using a variety of search parameters such as start/ end dates, username(s), text query search, and reference location area. Additionally, you can specify which tweet attributes you would like to include. Some attributes include: username, tweet text, date, retweets and hashtags.

Let’s run through some examples to illustrate the different ways we can use the GetOldTweets3 library to extract tweets:

Before you start, you will need to install GetOldTweets3.

!pip install GetOldTweets3

Example 1

Top 100 Recent Tweets from Specified News Sources

# Importing GetOldTweets3
import GetOldTweets3 as got
# Importing pandas
import pandas as pd

def get_tweets(username, top_only, start_date, end_date, max_tweets):

# specifying tweet search criteria
tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
.setTopTweets(top_only)\
.setSince(start_date)\
.setUntil(end_date)\
.setMaxTweets(max_tweets)

# scraping tweets based on criteria
tweet = got.manager.TweetManager.getTweets(tweetCriteria)

# creating list of tweets with the tweet attributes
# specified in the list comprehension

text_tweets = [[tw.username,
tw.text,
tw.date,
tw.retweets,
tw.favorites,
tw.mentions,
tw.hashtags] for tw in tweet]

# creating dataframe, assigning column names to list of
# tweets corresponding to tweet attributes

news_df = pd.DataFrame(text_tweets,
columns = ['User', 'Text','Date', 'Favorites', 'Retweets', 'Mentions', 'HashTags'])

return news_df

Now we can run the function based on our desired criteria.

# Defining news sources I want to include
news_sources = ['nytimes', 'bbcbreaking', 'bbcnews', 'bbcworld', 'theeconomist', 'reuters','wsj', 'financialtimes', 'guardian']
# getting tweets from the defined new sources,
# only including top tweets,
# looking at the past week with the end_date not inclusive,
# and specifying that we want a max number of tweets = 100.
# also sorting the tweets by date, descending.
news_df = get_tweets(news_sources,
top_only = True,
start_date = "2020-04-07",
end_date = "2020-04-14",
max_tweets = 100).sort_values('Date', ascending=False)
news_df.head()

Here is a screenshot of this output.

Let’s try something. Let’s export this as an html file so we can view the tweet text in its entirety in our browser…

# exporting as an html file
news_df.to_html('news_twitter_summary.html')

Here is a screenshot what it looks like when opened in my browser:

Now we have a personal news feed. Pretty cool, right?!

Example 2

First 1000 tweets that include the word “flood” in Wisconsin and its surrounding area, from 7/18/19 through 7/20/19.

(I used the following code in a recent project I completed, using social media to map floods in the US. Using data from FEMA, I identified periods with and without definite flooding in a variety of states, and then scraped twitter using GetOldTweets3. After cleaning the data and performing NLP, I converted the text data to numeric, and trained a classifier to identify which “flood” tweets corresponded to an actual US flood.)

# Importing GetOldTweets3
import GetOldTweets3 as got
# Importing pandas
import pandas as pd
def get_tweets(state, startdate, enddate, maxtweet):
tweetCriteria = got.manager.TweetCriteria().setQuerySearch("Flood")\
.setSince(startdate)\
.setUntil(enddate)\
.setNear(state)\
.setWithin("500mi")\
.setMaxTweets(maxtweet)
tweet = got.manager.TweetManager.getTweets(tweetCriteria)

text_tweets = [[tw.username,
tw.text,
tw.date,
tw.retweets,
tw.favorites,
tw.mentions,
tw.hashtags,
tw.geo] for tw in tweet]
df_state= pd.DataFrame(text_tweets, columns = ['User', 'Text', 'Date', 'Favorites', 'Retweets', 'Mentions','Hashtags', 'Geolocation'])

return df_state

Now we can run the function based on our desired criteria.

df_1 = get_tweets('Wisconsin', "2019-07-18", "2019-07-20", 1000)
df_1.head()

Here is a screenshot of the output:

Last update: April 13, 2020

Resources

--

--