GetOldTweets3
How to bypass limitations of the Twitter Official API when scraping twitter data.
GetOldTweets3 is a free Python 3 library that allows you to scrape data from twitter without requiring any API keys. It also allows you to scrape historical tweets > 1 week old, which you cannot do with the Twitter API.
Using GetOldTweets3, you can scrape tweets using a variety of search parameters such as start/ end dates, username(s), text query search, and reference location area. Additionally, you can specify which tweet attributes you would like to include. Some attributes include: username, tweet text, date, retweets and hashtags.
Let’s run through some examples to illustrate the different ways we can use the GetOldTweets3 library to extract tweets:
Before you start, you will need to install GetOldTweets3.
!pip install GetOldTweets3
Example 1
Top 100 Recent Tweets from Specified News Sources
# Importing GetOldTweets3
import GetOldTweets3 as got# Importing pandas
import pandas as pd
def get_tweets(username, top_only, start_date, end_date, max_tweets):
# specifying tweet search criteria
tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
.setTopTweets(top_only)\
.setSince(start_date)\
.setUntil(end_date)\
.setMaxTweets(max_tweets)
# scraping tweets based on criteria
tweet = got.manager.TweetManager.getTweets(tweetCriteria)
# creating list of tweets with the tweet attributes
# specified in the list comprehension
text_tweets = [[tw.username,
tw.text,
tw.date,
tw.retweets,
tw.favorites,
tw.mentions,
tw.hashtags] for tw in tweet]
# creating dataframe, assigning column names to list of
# tweets corresponding to tweet attributes
news_df = pd.DataFrame(text_tweets,
columns = ['User', 'Text','Date', 'Favorites', 'Retweets', 'Mentions', 'HashTags'])
return news_df
Now we can run the function based on our desired criteria.
# Defining news sources I want to include
news_sources = ['nytimes', 'bbcbreaking', 'bbcnews', 'bbcworld', 'theeconomist', 'reuters','wsj', 'financialtimes', 'guardian']# getting tweets from the defined new sources,
# only including top tweets,
# looking at the past week with the end_date not inclusive,
# and specifying that we want a max number of tweets = 100.
# also sorting the tweets by date, descending.news_df = get_tweets(news_sources,
top_only = True,
start_date = "2020-04-07",
end_date = "2020-04-14",
max_tweets = 100).sort_values('Date', ascending=False)news_df.head()
Here is a screenshot of this output.
Let’s try something. Let’s export this as an html file so we can view the tweet text in its entirety in our browser…
# exporting as an html file
news_df.to_html('news_twitter_summary.html')
Here is a screenshot what it looks like when opened in my browser:
Now we have a personal news feed. Pretty cool, right?!
Example 2
First 1000 tweets that include the word “flood” in Wisconsin and its surrounding area, from 7/18/19 through 7/20/19.
(I used the following code in a recent project I completed, using social media to map floods in the US. Using data from FEMA, I identified periods with and without definite flooding in a variety of states, and then scraped twitter using GetOldTweets3. After cleaning the data and performing NLP, I converted the text data to numeric, and trained a classifier to identify which “flood” tweets corresponded to an actual US flood.)
# Importing GetOldTweets3
import GetOldTweets3 as got# Importing pandas
import pandas as pddef get_tweets(state, startdate, enddate, maxtweet):
tweetCriteria = got.manager.TweetCriteria().setQuerySearch("Flood")\
.setSince(startdate)\
.setUntil(enddate)\
.setNear(state)\
.setWithin("500mi")\
.setMaxTweets(maxtweet)
tweet = got.manager.TweetManager.getTweets(tweetCriteria)
text_tweets = [[tw.username,
tw.text,
tw.date,
tw.retweets,
tw.favorites,
tw.mentions,
tw.hashtags,
tw.geo] for tw in tweet] df_state= pd.DataFrame(text_tweets, columns = ['User', 'Text', 'Date', 'Favorites', 'Retweets', 'Mentions','Hashtags', 'Geolocation'])
return df_state
Now we can run the function based on our desired criteria.
df_1 = get_tweets('Wisconsin', "2019-07-18", "2019-07-20", 1000)
df_1.head()
Here is a screenshot of the output:
Last update: April 13, 2020
Resources
- GetOldTweets3 Python3 library documentation at https://pypi.org/project/GetOldTweets3/.