Returns two data frames (tweets data and users data) using a provided search query.
search_tweets(q, n = 100, type = "recent", max_id = NULL,
include_rts = TRUE, parse = TRUE, usr = TRUE, token = NULL,
retryonratelimit = FALSE, verbose = TRUE, ...)
Query to be searched, used to filter and select tweets
to return from Twitter's REST API. Must be a character string not
to exceed maximum of 500 characters. Spaces behave like boolean
"AND" operator. To search for tweets containing at least one of
multiple possible terms, separate each search term with spaces
and "OR" (in caps). For example, the search
q = "data science"
looks for tweets containing both
"data" and "science" anywhere located anywhere in the tweets and
in any order. When "OR" is entered between search terms,
query = "data OR science"
, Twitter's REST API should return
any tweet that contains either "data" or
"science." It is also possible to search for exact phrases using
double quotes. To do this, either wrap single quotes around a
search query using double quotes, e.g.,
q = '"data science"'
or escape each internal double quote
with a single backslash, e.g., q = "\"data science\""
.
Integer, specifying the total number of desired tweets to
return. Defaults to 100. Maximum number of tweets returned from
a single token is 18,000. To return more than 18,000 tweets, users
are encouraged to set retryonratelimit
to TRUE. See details
for more information.
Character string specifying which type of search
results to return from Twitter's REST API. The current default is
type = "recent"
, other valid types include
type = "mixed"
and type = "popular"
.
Character string specifying the [oldest] status
id beyond which search results should resume returning.
Especially useful large data returns that require multiple
iterations interrupted by user time constraints. For searches
exceeding 18,000 tweets, users are encouraged to take advantage
of rtweet's internal automation procedures for waiting on
rate limits by setting retryonratelimit
argument to TRUE.
It some cases, it is possible that due to processing time and
rate limits, retreiving several million tweets can take several
hours or even multiple days. In these cases, it would likely be
useful to leverage retryonratelimit
for sets of tweets
and max_id
to allow results to continue where previous
efforts left off.
Logical, indicating whether to include retweets in search results. Retweets are classified as any tweet generated by Twitter's built-in "retweet" (recycle arrows) function. These are distinct from quotes (retweets with additional text provided from sender) or manual retweets (old school method of manually entering "RT" into the text of one's tweets).
Logical, indicating whether to return parsed
data.frame, if true, or nested list (fromJSON), if false. By default,
parse = TRUE
saves users from the wreck of time and frustration
associated with disentangling the nasty nested list returned
from Twitter's API (for proof, check rtweet's Github commit history).
As Twitter's APIs are subject to change, this argument would be
especially useful when changes to Twitter's APIs affect performance of
internal parsers. Setting parse = FALSE
also ensures the
maximum amount of possible information is returned. By default, the
rtweet parse process returns nearly all bits of information returned
from Twitter. However, users may occassionally encounter new or
omitted variables. In these rare cases, the nested list object will
be the only way to access these variables.
Logical indicating whether to return a data frame of
users data. Users data is stored as an attribute. To access this
data, see users_data
. Useful for marginal returns in
memory demand. However, any gains are likely to be negligible as
Twitter's API invariably returns this data anyway. As such, this
defaults to true, see users_data
.
OAuth token. By default token = NULL
fetches a
non-exhausted token from an environment variable. Find
instructions on how to create tokens and setup an environment
variable in the tokens vignette (in r, send ?tokens
to
console).
Logical indicating whether to wait and retry when rate limited. This argument is only relevant if the desired return (n) exceeds the remaining limit of available requests (assuming no other searches have been conducted in the past 15 minutes, this limit is 18,000 tweets). Defaults to false. Set to TRUE to automate process of conducting big searches (i.e., n > 18000). For many search queries, esp. specific or specialized searches, there won't be more than 18,000 tweets to return. But for broad, generic, or popular topics, the total number of tweets within the REST window of time (7-10 days) can easily reach the millions.
Logical, indicating whether or not to include
output processing/retrieval messages. Defaults to TRUE. For
larger searches, messages include rough estimates for time
remaining between searches. It should be noted, however, that
these time estimates only describe the amount of time between
searches and not the total time remaining. For large searches
conducted with retryonratelimit
set to TRUE, the
estimated retreival time can be estimated by dividing the number
of requested tweets by 18,000 and then multiplying the quotient
by 15 (token cooldown time, in minutes).
Futher arguments passed on to make_url
.
All named arguments that do not match the above arguments
(i.e., count, type, etc.) will be built into the request.
To return only English language tweets, for example, use
lang = "en"
. For more options see Twitter's
API documentation.
List object with tweets and users each returned as a data frame.
Twitter API documentation recommends limiting searches to 10 keywords and operators. Complex queries may also produce API errors preventing recovery of information related to the query. It should also be noted Twitter's search API does not consist of an index of all Tweets. At the time of searching, the search API index includes between only 6-9 days of Tweets.
Number of tweets returned will often be less than what was
specified by the user. This can happen because (a) the search
query did not return many results (the search pool is already
thinned out from the population of tweets to begin with),
(b) because user hitting rate limit for a given token, or (c)
of recent activity (either more tweets, which affect pagination
in returned results or deletion of tweets). To return more than
18,000 tweets in a single call, users must set
retryonratelimit
argument to true. This method relies on
updating the max_id
parameter and waiting for token rate
limits to refresh between searches. As a result, it is possible
to search for 50,000, 100,000, or even 10,000,000 tweets, but
these searches can take hours or even days. At these durations,
it would not be uncommon for connections to timeout. Users are
instead encouraged to breakup data retrieval into smaller chunks
by leveraging retryonratelimit
and then using the
status_id of the oldest tweet as the max_id
to resume
searching where the previous efforts left off.
https://dev.twitter.com/overview/documentation
Other tweets: get_favorites
,
get_timeline
,
lookup_statuses
,
stream_tweets
, tweets_data
# NOT RUN {
## search for 1000 tweets mentioning Hillary Clinton
hrc <- search_tweets(q = "hillaryclinton", n = 1000)
## data frame where each observation (row) is a different tweet
hrc
## users data also retrieved. can access it via users_data()
users_data(hrc)
## search for 1000 tweets in English
djt <- search_tweets(q = "realdonaldtrump", n = 1000, lang = "en")
djt
users_data(djt)
## exclude retweets
rt <- search_tweets("rstats", n = 500, include_rts = FALSE)
## perform search for lots of tweets
rt <- search_tweets("trump OR president OR potus", n = 100000,
retryonratelimit = TRUE)
## plot time series of tweets frequency
ts_plot(rt, by = "mins", theme = "spacegray",
main = "Tweets about Trump")
# }
Run the code above in your browser using DataLab