In order to share Twitter data with academic researchers, it is important that Pushshift does so in a way that complies with Twitter's TOS.
Recently, Twitter made some modifications to its TOS which allow for sharing an unlimited number of tweet ids with other researchers.
Conversation
The research team that is given this list of ids then need to "rehydrate" the tweets via Twitter's API. One of the things that I am looking into is what metadata can be shared along with each tweet id. This would enable researchers using the data to filter out tweets that aren't
1
8
needed for their research and reduce the number of API requests needed to rehydrate the remaining tweet ids. After I confirm with Twitter if sharing metadata for each tweet id is possible, I wanted to ask other researchers which pieces of metadata are most important.
1
7
Right now, I am creating a list of metadata fields to accompany each tweet id. Currently, I have the following metadata fields in mind. My question to other researchers is what other metadata would be helpful to aid in pre-filtering the data. Here are the fields I've added so far
1
6
tweet id (bigint)
user_id (bigint)
conversation_id (bigint)
is_retweet (bool)
is_quoted_tweet (bool)
is_reply (bool)
is_root (bool)
last_update_time (int)
retweet_count (int)
favorites_count (int)
reply_count (int)
contains_location_data (bool)
contains_media_types (enum int)
1
6
is_verified (bool)
user_follower_count (int)
user_creation_epoch (int)
Keep in mind that some fields may not be allowed via Twitters TOS (I need to clarify this with Twitter). Again, the main objective is to give as much metadata as possible to assist researchers in easily
1
5
pre-filtering the data before rehydrating tweets. If you can think of any other useful metadata to accompany each tweet id, please feel free to add your ideas and recommendations.
I will update with more info once I get more guidance from Twitter.
Replying to
I propose user_statuses_count, sometimes it's useful to filter out accounts that are too active before rehydrating
1
2
Replying to
Most of these would be useful to filter what needs to be hydrated for each project. How much geolocation data is there these days ?
1
1
Replying to
A little bit above 1% of all tweets appear to contain some type of location data.
2
Replying to
I'm not familiar with - I mean, I've never seen - reply_count, is_root or conversation_id (the last one sounds like a key from DMs ?)
Tho I am using only the free REST API.
1
Replying to
is_root can be calculated I believe from the standard API data. The other two I believe are in the enhanced API so I may not be able to include those. I'll have to check.
1
Show replies
Replying to
What about something like has_url, has_hashtag, has_mention, has_image? Or even more fine grained num_url, num_hashtag, etc?
1
2
Replying to
Great ideas. I'm thinking hashtag_count, user mention count, etc. The has_image is part of the media metadata and I'll probably use some type of bit field (8 bit int) to cover all the possible values.
1
Replying to
how about instead of providing the metadata with tweet IDs, you provide a queryable API filtering on the metadata on your end? you can share only tweet IDs from the resulting query made to said API. this wouldn't require an approval from Twitter either since no metadata is shared
2
1
2
Oh wow, that's actually a very good idea and I think it would comply with the ToS and the developer agreement of Twitter. I think one could even add a filter for words in the tweets or user bios!
1
2
Show replies





