needed for their research and reduce the number of API requests needed to rehydrate the remaining tweet ids. After I confirm with Twitter if sharing metadata for each tweet id is possible, I wanted to ask other researchers which pieces of metadata are most important.
Conversation
Right now, I am creating a list of metadata fields to accompany each tweet id. Currently, I have the following metadata fields in mind. My question to other researchers is what other metadata would be helpful to aid in pre-filtering the data. Here are the fields I've added so far
1
6
tweet id (bigint)
user_id (bigint)
conversation_id (bigint)
is_retweet (bool)
is_quoted_tweet (bool)
is_reply (bool)
is_root (bool)
last_update_time (int)
retweet_count (int)
favorites_count (int)
reply_count (int)
contains_location_data (bool)
contains_media_types (enum int)
1
6
is_verified (bool)
user_follower_count (int)
user_creation_epoch (int)
Keep in mind that some fields may not be allowed via Twitters TOS (I need to clarify this with Twitter). Again, the main objective is to give as much metadata as possible to assist researchers in easily
1
5
pre-filtering the data before rehydrating tweets. If you can think of any other useful metadata to accompany each tweet id, please feel free to add your ideas and recommendations.
I will update with more info once I get more guidance from Twitter.
7
8
Replying to
how about instead of providing the metadata with tweet IDs, you provide a queryable API filtering on the metadata on your end? you can share only tweet IDs from the resulting query made to said API. this wouldn't require an approval from Twitter either since no metadata is shared
2
1
2
Oh wow, that's actually a very good idea and I think it would comply with the ToS and the developer agreement of Twitter. I think one could even add a filter for words in the tweets or user bios!
1
2
precisely. essentially, this would be the Twitter search endpoint simulated. The only drawback being that it returns just the `tweet_id` attribute instead of the entire Tweet Object.
1
also, in case Twitter does permit the sharing of metadata, and you choose to go ahead with that instead, `created_at` and `lang` are two crucial attributes that I found missing in your list.
2
1
created_at can be reconstructed from the tweet id itself. It would make things easier for many people to have that field anyway. I think adding created_at would not be additional metadata but another way of formatting ids, Twitter should be fine with that.
1
2
For reference:
Quote Tweet
Replying to @AlduCornelissen
The reason I didn't include created_at is because, for these datasets, the tweets have Twitter's Snowflake implementation which includes the millisecond epoch within the Tweet id itself.
You can get the millisecond epoch like this:
epoch =1288834974657 + (tweet_id >> 22)


