Pushshift will be releasing a word frequency list created from over one billion tweets. This list will contain the word used, the number of times the word is used and the frequency (percentage) of the word used overall from the sample.
This may help researchers identify
Conversation
Replying to
Sounds great! What tokenization strategy are you planning to use? E.g. will hashtags be stripped or case normalized?
1
1
Replying to
Great questions! I am normalizing a few ways:
- Remove hashtags
- Remove user mentions
- Normalize unicode (basically ascii flattening)
- Lowercase string
If you have any other suggestions, please let me know!

