Does anyone have a Python method that extracts hashtags from text using the same rules that Twitter applies for their hashtag extraction? I've checked Stackoverflow but haven't found any updated code.
Conversation
def extract_hashtags(text):
'''Extract hashtags'''
valid_tags = set()
tags = re.findall(r'#(\w+)', text)
for tag in tags:
if tag.isdigit():
continue
else:
valid_tags.add(tag)
return valid_tags
2
1
2
This will ignore hashtags that are only digits. It seems hashtags can start with numbers but they can't just be numbers only.
At least this is a starting point and it appears to handle characters from other languages besides English.
1
1
If anyone can test this method to see if it produces invalid hashtags, I'd really appreciate it. I'll keep working on it to try and get it to exactly emulate Twitter's hashtag rules.
