1/ When someone types “neeva” into search, how do we know they mean “neeva.com” instead of “neevaneevaneeva.com”? After all, the second has 3 times as much neeva!
See how you can do much better than vanilla TF-IDF / cosine similarity for textual relevance!🧵
Conversation
7/ Yet, you can argue that longer documents are more likely to discuss multiple topics, so normalizing the vectors unjustly penalizes them.
To fairly balance between these extremes, we use ideas from the paper “pivoted document length normalization”
ecommons.cornell.edu/bitstream/hand
1
8
9/This helps prevent “unfair” penalization of long documents, while still showing relevant short documents.
However, this still doesn’t solve how to show neeva.com instead of neevaneevaneeva.com - even after length normalization, they are both equally relevant
1
5
12/ For example, “neeva.com” is linked to from a lot of pages with text like “see the neeva homepage here:“ Meanwhile, “neevaneevaneeva.com” isn’t linked to by anyone.
1
5
