Conversation

Replying to
2/ Textual relevance is only one part of document ranking (alongside signals like centrality, page quality, and click rate) But it’s one the most important parts and the one we’ll be covering in today’s thread.
3
7
3/ The most popular way to rank documents relative to queries is to use TF-IDF vector representation. Essentially, this claims the more often a term occurs on a page (TF), and the less often it occurs on other pages (IDF) the more likely that term is to be relevant to the page.
1
7
4/ e.g., the word “is” occurs frequently on most pages, so even though the TF is high, the IDF is low, giving it a low score overall. The word “koala” is relatively infrequent, so if it occurs a few times in a document it is likely to be an important part of the document.
1
3
5/ The score of a document relative to a query can then simply be the dot product of the two vectors (document and query). This brings up an important question - to normalize or not?
Image
1
6
6/ Normalizing vectors helps shorter documents - while the document “Neeva neeva neeva” will have a higher term frequency than the document “neeva,” it feels incorrect to promote it because it’s also three times as long.
1
3
8/ We tune a length normalization pivot such that “the probability of retrieval for the documents of a given length is very close to the probability of finding a relevant document of that length”. The final formula looks something like this 👇
Image
1
9
10/ A pillar of web search ranking algorithms is “what you think of yourself is much more less important than what others think of you”. It’s easy to lie on the internet, but it’s hard to convince a lot of other websites (and people) of your lie.
Image
1
7
11/ To apply this logic, limit the contribution from the content of the page itself (like title, url, body) to the final score by squashing them using a sigmoid. Instead, a larger contribution comes from anchor text -the text on *other* webpages that linked to your page
1
6
13/ We further weight the anchors differently based on whether they came from an authoritative (e.g. high pagerank) or less authoritative source.
1
4
14/ Most importantly, the part of scoring that is impossible to manipulate is click data. For each click on a URL, we associate the search terms with that URL. The contribution of these click terms is squashed to a much higher value.
2
7
15/ Simply put: 1) use pivoted DLN to fairly score documents of all length, and 2) apply squashing based on the importance of each section.
2
7
16/ As mentioned, textual relevance is just one of many factors that go into deciding which results to display for a search query. If you found this🧵interesting, follow us and keep an eye out for future ranking threads on things like centrality and deep learning!
9