(That I used the word "repository" instead of "corpus" should signal my level of experience in this area.)
-
-
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
I don't do OCR, so I can't speak to that portion, but once you get these in computer-readable format, it doesn't seem like too difficult a task...it depends on how fine-grained you need "similarity of questions" to be, though.
-
Yea, the second part is where I have very little experience.
- 5 more replies
New conversation -
-
-
Sounds like you got many positive responses, so I guess it's my job to give the negative. Assuming these are academic papers, it seems that sometimes even undergrads cannot extract main questions from closely read text. I'd be sceptical of NPL delivering any deep solution to this
-
Yea. I want to do so using *explicit* questions — content summarization is too thorny (and beyond my expertise). I think the main problem is that it may introduce bias conditioned by style. I use lots of rhetorical questions; but, lots of people think that's too conversational.
- 4 more replies
New conversation -
-
-
I want to sit down and do the same. Should be a relatively straightforward task. The Python data stack has way too many libs that can help. Search for term vectorization, tf-idf, and document clustering for a starter.
-
(I'm flying and doing a bunch of nonsense this week, but feel free to DM me next week if I forget.)
End of conversation
New conversation -
-
-
Assuming you have the questions as entities, word embeddings? But embed the whole question. See, e.g., https://fasttext.cc/ for some examples (esp "sentence embedding" sections). And then compare the vectors via cosine
-
Thanks, Brent!
End of conversation
New conversation -
-
-
https://arxiv.org/abs/1801.04898 uses topic modeling over arXiv abstracts and then clusters them (via similarity metrics). Might have some useful info in it.
- End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.