Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Word Stemming¶
existing stemming method such as NLTK.PorterStem, etc.
didn’t -> did not, there’s -> there is, etc. Mr. -> Mister Mrs. -> ... Ms. -> ...
Other things¶
it seems that it is hard to get useful information using 1-gram
URLs in text are often important and is relatively easy to extract.
After handing URLs, you can replace “/” and “.” with spaces to avoid confusing them with real long words.
long words often contain useful information, however, you have to be careful about words of the form “and/or”, etc. And do not confuse it with URLs.
the idea of keeping upper/lower quantile (e.g., 5%) of long words, 2-grams, etc. is a very good idea