i'm using python parse urls words. having success trying cut down on ambiguity. example, given following url
"abbeycarsuk.com"
and algorithm outputs:
['abbey','car','suk'],['abbey','cars','uk']
clearly second parsing correct one, first 1 technically correct (apparently 'suk' word in dictionary using).
what me out lot if there wordlist out there contains fequency/popularity of each word. work algorithm , second parsing chosen (since 'uk' more common 'suk'). know find such list? found wordfrequency.info charge data, , free sample offer not have enough words me able use successfully.
alternatively, suppose download large corpus (project gutenberg?) , frequency values myself, if such data set exists, make life lot easier.
there extensive article on subject written peter norvig (google's head of research), contains worked examples in python, , easy understand. article, along data used in sample programs (some excerpts of google ngram data) can found here. complete set of google ngrams, several languages, can found here (free download if live in east of us).
Comments
Post a Comment