I will not extensively dig into the coding part as it is very self explanatory. An ideal solution for a problem like this would be to crawl over various web pages and extract all relevant pages and then run an NLTK algorithm on top of these to extract the tags.Since I have already explained about web-crawling in one of my previous blogs i will not elaborate on that, instead i will use some hard coded corpuses and then analyse them to extract tags.
from nltk import ne_chunk, pos_tag, word_tokenize from nltk.tree import Tree def get_continuous_chunks(text): chunked = ne_chunk(pos_tag(word_tokenize(text))) prev = None continuous_chunk = [] current_chunk = [] for i in chunked: if type(i) == Tree: current_chunk.append(i.label()+ ": " + " ".join([token for token, pos in i.leaves()])) elif current_chunk: named_entity = " ".join(current_chunk) if named_entity not in continuous_chunk: continuous_chunk.append(named_entity) current_chunk = [] else: continue return continuous_chunk txt = "A subsidiary of SAP SE, SAP North America oversees all business operations in the U.S. and Canada, and is headquartered in Newtown Square, PA, in the Philadelphia area. Get to know our management teams, learn about our long-term commitment to community, and find out what the SAP University Alliances program is doing to empower students at home and abroad. " print "\n".join(get_continuous_chunks(txt))
Heres the out put with all Noun forms Nicely tagged
comments powered by
Disqus