Topic Modeling

This notebook contains a very simple implementation of topic modeling - taking documents and automatically identifying which topics they describe.

Import Modules

The first thing we need to do is import some modules.

In [1]:
import sys
sys.path.append('env/lib/python3.6/site-packages/')
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial
import numpy as np
import nltk, string
import warnings
warnings.filterwarnings("ignore")

Helper Functions

In this section, we define two helper functions that we can use for pre-processing our text. The first is a stemmer, which converts full words to shortened stems for easier comparisons. The next is for normalization, which removes punctuation, converts to lowercase, and does "tokenization," or conversion to a Python list of separate stems.

In [2]:
stemmer = nltk.stem.porter.PorterStemmer()
def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

Topic Definitions

In order to classify documents by topic, we need to define the topic names and topic content. We'll use topics related to Phase 1 ESA reports.

In [3]:
topicnames=['Owner Interview',
            'Aerial Photographs',
            'City Records',
            'REC',
            'Resumes and Qualifications'
           ]
topiccontent=['We conducted interviews with the property owner',
             'We included photos from aerial reconnaissance',
             'There were local records from cities in our research',
             'We found a recognized environmental condition on the property',
             'The appendix has resumes and qualifications of the professional who prepared the report']

topic_combination=[m+' '+n for m,n in zip(topicnames,topiccontent)]

Vectorizer

We'll convert text to numeric vectors, which will enable us to perform mathematical operations later. For vectorization, we'll use a simple method called TF-IDF, which we'll fit on our defined topics. We'll also take this chance to remove "stop words," which are extremely common English words that we remove so they don't "distract" us.

In [4]:
stopwords="a able about across after all almost also am among an and any are as at be because been but by can cannot could dear did do does either else ever every for from get got had has have he her hers him his how however i if in into is it its just least let like likely may me might most must my neither no nor not of off often on only or other our own rather said say says she should since so some than that the their them then there these they this tis to too twas us wants was we were what when where which while who whom why will with would yet you your ain't aren't can't could've couldn't didn't doesn't don't hasn't he'd he'll he's how'd how'll how's i'd i'll i'm i've isn't it's might've mightn't must've mustn't shan't she'd she'll she's should've shouldn't that'll that's there's they'd they'll they're they've wasn't we'd we'll we're weren't what'd what's when'd when'll when's where'd where'll where's who'd who'll who's why'd why'll why's won't would've wouldn't you'd you'll you're you've"
clf = TfidfVectorizer(ngram_range=(1, 2),tokenizer=normalize, stop_words=normalize(stopwords))
clf.fit(topic_combination)
Out[4]:
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True,
                stop_words=['a', 'abl', 'about', 'across', 'after', 'all',
                            'almost', 'also', 'am', 'among', 'an', 'and', 'ani',
                            'are', 'as', 'at', 'be', 'becaus', 'been', 'but',
                            'by', 'can', 'can', 'not', 'could', 'dear', 'did',
                            'do', 'doe', 'either', ...],
                strip_accents=None, sublinear_tf=False,
                token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function normalize at 0x7feb9007a9d8>, use_idf=True,
                vocabulary=None)

Calculate Similarity Metrics

In this section, we'll use our vectorizer to convert text to numeric vectors. Then, we'll use a method called "cosine distance" to calculate the similarity between each defined topic and our query document.

In [5]:
document='This ESA is for the property at 12400 Hwy 71. In this document, you can read some recognized environmental conditions. You can also see a list of the qualifications in an appendix.'
tfidf_reports = clf.transform(topic_combination).todense()
tfidf_question = clf.transform([document]).todense()

row_similarities = [1 - spatial.distance.cosine(tfidf_reports[x],tfidf_question) for x in range(len(tfidf_reports)) ]

Printing Output

Our output will consist of topic names, together with the similarity measurements we've calculated for each topic.

In [6]:
print(list(zip(topicnames,row_similarities)))
print('**************Final output*************')
print('The most relevant topic is:')
print(topicnames[np.argmax(row_similarities)])
[('Owner Interview', 0.057817799723204444), ('Aerial Photographs', 0.0), ('City Records', 0.0), ('REC', 0.5887003944688504), ('Resumes and Qualifications', 0.29676952010975954)]
**************Final output*************
The most relevant topic is:
REC
In [7]:
def topicmodel(query,allreports):
    tfidf_reports = clf.transform(allreports).todense()
    tfidf_question = clf.transform([query]).todense()
    row_similarities = [1 - spatial.distance.cosine(tfidf_reports[x],tfidf_question) for x in range(len(tfidf_reports)) ]
    print(row_similarities)
    return(row_similarities)#allreports[np.argmax(row_similarities)])

similarities=list(zip(topicnames,topicmodel(document,topic_combination)))
print(similarities)
[0.057817799723204444, 0.0, 0.0, 0.5887003944688504, 0.29676952010975954]
[('Owner Interview', 0.057817799723204444), ('Aerial Photographs', 0.0), ('City Records', 0.0), ('REC', 0.5887003944688504), ('Resumes and Qualifications', 0.29676952010975954)]
In [8]:
similarities2=[sim for sim in similarities if sim[1]>0.1]
print('Topics that Appear in Our Document')
print('')
for sim in similarities2:
    print('The topic '+str(sim[0])+' appears in the document, with relevance  '+str(round(sim[1],3)))
Topics that Appear in Our Document

The topic REC appears in the document, with relevance  0.589
The topic Resumes and Qualifications appears in the document, with relevance  0.297
In [ ]: