Python: Word counting and TF-IDF analysis on Google plus

I made the test snippets to get and analyze sentences on google plus post.

 

Getting and processing text data:

I used google-api-python-client for retrieving google plus posts of Tim O’Reilly and Chris Anderson.

# In[0]:
import httplib2
import apiclient.discovery

# Set API key.
API_KEY = ""
service = apiclient.discovery.build("plus", "v1", http=httplib2.Http(), developerKey=API_KEY)

# Get activitis of O'reilly or Chris Anderson at most 10 post.
users = {'oreilly': '107033731246200681024', 'chris-anderson': '105910977869522122580'}
activity_feed = service.activities().list(userId=users['oreilly'], collection="public", maxResults=100).execute()

Then I cleaned HTML tags in post sentences by BeaurifulSoup4 and tokenized them by nltk. The result list contains list of words in each sentence.

# In[1]:
# Delete HTML tags.
from bs4 import BeautifulSoup

def cleanHtml(html):
    if html == "": return ""
    soup = BeautifulSoup(html)
    return soup.get_text()

sentences = [cleanHtml(i["object"]["content"]) for i in activity_feed["items"]]

# Make words Lower cases and Tokenized by nltk.
import nltk

sentences_token = [nltk.word_tokenize(i.lower()) for i in sentences if len(i) != 0]

 

Word count analysis:

Firstly, I made the histogram for Counting sentences of a length.
The patterns are similar, but O’Reilly usually writes longer posts than Chris Anderson.

# In[2]:
from matplotlib.pylab import plt

# Count sentences of a length by histogram.
sentence_len = [len (i) for i in sentences_token]
plt.hist(sentence_len)
plt.xlabel("Length of Sentence")
plt.ylabel("Count")
Left: O'Reilly, Right:Chris.

Left: O’Reilly, Right:Chris.

 

Second, I made scatter plot for comparing word count in sentences of a length.
Obviously small posts have less opportunity to write various words, so I think if you chose a meaningful word rather than functional words, small posts cannot be evaluated properly.
Thus I decided to select functional words in this examples.

# Compare count and length of sentence for a word.
def countRatio(sentences_token, word):
    word_ratio = [i.count(word) for i in sentences_token]

    plt.scatter(sentence_len,word_ratio)
    plt.xlabel("Length of Sentence")
    plt.ylabel("Word Count")

 

I assumed “.” should appear in longer posts, because longer posts commonly have a lot of sentences. O’reilly’s sample seems to fits my assumption.

countRatio(sentences_token,".")
Left: O'Reilly, Right:Chris.

Left: O’Reilly, Right:Chris.

 

I thought “!” is used in small posts like news. O’Reilly does’n obey my thought, but Chris Anderson looks use “!” in smaller posts.

countRatio(sentences_token,"!")
Left: O'Reilly, Right:Chris.

Left: O’Reilly, Right:Chris.

 

Another functional word is “?”. I didn’t idea how the word is written in smaller of longer posts. I can’t figure out pattern after saw the result.

countRatio(sentences_token,"?")

oreilly_chris_length2_word1_%22?%22

 

TF-IDF analysis:

Next, I calculated TF-IDF by scikit learn.

# In[3]:
# TF-IDF by Scikit learn.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, max_df=50)
X = vectorizer.fit_transform(sentences)

# Top 10 words in TF-IDF: (word, TF-IDF value, index).
top_n = 10
indices = vectorizer.idf_.argsort()[::-1] # Sort by TF-IDF.
features = vectorizer.get_feature_names() # Word list.
top_features = [(features[i], vectorizer.idf_[i], i) for i in indices[:top_n]]
print(top_features)

 

Top 10 TF-IDF words of O’Reilly is below:

[('zoos', 4.9219733362813143, 2227), ('goal', 4.9219733362813143, 829), ('golden', 4.9219733362813143, 832), ('goldman', 4.9219733362813143, 834), ('gone', 4.9219733362813143, 835), ('googley', 4.9219733362813143, 841), ('governmental', 4.9219733362813143, 845), ('grades', 4.9219733362813143, 848), ('granite', 4.9219733362813143, 850), ('graphs', 4.9219733362813143, 851)]

Top 10 TF-IDF words of Chris Anderson is below:

[('zimmer', 4.9219733362813143, 801), ('hoeken', 4.9219733362813143, 351), ('i2c', 4.9219733362813143, 359), ('huge', 4.9219733362813143, 358), ('https', 4.9219733362813143, 357), ('house', 4.9219733362813143, 354), ('hot', 4.9219733362813143, 353), ('hook', 4.9219733362813143, 352), ('hobby', 4.9219733362813143, 350), ('hear', 4.9219733362813143, 338)]

 

I made heat map with bag of words and sentence.

# heatmap with bag of words and sentences.
plt.matshow(X.toarray().T)
Left: O'Reilly, Right:Chris.

Left: O’Reilly, Right:Chris.

 

Lastly I made histogram to see how TF-IDF values are distributed. Oreilly makes higher TF-IDF words, meaning he seems to write not simpler posts than Chris.

# Histgram of TF-IDF value.
plt.hist(vectorizer.idf_)
Left: O'Reilly, Right:Chris.

Left: O’Reilly, Right:Chris.

Leave a Reply