Monday, November 12, 2012

twitteR - twitter data collection & analysis on R

Hi all,

Session 9 -"Emerging trends in MKTR" - is going to be reading heavy (you've been warned). Sooo many great readings and so little time. Anyway, there's this twitter based reading for which I'm putting up sample code below. Will ask AAs to load the data on LMS. But before that, some background.

Some folks have asked why we stopped where we did with text analysis. When, obviously, so much more downstream analysis and processing could have been done. Sure, a lot is possible and do-able on R. But class time is limited and only so much can fit in. One particular Q that came up:

"Can we do better sentiment analysis than what we just did for the session 6 HW?"
Sure, we can. It would be great if we could categorize sentiment and then classify text responses accordingly.
Here's what Wikipedia says on the subject:
A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional states such as "angry," "sad," and "happy."

The rise of social media such as blogs and social networks has fueled interest in sentiment analysis. With the proliferation of reviews, ratings, recommendations and other forms of online expression, online opinion has turned into a kind of virtual currency for businesses looking to market their products, identify new opportunities and manage their reputations. As businesses look to automate the process of filtering out the noise, understanding the conversations, identifying the relevant content and actioning it appropriately, many are now looking to the field of sentiment analysis. If web 2.0 was all about democratizing publishing, then the next stage of the web may well be based on democratizing data mining of all the content that is getting published.

Several research teams in universities around the world currently focus on understanding the dynamics of sentiment in e-communities through sentiment analysis.The CyberEmotions project, for instance, recently identified the role of negative emotions in driving social networks discussions. Sentiment analysis could therefore help understand why certain e-communities die or fade away (e.g., MySpace) while others seem to grow without limits (e.g., Facebook).

The problem is that most sentiment analysis algorithms use simple terms to express sentiment about a product or service. However, cultural factors, linguistic nuances and differing contexts make it extremely difficult to turn a string of written text into a simple pro or con sentiment. The fact that humans often disagree on the sentiment of text illustrates how big a task it is for computers to get this right. The shorter the string of text, the harder it becomes.

Anyway, R does sentiment analysis. Its package twitteR (last 'R' is capital) lets you set what keywords you want mined from twitter feeds, where in the world you want this data collected form (specify latitude and longitude of major cities, for example and a 50mile radius around them), collect that data, text mine it and analyze its content, score its sentiment and more. Neat, eh? Well, that's R for you.

Now, finally, on popular demand, here is some elementary R code that I used to analyze tweeple reactions to the latest Bond movie 'Skyfall'.

Step 1: Invoke appropriate libraries. Ensure you've the 'twitteR' and 'sentiment' packages downloaded and installed.

library(twitteR)
library(sentiment)
library(tm)
library(Snowball)
library(wordcloud)

Step 2: Send R to search for and save the data you want. This step is a little involved. Pls read instruction given in bullet points below carefully.

  • First copy and paste the below block of code into an empty notepad. Make all edits to the code in this nbotepad and then copy-paste to the R console.
  • Your PGP username and password that you use to connect to the web is required. Enter these in the code in place of 'username' and 'password' as given in 'set proxy for R' step.
  • If you want specific city based tweets only, use the geocode option in the searchTwitter() function below. For example, ' geocode="29.0167, 77.3833, 50mi" ' refers to tweets originating from a 50 mile radius around the center of Delhi.
  • In write.table(), write the tweets collected to a notepad only.
  • If you ask R to save more than n=500 tweets in the searchTwitter() function, it might take upto a couple of minutes (depending on your web connection) to find and save them.
###### search in twitter #######
#set proxy in R
Sys.setenv(http_proxy = "http://username:password@172.16.0.87:8080")

# send R to go collect data
rev = searchTwitter("#skyfall", n=500, lang="en")

## -- to do location specific searches ---
searchTwitter(searchString, n=25, lang="en", since=date, until=date, geocode = "38.5, 81.4, 50mi")

rev[1:5] #shows first 5 tweets
rev.df = twListToDF(rev) # changes tweets to data frame
#save data
write.table(as.matrix(rev.df[ ,1]), file.choose())
Here's what the first 5 recent tweets that R captured looked like:

Step 3: Standard text mining stuff which we already saw in session 6. I won't go into making barplots and histograms, you can do that yourself using the session 6 code.

x = readLines(file.choose())
x1 = Corpus(VectorSource(x))
# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)
x1 = tm_map(x1, removeNumbers)
# removing it from stopwords
myStopwords <- c(stopwords('english'), "skyfall", "bond")
x1 = tm_map(x1, removeWords, myStopwords)
x1 = tm_map(x1, stemDocument)
# make the doc-term matrix #
x1mat = DocumentTermMatrix(x1)

Step 4: Invoke sentiment analysis. Classify the tweets by emotion, find the polarity (or which emotion pole - pos or neg - dominates a text output) using simple functions.

## --- inspect only those tweets which
## got a clear sentiment orientation ---

library(sentiment)
a1=classify_emotion(x1)
a2=x[(!is.na(a1[,7]))] # which tweets had clear polarity
a2[1:10]

# what is the polarity score of each tweet? #
# that is, what's the ratio of pos to neg content? #
b1=classify_polarity(x1)
dim(b1)
# build polarities table
b1[1:5,] # view a few rows
The top 10 tweets are a subset of the ones which have emotional content in them. The bottom table shows rows from the emotional polarities table - gives a measure of the POS score, the NEG score, the POS/NEG ratio and then computes a net-net balance polarity for the document (in this case, a tweet) under the column BEST_FIT.

Step 5: Now we dive deeper into emotion classification. Six primary emotion states are available in twitter output from the sentiment package: "Anger", "Disgust", "fear", "joy", "sadness", and "surprise". We classify which tweets score high on which emotion type and view a few rows of each type.

##--- changing the a1 thing to reg data frame
a1a=data.matrix(as.numeric(a1))
a1b=matrix(a1a,nrow(a1),ncol(a1))
# build sentiment type-score matrix
a1b[1:4,] # view few rows

# recover and remove the mode values
mode1 <- function(x){names(sort(-table(x)))[1]}
for (i1 in 1:6){ # for the 6 primary emotion dimensions
mode11=as.numeric(mode1(a1b[,i1]))
a1b[,i1] = a1b[,i1]-mode11 }
summary(a1b)
a1c = a1b[,1:6]
colnames(a1c) <- c("Anger", "Disgust", "fear", "joy", "sadness", "surprise")
a1c[1:10,] # view a few rows

## -- see top 10 tweets in "Joy" (for example)
a1c=as.data.frame(a1c);attach(a1c)
test = x[(joy != 0)]; test[1:10]
# for the top few tweets in "Anger"
test = x[(Anger != 0)]; test[1:10]
test = x[(sadness != 0)]; test[1:10]
The above image shows the results I got. If you pull out tweets at a later time, you will get a different set of tweets and a different set of results than what I got. To replicate my results, pls use my dataset (up on LMS under skyfall_twitteR.txt).

Could more be done downstream? Can I now cluster tweets by sentiment? Do collocation dendograms by sentiment polarity?
Sure and more.
But I will stop here for now.

See you in class soon. Ciao.

Sudhir

1 comment:

  1. Good day Prof. Voleti,

    I had been struggling with the twitter API handshake with R for a long time. I had tried multiple codes but they seemed to give some error or the other. I recently managed to crack the code and thought I'd share it with everyone, in case others are having trouble too.

    > require(ROAuth)
    Loading required package: ROAuth
    Loading required package: RCurl
    Loading required package: bitops
    Loading required package: digest
    > require(twitteR)
    Loading required package: twitteR
    Loading required package: rjson
    > reqURL <- "http://api.twitter.com/oauth/request_token"
    > accessURL <- "http://api.twitter.com/oauth/access_token"
    > authURL <- "http://api.twitter.com/oauth/authorize"
    > consumerKey <- "YOUR CONSUMER KEY"
    > consumerSecret <- "YOUR CONSUMER SECRET"
    > twitCred <- OAuthFactory$new(consumerKey=consumerKey,
    + consumerSecret=consumerSecret,
    + requestURL=reqURL,
    + accessURL=accessURL,
    + authURL=authURL
    + )
    > twitCred$handshake(cainfo="cacert.pem", ssl.verifypeer=FALSE)
    To enable the connection, please direct your web browser to:
    http://api.twitter.com/oauth/authorize?oauth_token= "YOUR AUTHORIZATION PIN"
    When complete, record the PIN given to you and provide it here: VPVPbiutaQw3xEPRyf4yweRdP2KoqRIVzgy1JmV4Rnw
    > registerTwitterOAuth(twitCred)
    [1] TRUE



    ReplyDelete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.