Classifying and Performing Sentiment Analysis on Twitter Data

Gustavo Alejandro Chavez
7 min readMar 22, 2021
Photo by Alexander Shatov on Unsplash

Twitter. What other platform can compare to Twitter in the amount of sudden news, hype, or vitriol that it can produce? Twitter is the go-to place for sharing one’s thoughts or of other’s. Personally, I have a fascination with how people communicate and the different ways the advent of the digital age changed and revolutionized communication.

One brief example is with the criticism of architecture. Since the rise of the internet, just about anyone who wants to provide critique on architecture is able to simply by posting, whether it be on a blog or social media platform or anywhere else that opinions may be shared. While some may view this as a bad thing, from the perspective of a producer-consumer relationship, the ability for near instant feedback and critique, whether warranted or not, is a powerful thing only achievable in this modern era.

It is for this reason that I decided to do my fourth Flatiron School project on Natural Language Processing and analyzing Twitter data. What people say about a product or brand has an effect on said product/brand. When properly analyzed, this can be used to one’s advantage and possibly help identify what is succeeding and what needs more attention.

The Data

The Data I used for this project can be found on data.world and contains a little over 9,000 tweets where each tweet has an attached sentiment as well as a product that the emotion was directed towards. By looking at the Data, we can deduce some information about when it was gathered. One of the most prevalent hashtags among the tweets was #SXSW, which is a film and music festival that occurs typically in mid-March in Austin, TX. Furthermore, we can deduce that this data was specifically from 2011 because Google Circles (later changed to Google+) is mentioned fairly often throughout the tweets. Considering that the average lifespan of a tweet is about 18 minutes, we can assume that these tweets were from about that time.

Preprocessing and Natural Language ToolKit (NLTK)

One of the most important steps and probably my favorite of this project was the preprocessing. With text data, this step is typically more involved than with typical numerical/categorical data. This is because of the challenges that arise in language that aren’t present in numbers. For example, if the word ‘run’ were to appear, how do we know if every ‘run’ refers to the same kind? Someone could go on a run, while someone else has a run of good luck. While it is the same word in spelling, each ‘run’ is unique in meaning and usage. Other challenges and decisions come in reference to uniqueness of words. For example, should ‘goat’ and ‘goats’ be the same word or separate since they are referring to the same thing and one just happens to be plural? Normally Python would identify them as unique to each other! NLTK tries to provide solutions for these questions in addition to having tools that removes stop-words. Stop-words are words that occur very frequently simply because they are necessary in terms of grammar. If these aren’t removed they will almost always appear as the most common words within any body of text. If you are interested in using NLTK, here are the imports for the various parts related to NLTK that I used in my project.

import nltk
nltk.download(‘punkt’)
nltk.download(‘stopwords’)
nltk.download(‘wordnet’)
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.collocations import *
#additional imports for preprocessing
import string
import re

For preprocessing the tweets, I created a function that uses two different regular expressions. One expression removes all punctuation and mentions, while the other tokenizes and gathers all words. Afterwards, I removed the stop-words and then performed word lemmatization to get the root words of any plurals that were possibly in the tweets. Below is the function I wrote to apply to the tweets.

#Creating function that will preprocess and tokenize the data
def preprocess(X):
"""Takes in str X and processes it to tokens
"""
#lowercases everything
X = X.lower()

#Removes all mentions and removes all punctuation
subpattern = f'(@[A-z0-9]*)|[{string.punctuation[1:].replace("@","")}]*'
replacer = re.compile(subpattern)
X = replacer.sub('',X)

#Tokenizes the text. wrote it this way so that it also pulls words with numbers
tokenpattern = '([0-9]*[a-z]+[0-9]*[a-z]*)'
tokenizer = re.compile(tokenpattern)
X = tokenizer.findall(X)

#Removes stopwords
stopwords_list = stopwords.words('english') + ['sxsw']
X = [word for word in X if word not in stopwords_list]

#lemmatizes
lemmatizer = WordNetLemmatizer()
X = [lemmatizer.lemmatize(word) for word in X]
X = ' '.join(X)

return X

Word Clouds

If the preprocessing wasn’t my favorite part, then creating the word clouds that I did as a portion of my EDA would be. They are so much fun! Shoutout to DataCamp and the author of the article Duong Vu, who wrote this easy-to-follow tutorial on creating word clouds. Word clouds are a very fun and intuitive way of visualizing word frequency within a body of text. Simply put, the more often a word appears in a body of text, the larger that word will be within the cloud. I’ve seen these used before with live twitter data during events such as the Grammys where the cloud follows hashtags to see what people are buzzing about during the event. Since I had separated my data by both positive and negative emotion, I also wanted to separate it by company (just Apple and Google in this set) and see what words were frequently being said in the positive and negative review for each. In addition to the regular stop-words, I removed an additional set of words for each company’s word clouds that were heavily present in both positively and negatively rated tweets. For the sake of brevity, in the post I will only include the positive and negative word clouds I produced for Google. For Google, one thing that is worth of note is that if people were talking about Google, they were mostly talking about Google+, which was what was being unveiled during the event. (If you are interested in seeing my full work, I will have a link to my GitHub for this project at the bottom)

Word Cloud for positively-rated tweets that were directed at Google

It is so wonderful the variety of conclusions our brain can draw when just given a word and a sentiment through which to view it. There are a lot of generally positive words here such as ‘great’, ‘best’, ‘fun’ and ‘awesome’, but then we have words like ‘party’ that inherently have positive connotations, but in this context, the word provides a sensation for the type of atmosphere that is created. Word associating aside, this cloud does give us a general sense that people had a diversity of different positive things to say about Google+ at the time.

Word Cloud for negatively-rated tweets that were directed at Google

Before I begin diving into the words shown in this cloud, I do want to say that the number of positive tweets were significantly higher than the number of negative tweets for both Apple and Google. In the word cloud of negative tweets, we see a lot of words that downright have bad connotations. Words ‘suck’, ‘fail’ and ‘lost’ are negative in nature and need no additional information. Considering how big ‘product’ is in this cloud, we can infer that the people who tweeted or retweeted negative tweets about Google during the event were specifically displeased with Google+, thinking of it as a bad product that wasn’t needed at best, and a completely disconnected failure at worst. Since these tweets are technically in the past, we know that Google+ did not succeed due to deeply embedded flaws that were present at the time of launch.

Modeling

Before reaching the point where I took the data and put it in a classifier, I used the TF-IDF Vectorizer found within the Sci-kit Learn package and also used TruncatedSVD to reduce the dimensionality of the data. I used LogisticRegressionCV, SVC, RandomForest, and XGBoost classifiers to run basic models on the data. After getting a baseline for the models, I performed SMOTE to deal with the class imbalance problem found in the dataset and tuned my classifiers based on the macro f1-score. I chose the macro f1-score since it is a harmonic statistic that takes into account both precision and recall. Testing based on accuracy in a set that contains class imbalance can be misleading in it’s results, which is why I chose the macro f1-score. Below I have a table that has the initial and final macro f1-scores and accuracies.

Table with Macro F1 scores and Accuracies on the classifiers used

My best model was the SVC (support vector machine) classifier. A final accuracy of 0.90 is pretty good!

Thank you for reading! The link to my GitHub which contains my full code for this project can be accessed by clicking here if you are interested in viewing it.

--

--

Gustavo Alejandro Chavez

Data neophyte and Nature enthusiast. Maryland-based for the time being.