Sentiment Analysis on Nltk Reuters Corpus

4 min readDec 7, 2020

Sentiment Analysis on NLTK Reuters Data

Here I will attempt to show how to carry out sentiment analysis on nltk’s Reuters data set.

I will first talk briefly about sentiment analysis, then getting the data and finally doing the sentiment analysis.

Finally I will visualize the data. You can see the result here: https://github.com/halitanildonmez/vader-sentiment-analysis/blob/main/VaderReutersSentiment.ipynb

It is a jupyter notebook so the output from functions are visible.

Sentiment Analysis

Simply put, it is analyzing a given text and decide the “mood” of the text.

For example, you may have some user reviews about a movie and you may want to know how the reviewers were thinking about. Then you would want to go over the reviews and give a score to each review and then you can decide the general “sentiment” toward the movie.

Valence Aware Dictionary and Sentiment Reasoner (VADER)

In order to do that we will use VADER. It is a lexicon and rule based sentiment analysis tool.

See here: https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f

Reuters NLTK

This will be about the data from NLTK. Under section “Reuters Corpus” you can see the news documents from reuters with 1.3 million words. See https://www.nltk.org/book/ch02.html

In this notebook I will:

Download and parse the reuters corpus
Create a dataframe from it at pandas
Use Vader to do sentiment analysis
Visualize the data

So lets begin!

Downloading and Parsing the Reuters Corpus

The data in the corpus is actually seperated to files and words. You have to first get the file and that file has the words, including the punctuation as a list.

So you just have to join them together.

First you have to do some imports:

import nltk
import json
import pandas as pd

nltk.download('reuters')
from nltk.corpus import reuters
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Then we will get the file id’s

fileids = reuters.fileids()

Creating a Data frame

These are just names each containing array of words. So you can do the following:

all_reuters_words = []
for file_id in fileids:
    file_words = reuters.words(file_id)
    output = " ".join(file_words)
    all_reuters_words.append(output)json_data = {"all_words": all_reuters_words}
df_sentiment = pd.DataFrame.from_dict(json_data)

The code above will enable you to either write the words to a json file or load it into pandas framework.

Lets take a look at the data frame:

df_sentiment

Doing Sentiment Analysis

So we have the data and now we can carry out sentiment analysis.

Lets first run an experiment.

vader_sentiment_analyzer = SentimentIntensityAnalyzer()
text = "This movie is really good!!"
vader_sentiment_analyzer.polarity_scores(text)

Will give the output:

{'neg': 0.0, 'neu': 0.514, 'pos': 0.486, 'compound': 0.5827}

What this means:

Negative, positive, neutral: These are straight forward. Sentence above is 0% negative, 51% neutral and 48% positive.
Compound: Is a number [-1,1] from most extremely negative to most extreme positive

For more information read here: https://github.com/cjhutto/vaderSentiment#about-the-scoring

So now lets do the analysis for each of the words in the list

scores_neg = []
scores_neu = []
scores_pos = []
scores_comp = []for text in df_sentiment["all_words"]:
    s = vader_sentiment_analyzer.polarity_scores(text)
    scores_neg.append(float(s['neg']))
    scores_neu.append(float(s['neu']))
    scores_pos.append(float(s['pos']))
    scores_comp.append(float(s['compound']))df_sentiment["negative"] = scores_neg
df_sentiment["neutral"] = scores_neu
df_sentiment["positive"] = scores_pos
df_sentiment["compound"] = scores_comp
df_sentiment

Will output:

And that is it. Here is what I did:

I do the sentiment analysis and place the results into their respective arrays. I am also casting to float so make it easier to visualize.

Then I am creating a new column in the data frame and placing the results there and that is it!

Visualizing the Data

It is really simple to do so. The only challenge is the immense number of data.

df_sentiment.plot(y = 'positive', figsize=[20,10])

Above is the visualization for the positive sentiments. It looks like there are some rare good articles out there and mostly they are not that great.

The good thing with this approach is that now you can play around!

df_sentiment.plot(y='negative', figsize=[20,10])

Hope this helps in some way!