Data Analysis of Tuscaloosa Tweets

I wanted to play around with Sentiment Analysis of Tweets; specifically, I wanted to try the Python TextBlob library, which has a built-in function that performs text analysis to determine if a string has a positive or negative sentiment. After pondering a bit, I decided it would be fun to search for tweets that were created specifically within the city limits of Tuscaloosa, where I am currently attending school. I wrote a script that scrapes Twitter and returns tweets by geolocation, and then uses TextBlob on the results.

# -*- coding: utf-8 -*-
"""
Created on Wed Jul  6 15:58:58 2022

@author: austin
"""

import snscrape.modules.twitter as sntwitter #Social Network Scraping Library
import pandas as pd #so I can make a dataframe of results
from textblob import TextBlob
import csv
import time

#Tuscaloosa = geocode:33.23726448661455,-87.58279011262114,20km
query = "geocode:33.23726448661455,-87.58279011262114,20km"
tweets = []
combinedtweets = []
limit = 10000000 #set a limit on how many results I want to pull

for tweet in sntwitter.TwitterSearchScraper(query).get_items():
    
    if len(tweets) == limit:
        break
    else:
        # set sentiment 
        text = tweet.content
        analysis = TextBlob(text)
        if analysis.sentiment.polarity >= 0:
            sentiment = 'positive'
        else: 
            sentiment = 'negative'
        tweets.append([tweet.date, tweet.user.username, tweet.content, sentiment])

df = pd.DataFrame(tweets, columns=['Date', 'User','Tweet', 'Sentiment'])
df.to_csv('twitter_scrape_results.csv') #save dataframe as csv

print("\014") #clear console
time.sleep(10) 
print("CSV Successfully Created")

The results were pretty interesting (I uploaded the dataset to Kaggle if anyone is interested). It seems sentiment stays roughly the same each year, hovering around 85% positive and 15% negative. I really would have thought negative sentiment would be much higher based on my personal observations of Twitter content: makes me wonder if Tuscaloosa is an unusually happy place, or if my Twitter observations are influenced by negative bias…

In any case, perhaps a more interesting bit of data is that the total amount of Tweets seems to decline quite a bit each year. This raises the question, why are Tuscaloosians tweeting less often? I put the results into this Tableau dashboard, which displays just how steady and steep a decline there has been.

Update:

I decided to test a hypothesis: perhaps the high level of positive tweet sentiment is due to the fact that this is a college town, and numerous tweets were posted by official University of Alabama departments? I used OpenRefine to filter out official UA accounts, which was easy enough to do since their usernames seem to either begin with “UA_” or end with “_UA”. Surprisingly though, that didn’t change the sentiment percentages at all. I now suspect that even if you factor in all official UA Twitter accounts, you would also have to factor for the fact that a large number of Tuscaloosians work for UA (45,000 employees). I know many of my professors post UA related content using their personal Twitter accounts, and by design this content will logically slant positive.

Data Analysis of the MechanicalKeyboards Subreddit

Developers tend to take their keyboards seriously. I have been using classic buckling spring IBM Model M computer keyboards since I first began programming. These are great to type on, and I still love them (kind of feels like typing on a typewriter), but I decided recently that I should upgrade to a compact keyboard that uses modern mechanical switches. This would give me more space on my desk, and allow for some customization. There seems to be an endless sea of options to choose from, though; the first step in my consumer journey is to narrow my options down to a few top brands, so what is a developer to do? I thought a good way to cut through the clutter would be to scrape the r/MechanicalKeyboards subreddit to see what brands are the most talked about currently. So I wrote this Python script that uses Reddit’s API to scrape the subreddit.

import praw
from praw.models import MoreComments
import datetime
import pandas as pd

# Lets use PRAW (a Python wrapper for the Reddit API)
reddit = praw.Reddit(client_id='', client_secret='', user_agent='')

# Scraping the posts
posts = reddit.subreddit('MechanicalKeyboards').hot(limit=None) # Sorted by hottest
 
posts_dict = {"Title": [], "Post Text": [], "Date":[],
               "Score": [], "ID": [],
              "Total Comments": [], "Post URL": []
              }

comments_dict = {"Title": [], "Comment": [], "Date":[],
              "Score": [], "ID": [], "Post URL": []
              }

for post in posts:
    # Title of each post
    posts_dict["Title"].append(post.title)
     
    # Text inside a post
    posts_dict["Post Text"].append(post.selftext)
    
    # Date of each post
    dt = datetime.date.fromtimestamp(post.created_utc) # Convert UTC to DateTime
    posts_dict["Date"].append(dt)
     
    # The score of a post
    posts_dict["Score"].append(post.score)
    
    # Unique ID of each post
    posts_dict["ID"].append(post.id)
     
    # Total number of comments inside the post
    posts_dict["Total Comments"].append(post.num_comments)
     
    # URL of each post
    posts_dict["Post URL"].append(post.url)
    
    # Now we need to scrape the comments on the posts
    id = post.id
    submission = reddit.submission(id)
    submission.comments.replace_more(limit=0) # Use replace_more to remove all MoreComments
    
    # Use .list() method to also get the comments of the comments
    for comment in submission.comments.list(): 
        # Title of each post
        comments_dict["Title"].append(post.title)
        
        # The comment
        comments_dict["Comment"].append(comment.body)
        
        # Date of each comment
        dt = datetime.date.fromtimestamp(comment.created_utc) # Convert UTC to DateTime
        comments_dict["Date"].append(dt)
        
        # The score of a comment
        comments_dict["Score"].append(comment.score)
         
        # Unique ID of each post
        comments_dict["ID"].append(post.id)
         
        # URL of each post
        comments_dict["Post URL"].append(post.url)

# Saving the data in pandas dataframes
allPosts = pd.DataFrame(posts_dict)
allPosts

allComments = pd.DataFrame(comments_dict)
allComments

# Time to output everything to csv files
allPosts.to_csv("MechanicalKeyboards_Posts.csv", index=True)
allComments.to_csv("MechanicalKeyboards_Comments.csv", index=True)

Reddit limits API requests to 1000 posts, so the most current 1000 posts is my sample size. My code outputs two files: the last 1000 posts, and more importantly the comments on those 1000 posts, which ended up being 9042 rows of data. (I posted the files to Kaggle if anyone would like to play with them.) Then I imported my comments dataset into OpenRefine so I could run text filters to find brand names, and I recorded the number of mentions for each brand. Finally, using Tableau, I created a couple of Data Visualization charts to express my findings. Here are the most talked about keyboard brands on r/MechanicalKeyboards currently:

Update:

I decided to go with the Keychron keyboard that my research found to be the most discussed (and I also added Glorious Panda Switches and HK Gaming PBT Keycaps). Couldn’t be happier; it’s a pleasure to type on.

School Shooting Data Analysis

I came across this interesting dataset on Kaggle on U.S. school shootings from 1990 to 2022. I decided to poke around in the set and see if I could find any trends. Since it was amassed from multiple sources, there were some duplicate entries that I removed. Then I filtered out instances at colleges, so I would be left with only K-12 data. Then I filtered out instances that did not result in any fatalities and visualized the results.

Fatalities remain fairly consistent over time: between 2 and 37, with a mean average of 13. These numbers thankfully are quite small in relation to the 49,400,000 U.S. students. Of course, each school shooting is horrifically tragic, but it is a statistically rare occurrence. News Media outlets focus on school shootings when they do happen, creating a false sense that they occur at a much higher rate (this is the principle of Cultivation Theory, that because the news focuses disproportionately on negative incidents, people are led to cultivate a disproportionate view of how often these negative events occur). To put the odds in perspective, I made a chart showing the likelihood of a U.S. student dying in a school shooting, compared to the odds of getting struck by lightning. As you can see, you are more than twice as likely to be struck by lightning.

While I was at it, I also broke the data down by state and city. It seems Texas and California have had the highest number of fatalities with 53 and 52, respectively. In Texas, cities near their southern border have been hit the hardest, with the remainder concentrated around the Dallas area. California also has pockets of increased incidents, near Hollywood, Belmont, and Sacramento. The city with the largest number of fatalities however is Newtown, Connecticut with 28. Newtown represents an outlier though, as this where the infamous Sandy Hook Elementary School shooting took place, and this single incident is responsible for all 28 fatalities.

RSS
Twitter
LinkedIn