Deep Learning and Sentiment Analysis to Forecast Stock Market Volatility
CMU 11-785 Course: Deep Learning
Introduction —
Many investors and traders are interested in stock trends and volatility as they can adjust their trading and pricing strategies to make profits by predicting stock markets' price differences. While there has been extensive research analyzing and predicting stock returns using historical stock data and alternative data, many studies have neglected the importance of analyzing social media data.
We explore the effectiveness of different news sources (e.g., Webull, Reddit, Twitter) for capturing sentiment scores to predict stock price differences. We will compare the performance of using a single news data source (i.e., Webull news headlines) versus multiple news sources to identify whether diverse sources can perform better in capturing sentiment scores. We will also investigate the correlation between the sentiment scores and stock price differences for two popular stocks, AMC and Zoom, during the COVID time period. Furthermore, the study aims to improve upon previous research by exploring the use of alternative keywords and tickers beyond those previously identified and applying sentiment analysis on these, but not limited to those data sources.
Black Swan Events: COVID 19 —
The COVID-19 pandemic brought about an unprecedented level of uncertainty and volatility to the financial markets. Social media and news headlines played a critical role in driving this volatility. The wide-spread discussions about the uncertainty of consumer behavior had a significant impact on the stock prices of both large and small cap companies, especially for small cap companies due to broader media coverage. We conducted a market cap analysis for six stocks: three small cap stocks with a market value below \$10 billion in 2020 (GME, BNGO, AMC), and three large cap stocks with a market value at or above \$10 billion in 2020 (ZM, MRNA, PTON).
Baseline —
Our chosen baseline by Kolasani et al. aimed to improve previous research that utilized social media and historical data to predict stock market trends and prices by implementing a Multilayer Perceptron Neural Network (MLP) model. The study uses a sentiment tagged Twitter dataset of 1.6 million tweets collected from Sentiment 140 for sentiment classification. Theyl compare the effectiveness of the MLP model with the Boosted Regression Tree model to predict the next day’s stock movement with the present day’s tweets containing the “stock market”, “StockTwits”, “AAPL”. Our methodology improved on our selected baseline by collecting news and stock market details for two stocks, Zoom and AMC, from four main sources on a daily basis: Reddit, Twitter, Webull, and Yahoo Finance.
Our Implementation —
A separate train dataset was created for each news source for both stocks covering the period 2020-2021, while a separate test dataset was created for each news source for the year 2022.
The Reddit, Twitter, and Webull headlines were preprocessed for exploratory data analysis, but no preprocessing was required for the VADER sentiment analysis tool.
We calculated the price difference between the closing and opening prices of the stocks, and performed feature engineering where we add new columns to the stock dataset to help predict future stock prices. The new columns are created by shifting the "Close" price value of the stock by 1-7 days. We did this so that it helps the model better understand the relationship between past stock prices and future stock prices.
Once our datasets were curated, we decided to train our LSTM model using dropout, stochastic gradient descent as the optimizer, mean squared error as the loss function, for 500 epochs and a batch size of 64. The model is trained on data from January to December (during COVID period) and tested on the curated dataset from January to December 2022 (post-COVID).
Reflection —
I can say that it was a challenging yet rewarding experience. One of the most difficult aspects of the project was data scraping, which required a significant amount of time and effort to gather relevant data. Moreover, prioritizing work and managing time effectively proved to be crucial, as we had to balance data collection, model development, and evaluation. We also faced the challenge of comparing our model with a baseline model effectively and interpreting the results. Despite these challenges, the project provided a valuable opportunity to apply our knowledge and skills in natural language processing and deep learning to a real-world problem. Ultimately, we were able to develop models that showed promising results in sentiment analysis and stock price prediction.
Acknowledgement —
Thank you to my teammates for the support and collaboration: Moulya Sudhir, Aishwarya Agrawal, Diwen Dang. Huge shout out to my project mentor for his guidance: Varun Jain. Lastly, I would love to thank everyone in the IDL course that reviewed and gave valuable feedback on this project.