Deep Learning for Drug Sentiment

For our course Advanced Machine Learning for Public Policy, my team undertook a sentiment analysis project using the various Natural Language Processing (NLP) and Deep Learning concepts learned throughout the academic term. Building upon the data obtained by Grasser et al (2018) of online drug reviewss scraped from, we extended this dataset by writing a scraper to pick up where the data leaves off and extended the dataset to May 2021.
While scraping the website and updating the dataset, we noticed that a large number of reviews in the original dataset we duplicated (40%, more precisely) and we explored the implications of these duplicates on the models in previously published research articles. We trained a Logistic Regression (as used by Grasser et al) with both the duplicated and unduplicated datasets (see Figures 5 and 6 below), as well as the results from and LSTM model with both datasets. We found little difference in accuracy for the Logistic Regression, however, the LSTM performed at a 10% higher accuracy on the dataset with duplicates.

Code repository can be found at this link
Return to main