Identifying Amazon Product Review Spam or Fraudulent Reviews

Introduction/Background

In the world of e-commerce, products are easily available for consumers to order online and deliver to their homes. The importance of reviews has increased as people are buying more products online from platforms, like Amazon or eBay. People are relying on customer reviews or user generated content to analyze the effectiveness of a product. So, analyzing the data from those customer reviews to make the data more dynamic is an essential field nowadays[3]. With the advancement of machine learning algorithms, we can filter out spam or junk reviews to ensure that customers have verified reviews.

The objective of this project is to use Natural Language Processing to complete a sentiment analysis to identify spam or fraudulent reviews. The sentimental analysis could be hugely instrumental for us to get an overview of a paragraph[2]. This paper will start off explaining how sentiment analysis will be used by extracting underlying sentiment of a text review. Then, the paper will end on our proposed method of using sentiment analysis to identify fake reviews. In most cases sentiment classification is done into two classes: positive and negative, or sometimes into three classes with an additional neutral class [3].

Problem Definition

Question: Is it possible to develop a system that detects and filters out fraudulent or spammy reviews, maintaining the credibility of the review system on Amazon?

Motivation: Detect and filter out fake or spammy reviews to maintain the integrity of the review system on Amazon.

Methods

The dataset we will be using is the Amazon US Reviews, more specifically, the data set for Musical Instruments product reviews. [4] To make the data usable, we will clean and process it before applying unsupervised/supervised learning models. Feature engineering can include text length, sentiment analysis, and user behavior analysis. For unsupervised learning, we plan on using K-means and GMM. For supervised learning, we plan to use Naive Bayes and Support Vector Machines (SVM). Our approach will potentially change as we start implementing different ML algorithms and we will adapt accordingly.

Potential Results And Discussion

Since this is a binary classification task, common metrics used to evaluate how effective our classification models would be are accuracy, recall, precision, and F1 Score, which will help us determine the type 1 and type 2 error of our models. We could also correspondingly look at a confusion matrix of the model to determine its specific weaknesses. For the unsupervised learning, we plan on using clustering, and a good metric to determine if the data is clustered correctly is through the adjusted_rand_score. We plan to measure our model using this data both on the training and testing data. We hope to achieve at least an accuracy of greater than 50% because the baseline accuracy would be 50% which is just random guessing.

Timeline & Contribution Table (Gantt Chart)

Branching

Checkpoints

Here is the dataset we plan on using for this project: https://www.tensorflow.org/datasets/catalog/amazon_us_reviews#amazon_us_reviewsmusical_instruments_v1_00

References

Haque, Tanjim Ul, Nudrat Nawal Saber, and Faisal Muhammad Shah. “Sentiment analysis on large scale Amazon product reviews.” 2018 IEEE international conference on innovative research and development (ICIRD). IEEE, 2018.
Chen, Mingxiang, and Yi Sun. “Sentimental Analysis with Amazon Review Data.” (2017).
Mukherjee, Anirban, et al. “Utilization of oversampling for multiclass sentiment analysis on amazon review dataset.” 2019 IEEE 10th international conference on awareness science and technology (iCAST). IEEE, 2019.
Amazon_us_reviews : tensorflow datasets. TensorFlow. (n.d.). https://www.tensorflow.org/datasets/catalog/amazon_us_reviews#amazon_us_reviewsmusical_instruments_v1_00
Tarasov, K. (2020, September 6). Amazon is filled with fake reviews and it’s getting harder to spot them. CNBC. https://www.cnbc.com/2020/09/06/amazon-reviews-thousands-are-fake-heres-how-to-spot-them.html

back

CS 7641 Final Project Proposal

By Carolina Hau Loo, Desiree Dominguez, Lucas Luwa, James Wellington, Matthew Chen