Identifying Amazon Product Review Spam or Fraudulent Reviews

Introduction/Background

In the world of e-commerce, products are easily available for consumers to order online and deliver to their homes. The importance of reviews has increased as people are buying more products online from platforms, like Amazon or eBay. People are relying on customer reviews or user generated content to analyze the effectiveness of a product. So, analyzing the data from those customer reviews to make the data more dynamic is an essential field nowadays[1,2,3]. With the advancement of machine learning algorithms, we can filter out spam or junk reviews to ensure that customers have verified reviews.

The dataset we will be using is the Amazon US Reviews, more specifically, the data set for Musical Instruments product reviews. [4] As of October 17th 2023, the data set was removed due to Amazon chosing to defunct their datasets [5]. We were able to download the dataset and we implemented our code with a local copy. We are working with the “Musical Instruments” subset, which has 904,765 data points, and 15 features. The 15 features are: ‘marketplace’, ‘customer_id’, ‘review_id’, ‘product_id’, ‘product_parent’, ‘product_title’, ‘product_category’, ‘star_rating’, ‘helpful_votes’, ‘total_votes’, ‘vine’, ‘verified_purchase’, ‘review_headline’, ‘review_body’ and ‘review_date’.

Literature Review

Spam analysis has been widely studied since 2008 with “Opinion and Spam Analysis” from Jindal and Liu, 2008 [10] being one of the seminal works in the field. They were the first to classify spam into three categories (though in this project we only classify opinions as either spam or not spam):

Untruthful Opinions - Opinions made to deliberiately mislead readers (think underserving positive reviews, unjust negative reviews).
Brand-Only Reviews - Opinions that target the brand rather than the product the brand sells.
Non-Reviews - Advertisements, Irrelevant reviews containing opinions, any review that is not related to providing feedback on the product.

The paper notes that detecting class 2 and class 3 spam is non-trivial whereas detecting class 1 spam is trivial.

Another important work in the field by Ott et. al in 2011 is Finding Deceptive Opinion Spam by Any Stretch of the Imagination[6], which attempts to apply work from psychology and computational linguistics to automate the detection of spam reviews on TripAdvisor reviews. Some of their main contributions include creating a “gold standard” opinion spam dataset by scraping TripAdvisor and asking humans to classify TripAdvisor reviews as either spam or not spam. They then compare this manual approach of labelling spam with an automated approach that includes featurizing reviews based on part-of-speech (POS), Linguistic Inquiry and Word Count (LIWC), and text categorization (unigrams, bigrams, trigrams). Suprisingly, one of the main results that found was that automated labelling of reviews as spam or not spam performed much better than manual human labelling, which we take into account and use this fact in our data preprocessing step.

Other works such as [8] consider the fact that it is possible to precisely and accurately label each review as either spam or not spam, and use Positive-Unlabeled Learning in order to perform spam classification. PU Learning is when one knows for sure a subset of data is positively labelled, in this case as spam, and the rest is unlabelled, i.e. they can still be either spam or not spam.

In a state of the art review paper [11], they note that Yelp has a proprietary algorithm [7] that excels at spam detection and that methods for annotating spam datasets include rule-based, human-based, algorithm-based filtering, and paid manual human review based. Moreover, the features considered in these reviews have mostly been related to linguistic features and text features, relating to their review bodies. We attempt to expand on this by also considering Amazon review specific metdata.

Problem Definition

Question: Is it possible to develop a system that detects and filters out fraudulent or spammy reviews, maintaining the credibility of the review system on Amazon?

Motivation: Detect and filter out fake or spammy reviews to maintain the integrity of the review system on Amazon

Data Preprocessing for Supervised Learning

We are interested in detecting whether a review is spam and one of the primary ways of doing that is by looking at the review body. However there are some other useful fields as well including customer ID, product ID, product title, star rating, helpful votes, total votes, whether it was a vine review, whether the review is on a verified purchase, the review headline and the review date. All of these rows are unlabeled, which means that we will have to label reviews as spam or not ourselves before performing any supervised (or at least semi-supervised) learning.

There is significant research in spam detection from unlabeled data [6, 7, 8, 9, 10]. We will be focusing on using methods from [4] and from [5] because they use Amazon reviews as their dataset as well. Also as [10] suggested, manual human labelling is actually error prone and ineffective at actually detecting spam or not. Thus we will be using a rule-based process for labelling spam reviews:

We will find duplicate reviews (if either review body or the review header is identical) in the dataset. There are at least 60,000 duplicate reviews in the dataset. We can guarantee with high probability that these are spam reviews because it is very unlikely for one to write the same review for different products.

We will further wittle down this dataset by looking at whether or not the review is a verified purchase and if the ratio between the helpful votes and total votes is greater than 80%. This ensures we don’t only identify based on review body, as [10] suggested that the same review on the same product does not neccessarily mean that its a spam review.

We will find the User Ids of these reviews and assume that all other reviews made by this user are also spam reviews.
We then convert the review bodies into unigrams/bigrams and then compare each unlabelled review’s unigrams/bigrams with the spam review’s unigrams/bigrams and if the cosine similarity greater than 0.99 then we consider that review as spam as well.
- We featurize the unigrams/bigrams into count frequencies and TD-IDF vectors.
- Since there are A LOT of unigram/bigram features, we need to reduce the number of features. Moreover, the matrix is sparse to PCA cannot be applied. Therefore, we have to use sklearn’s Truncated SVD to reduce the millions of features down to 100 features. Since this method is randomized, we fix the seed perform transforming the data. We choose 150 features because there is still a lot of variance amongst the features and 150 features actually only explains around 68.2% of the variance for the Bag of Words features and 53.9% for the TF-IDF features.
  - Also, to limit the initial amount of features used to 100000000 and ignore unigrams/bigrams with document frequency ratio greater than 0.99 and less than .01 because these features aren’t representative of the reviews since they occur too much or too little.
- We choose Truncated SVD since its a very common method to feature reduce in NLP, it is also called Latent Semantic Analysis (LSA): TruncatedSVD Sklearn Documentation

Before performing these steps, we need to clean the data by removing quotes, punctuation, digits, stop words, and having all text as lowercase. Moreover, we can remove the marketplace, product_parent, and product category since these are all the same for the dataset or intuitively provide no bearing to predicting whether a review is spam or not.

In the end, out of the 900,000 reviews, we label approximately 90,000 of them as spam and the rest as not spam. We still have to be mindful that this labelling might not be entirely correct, but with high probability of being correct.

Moreover, here are some metrics during feature reduction for both the BoW and TF-IDF features:

Feature	Original Sparsity	Explained Variance after reducing to 150 features
Bag-Of-Words	0.29	0.682
TF-IDF	0.29	0.539

Unsupervised Learning

Using a mock data set of only 100 data points, we implemented a simple K-means clustering using ‘Star Ratings’ and ‘Verified Purchase’. We provide a cluster plot of the preliminary results below. The next step for unsupervised learning would be to apply K-means clustering to all the cleaned data set.

Results And Discussion

Initial Data: Branching

Vine and Verified Status with Relative Number of Stars Branching
The image above gives us a good idea of the makeup of each category in terms of the stars they receive.

Distribution of Reviews by Vine/Verified Status
Branching

Unhelpful & Helpful Reviews
Branching
The two graphs above show the number of helpful and unhelpful reviews and their distribution of stars.

Distribution of Helpful vs. Unhelpful Reviews: Branching

It seems that most of the reviews are not helpful, which could suggest that a lot of the reviews are either insiginificant, spam, or that simply its much more effort to rate a review as helpful than not helpful.

Filtered Data: Branching

Although this is pretty similar, you can see a small difference in some of the points that are included. One can see that one average, most of the reviews hover around the range of 4.5 starts and generally decrease in variability as time moves forward, which is interesting and unexpected since date shouldn’t affect how many stars a review should have.

Unsupervised learning (K-means Clustering):
Branching

This visual representation illustrates that the unsupervised learning models have effectively partitioned the data into discernible, well-defined clusters. This outcome emphasizes the distinctiveness of each cluster and suggests that the algorithm has successfully captured meaningful patterns within the dataset with respect to the boolean verified feature and the rating. It seems that since the colors vary with rating, the K-means clustering inherently can group reviews well by rating.

Unsupervised Clustering on the Feature Reduced BoW Vectors and TF-IDF Vectors:

Besides clustering the data on ratins and helpfulness, using the BoW and the TF-IDF features we extracted, we attempt to K-means cluster assuming that K = 2 and hope that it can cluster spam and non-spam into two different clusters. Below are the results using several different clustering metrics:

Review Clustering on BoW features Reduced to 150 features
Metric	Value
Homogeneity	0.001
Completeness	0.002
V-measure	0.001
Adjusted Rand-Index	-0.021
Silhouette Coefficient	0.614
Davies Bouldin Score	2.368

Review Clustering on TFIDF features Reduced to 150 features
Metric	Value
Homogeneity	0.001
Completeness	0.002
V-measure	0.001
Adjusted Rand-Index	-0.021
Silhouette Coefficient	0.591
Davies Bouldin Score	2.369

Unfortunately, based on the V-measure, which combines homogeneity and completeness, is extremely low in both clustering methods, which means that inherently spam is not easily distinguishable as a latent cluster among these features. Moreover, the Adjusted Rand-Index is less than 0, which means that the clustering is somehow worse than clustering by random chance, which means the two clusterings we have have less in common than random choice. However, the silhouette coeffecient for both clusterings is noticeable greater than 0, which suggests that the reviews are far apart and can be distinguished, perhaps by a different type of labelling? Since the Davies-Bouldin Score is a value relative to the dataset, we do not have much confidence on whether or not the scores on the clustering are relatively low compared to other methods. Though, they are relatively close for both features and perhaps could signal good seperation of clusters. In other words, the external clustering measures performed poorly, but this might be because of poor labelling whereas the internal cluster measure of the Silhouette coefficient clustering measure presents more promising results.

Finding a better K using the elbow method:

Perhaps distinguishing the reviews by spam or not spam is not the best way to cluster them. Therefore, we attempt the elbow method where K ranges from 2 to 8 and calculate the distortion, which is the average of the squared distances from the cluster centers of the respective clusters to each data point, and the inertia, which is the sum of the squared distances of samples to their closest cluster center. When we see these values decrease and reach almost a straight line, we can choose that value of $k$ to use and perhaps it will perform much better.

Elbow Method for BoW features: Branching

Elbow Method for TF-IDF features: Branching

It seems in both cases the elbow method starts decreasing rapidly until K = 8 (especially for the BoW features) and it might decrease even more, but for sake of computational feasability we will only consider K = 8. So now lets evaluate it when K = 8:

BoW 150 features K = 8
Metric	Value
Homogeneity	0.006
Completeness	0.002
V-measure	0.003
Adjusted Rand-Index	-0.022
Silhouette Coefficient	0.093
Davies Bouldin Score	3.701

TFIDF 150 features K = 8
Metric	Value
Homogeneity	0.007
Completeness	0.001
V-measure	0.002
Adjusted Rand-Index	0.004
Silhouette Coefficient	0.023
Davies Bouldin Score	4.591

It seems that the internal clustering measures such as the Silhouette Coefficient and the Davies Bouldin Score both get worse with a higher K, which means that suggests that perhaps having a lower K is preferable. Something to note is now we have another Davies Bouldin Score to compare to, which means that a value below 3 is relatively good. Moreover the V-measure slightly increases with the higher K, which further adds evidence that the labelling method in the data preprocessing is incorrect. One hypothesis is that the spam needs to be distinguished further into seperate classes as mentioned in [10] because different classes of spam could look significantly different. Therefore, these avenues would be considered in our next steps.

Next Steps:

We will next be implementing supervised learning models and analyze the predictive models performance. Afterwards, we will be testing the unsupervised and supervised model on the whole cleaned and labelled dataset. Finally, we plan on spending some time optimizing our models so that they can achieve better performance (such as better hyperparameter selection, further data engineering to improve proper labelling to get better external clustering measures, etc).

Timeline & Contribution Table (Gantt Chart)

Branching Link to Gantt Chart: http://tiny.cc/fn5evz

References

Haque, Tanjim Ul, Nudrat Nawal Saber, and Faisal Muhammad Shah. “Sentiment analysis on large scale Amazon product reviews.” 2018 IEEE international conference on innovative research and development (ICIRD). IEEE, 2018.
Chen, Mingxiang, and Yi Sun. “Sentimental Analysis with Amazon Review Data.” (2017).
Mukherjee, Anirban, et al. “Utilization of oversampling for multiclass sentiment analysis on amazon review dataset.” 2019 IEEE 10th international conference on awareness science and technology (iCAST). IEEE, 2019.
Amazon_us_reviews : tensorflow datasets. TensorFlow. (n.d.). https://www.tensorflow.org/datasets/catalog/amazon_us_reviews#amazon_us_reviewsmusical_instruments_v1_00
“FileNotFound: Does this still work?” https://huggingface.co/datasets/amazon_us_reviews/discussions/4
Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 309–319, Portland, Oregon, USA. Association for Computational Linguistics.
Mukherjee, Arjun et al. “What Yelp Fake Review Filter Might Be Doing?” Proceedings of the International AAAI Conference on Web and Social Media (2013): n. pag.
H. Li, Z. Chen, B. Liu, X. Wei and J. Shao, “Spotting Fake Reviews via Collective Positive-Unlabeled Learning,” 2014 IEEE International Conference on Data Mining, Shenzhen, China, 2014, pp. 899-904, doi: 10.1109/ICDM.2014.47.
Wijnhoven, Fons and Pieper, Anna Theres, “Review spam criteria for enhancing a review spam detector” (2018). ICIS 2018 Proceedings. 3. https://aisel.aisnet.org/icis2018/social/Presentations/3
Jindal, Nitin and B. Liu. “Opinion spam and analysis.” Web Search and Data Mining (2008).
Arvind Mewada, Rupesh Kumar Dewang, Research on false review detection Methods: A state-of-the-art review, Journal of King Saud University - Computer and Information Sciences, Volume 34, Issue 9, 2022, Pages 7530-7546, ISSN 1319-1578, https://doi.org/10.1016/j.jksuci.2021.07.021.

CS 7641 Final Project Midterm Report

By Carolina Hau Loo, Desiree Dominguez, Lucas Luwa, James Wellington, Matthew Chen