Amazon Reviews Analysis: Unlocked Mobile Phones

VincentWei

天地间，浩然正气长存，为天地立心，为生民立命，为往圣继绝学，为万世开太平！

免责声明：网站内容仅供个人学习记录，禁做商业用途，转载请注明出处。

版权所有 © 2017-2020 NEUSNCP个人学习笔记辽ICP备17017855号-2

Amazon Reviews Analysis: Unlocked Mobile Phones

VincentWei 2020年3月2日 15:16:36

Introduction

Merchants selling products through ecommerce often received a high amount of customers reviews too large in scale for human processing. These reviews often have important business insights that can be leveraged to perform actions that can improve profits. In this project we analyze ~400,000 mobile phone reviews from Amazon.com aiming to find trends and patterns to determine which product characteristics are mentioned most by customers and with what sentiment. Our task is performed in six steps: (1) pre-processing to prepare the data for analysis including tokenization and part-of-speech tagging, (2) product names standardization, (3) characteristics extraction, (4) reviews filtering to remove reviews considered as outliers, unbalanced or meaningless, (5) sentiment extraction for each product-characteristic and (6) performance analysis to determine the accuracy of the model where we evaluate characteristic extraction separately from sentiment scores.

Methodology

A flowchart of the project, including the approach, performance and final business analysis is presented below:

1. Pre-procesing

This part includes:

1.1 Tokenization

Applied to both product names and reviews. It involves removal of stopwords, treating stemming of words, case-folding, removing characters that are not alphanumeric and breaking at whitespace.

Synonyms

Synonyms were grouped together as a means of dimensionality reduction, with manually inputted gazetteer with most common synonyms (for example the words “camera”, “video”, “display” are all transformed into “camera”).

Negation

It was important to handle negation for sentiment analysis so that negated opinion words could be reversed when computing its score. This method comes from Das and Chen 2001 - basically appending the suffix '_NEG' to every word appearing between a negation and a clause-level punctuation mark (such as comma). The built-in function sentiment.util.mark_negation from NLTK package was used without considering double negation.

Spelling correction Because reviews are hand-typed the function 'spell' from the 'autocorrect package' was used to treat misspellings but also considering a manually inputted gazetteer to ignore special cases (for example the word “microsd” was incorrectly being transformed into “micros”).

1.2 Part of Speech tagging

POS tagging was critical for three reasons.

(1) To find adjectives which were all considered as opinion words (as well as others exceptions that will be discussed in next sections),

(2) to extract it’s sentiment score since words have different polarity depending on their POS tag and

(3) to extract products characteristics where Nouns (NN) and Noun-phrases (NNP) were considered as potential candidates.

The function pos_tag from NLTK package was used for this task.

1.3 Vector Space Model and TF * IDF transformation

Vector Space Model

A vector space model was created based on a normalized (by euclidean distance) Term-Document-Matrix via bags-of-words for both product names as well as reviews in preparation for clustering purposes. For the first to standardize product names and for the latter to filter reviews.

Inverse Document Frequency

Another normalized TDM was constructed this time using TF*IDF weightings for each product name term. Its purpose was to determine which potential terms could be considered as standardized product names. The higher the IDF value the more important to be a potential part of the standardized name since the most commons words such as “unlocked”, “black” or “dual-core” should be avoided (and they have low IDF scores).

Load Libraries

from IPython.display import display

import timeit

from collections import defaultdict

import math

import numpy as np

import pandas as pd

import random

import seaborn as sns

from matplotlib import pyplot as plt

import matplotlib.dates as md

%matplotlib inline

import operator

from sklearn.model_selection import train_test_split

from sklearn.cluster import KMeans

from bs4 import BeautifulSoup

import re

import nltk

from nltk import word_tokenize

from nltk.corpus import sentiwordnet as swn

from nltk.corpus import wordnet as wn

from nltk.corpus import wordnet

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

from nltk import sentiment

from autocorrect import spell # For spelling correction

from urllib import request

Load alternative for WordNet

url_pos = r'https://raw.githubusercontent.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/master/data/opinion-lexicon-English/positive-words.txt'

url_neg = r'https://raw.githubusercontent.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/master/data/opinion-lexicon-English/negative-words.txt'

pos_list = request.urlopen(url_pos).read().decode('utf-8')[1:]

pos_list = pos_list[pos_list.find("a+"):].split("\n")

neg_list = request.urlopen(url_neg).read().decode('ISO-8859-1')[1:]

neg_list = neg_list[neg_list.find("2-faced"):].split("\n")

Load and correct Test Data

# The initial format of he annotated test_set is difficult to read

# as a dataframe, transformation to .csv format is computed first

# with regular expressions.

test = open('data/annotated_test_set.txt','r', encoding='utf8')

test_file = test.read()

test.close()

test_file[:200]

test_file = re.sub(r"{[^{}]+}", lambda x: x.group(0).replace(",", ";"), test_file)

test_file = test_file.replace(';', "%")

test_file = test_file.replace(',', ";")

test_file = test_file.replace('%', ",")

test_file = test_file.replace('{', "{'")

test_file = test_file.replace(',', ",'")

test_file = test_file.replace(':', "':")

test_file = test_file.replace("},'", "}")

# Once fixed, save and load:

text_file = open("data/annotated_test_set_corrected.csv", "w")

for row in test_file.split(",\n"):

    text_file.write(row)

    text_file.write("\n")

text_file.close()

test = open('data/annotated_test_set_corrected.csv','r', encoding='utf8')

test_file = test.read()

test.close()

test = pd.read_csv('data/annotated_test_set_corrected.csv', delimiter = ";")

test.columns = ['review_id', 'Product', 'Sentiments_test']

Load Amazon Reviews Data

https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones/downloads/Amazon_Unlocked_Mobile.csv

df = pd.read_csv('data/Amazon_Unlocked_Mobile.csv', delimiter = ",")

n = len(df)

df.columns = ['Product', 'Brand', 'Price', 'Rating', 'Review', 'Votes']

df['id_col'] = range(0, n)

n_reviews = 1000 # Let's get a sample

keep = sorted(random.sample(range(1,n),n_reviews))

keep += list(set(test.review_id)) # this are the reviews annotated for test

df = df[df.id_col.isin(keep)]

n_reviews = len(df)

df['id_new_col'] = range(0, n_reviews)

df.head()

Out[8]:

	Product	Brand	Price	Rating	Review	Votes	id_col	id_new_col
53	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	muy buen producto	0.0	53	0
69	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	Nokia Asha 302 Unlocked GSM Phone with 3.2MP C...	13.0	69	1
71	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	1	Hola, compramos dos teléfonos y vienieron tota...	2.0	71	2
73	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	GRACIAS ME LLEGO EL PROCTO QUE COMPRE Y LLEVO ...	0.0	73	3
75	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	4	The keys are a little hard to hit, and I didn'...	0.0	75	4

Sample review:

id_prod = 69

for val in df[df.id_col == id_prod].Review:

    print(val)

Nokia Asha 302 Unlocked GSM Phone with 3.2MP Camera, Video, QWERTYDependableTraditional Nokia Menu'sNot Complicated like 'Smart Phones'DurableEasy to use on Straighttalk, Internet, WiFi, Bluetooth.

Create functions

def get_tokens(df, stem = False, negation = False):

    stemmer = PorterStemmer()

    stop = set(stopwords.words('english'))

    reviews = []

    i = 1

    for review in df["Review"]:

        tokenized_review = []

        review = str(review).lower() # lowercase

        # Remove every character except A-Z, a-z,space

        # and punctuation (we'll need it for negation)

        review = re.sub(r'[^A-Za-z /.]','',review)

        # mark_negation needs punctuation separated by white space.

        review = review.replace(".", " .")

        tokens = word_tokenize(review)

        for token in tokens:

            # Remove single characters and stop words

            if (len(token)>1 or token == ".") and token not in stop:

                if stem:

                    tokenized_review.append(stemmer.stem(get_synonym(token)))

                else:

                    tokenized_review.append(get_synonym(token))

        if negation:

            tokenized_review = sentiment.util.mark_negation(tokenized_review)

        # Now we can get rid of punctuation and also let's fix some spellings:

        tokenized_review = [correction(x) for x in tokenized_review if x != "." ]

        reviews.append(tokenized_review)

        if i%100 == 0:

            print('progress: ', (i/len(df["Review"]))*100, "%")

        i = i + 1

    return reviews

def get_pos(tokenized_reviews):

    tokenized_pos = []

    for review in tokenized_reviews:

        tokenized_pos.append(nltk.pos_tag(review))

    return tokenized_pos

def get_frequency(tokens):

    term_freqs = defaultdict(int)

    for token in tokens:

        term_freqs[token] += 1

    return term_freqs

def get_tdm(tokenized_reviews):

    tdm = []

    for tokens in tokenized_reviews:

        tdm.append(get_frequency(tokens))

    return tdm

def normalize_tdm(tdm):

    tdm_normalized = []

    for review in tdm:

        den = 0

        review_normalized = defaultdict(int)

        for k,v in review.items():

            den += v**2

        den = math.sqrt(den)

        for k,v in review.items():

            review_normalized[k] = v/den

        tdm_normalized.append(review_normalized)

    return tdm_normalized

def get_all_terms(tokenized_reviews):

    all_terms = []

    for tokens in tokenized_reviews:

        for token in tokens:

            all_terms.append(token)

    return(set(all_terms))

def get_all_terms_dft(tokenized_reviews, all_terms):

    terms_dft = defaultdict(int)

    for term in all_terms:

        for review in tokenized_reviews:

            if term in review:

                terms_dft[term] += 1

    return terms_dft

def get_tf_idf_transform(tokenized_reviews, tdm, n_reviews):

    tf_idf = []

    all_terms = get_all_terms(tokenized_reviews)

    terms_dft = get_all_terms_dft(tokenized_reviews, all_terms)

    for review in tdm:

        review_tf_idf = defaultdict(int)

        for k,v in review.items():

            review_tf_idf[k] = v * math.log(n_reviews / terms_dft[k], 2)

        tf_idf.append(review_tf_idf)

    return tf_idf

def get_idf_transform(tokenized_reviews, tdm, n_reviews):

    idf = []

    terms_dft = defaultdict(int)

    all_terms = get_all_terms(tokenized_reviews)

    for term in all_terms:

        for review in tokenized_reviews:

            if term in review:

                terms_dft[term] += 1

    for review in tdm:

        review_idf = defaultdict(int)

        for k,v in review.items():

            review_idf[k] = math.log(n_reviews / terms_dft[k], 2)

        idf.append(review_idf)

    return idf

def correction(x):

    ok_words = ["microsd"]

    if x.find("_NEG") == -1 and x not in ok_words: # Don't correct if they are negated words or exceptions

        return spell(x)

    else:

        return x

def get_synonym(word):

    synonyms = [["camera","video", "display"],

                ["phone", "cellphone", "smartphone", "phones"],

               ["setting", "settings"],

               ["feature", "features"],

               ["pictures", "photos"],

               ["speakers", "speaker"]]

    synonyms_parent = ["camera", "phone", "settings", "features", "photos", "speakers"]

    for i in range(len(synonyms)):

        if word in synonyms[i]:

            return synonyms_parent[i]

    return word

def get_similarity_matrix(similarity, tokenized_reviews):

    similarity_matrix = []

    all_terms = get_all_terms(tokenized_reviews)

    for review in similarity:

        similarity_matrix_row = []

        for term in all_terms:

            similarity_matrix_row.append(review[term])

        similarity_matrix.append(similarity_matrix_row)

    return similarity_matrix

# EXECUTE

tic=timeit.default_timer()

tokenized_reviews = get_tokens(df, stem = False, negation = False)

tokenized_pos = get_pos(tokenized_reviews)

tdm = get_tdm(tokenized_reviews)

vsm = normalize_tdm(tdm)

tf_idf = get_tf_idf_transform(tokenized_reviews, tdm, n_reviews)

toc=timeit.default_timer()

print("minutes: ", (toc - tic)/60)

progress:  8.865248226950355 %
progress:  17.73049645390071 %
progress:  26.595744680851062 %
progress:  35.46099290780142 %
progress:  44.32624113475177 %
progress:  53.191489361702125 %
progress:  62.056737588652474 %
progress:  70.92198581560284 %
progress:  79.7872340425532 %
progress:  88.65248226950354 %
progress:  97.51773049645391 %
minutes:  2.6501673290627097

Let's see a sample of:

Tokenized reviews
Part of speech
Term-Document-Matrix (TDM)
TD*IDF transformation

lookup_review = 1

for val in df[df.id_new_col == lookup_review]["Review"]: print(val)

display(tokenized_reviews[lookup_review])

display(tokenized_pos[lookup_review])

display(tdm[lookup_review])

display(tf_idf[lookup_review])

Nokia Asha 302 Unlocked GSM Phone with 3.2MP Camera, Video, QWERTYDependableTraditional Nokia Menu'sNot Complicated like 'Smart Phones'DurableEasy to use on Straighttalk, Internet, WiFi, Bluetooth.

['Nokia',
 'Asha',
 'unlocked',
 'grm',
 'phone',
 'imp',
 'camera',
 'camera',
 'qwertydependabletraditional',
 'Nokia',
 'menusnot',
 'complicated',
 'like',
 'smart',
 'phonesdurableeasy',
 'use',
 'straighttalk',
 'internet',
 'WiFi',
 'Bluetooth']

[('Nokia', 'NNP'),
 ('Asha', 'NNP'),
 ('unlocked', 'VBD'),
 ('grm', 'JJ'),
 ('phone', 'NN'),
 ('imp', 'NN'),
 ('camera', 'NN'),
 ('camera', 'NN'),
 ('qwertydependabletraditional', 'JJ'),
 ('Nokia', 'NNP'),
 ('menusnot', 'NN'),
 ('complicated', 'VBN'),
 ('like', 'IN'),
 ('smart', 'JJ'),
 ('phonesdurableeasy', 'NN'),
 ('use', 'NN'),
 ('straighttalk', 'NN'),
 ('internet', 'NN'),
 ('WiFi', 'NNP'),
 ('Bluetooth', 'NNP')]

defaultdict(int,
            {'Asha': 1,
             'Bluetooth': 1,
             'Nokia': 2,
             'WiFi': 1,
             'camera': 2,
             'complicated': 1,
             'grm': 1,
             'imp': 1,
             'internet': 1,
             'like': 1,
             'menusnot': 1,
             'phone': 1,
             'phonesdurableeasy': 1,
             'qwertydependabletraditional': 1,
             'smart': 1,
             'straighttalk': 1,
             'unlocked': 1,
             'use': 1})

defaultdict(int,
            {'Asha': 9.139551352398794,
             'Bluetooth': 5.969626350956481,
             'Nokia': 12.66439286068238,
             'WiFi': 5.010268335453827,
             'camera': 7.393215713100131,
             'complicated': 10.139551352398794,
             'grm': 6.439111634257702,
             'imp': 10.139551352398794,
             'internet': 5.817623257511432,
             'like': 3.0627357553479633,
             'menusnot': 10.139551352398794,
             'phone': 0.9179642311339886,
             'phonesdurableeasy': 10.139551352398794,
             'qwertydependabletraditional': 10.139551352398794,
             'smart': 5.554588851677638,
             'straighttalk': 8.554588851677638,
             'unlocked': 4.554588851677638,
             'use': 3.2940613014544184})

In [ ]:

1. Pre-procesing

This part includes:

1.4 Product Names Standardization

Merchants often name their products in different ways, for example “iPhone 4 32GB Black, AT&T” and “iPhone 4 16GB Gold, Verizon”. This tasks objective was to add a new standard name that for this case should simply be “iPhone 4”.

Three different approaches were combined,

(1) manually inputted set/gazetteer with words to be removed,

(2) IDF importance score and

(3) Clustering.

The first step cleaned the names through the (1) gazetteer, removing colors names and common terms (such as “unlocked”).

With the remaining terms (2) selected only the first 5 terms with the highest IDF in the group (none common terms).

Finally with the remaining terms performed the (3) Clustering using a VSM matrix with k=N°_reviews/2 clusters. This number was approximated through trial and error validating with visualization since did not know a priori how many product names were in the dataset. The number of clusters was a trade-off between having different products in the same standardized names (low number of clusters) which was highly undesirable and having too many standardized names that couldn’t standardize properly (for example having “iPhone 4 32GB” and “iPhone 4 16GB”).

A sample output can be seen in figure 2.

The approach attempted was taking advantage of POS tagging. The hypothesis was that nouns (NN and NNP) could be potential terms in a standardized product name, however NLTK tagging couldn’t grab some of the terms as NN which were the most important ones (for example model names such as “A850” were tagged as verbs).

Load Function

def get_product_tokens(df):

    stop = set(stopwords.words('english'))

    products = []

    i = 1

    for product in df["Product"]:

        tokenized_product = []

        product = product.lower() # lowercase

        # Remove every character except A-Z, a-z,space

        # and punctuation (we'll need it for negation)

        product = re.sub(r'[^0-9A-Za-z \.]','',product)

        # Only consider first 10 words of the product names

        tokens = word_tokenize(product)[:11]

        for token in tokens:

            # Remove stop words

            if token not in stop:

                tokenized_product.append(token)

        products.append(tokenized_product)

        if i%100 == 0:

            print('progress: ', (i/len(df["Product"]))*100, "%")

        i = i + 1

    return products

tokenized_products = get_product_tokens(df)

products_tokenized_pos = get_pos(tokenized_products)

products_tdm = get_tdm(tokenized_products)

products_tf_idf = get_tf_idf_transform(tokenized_products, products_tdm, n_reviews)

products_idf = get_idf_transform(tokenized_products, products_tdm, n_reviews)

progress:  8.865248226950355 %
progress:  17.73049645390071 %
progress:  26.595744680851062 %
progress:  35.46099290780142 %
progress:  44.32624113475177 %
progress:  53.191489361702125 %
progress:  62.056737588652474 %
progress:  70.92198581560284 %
progress:  79.7872340425532 %
progress:  88.65248226950354 %
progress:  97.51773049645391 %

Based on IDF we will get only the words that have the highest importance. The hypothesis is that the most common will be words typical in many smarphones such as "black, unlocked, dual, etc." **
Unfortunately we can't filter through POS, it fails to grab the most important words. For example A850 is tagged as Verb when in fact it's the model of a smartphone (the main word to have in the product name) **
We can assume that we will not lose the brand (which might not be grabed) since we do have it in a second column.

Visualization for analysis below:

lookup_product = 53

display(df[df.id_new_col== lookup_product]["Product"])

# we want to grab those with higher scores (least common terms)

display(sorted(products_idf[lookup_product].items(),

               key=operator.itemgetter(1), reverse = True))

# Unfortunately we can't filter through POS

display(products_tokenized_pos[lookup_product])

10170    Apple iPhone 4 8GB, White, for Straight Talk, ...
Name: Product, dtype: object

[('straight', 9.139551352398794),
 ('talk', 9.139551352398794),
 ('contract', 6.33219643034119),
 ('4', 4.8916238389552085),
 ('8gb', 4.4671260104272985),
 ('white', 3.052088511148454),
 ('apple', 2.2569083030369526),
 ('iphone', 2.202913413396223)]

[('apple', 'NN'),
 ('iphone', 'NN'),
 ('4', 'CD'),
 ('8gb', 'CD'),
 ('white', 'JJ'),
 ('straight', 'JJ'),
 ('talk', 'NN'),
 ('contract', 'NN')]

In [19]:

colors = ["black", "red", "blue", "white", "gray", "green","yellow", "pink", "gold"]

common_terms = ["smarthphone", "phone", "cellphone", "retail", "warranty",

                "silver", "bluetooth", "wifi", "wireless", "keyboard", "gps",

               "original", "unlocked", "camera", "certified", "international",

               "actory", "packaging", "us", "usa", "international", "refurbished",

               "phones", "att", "verizon", "-", "8gb", "16gb", "32gb", "64gb", "contract"]

def standardize_names(products_idf, colors, common_terms):

    standard_names = []

    brands = [str(x).lower() for x in set(df.Brand)]

    for product in products_idf:

        for k, v in product.items():

            # Remove color and brand words

            if k in colors or k in common_terms or k in brands:

                product[k] = 0

        # Grab the first 5 words with highest score

        product = sorted(product.items(), key=operator.itemgetter(1), reverse = True)[:5]

        standard_names.append(product)

        tokenized_standard_product_names = []

    for product in standard_names:

        product_name = []

        for word in product:

            if word[1] > 0:

                product_name.append(word[0])

        tokenized_standard_product_names.append(product_name)

    return tokenized_standard_product_names

standard_product_names = standardize_names(products_idf, colors, common_terms)

product_tdm = get_tdm(standard_product_names)

product_vsm = normalize_tdm(product_tdm)

product_vsm[1]

Out[20]:

defaultdict(int,
            {'3.2mp': 0.4472135954999579,
             '302': 0.4472135954999579,
             'asha': 0.4472135954999579,
             'qwerty': 0.4472135954999579,
             'video': 0.4472135954999579})

CLUSTER PRODUCT NAMES

similarity = product_tdm

product_names_clusters = int(round(n_reviews/2,0))

similarity_matrix = pd.DataFrame(get_similarity_matrix(similarity, standard_product_names), columns = get_all_terms(standard_product_names))

kmeans = KMeans(n_clusters=product_names_clusters, random_state=0).fit(similarity_matrix)

clusters=kmeans.labels_.tolist()

clustered_matrix = similarity_matrix.copy()

clustered_matrix['product_name_cluster'] = clusters

clustered_matrix['id_col'] = range(0, n_reviews)

display(clustered_matrix[:5])

count_clusters = pd.DataFrame(clustered_matrix.product_name_cluster.value_counts())

display(count_clusters[:5])

	...	product_name_cluster	id_col
0	...	18	0
1	...	18	1
2	...	18	2
3	...	18	3
4	...	18	4

5 rows × 924 columns

	product_name_cluster
28	17
30	14
167	13
27	12
9	12

ASSIGN CLUSTER PRODUCT NAMES

df["cluster_name"] = list(clustered_matrix.product_name_cluster)

def create_standard_name(df):

    new_names = defaultdict(int)

    current_names = df.groupby('cluster_name').first().Product

    for i in set(clusters):

        cluster_name = df[df.cluster_name == i].Product.value_counts().index[0]

        new_name = []

        for word in cluster_name.split():

            temp_word= re.sub(r'[^0-9A-Za-z \.\-]','',word).lower()

            if temp_word not in colors and temp_word not in common_terms :

                new_name.append(word)

        new_names[i] = ' '.join(new_name)

    new_standard_names = []

    for row in df.cluster_name:

        new_standard_names.append(new_names[row])

    df["Standard_Product_Name"] = new_standard_names

    return df

df = create_standard_name(df)

df.head()

Out[23]:

	Product	Brand	Price	Rating	Review	Votes	id_col	id_new_col	cluster_name	Standard_Product_Name
53	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	muy buen producto	0.0	53	0	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
69	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	Nokia Asha 302 Unlocked GSM Phone with 3.2MP C...	13.0	69	1	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
71	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	1	Hola, compramos dos teléfonos y vienieron tota...	2.0	71	2	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
73	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	GRACIAS ME LLEGO EL PROCTO QUE COMPRE Y LLEVO ...	0.0	73	3	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
75	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	4	The keys are a little hard to hit, and I didn'...	0.0	75	4	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...

Sample for 'iPhone'

df[["Product","Standard_Product_Name"]][df['Product'].str.contains("iPhone")][:8]

Out[26]:

	Product	Standard_Product_Name
3968	Apple A1533 Unlocked iPhone 5S Smart Phone, 16...	Apple A1533 iPhone 5S Smart 16 GB
7237	Apple iPhone 4 16GB (Black) - AT&T	Apple iPhone 4
7283	Apple iPhone 4 16GB (Black) - AT&T	Apple iPhone 4
7592	Apple iPhone 4 16GB (Black) - AT&T	Apple iPhone 4
8913	Apple iPhone 4 32GB (Black) - Verizon	Apple iPhone 4
9022	Apple iPhone 4 32GB (Black) - Verizon	Apple iPhone 4
9541	Apple iPhone 4 32GB (White) - Verizon	Apple iPhone 4
9739	Apple iPhone 4 8GB Unlocked- Black	Apple iPhone 4

2.1 Characteristics Extraction

Two steps were taken to extract the main characteristics from reviews:

(1) manually inputted set/gazetteer with words to be removed or included and

(2) identifying NN/NNP POS tagged terms that exceeded a specific threshold (set in 1%*N°_reviews) of reviews occurrences.

Load functions and shortcuts

def get_all_terms_pos_dft(all_terms, terms_dft):

    all_terms_pos = nltk.pos_tag(all_terms)

    i = 0

    for k, v in terms_dft.items():

        all_terms_pos[i] = all_terms_pos[i] + (v,)

        i+=1

    return all_terms_pos

def get_threshold_terms(all_terms_pos_dft, threshold = 20):

    threshold_terms = []

    for term in all_terms_pos_dft:

        if term[0] in exceptions_to_consider or (term[2] >= threshold and term[1] in ["NN", "NNS", "NNP", "NNPS"] and term[0] not in exceptions_not_to_consider):

            threshold_terms.append(term)

    return threshold_terms

exceptions_to_consider = ["apps", "android", "buttons", "hardware", "wifi",

                         "audio", "speed", "settings", "charger", "design",

                         "price", "look", "trackball", "microsd", "speaker"]

exceptions_not_to_consider = ["phone", "cool", "love", "awesome", "tell",  'tell',

 'feels',

  'works',

 'excelente',

 'item',

 'get',

 'iPhone',

 'dont',

 'lot',

 'let',

 'money',

 'brand',

 'recommend',

 'issues',

 'cant',

 'nothing',

 'number',

 'check',

 'month',

 'husband',

 'need',

 'note',

 'venezuela',

 'give',

 'Samsung',

 'see',

 'turn',

 'pocket',

 'amazing',

 'hands',

 'couldnt',

 'fast',

 'condition',

 'super',

 'today',

 'star',

 'life',

 'anyone',

 'storage',

 'speaker',

 'internet',

 'delivery',

 'picture',

 'games',

 'hand',

 'model',

 'glass',

 'case',

 'micro',

 'sound',

 'mp',

 'watch',

 'grm',

 'try',

 'line',

 'thing',

 'isnt',

 'thanks',

 'Verizon',

 'experience',

 'box',

 'scratches',

 'problems',

 'waste',

 'bottom',

 'company',

 'bit',

 'youre',

 'lack',

 'deal',

 'pay',

 'i',

 'reason',

 'issue',

 'couple',

 'option',

 'beautiful',

 'mobile',

 'replacement',

 'wasnt',

 'way',

 'days',

 'loves',

 'trouble',

 'quick',

 'someone',

 'glad',

 'weeks',

 'ones',

 'something',

 'market',

 'galaxy',

 'apple',

 'havent',

 'download',

 'time',

 'lg',

 'send',

 'home',

 'years',

 'product',

 'change',

 'people',

 'review',

 'price',

 'simple',

 'person',

 'lasts',

 'user',

 'hold',

 'please',

 'reviews',

 'work',

 'thats',

 'text',

 'im',

 'end',

 'thank',

 'look',

 'cost',

 'months',

 'buying',

 'point',

 'version',

 'web',

 'times',

 'Nokia',

 'problem',

 'wouldnt',

 'performance',

 'products',

 'minutes',

 'customer',

 'order',

 'guess',

 'things',

 'everything',

 'week',

 'play',

 'daughter',

 'anything',

 'purchase',

 'ok',

 'year',

 'stars',

 'day',

 'wife',

 'son',

 'doesnt',

 'blackberry',

 'hours',

 'return',

 'use']

all_terms = get_all_terms(tokenized_reviews)

terms_dft = get_all_terms_dft(tokenized_reviews, all_terms)

all_terms_pos_dft = get_all_terms_pos_dft(all_terms, terms_dft)

threshold_terms = get_threshold_terms(all_terms_pos_dft, threshold = 0.01 * n_reviews)

threshold_terms[:10]

Out[31]:

[('voice', 'NN', 12),
 ('design', 'NN', 13),
 ('microsd', 'NN', 7),
 ('online', 'NN', 15),
 ('connection', 'NN', 13),
 ('service', 'NN', 36),
 ('warranty', 'NN', 22),
 ('os', 'NN', 13),
 ('WiFi', 'NNP', 35),
 ('network', 'NN', 22)]

characteristics = [x[0] for x in threshold_terms]

characteristics[:10]

Out[32]:

['voice',
 'design',
 'microsd',
 'online',
 'connection',
 'service',
 'warranty',
 'os',
 'WiFi',
 'network']

2.2 Filtering

To effectively reduce the dataset size and improve performance, we need to filter out unusable, misleading and noisy reviews through 4 methods described below. In the end the dataset was reduced by 77% from an initial ~1500 reviews.

POS Filter Reviews without an adjective POS tag are removed since sentiment orientation is extracted only from adjectives.

Wordnet Filter Reviews with descriptive words not recognised by Wordnet or other sentiment lexicons are also pruned.

Rough Sentiment Analysis Filter To filter misleading reviews, we first conduct a rough sentiment analysis on individual opinion words, giving them a score of -1 or 1, and the overall score of a review, which is the sum of scores of all adjectives in the review. If there are less than three times the number of positive adjectives than that of negative reviews, or vice versa, then we assume the review is noisy and filter it out. Additionally, we assume that the sum of all reviews being positive, zero, or negative corresponds to a rating of >=3, 3 and below 3 respectively. Thus reviews not satisfying this equality condition against the rating are pruned.

Clustering Filter Performing clustering through a raw and normalized (VSM) TDM, the best results were obtained through VSM since it managed to obtain more diverse clusters - TDM was biased to create clusters based on the amount/frequency of words (clustering almost based exclusively on the lengths of the reviews).

Characteristics Filter The last step aims to keep only the reviews that have at least one characteristic. Since the objective of the project is determining why a product is good or bad through their characteristics sentiment instead of just computing the sentiment score of the review which can already be derived by the rating.

# first import 1000 rows in dataframe

to_prune = [i+1 for i in range(n_reviews)]

ratings = list(df['Rating'])

def get_wordnet_pos(pos):

    for tag in [('J','ADJ'),('V','VERB'),('N','NOUN'),('R','ADV')]:

        if pos.startswith(tag[0]):

            return getattr(wordnet,tag[1])

    else:

        return 'None'

def get_adj(review):

    with_adj = [tup for tup in review if tup[1] == 'JJ']

    return with_adj

# score for each word

def senti(synset):

    s = swn.senti_synset(synset).pos_score() - swn.senti_synset(synset).neg_score()

    if s>=0:

        return 1

    else:

        return -1

adjs = {x.name().split('.', 1)[0] for x in wn.all_synsets('a')}

### 1. prune reviews without adjectives recognised by wordnet

def prune_adj(tokenized_pos):

    for k in [i for i in to_prune if i!=0]:

        if not len(get_adj(tokenized_pos[k-1])) or not all(i[0] in adjs for i in get_adj(tokenized_pos[k-1])):

                to_prune[k-1] = 0

    return to_prune

### 2. prune by number of pos and neg adj

# list of scores for each review

def slist(tokenized_pos):

    score = []

    for k in [i for i in to_prune if i!=0]:

        r = get_adj(tokenized_pos[k-1])

        tag = [get_wordnet_pos(tuple[1]) for tuple in r]

        synsets = [r[i][0] + '.' + tag[i] + '.01' for i in range(len(r))]

        score.append([senti(i) for i in synsets])

    return score

def balance(score_list):

    m=-1

    for k in [i for i in to_prune if i!=0]:

        m+=1

        s = score_list[m]

        if 1 in s and -1 in s and max([s.count(1),s.count(-1)])/min([s.count(1),s.count(-1)]) <= 3:

            to_prune[k-1] = 0

    return to_prune

### 3. prune by average score compared to rating score

def average_score(score_list):

    m = -1

    for k in [i for i in to_prune if i!=0]:

        m += 1

        s = score_list[m]

        #(sum >=0, then rating >=3)

        if sum(s)>=0 and (sum(s)+1)*(ratings[k-1]-2.5)<=0:

            to_prune[k-1] = 0

        elif sum(s)<0 and (sum(s)+1)*(2.5-ratings[k-1])<=0:

            to_prune[k-1] = 0

    return to_prune

# initialise index to_prune = [1,2,3,...,1000]

to_prune = [i+1 for i in range(n_reviews)]

#to_prune = list(set(df.id_col))

to_prune = prune_adj(tokenized_pos)

score_list = slist(tokenized_pos)

to_prune = balance(score_list)

to_prune = average_score(score_list)

# len([i for i in to_prune if i!=0])

to_keep = [i for i in to_prune if i!=0]

to_keep += list(df[df.id_col.isin(list(set(test.review_id)))].id_new_col) # this are the reviews annotated for test

to_keep = list(set(to_keep))

df_filtered = df[df.id_new_col.isin(to_keep)]

df_filtered[:3]

Out[53]:

	Product	Brand	Price	Rating	Review	Votes	id_col	id_new_col	cluster_name	Standard_Product_Name
53	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	muy buen producto	0.0	53	0	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
69	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	Nokia Asha 302 Unlocked GSM Phone with 3.2MP C...	13.0	69	1	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
71	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	1	Hola, compramos dos teléfonos y vienieron tota...	2.0	71	2	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...

len(list(df_filtered[df_filtered.id_col.isin(list(set(test.review_id)))].id_new_col))

Out[54]:

to_keep = list(df_filtered.id_new_col)

clustering filter

n_reviews = len(to_keep)

tokenized_reviews = get_tokens(df_filtered, stem = False, negation = False)

tokenized_pos = get_pos(tokenized_reviews)

tdm = get_tdm(tokenized_reviews)

vsm = normalize_tdm(tdm)

tf_idf = get_tf_idf_transform(tokenized_reviews, tdm, n_reviews)

similarity = vsm #vsm # tdm

similarity_matrix = pd.DataFrame(get_similarity_matrix(similarity, tokenized_reviews), columns = get_all_terms(tokenized_reviews))

similarity_matrix[:10]

progress:  48.54368932038835 %
progress:  97.0873786407767 %

Out[56]:

	grm	...	copy	wider
0	0.000000	...	0.000000	0.00000
1	0.204124	...	0.000000	0.00000
2	0.000000	...	0.000000	0.00000
3	0.000000	...	0.000000	0.00000
4	0.000000	...	0.000000	0.00000
5	0.000000	...	0.000000	0.07785
6	0.000000	...	0.000000	0.00000
7	0.000000	...	0.000000	0.00000
8	0.000000	...	0.000000	0.00000
9	0.000000	...	0.102062	0.00000

10 rows × 1578 columns

kmeans = KMeans(n_clusters=int(round(math.sqrt(n_reviews),0)), random_state=0).fit(similarity_matrix)

clusters=kmeans.labels_.tolist()

# clustered_matrix = pd.DataFrame(tf_idf_matrix, clusters)

clustered_matrix = similarity_matrix.copy()

clustered_matrix['cluster'] = clusters

clustered_matrix['id_col'] = to_keep

display(len(clustered_matrix))

display(clustered_matrix[:5])

top_clusters = pd.DataFrame(clustered_matrix.cluster.value_counts())

display(top_clusters)

	grm	...	cluster	id_col
0	0.000000	...	0	0
1	0.204124	...	5	1
2	0.000000	...	0	2
3	0.000000	...	0	3
4	0.000000	...	1	4

5 rows × 1580 columns

	cluster
13	59
0	28
5	25
3	17
1	17
11	11
10	10
12	8
7	6
6	6
4	6
8	5
2	5
9	3

limit = top_clusters.cluster.quantile(0.3)

cluster_filter = top_clusters[top_clusters.cluster > limit]

display(cluster_filter)

list(cluster_filter.index)

	cluster
13	59
0	28
5	25
3	17
1	17
11	11
10	10
12	8

Out[58]:

[13, 0, 5, 3, 1, 11, 10, 12]

df_filtered["cluster"] = list(clustered_matrix.cluster)

df_filtered[:3]

C:\Program Files\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Out[59]:

	Product	Brand	Price	Rating	Review	Votes	id_col	id_new_col	cluster_name	Standard_Product_Name	cluster
53	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	muy buen producto	0.0	53	0	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0
69	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	Nokia Asha 302 Unlocked GSM Phone with 3.2MP C...	13.0	69	1	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	5
71	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	1	Hola, compramos dos teléfonos y vienieron tota...	2.0	71	2	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0

to_keep = list(df_filtered[df_filtered.cluster.isin(list(cluster_filter.index))].id_new_col)

to_keep += list(df[df.id_col.isin(list(set(test.review_id)))].id_new_col) # this are the reviews annotated for test

to_keep = list(set(to_keep))

df_filtered = df_filtered[df_filtered.id_new_col.isin(to_keep)]

df_filtered[:3]

Out[60]:

	Product	Brand	Price	Rating	Review	Votes	id_col	id_new_col	cluster_name	Standard_Product_Name	cluster
53	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	muy buen producto	0.0	53	0	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0
69	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	Nokia Asha 302 Unlocked GSM Phone with 3.2MP C...	13.0	69	1	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	5
71	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	1	Hola, compramos dos teléfonos y vienieron tota...	2.0	71	2	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0

Characteristic Filter

The idea is to consider only reviews that have at least one characteristic

def filter_with_characteristics(df_filtered, characteristics):

    tokenized_reviews = get_tokens(df_filtered, stem = False, negation = False)

    to_keep_in = []

    j = 0

    for i in df_filtered.id_col:

        for token in tokenized_reviews[j]:

            if token in characteristics:

                to_keep_in.append(i)

                break

        j+=1

    return to_keep_in

to_keep_in = filter_with_characteristics(df_filtered, characteristics)

len(to_keep_in)

progress:  52.083333333333336 %

Out[61]:

to_keep_in += list(set(test.review_id)) # this are the reviews annotated for test

df_filtered = df_filtered[df_filtered.id_col.isin(to_keep_in)]

df_filtered[:3]

Out[62]:

	Product	Brand	Price	Rating	Review	Votes	id_col	id_new_col	cluster_name	Standard_Product_Name	cluster
53	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	muy buen producto	0.0	53	0	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0
69	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	Nokia Asha 302 Unlocked GSM Phone with 3.2MP C...	13.0	69	1	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	5
71	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	1	Hola, compramos dos teléfonos y vienieron tota...	2.0	71	2	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0

2.3 Characteristics Sentiment Extraction

This task was approached with five combined methods.

(1) Manually inputting set/gazetteer to fix wordnet sentiments that should be positive/negative instead, to ignore certain opinion words (for example “unlocked”, “old”, “normal”, “yellow”, etc.) and to include words that are not tagged as adjectives (i.e. opinion words) such as “broken”, “love” or “cool”.

(2) Inverting the polarity of words when they were negated.

(3) Using “Minqing Hu and Bing Liu” lexicon when opinion words were not supported by wordnet (either missing or neutral).

(4) Extracting the nearests opinion words (with a maximum set at 2) considering token distance (with a maximum set at 5).

(5) Computing the final characteristic sentiment score weighting by their distance (the further apart the lesser its weight).

For (4) the procedure considered looking at opinion words before and after the characteristic found and always keeping the closest ones first (for example taking the opinion word at distance -1 before the one at distance +2) where distance refers to the numbers of tokens from the characteristic word. The maximum amount of opinion words was set at two since usually when a third is found it is because the application misses a new characteristic that was there and which that third opinion word was referring to, hence avoiding assigning incorrectly an opinion word to a characteristic. Furthermore the procedure also considers if an opinion word has already been assigned to a characteristic, which is an advantage (for example avoiding to assign an opinion word twice) as well as a limitation (an opinion word between two characteristic might end up being incorrectly assigned to the first characteristic found) and should be handled in further improvements through Relationship Extraction (RE).

Another limitation and challenge of this task was the fact that customers usually give a review comparing the product with another one. This is problematic since for example they could be talking positively about the screen of the product of importance and negatively about another one they previously had, giving a neutral sentiment in the end. This was not handled in the current project and should be further investigated using RE as well.

For (5) an altered formulation proposed by Ding et al. was used, where the sentiment score for one characteristic of one product is aggregated across all sentiment polarities , as follows:

Challenges not taken care of: Sometimes customers explain they got rid of their old phone that had a "bad" camera, to get a new one. This algorithm considers that that bad camera as part of the new phone.

Load functions

positive_exceptions = ["high", "surprised"] # wordnet have it as negative, should be positive

negative_exceptions = ["old"] # wordnet have it as positive, should be negative.

ignore_exceptions = ["old", "new", "unlocked", "normal"]

ignore_exceptions += colors

word_exceptions = ["missing", "broken", "love", "awesome", "cool"] # They are not tagged as JJ sometimes, they should.

def compute_score(word, word_neg):

    if word in ignore_exceptions:

        return 0

    if word in positive_exceptions:

        if word_neg.find("_NEG") == -1:

            return 1

        else:

            return -1

    if word in negative_exceptions:

        print(word)

        if word_neg.find("_NEG") == -1:

            return -1

        else:

            return 1

    word2 = ''.join([word,".a.01"])

    try:

        pos_score = swn.senti_synset(word2).pos_score()

        neg_score = swn.senti_synset(word2).neg_score()

    except:

        if word in pos_list:

            pos_score = 1

            neg_score = 0

        elif word in neg_list:

            pos_score = 0

            neg_score = 1

        else:

            return 0

    if pos_score > neg_score:

        if word_neg.find("_NEG") == -1:

            return 1

        else:

            return -1

    elif neg_score > pos_score:

        if word_neg.find("_NEG") == -1:

            return -1

        else:

            return 1

    else:

        if word in pos_list:

            return 1

        elif word in neg_list:

            return -1

        else:

            return 0

def extract_characteristic_opinion_words(review, review_neg, max_opinion_words = 2, max_distance = 5, use_distance = False):

    review_charactetistics_sentiment = defaultdict(list)

    i = 0

    temp_review = []

    for word in review:

        word = word + ("free",)

        temp_review.append(list(word))

    for i in range(len(review)):

        if review[i][0] in characteristics:

            keep_forward = True

            keep_backward = True

            opinion_words = 0

            for j in range(1,max_distance+1):

                if  i+j >= len(review):

                    keep_forward = False

                if keep_forward:

                    if  review[i+j][0] in characteristics or opinion_words >= max_opinion_words:

                        keep_forward = False

                    elif i+j < len(review) and (review[i+j][1] in ["JJ", "JJR", "JJS"] or review[i+j][0] in word_exceptions) and temp_review[i+j][2] == "free":

                        sentiment = defaultdict(int)

                        score = compute_score(review[i+j][0], review_neg[i+j][0])

                        if score == 0: continue

                        if use_distance:

                            distance = j

                        else:

                            distance = 1

                        sentiment[review[i+j][0]] = (score,distance)

                        review_charactetistics_sentiment[review[i][0]].append(sentiment)

                        temp_review[i+j][2] = "used"

                        opinion_words +=1

                if  i-j < 0:

                    keep_backward = False

                if keep_backward:

                    if  review[i-j][0] in characteristics or opinion_words >= max_opinion_words:

                        keep_backward = False

                    elif i-j > -1 and (review[i-j][1] in ["JJ", "JJR", "JJS"] or review[i-j][0] in word_exceptions) and temp_review[i-j][2] == "free":

                        sentiment = defaultdict(int)

                        score = compute_score(review[i-j][0], review_neg[i-j][0])

                        if score == 0: continue

                        if use_distance:

                            distance = j

                        else:

                            distance = 1

                        sentiment[review[i-j][0]] = (score,distance)

                        review_charactetistics_sentiment[review[i][0]].append(sentiment)

                        temp_review[i-j][2] = "used"

                        opinion_words +=1

                if not keep_forward and not keep_backward:

                    break

    return review_charactetistics_sentiment

def consolidate_score(characteristic_dict):

    num = 0

    den = 0

    for opinion in characteristic_dict:

        for k, v in opinion.items():

            num += v[0]/v[1]

            den += 1/v[1]

    return num/den

def compute_sentiment_scores(tokenized_pos, tokenized_pos_neg, max_distance = 5, use_distance = True):

    if len(tokenized_pos) != len(tokenized_pos_neg):

        print("FATAL ERROR: Different lenght between tokenized_pos and tokenized_pos_neg")

        return None

    else:

        reviews_sentiment_scores = []

        for i in range(len(tokenized_pos)):

            review_sentiment_score = defaultdict(int)

            review_characteristics_opinion_words = extract_characteristic_opinion_words(tokenized_pos[i], tokenized_pos_neg[i], max_distance = max_distance, use_distance = use_distance)

            for k, v in review_characteristics_opinion_words.items():

                review_sentiment_score[k] = consolidate_score(v)

            reviews_sentiment_scores.append(review_sentiment_score)

        return reviews_sentiment_scores

def get_NN_count(tokenized_pos):

    NN_count = []

    for review in tokenized_pos:

        review_NN_count = 0

        for token in review:

            if token[1] in ["NN", "NNS", "NNP"] or token[0] in characteristics:

                review_NN_count += 1

        NN_count.append(review_NN_count)

    return NN_count

tokenized_reviews = get_tokens(df_filtered, stem = False, negation = False)

tokenized_pos = get_pos(tokenized_reviews)

tokenized_reviews_neg = get_tokens(df_filtered, stem = False, negation = True)

tokenized_pos_neg = get_pos(tokenized_reviews_neg)

NN_count = get_NN_count(tokenized_pos)

df_filtered['new_id'] = range(0, len(df_filtered))

progress:  59.171597633136095 %
progress:  59.171597633136095 %

The following review as an example gives insight of the application capabilities and limitations:

lookup_product_id = 7

for val in df_filtered[df_filtered.new_id == lookup_product_id]["Review"]: print(val)

display(tokenized_pos[lookup_product_id])

review_characteristics_opinion_words = extract_characteristic_opinion_words(tokenized_pos[lookup_product_id], tokenized_pos_neg[lookup_product_id], max_distance = 5, use_distance = True)

display(review_characteristics_opinion_words)

This phone in an excellent phone at a great price. I was impressed with the features of this phone and would recommend this to anyone.

[('phone', 'NN'),
 ('excellent', 'JJ'),
 ('phone', 'NN'),
 ('great', 'JJ'),
 ('price', 'NN'),
 ('impressed', 'VBD'),
 ('features', 'NNS'),
 ('phone', 'NN'),
 ('would', 'MD'),
 ('recommend', 'VB'),
 ('anyone', 'NN')]

defaultdict(list,
            {'price': [defaultdict(int, {'great': (1, 1)}),
              defaultdict(int, {'excellent': (1, 3)})]})

On the sentiment dictionary we store the characteristic as well as its opinion words with the sentiment score {-1,1} and the distance from the characteristic. Because “impressed” was tagged as a Verb it was not included as an opinion word. The application deal with this kind of cases using a gazetteer (not implemented for “impressed” in this case). Furthermore since “features” does not have any opinion words after it, it was not included in the sentiment dictionary (and even if “impressed” was considered as an opinion word it would be given to “price” which comes first).

review_sentiment_scores = compute_sentiment_scores(tokenized_pos, tokenized_pos_neg, max_distance = 5, use_distance = True)

review_sentiment_scores[:6]

df_filtered["Sentiments"] = list(review_sentiment_scores)

df_filtered["NN_count"] = list(NN_count)

df_filtered[:3]

Out[68]:

	Product	Brand	Price	Rating	Review	Votes	id_col	id_new_col	cluster_name	Standard_Product_Name	cluster	new_id	Sentiments	NN_count
53	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	muy buen producto	0.0	53	0	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0	0	{}	1
69	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	5	Nokia Asha 302 Unlocked GSM Phone with 3.2MP C...	13.0	69	1	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	5	1	{'WiFi': 1.0}	14
71	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.0	1	Hola, compramos dos teléfonos y vienieron tota...	2.0	71	2	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0	2	{}	26

3. Performance

To determine how effective the application is performing we separated the measurements in two steps.

(1) Measure effectiveness of the mobile phones characteristics extraction and

(2) over the corrected characteristics extracted, measure how effective the sentiments were recorded.

For both steps we created a manually annotated test set with ~150 reviews chosen at random. The format of the test set is as follows:

Where the third column corresponds to manually inputted results.

For step (1) we compared the characteristics extracted by the application for the reviews annotated in the test set. The measurements computed for each review were:

● True_Positives: Correctly extracted characteristics

● True_Negatives: All potential characteristics (NN/NNPs) that were not considered and are not in test set

● False_Positives: Incorrectly extracted characteristics (i.e. not in the test set)

● False_Negatives: Missed characteristics that were considered in the test set

Based on those metrics aggregated on all reviews we calculated Specificity (0.773), Recall (0.070), F1_score (0.036) and Accuracy (0.720).

Our main focus is to have a high Recall, that is, to correctly extract characteristics which represent the main output of the business objective.

Currently it’s extremely low failing to produce relevant insights. Since in contrast Specificity is relatively high it further proves that the model is missing characteristics.

For step (2) using only the characteristics correctly extracted (Recall results) we compared their sentiment scored against those from the test set. The measurements computed for each review were:

● True_Positives: Characteristic correctly classified with positive score

● True_Negatives: Characteristic correctly classified with negative score

● False_Positive: Characteristic incorrectly classified with positive score

● False_Negatives: Characteristic incorrectly classified with negative score

Based on those metrics aggregated on all reviews we calculated Specificity (0.8), Recall (0.666), F1_score (0.666) and Accuracy (0.75). However, results are not statistically significant since the test set on this part was extremely low with only 7 reviews considered that had the correct characteristic extraction. Nonetheless it gives insights that assigning correct sentiment scores is performing better than characteristic extraction with higher Specificity and Recall.

Load and correct Test Data

# The initial format of he annotated test_set is difficult to read

# as a dataframe, transformation to .csv format is computed first

# with regular expressions.

test = open('data/annotated_test_set.txt','r', encoding='utf8')

test_file = test.read()

test.close()

test_file[:200]

test_file = re.sub(r"{[^{}]+}", lambda x: x.group(0).replace(",", ";"), test_file)

test_file = test_file.replace(';', "%")

test_file = test_file.replace(',', ";")

test_file = test_file.replace('%', ",")

test_file = test_file.replace('{', "{'")

test_file = test_file.replace(',', ",'")

test_file = test_file.replace(':', "':")

test_file = test_file.replace("},'", "}")

# Once fixed, save and load:

text_file = open("data/annotated_test_set_corrected.csv", "w")

for row in test_file.split(",\n"):

    text_file.write(row)

    text_file.write("\n")

text_file.close()

test = open('data/annotated_test_set_corrected.csv','r', encoding='utf8')

test_file = test.read()

test.close()

test = pd.read_csv('data/annotated_test_set_corrected.csv', delimiter = ";")

test.columns = ['review_id', 'Product', 'Sentiments_test']

test[:3]

Out[70]:

	review_id	Product	Sentiments_test
0	1540	BlackBerry Curve	{'Trackball':-1,'Battery':-1,'Micro-SD':-1}
1	1554	Acer Liquid E700 TRIO	{'Camera':-1,'Hardware':-1,'Buttons':-1}
2	1697	Alcatel OneTouch	{'Hardware':-1,'Charging Port':-1}

df_merge = pd.merge(df_filtered, test, left_on='id_col', right_on='review_id', how = "left")

df_merge[df_merge.Sentiments_test.isnull()==False]

Out[71]:

	Product_x	Brand	Price	Rating	Review	Votes	id_col	id_new_col	cluster_name	Standard_Product_Name	cluster	new_id	Sentiments	NN_count	review_id	Product_y	Sentiments_test
0	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	5	muy buen producto	0.0	53	0	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0	0	{}	1	53.0	Asha 302	{'sound': 1,' smart phone features': 1,' soft...
1	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	5	Nokia Asha 302 Unlocked GSM Phone with 3.2MP C...	13.0	69	1	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	5	1	{'WiFi': 1.0}	14	69.0	Asha 302	{'build': 1,' keyboard': 1,'sound': 1,' Xpres...
2	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	1	Hola, compramos dos teléfonos y vienieron tota...	2.0	71	2	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0	2	{}	26	71.0	Asha 302	{'build': 1,' reception': 1,' audio': 1,' key...
3	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	5	GRACIAS ME LLEGO EL PROCTO QUE COMPRE Y LLEVO ...	0.0	73	3	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	0	3	{}	8	73.0	Asha 302	{'apps': 1}
4	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	4	The keys are a little hard to hit, and I didn'...	0.0	75	4	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	1	4	{'didnt': -1.0, 'keyboard': 1.0}	5	75.0	Asha 302	{'SMS': 1,' rings': 1,' body': 1,' freezes': -1}
5	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	5	I bought this phone as a Christmas present for...	3.0	78	5	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	13	5	{'amazon': -0.14285714285714285, 'features': 0...	57	78.0	Asha 302	{'ring tones': 1}
6	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	4	The Phone is pretty good. I am using it with a...	2.0	79	6	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	13	6	{}	12	79.0	Asha 302	{'wi-fi': 1,' calendar': 1,' alarm clock': 1,...
7	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	5	This phone in an excellent phone at a great pr...	1.0	82	7	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	13	7	{'price': 1.0}	6	82.0	Asha 302	{'price': 1}
8	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	4	This is a good phone although it seems to have...	1.0	84	8	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	1	8	{}	7	84.0	Asha 302	{'time': -1,' support': -1,' booklet': -1}
9	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	5	I've been a long time user of the iPhone. I fi...	2.0	85	9	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	13	9	{'battery': 1.0, 'email': 1.0, 'etc': 1.0, 'ke...	45	85.0	Asha 302	{'screen': -1,' calling': 1,' messaging': 1,'...
10	"Nokia Asha 302 Unlocked GSM Phone with 3.2MP ...	Nokia	299.00	5	This phone, like all of Nokia's feature phones...	12.0	86	10	18	"Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...	13	10	{'features': 1.0, 'value': -1.0, 'keyboard': -...	81	86.0	Asha 302	{'texting interface': 1,' battery': 1,' email...
11	5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...	NaN	161.06	5	Very nice.arrived on time. I love it.	0.0	734	14	9	5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q...	12	11	{}	1	734.0	Lenovo A850	{'screen':1,' audio': 1,' apps': -1,' speed':...
12	5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...	NaN	161.06	3	I sent the phone to Colombia, and They had to ...	0.0	755	15	9	5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q...	13	12	{}	9	755.0	Lenovo A850	{'language setting': -1,' battery': -1}
13	5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...	NaN	161.06	2	Didn't have the color that originally wanted w...	0.0	773	16	9	5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q...	3	13	{}	14	773.0	Lenovo A850	{'apps': 1,' price': 1}
14	5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...	NaN	161.06	3	sometimes the screen and home button are unres...	0.0	774	17	9	5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q...	6	14	{'button': -0.6666666666666667}	5	774.0	Lenovo A850	{'charger': -1}
15	5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...	NaN	161.06	5	Nice phone. Android gsm with 2sims great.	0.0	776	18	9	5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q...	13	15	{'android': 1.0}	3	776.0	Lenovo A850	{'screen': -1,' button': -1,' brand': 1}
16	5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...	NaN	161.06	5	I like this smartphone, good quality very very...	0.0	828	19	9	5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q...	13	16	{'quality': 1.0, 'color': 1.0, 'amazon': 1.0}	10	828.0	Lenovo A850	{'price': 1,' size': 1}
17	5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...	NaN	161.06	5	I have not reached the tlf	1.0	943	20	9	5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q...	0	17	{}	1	943.0	Lenovo A850	{'speed': 1,' size': 1,' screen': 1,' camera'...
19	8330 BlackBerry Curve (US Cellular) Titanium P...	NaN	29.95	1	I recevied the phone with broken trackball, mi...	4.0	1540	26	322	8330 BlackBerry Curve Cellular) Titanium	13	19	{'trackball': -1.0, 'microsd': -1.0}	16	1540.0	BlackBerry Curve	{'Trackball':-1,'Battery':-1,'Micro-SD':-1}
20	Acer Liquid Jade Z Andoid KitKat Unlocked Quad...	Acer	129.99	2	I had high hopes for this Acer phone based the...	0.0	1554	27	179	Acer Liquid Jade Z Andoid KitKat Quad-Core 5" ...	13	20	{'plastic': 0.6666666666666667, 'battery': 0.1...	34	1554.0	Acer Liquid E700 TRIO	{'Camera':-1,'Hardware':-1,'Buttons':-1}
21	ALCATEL OneTouch Idol 3 Global Unlocked 4G LTE...	Alcatel	292.98	5	It's good	0.0	1697	28	112	ALCATEL OneTouch Idol 3 Global 4G LTE Smartpho...	1	21	{}	0	1697.0	Alcatel OneTouch	{'Hardware':-1,'Charging Port':-1}
22	ALCATEL OneTouch Idol 3 Global Unlocked 4G LTE...	Alcatel	129.00	1	I am never one to write negative reviews but i...	2.0	1930	29	444	ALCATEL OneTouch Idol 3 Global 4G LTE Smartpho...	13	22	{'audio': 1.0}	18	1930.0	Alcatel OneTouch	{'Screen':1,'Size':-1}
23	Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho...	Apple	33.00	1	The phone that I got doesnt work!	0.0	3177	32	27	Apple Iphone 5c A1532 16 GB Cell	5	23	{}	2	3177.0	iPhone 5c	{'size': 1,' charger': 1,' apps': 1,' headpho...
24	Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho...	Apple	33.00	5	Received a great looking and working used phon...	0.0	3270	33	27	Apple Iphone 5c A1532 16 GB Cell	13	24	{}	3	3270.0	iPhone 5c	{'Wifi':-1}
25	Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho...	Apple	33.00	5	Received a great looking and working used phon...	0.0	3270	33	27	Apple Iphone 5c A1532 16 GB Cell	13	24	{}	3	3270.0	iPhone 5c	{'wi-fi': -1}
26	Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho...	Apple	33.00	5	good phone unlocked	0.0	3274	34	27	Apple Iphone 5c A1532 16 GB Cell	1	25	{}	1	3274.0	iPhone 5c	{'screen':1,'speed':1,'battery':-1}
27	Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho...	Apple	33.00	2	The phone came with a bad speaker I could retu...	0.0	3308	35	27	Apple Iphone 5c A1532 16 GB Cell	13	26	{'speakers': -1.0}	8	3308.0	iPhone 5c	{'charging':1,'battery':1}
28	Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho...	Apple	33.00	1	I did not receive a Verizon wireless,as stated...	31.0	3310	36	27	Apple Iphone 5c A1532 16 GB Cell	8	27	{}	3	3310.0	iPhone 5c	{'speaker':-1}
29	Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho...	Apple	33.00	5	The phone was like new, and works perfect, tha...	0.0	3323	37	27	Apple Iphone 5c A1532 16 GB Cell	10	28	{}	3	3323.0	iPhone 5c	{'charger':-1}
30	Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho...	Apple	33.00	5	quick delivery, product received as described,...	0.0	3329	38	27	Apple Iphone 5c A1532 16 GB Cell	7	29	{}	3	3329.0	iPhone 5c	{'battery':-1}
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
140	HTC Rhyme 3G Android Smartphone Plum Verizon	HTC	64.99	4	This phone is what a smart stands for; the nav...	0.0	198109	594	8	HTC Rhyme 3G Android Smartphone Plum	13	139	{'system': 1.0, 'etc': 1.0}	18	198109.0	HTC Rhyme	{'charger':-1,'battery':-1}
141	HTC Rhyme 3G Android Smartphone Plum Verizon	HTC	64.99	5	It was a delightful surprise to find that this...	0.0	198111	595	8	HTC Rhyme 3G Android Smartphone Plum	5	140	{}	4	198111.0	HTC Rhyme	{'navigation system':1,'voice search':1,'spea...
142	HTC Rhyme 3G Android Smartphone Plum Verizon	HTC	64.99	5	just started having a few issues but it has wo...	0.0	198115	596	8	HTC Rhyme 3G Android Smartphone Plum	0	141	{}	1	198115.0	HTC Rhyme	{'size':1,'weight':1,'keyboard':-1,'SD card':...
143	Huawei Ascend P7 16G 5" Android 4.4 Quad Core ...	Huawei	2066.00	5	All very good, excellent product.	0.0	199958	604	304	Huawei Ascend P7 16G 5" Android 4.4 Quad Core ...	4	142	{}	1	199958.0	Huawei Ascend P7	{'software':-1}
144	HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L...	Huawei	182.99	5	t camera and designupdate still love the pho...	1.0	199986	605	61	HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone	13	143	{}	10	199986.0	Huawei Ascend P7	{'image quality':-1,'coverage':-1}
145	HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L...	Huawei	182.99	4	Network weak	0.0	199992	606	61	HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone	3	144	{'network': -1.0}	1	199992.0	Huawei Ascend P7	{'wifi':-1}
146	HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L...	Huawei	182.99	4	Great phone good price. It needs to come with ...	6.0	200009	607	61	HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone	1	145	{'price': 1.0}	6	200009.0	Huawei Ascend P7	{'price':1,'hardware':1}
147	HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L...	Huawei	182.99	5	excellent service and product as described	0.0	200041	608	61	HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone	7	146	{'service': 1.0}	2	200041.0	Huawei Ascend P7	{'specs':1,'price':1,'software':1,'screen':1,...
148	Huawei GX8 Unlocked Smartphone (US Version: RI...	Huawei	285.00	5	Best phone I have had. Fingerprint sensor is e...	1.0	200629	610	64	Huawei GX8 Smartphone Version: RIO-L03) Horizon	13	147	{'device': 1.0, 'battery': -1.0}	9	200629.0	Huawei GX8	{'battery': 1}
149	Huawei GX8 Unlocked Smartphone (US Version: RI...	Huawei	285.00	5	It's less than half the price of Galaxy 6, but...	3.0	200641	611	64	Huawei GX8 Smartphone Version: RIO-L03) Horizon	11	148	{'feel': 1.0}	6	200641.0	Huawei GX8	{'price': 1,' security options': 1,' battery'...
150	Huawei GX8 Unlocked Smartphone (US Version: RI...	Huawei	285.00	4	Phone looks and feel nice. It is about the siz...	13.0	200642	612	64	Huawei GX8 Smartphone Version: RIO-L03) Horizon	13	149	{'feel': 1.0, 'size': 1.0, 'didnt': -1.0}	11	200642.0	Huawei GX8	{'price': 1,' design': 1,' screen': 1,' finge...
151	Huawei GX8 Unlocked Smartphone (US Version: RI...	Huawei	285.00	1	Just bought a gx8 from Amazon. After one day o...	1.0	200657	613	64	Huawei GX8 Smartphone Version: RIO-L03) Horizon	5	150	{}	11	200657.0	Huawei GX8	{'multi-tasking': -1,' fingerprint reader': 1...
152	Huawei GX8 Unlocked Smartphone (US Version: RI...	Huawei	285.00	5	Just switched from iphone 6s plus to this prod...	4.0	200658	614	64	Huawei GX8 Smartphone Version: RIO-L03) Horizon	3	151	{}	17	200658.0	Huawei GX8	{'battery': 1,' camera': 1,' screen': 1,' pri...
153	Huawei Mate 2 - Factory Unlocked (Black)	Huawei	229.99	2	I bought the phone July 2014,It didn't work we...	2.0	200946	615	102	Huawei Mate 2 Factory	5	152	{'voice': -1.0, 'function': -1.0}	30	200946.0	Huawei Mate 2	{'battery':1,'bluetootch':1,'size':1}
154	Huawei Mate 2 - Factory Unlocked (Black)	Huawei	229.99	1	This phone ages quickly. My previous phone was...	2.0	200955	616	102	Huawei Mate 2 Factory	13	153	{'look': -1.0, 'settings': -1.0, 'power': -1.0}	48	200955.0	Huawei Mate 2	{'battery':1,'screen':1}
155	LG Optimus S Android Phone, Gray (Sprint)	LG	69.98	4	I was skeptical of buying this item because of...	0.0	236742	697	24	LG Optimus S Android (Sprint)	0	154	{'design': -0.3333333333333333}	37	236742.0	Optimus S	{'look': 1,' case': 1,' screen protector': 1}
156	LG Optimus S Android Phone, Gray (Sprint)	LG	69.98	4	Great rubberized case! Instead of being flimsy...	0.0	236743	698	24	LG Optimus S Android (Sprint)	11	155	{'feel': 1.0, 'look': 0.3333333333333333}	21	236743.0	Optimus S	{'case': 1,' design': -1}
157	LG Optimus S Android Phone, Gray (Sprint)	LG	69.98	5	Thank you	0.0	236745	699	24	LG Optimus S Android (Sprint)	10	156	{}	1	236745.0	Optimus S	{'case': 1,' screen protector': -1}
158	LG Optimus S Android Phone, Gray (Sprint)	LG	69.98	5	I purshes this ad on for my cell phone sprit s...	0.0	236748	700	24	LG Optimus S Android (Sprint)	10	157	{'cell': 1.0}	15	236748.0	Optimus S	{'design': 1,' case': -1}
159	LG Optimus S Android Phone, Gray (Sprint)	LG	69.98	1	They sent me the wrong case. I was so disappoi...	0.0	236750	701	24	LG Optimus S Android (Sprint)	0	158	{}	9	236750.0	Optimus S	{'build': -1}
160	LG Optimus S Android Phone, Gray (Sprint)	LG	69.98	5	This case is for an LG Optimus S but fits the ...	0.0	236751	702	24	LG Optimus S Android (Sprint)	0	159	{'design': -1.0}	14	236751.0	Optimus S	{'case': -1}
161	LG Xenon GR500 Unlocked Phone with QWERTY Keyb...	LG	129.99	5	This phone is easy to maneuver and user friend...	2.0	238741	707	49	LG Xenon GR500 with QWERTY 2MP and Touch Screen	13	160	{'speakers': 1.0, 'look': 1.0}	15	238741.0	LG Xenon GR500	{'keyboard': 1,' color': 1}
162	LG Xenon GR500 Unlocked Phone with QWERTY Keyb...	LG	129.99	4	I bought this phone for my son, who has had lo...	0.0	238859	708	49	LG Xenon GR500 with QWERTY 2MP and Touch Screen	13	161	{'service': -1.0, 'didnt': -1.0, 'features': 1.0}	31	238859.0	LG Xenon GR500	{'screen': -1,' ease of use': 1,' battery': -1}
163	LG Xenon GR500 Unlocked Phone with QWERTY Keyb...	LG	129.99	5	Great phone for the money. Easy to operate and...	3.0	238891	709	49	LG Xenon GR500 with QWERTY 2MP and Touch Screen	13	162	{}	7	238891.0	LG Xenon GR500	{'ease of use': 1,' keyboard': 1}
164	Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ...	Microsoft	300.51	5	The most important for me is that it let me wo...	1.0	240859	715	90	Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ...	5	163	{'windows': 1.0}	7	240859.0	Microsoft Lumia 950	{'price': 1,' camera': 1,' battery': 1,' weig...
165	Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ...	Microsoft	300.51	3	Not too bad.	1.0	240862	716	90	Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ...	0	164	{}	0	240862.0	Microsoft Lumia 950	{'size': 1,' software': 1,' apps': 1}
166	Microsoft Lumia 950 XL RM-1085 32GB Black, Sin...	Microsoft	328.41	5	As a Tmobile customer, I was saddened when I l...	4.0	240956	718	45	Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7...	13	165	{'windows': 1.0, 'cover': 0.3333333333333333}	66	240956.0	Microsoft Lumia 950	{'speed': 1,' case': 1,' screen': 1}
167	Microsoft Lumia 950 XL RM-1085 32GB Black, Sin...	Microsoft	328.41	5	I can't even tell you how much I love this pho...	1.0	241013	720	45	Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7...	13	166	{'online': -1.0, 'device': -0.3333333333333333...	38	241013.0	Microsoft Lumia 950	{'ease of use': 1}
168	Microsoft Lumia 950 XL RM-1085 32GB White, Sin...	NaN	333.41	4	So,the phone arrived with no phone case and no...	2.0	241146	722	45	Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7...	13	167	{}	8	241146.0	Microsoft Lumia 950	{'internet': -1}
169	Microsoft Lumia 950 XL RM-1085 32GB White, Sin...	NaN	333.41	5	This is the single SIM card version, not the d...	1.0	241214	723	45	Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7...	5	168	{'features': 1.0}	52	241214.0	Microsoft Lumia 950	{'sim card': 1,' speed': 1,' camera': 1,' SMS...

129 rows × 17 columns

lookup = 1540

for val in df_merge[df_merge.id_col == lookup].Review:

    print(val)

df_merge[df_merge.id_col == lookup]

I recevied the phone with broken trackball, missing micro-sd and missing battery.The seller claimed that it is 100% working. i cannot see how such a phone can be workingwithout the internal sd and battery. It claimed that it is OEM and brand new.My obervations indicated this was a poorly attempted refurbished phone. They must berunnin out of second handed parts.

Out[72]:

	Product_x	Brand	Price	Rating	Review	Votes	id_col	id_new_col	cluster_name	Standard_Product_Name	cluster	new_id	Sentiments	NN_count	review_id	Product_y	Sentiments_test
19	8330 BlackBerry Curve (US Cellular) Titanium P...	NaN	29.95	1	I recevied the phone with broken trackball, mi...	4.0	1540	26	322	8330 BlackBerry Curve Cellular) Titanium	13	19	{'trackball': -1.0, 'microsd': -1.0}	16	1540.0	BlackBerry Curve	{'Trackball':-1,'Battery':-1,'Micro-SD':-1}

Load Functions (for characteristics extraction performance)

def characteristics_extraction_performance(NN_count, training, test):

    TP = 0

    TN = 0

    FP = 0

    FN = 0

    temp_test = []

    test = eval(test)

    for test_characteristic in test.keys():

        test_characteristic = str(test_characteristic).lower()

        test_characteristic = re.sub(r'[^A-Za-z /.]','',test_characteristic)

        temp_test.append(test_characteristic)

        if test_characteristic in training.keys():

            TP += 1

        else:

            FN += 1

    TN = NN_count - len(training.keys()) - FN

    for train_characteristic in training.keys():

        if train_characteristic not in temp_test:

            FP += 1

    return TP, TN, FP, FN

def compute_characteristics_extraction_performance(df_merge):

    total_TP = 0

    total_TN = 0

    total_FP = 0

    total_FN = 0

    for i in range(len(df_merge)):

        NN_count = df_merge.NN_count[i]

        training = df_merge.Sentiments[i]

        test = df_merge.Sentiments_test[i]

        if pd.isnull(test): continue

        TP, TN, FP, FN = characteristics_extraction_performance(NN_count, training, test)

        total_TP += TP

        total_TN += TN

        total_FP += FP

        total_FN += FN

    if total_TP + total_FP == 0:

        TPR_RECALL = 0

    else:

        TPR_RECALL =  total_TP / (total_TP + total_FP)

    TNR_SPECIFICITY = total_TN / (total_TN + total_FN)

    F1_Score = 2* total_TP / (2*total_TP + total_FP + total_FN)

    Accuracy = (total_TP + total_TN) / (total_TP + total_TN + total_FP + total_FN)

    fpr = total_FP / (total_FN + total_FP)

    return TPR_RECALL, TNR_SPECIFICITY, F1_Score, Accuracy, fpr

Recall, Specificity, F1_Score, Accuracy, fpr= compute_characteristics_extraction_performance(df_merge)

print("Recall: ", Recall)

print("Specificity: ", Specificity)

print("F1_Score: ", F1_Score)

print("Accuracy: ", Accuracy)

Recall:  0.05128205128205128
Specificity:  0.777699364855
F1_Score:  0.0273972602739726
Accuracy:  0.722294654498

Load Functions (for sentiment analysis performance)

def characteristics_sentiment_performance(training, test):

    TP = 0

    TN = 0

    FP = 0

    FN = 0

    test = eval(test)

    for test_characteristic, test_score in test.items():

        test_characteristic = str(test_characteristic).lower()

        test_characteristic = re.sub(r'[^A-Za-z /.]','',test_characteristic)

        if test_characteristic in training.keys():

            if test_score == training[test_characteristic]:

                if test_score > 0:

                    TP += 1

                else:

                    TN += 1

            else:

                if test_score > 0:

                    FN += 1

                else:

                    FP += 1

        else:

            continue

    return TP, TN, FP, FN

def compute_characteristics_sentiment_performance(df_merge):

    total_TP = 0

    total_TN = 0

    total_FP = 0

    total_FN = 0

    cases = 0

    for i in range(len(df_merge)):

        training = df_merge.Sentiments[i]

        test = df_merge.Sentiments_test[i]

        if pd.isnull(test): continue

        TP, TN, FP, FN = characteristics_sentiment_performance(training, test)

        if TP+ TN+ FP+ FN > 0:

            cases+=1

        total_TP += TP

        total_TN += TN

        total_FP += FP

        total_FN += FN

    if total_TP + total_FP == 0:

        TPR_RECALL = 0

    else:

        TPR_RECALL =  total_TP / (total_TP + total_FP)

    TNR_SPECIFICITY = total_TN / (total_TN + total_FN)

    F1_Score = 2* total_TP / (2*total_TP + total_FP + total_FN)

    Accuracy = (total_TP + total_TN) / (total_TP + total_TN + total_FP + total_FN)

    fpr = total_FP / (total_FN + total_FP)

    return TPR_RECALL, TNR_SPECIFICITY, F1_Score, Accuracy, cases

Recall, Specificity, F1_Score, Accuracy, cases= compute_characteristics_sentiment_performance(df_merge)

print("Reviews Evaluated: ", cases)

print("Recall: ", Recall)

print("Specificity: ", Specificity)

print("F1_Score: ", F1_Score)

print("Accuracy: ", Accuracy)

Reviews Evaluated:  5
Recall:  0.6666666666666666
Specificity:  0.6666666666666666
F1_Score:  0.6666666666666666
Accuracy:  0.6666666666666666

4. Business Insights

By extracting the main characteristics that customers are reviewing and which rating (i.e sentiment score) they are giving to them the business will be able to understand what positively or negatively affects product reviews and what specifically users choose as highlights or pain points. From the output table with the sentiments scores assigned to each product name characteristics and simple reporting transformation a the following table can be obtained:

Flexible enough allowing to create further reports such as:

Which can then be used by manufacturers (i.e. Apple or Samsung) to improve the quality of their products based on a specific characteristic they are getting negative reviews, and also by sellers who can use this information to diversify their products (for example have one which is strong in screen quality and another in battery) or to stop buying products that have critical issues.

5. Discussion

5.1 Further Improvements

Businesses do not necessarily need to have a sentiment score for reviews, especially for ecommerce sites such as Amazon where a rating is also available. For manufactures in particular even if they have a score they would not know exactly where to prioritize their efforts to improve their products. Instead, by giving them the specifics characteristics where their products are failing or not they get valuable insights to tackle problems as they arise. Hence, the challenge of correctly extracting the products characteristics is of major importance. This application underperforms in the capabilities of extracting the characteristics and seems to perform fairly well in assigning the correct sentiments to them (although some exceptions need to be adjusted through gazetteers by using the domain knowledge of the business and industry). To further improve characteristics extraction an approach using topic modelling could be implemented, where assumptions are made on the probabilistic distribution of topics inside documents. An example of this would be the Latent Dirichlet Allocation that outputs word clusters. By extending the basic model of identifying topics, we can separate sentiment and features from each topic. As mentioned before, opinion word can be incorrectly assigned to characteristics when multiple characteristics are present, a task that could be tackled and improved with the usage of Name Entity Recognition (NER) and Relationship extraction (RE). Because of computational limitations we worked only on a subsample of the ~400,000 reviews. In the future using cloud computing as well as parallelization and improving the algorithm will allow to process an even larger amount of reviews. Finally to have statistically significant results a larger test set should be created with roughly at least 10% of the data (for this project only ~150 reviews were created).

5.2 Conclusion

In this project we analyzed the performance of measuring sentiment analysis on specific characteristics of mobile phones mentioned in customer reviews to provide manufacturers with actionable insights to improve their products and for sellers to improve their offerings. Results shows the worst performance on characteristic extraction where Recall is critically low. This topic is also the main challenge which could be further improved by implementing topic modelling. Sentiment scores on characteristics extraction revealed a good but not great performance suggesting that further improvements could be made using Relationship Extraction. However the test set was too small to have a clear statistical significance on the results.

最近更新： 2020年3月2日 15:48:05

浏览： 2.4K

您的评论 *

[[total]] 条评论

添加评论

[[item.time]]

[[item.user.username]] [[item.floor]]楼

[[cc.time]]

[[cc.user.username]] #[[cc.room]]

- «
- 1
- ...
- [[i]]
- ...
- »

点击加载更多……
添加评论
登录后即可回复

添加评论登录后即可回复

	...	product_name_cluster	id_col
0	...	18	0
1	...	18	1
2	...	18	2
3	...	18	3
4	...	18	4

	...	product_name_cluster	id_col
0	...	18	0
1	...	18	1
2	...	18	2
3	...	18	3
4	...	18	4

VincentWei

67

1.1K

Amazon Reviews Analysis: Unlocked Mobile Phones

Introduction

Methodology

1. Pre-procesing

1.1 Tokenization

1.2 Part of Speech tagging

1.3 Vector Space Model and TF * IDF transformation

Load Libraries

Load alternative for WordNet

Load and correct Test Data

Load Amazon Reviews Data

Sample review:

Create functions

Let's see a sample of:

1. Pre-procesing

1.4 Product Names Standardization

Load Function

CLUSTER PRODUCT NAMES

ASSIGN CLUSTER PRODUCT NAMES

Sample for 'iPhone'

2.1 Characteristics Extraction

Load functions and shortcuts

2.2 Filtering

clustering filter

Characteristic Filter

2.3 Characteristics Sentiment Extraction

Load functions

The following review as an example gives insight of the application capabilities and limitations:

3. Performance

Load and correct Test Data

Load Functions (for characteristics extraction performance)

Load Functions (for sentiment analysis performance)

4. Business Insights

5. Discussion

5.1 Further Improvements

5.2 Conclusion

[[total]] 条评论

	...	product_name_cluster	id_col
0	...	18	0
1	...	18	1
2	...	18	2
3	...	18	3
4	...	18	4