Python beginner : Preprocessing a french text in python and calculate the polarity with a lexicon - pandas

I am writing an algorithm in python which processes a column of sentences and then gives the polarity (positive or negative) of each cell of my column of sentences. The script uses a list of negative and positive word from the NRC emotion lexicon (French version) I am having a problem writing the preprocess function. I have already written the count function and the polarity function but since I have some difficulty writing the preprocess function, I am not really sure if those functions works.
The positive and negative words were in the same file (lexicon) but I export positive and negztive words separately because I did not know how to use the lexicon as it was.
My function count occurrence of positive and negative does not work and I do not know why it Always sends me 0. I Added positive word in each sentence so the should appear in the dataframe:
stacktrace :
[4 rows x 6 columns]
id Verbatim ... word_positive word_negative
0 15 Je n'ai pas bien compris si c'était destiné a ... ... 0 0
1 44 Moi aérien affable affaire agent de conservati... ... 0 0
2 45 Je affectueux affirmative te hais et la Foret ... ... 0 0
3 47 Je absurde accidentel accusateur accuser affli... ... 0 0
=>
def count_occurences_Pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_Pos, args=(lexiconPos, ))
This my csv_data : line 44 , 45 contains positive words and line 47 more negative word but in the column of positive and negative word , it is alwaqys empty, the function does not return the number of words and the final column is Always positive whereas the last sentence is negative
id;Verbatim
15;Je n'ai pas bien compris si c'était destiné a rester
44;Moi aérien affable affaire agent de conservation qui ne agraffe connais rien, je trouve que c'est s'emmerder pour rien, il suffit de mettre une multiprise
45;Je affectueux affirmative te hais et la Foret enchantée est belle de milles faux et les jeunes filles sont assises au bor de la mer
47;Je absurde accidentel accusateur accuser affliger affreux agressif allonger allusionne admirateur admissible adolescent agent de police Comprends pas la vie et je suis perdue
Here the full code :
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict, Counter
csv_df = pd.read_csv('test.csv', na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
#print(csv_df.head())
stopWords = set(stopwords.words('french'))
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
def process_text(text):
'''extract lemma and lowerize then removing stopwords.'''
text_preprocess =[]
text_without_stopwords= []
text = tagger.tag_text(text)
for word in text:
parts = word.split('\t')
try:
if parts[2] == '':
text_preprocess.append(parts[1])
else:
text_preprocess.append(parts[2])
except:
print(parts)
text_without_stopwords= [word.lower() for word in text_preprocess if word.isalnum() if word not in stopWords]
return text_without_stopwords
csv_df['sentence_processing'] = csv_df['Verbatim'].apply(process_text)
#print(csv_df['word_count'].describe())
print(csv_df)
lexiconpos = open('positive.txt', 'r', encoding='utf-8')
print(lexiconpos.read())
def count_occurences_pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_pos, args=(lexiconpos, ))
#print(csv_df)
lexiconneg = open('negative.txt', 'r', encoding='utf-8')
def count_occurences_neg(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_negative'] = csv_df['Verbatim'].apply(count_occurences_neg, args= (lexiconneg, ))
#print(csv_df)
def polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
positives_text =count_occurences_pos(text, lexiconpos)
negatives_text =count_occurences_neg(text, lexiconneg)
if positives_text > negatives_text :
return "positive"
else :
return "negative"
csv_df['polarity'] = csv_df['Verbatim'].apply(polarity_score)
#print(csv_df)
print(csv_df)
If you could also see if the rest of the code is good to thank you.

I have found your error!
It comes from the Polarity_score function.
It's just a typo :
In your, if statement you were comparing count_occurences_Pos and count_occurences_Neg which are function instead of comparing the results of the function count_occurences_pos and count_occurences_peg
Your code should be like this :
def Polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
count_text_pos =count_occurences_Pos(text, word_list)
count_text_neg =count_occurences_Neg(text, word_list)
if count_occurences_pos > count_occurences_peg :
return "Positive"
else :
return "negative"
In the future, you need to learn how to have meaningful names for your variables to avoid those kinds of errors
With correct variables names, your function should be :
def polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
positives_text =count_occurences_pos(text, word_list)
negatives_text =count_occurences_neg(text, word_list)
if positives_text > negatives_text :
return "Positive"
else :
return "negative"
Another improvement you can make in your count_occurences_pos and count_occurences_neg function is to use set instead of the list. Your text and world_list can be converted to sets and you can use the set intersection to retrieve the positive texts in them.Because set are faster than lists

Related

How to solve a set of nonlinear equations multiple times when one of the coefficient of the variable changes

I want to solve for the variables s,L for different values of t. t is a part of my second equation and its values changes, I tried to solve for s,L for different t values then append the values to an empty list so that i could have different values of s, L for diiferent t values. But what i was getting is just an empty list.PLease help me with this
from scipy.optimize import fsolve
import numpy as np
import math as m
q0=0.0011
thetas,thetai,thetar=0.43,0.1,0.05
ks=0.0022#m/hr
psib=-0.15# m
lamda=1
eta=2+3*lamda
ki=8.676*10**(-8)
si=0.13157
t=np.array([3,18,24])
S=0.02/24
delta=-0.1001
b=[]
n=[]
for i in range(3):
def equations(p):
s, L = p
f1=(ks*s**(3+(2/lamda))-(psib/(1-eta))*(((ki*si**(-1/lamda))-(ks*s**(3+(1/lamda))))/L)-q0)
f2=(L*(s*(thetas-thetar))+S*t[i]*0.5*(m.exp(-delta*psib*(-1+s**(-1/lamda))))-(q0-ki)*t[i])
return(f1,f2)
s,L=fsolve(equations,([0.19,0.001]))
b.append(s)
n.append(L)
print(b)
print(n)
There are several ways to evaluate this system with an adjustable parameter. You could plug each value in before solving, which would make it compatible with additional solvers if fsolve didn't give you the desired results, or you could utilize the args parameter within fsolve. If I set up a dummy system such that I try to find x,y,z for some initial guess, and step through a parameter, I can append a preallocated solution array with the results
import numpy as np
from scipy.optimize import fsolve
a = np.linspace(0,10,21)
def equations(variables, a):
x,y,z = variables
eq1 = x+y+z*a
eq2 = x-y-z
eq3 = x*y*x*a
return tuple([eq1, eq2, eq3])
solutions = np.zeros((21,3))
for idx, i in enumerate(a):
solutions[idx] = fsolve(equations, [-1,0,1], args=(i))
print(solutions)
which gives
[[ 5.00000000e-01 -5.00000000e-01 1.00000000e+00]
[ 9.86864911e-17 -2.96059473e-16 3.94745964e-16]
[ 1.62191889e-39 -1.28197512e-16 1.28197512e-16]
[-2.15414908e-17 -1.07707454e-16 8.61659633e-17]
[ 2.19853562e-28 6.59560686e-28 -4.39707124e-28]
[-1.20530409e-28 -2.81237621e-28 1.60707212e-28]
[-3.34744837e-17 -6.69489674e-17 3.34744837e-17]
[ 6.53567253e-17 1.17642106e-16 -5.22853803e-17]
[-3.14018492e-17 -5.23364153e-17 2.09345661e-17]
[-5.99115518e-17 -9.41467242e-17 3.42351724e-17]
[ 5.18199815e-29 7.77299722e-29 -2.59099907e-29]
[-2.70691440e-17 -3.90998747e-17 1.20307307e-17]
[-2.57288510e-17 -3.60203914e-17 1.02915404e-17]
[-2.44785120e-17 -3.33797891e-17 8.90127708e-18]
[-1.27252940e-28 -1.69670587e-28 4.24176466e-29]
[ 2.24744956e-56 2.93897250e-56 -6.91522941e-57]
[-2.12580678e-17 -2.73318015e-17 6.07373366e-18]
[-2.03436865e-17 -2.57686696e-17 5.42498307e-18]
[-3.89960988e-17 -4.87451235e-17 9.74902470e-18]
[-1.87148635e-17 -2.31183608e-17 4.40349730e-18]
[-7.19531738e-17 -8.79427680e-17 1.59895942e-17]]

Using string output from pytesseract to do a vlookup in pandas dataframe

I'm very new to Python, and I'm trying to make a simple image to song title to BPM program. My approach is using pytesseract to generate a string output; and then, using that string output, I wish to vlookup in a dataframe created by pandas. However, it always return zero value even though that song does exist in the data.
import PIL.ImageGrab
from PIL import ImageGrab
import numpy as np
import pytesseract
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def getTitleImage(left, top, width, height):
printscreen_pil = ImageGrab.grab((left, top, left + width, top + height))
printscreen_numpy = np.array(printscreen_pil.getdata(), dtype='uint8') \
.reshape((printscreen_pil.size[1], printscreen_pil.size[0], 3))
return printscreen_numpy
# Printscreen:
titleImage = getTitleImage(x, y, w, h)
# pytesseract to string:
songTitle = pytesseract.image_to_string(titleImage)
print('Name of the song: ', songTitle)
# Importing the csv data via pandas.
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
# A simple vlookup formula that return the BPM of the song by taking data from the same row.
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
Output:
Name of the song: Macarena
The BPM of the song is: 0
However, when I tried to forcefully provide the string to the songTitle variable, it works:
songTitle = 'Macarena'
print('Name of the song: ', songTitle)
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
Output:
Name of the song: Macarena
The BPM of the song is: 103
I have checked the string generated from pytesseract: It has no extra space in the front or the back, totally identical to the forced string, but they still produce different results. What could be the problem?
I found the answer.
It is because the songTitle coming from:
songTitle = pytesseract.image_to_string(titleImage)
...is actually 'Macarena\n' instead of 'Macarena'.
They might look the same after print out, except the former will create a new line after it.
A great lesson learn for me.

How to calculate tf-idf when working on .txt files in python 3.7?

I have books in pdf and I want to do NLP tasks such as preprocessing, tf-idf calculation, word2vec, etc on those books. So I converted them into .txt files and was trying to get tf-idf scores. Previously I performed tf-idf on a CSV file, so I made some changes in that code and tried to use it for .txt file. But I am unsuccessful in my attempt.
Below is my code:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = open('jungle book.txt', 'r+')
# print(data.read())
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec.fit(data)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data)
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(20)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(20)
print(tf_idf)
This code is working until print ('Non Zero Count :', cvec_count.shape) and printing:
Sparse Matrix Shape : (0, 7132)
Non Zero Count : 0
Then it is giving error:
ZeroDivisionError: division by zero
Even if I run this code with ignoring ZeroDivisionError, still it is wrong as it is not counting any frequencies.
I have no idea how to work around .txt file. What is the proper way to work on .txt file for NLP tasks?
Thanks in advance!
You are getting the error because data variable is empty or wrong type. Just opening the text file is not enough. You have to read the contents into a string variable and then do the preprocessing on that variable. Try replacing
data = open('jungle book.txt', 'r+')
# print(data.read())
with
with open('jungle book.txt', 'r') as file:
data = file.read()

POS Tags to Wordnet in Pandas Dataframe

I am using NLTK on a dataset stored as a pandas dataframe. All the raw text processing procedures worked fine until I tried to convert the Treebank POS tags to Wordnet POS tags. These are the codes which worked fine for me.
import pandas as pd
import string
from nltk import WordPunctTokenizer, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn, stopwords
# Example dataframe
df = pd.DataFrame([[2, "I am new at programming."],
[7, "Leaves are falling from the tree."],
[4, "Sophia has been studying since this morning."]], columns = ['ID', 'Text'])
# Tokenize text
tokenizer = nltk.WordPunctTokenizer()
df["Tokens"] = df["Text"].str.lower().apply(tokenizer.tokenize)
# Remove punctuations
pattern = string.punctuation
print(pattern)
def remove_punctuation(tokens):
filtered = [word for word in tokens if word not in pattern]
return filtered
df["Tokens"] = df["Tokens"].apply(remove_punctuation)
# Remove stopwords
stopwords = stopwords.words('english')
def remove_stopwords(tokens):
filtered_words = [word for word in tokens if word not in stopwords]
return filtered_words
df["Tokens"] = df["Tokens"].apply(remove_stopwords)
The following lines of codes did not work and I got this error:
ValueError: too many values to unpack (expected 2)
def wordnet_pos(pos_tag):
if pos_tag.startswith('J'):
return wn.ADJ
elif pos_tag.startswith('V'):
return wn.VERB
elif pos_tag.startswith('N'):
return wn.NOUN
elif pos_tag.startswith('R'):
return wn.ADV
else:
return None
def wordnet(tokens):
pos_tokens = [nltk.pos_tag(token) for token in tokens]
pos_tokens = [(word, wordnet_pos(pos_tag)) for (word, pos_tag) in pos_tokens]
return pos_tokens
df["Wordnet"] = df["Tokens"].apply(wordnet)
This is what I had hoped to achieve - to create df["Wordnet"] with the Wordnet POS tags.
print(df["Wordnet"])
0 [(new, a), (programming, n)]
1 [(leaves, n), (falling, v), (tree, n)]
2 [(sophia, n), (studying, v), (since, n), (...
Name: Wordnet, dtype: object

word2vec, sum or average word embeddings?

I'm using word2vec to represent a small phrase (3 to 4 words) as a unique vector, either by adding each individual word embedding or by calculating the average of word embeddings.
From the experiments I've done I always get the same cosine similarity. I suspect it has to do with the word vectors generated by word2vec being normed to unit length (Euclidean norm) after training? or either I have a BUG in the code, or I'm missing something.
Here is the code:
import numpy as np
from nltk import PunktWordTokenizer
from gensim.models import Word2Vec
from numpy.linalg import norm
from scipy.spatial.distance import cosine
def pattern2vector(tokens, word2vec, AVG=False):
pattern_vector = np.zeros(word2vec.layer1_size)
n_words = 0
if len(tokens) > 1:
for t in tokens:
try:
vector = word2vec[t.strip()]
pattern_vector = np.add(pattern_vector,vector)
n_words += 1
except KeyError, e:
continue
if AVG is True:
pattern_vector = np.divide(pattern_vector,n_words)
elif len(tokens) == 1:
try:
pattern_vector = word2vec[tokens[0].strip()]
except KeyError:
pass
return pattern_vector
def main():
print "Loading word2vec model ...\n"
word2vecmodelpath = "/data/word2vec/vectors_200.bin"
word2vec = Word2Vec.load_word2vec_format(word2vecmodelpath, binary=True)
pattern_1 = 'founder and ceo'
pattern_2 = 'co-founder and former chairman'
tokens_1 = PunktWordTokenizer().tokenize(pattern_1)
tokens_2 = PunktWordTokenizer().tokenize(pattern_2)
print "vec1", tokens_1
print "vec2", tokens_2
p1 = pattern2vector(tokens_1, word2vec, False)
p2 = pattern2vector(tokens_2, word2vec, False)
print "\nSUM"
print "dot(vec1,vec2)", np.dot(p1,p2)
print "norm(p1)", norm(p1)
print "norm(p2)", norm(p2)
print "dot((norm)vec1,norm(vec2))", np.dot(norm(p1),norm(p2))
print "cosine(vec1,vec2)", np.divide(np.dot(p1,p2),np.dot(norm(p1),norm(p2)))
print "\n"
print "AVG"
p1 = pattern2vector(tokens_1, word2vec, True)
p2 = pattern2vector(tokens_2, word2vec, True)
print "dot(vec1,vec2)", np.dot(p1,p2)
print "norm(p1)", norm(p1)
print "norm(p2)", norm(p2)
print "dot(norm(vec1),norm(vec2))", np.dot(norm(p1),norm(p2))
print "cosine(vec1,vec2)", np.divide(np.dot(p1,p2),np.dot(norm(p1),norm(p2)))
if __name__ == "__main__":
main()
and here is the output:
Loading word2vec model ...
Dimensions 200
vec1 ['founder', 'and', 'ceo']
vec2 ['co-founder', 'and', 'former', 'chairman']
SUM
dot(vec1,vec2) 5.4008677771
norm(p1) 2.19382594282
norm(p2) 2.87226958166
dot((norm)vec1,norm(vec2)) 6.30125952303
cosine(vec1,vec2) 0.857109242583
AVG
dot(vec1,vec2) 0.450072314758
norm(p1) 0.731275314273
norm(p2) 0.718067395416
dot(norm(vec1),norm(vec2)) 0.525104960252
cosine(vec1,vec2) 0.857109242583
I'm using the cosine similarity as defined here Cosine Similarity (Wikipedia). The values for the norms and dot products are indeed different.
Can anyone explain why the cosine is the same?
Thank you,
David
Cosine measures the angle between two vectors and does not take the length of either vector into account. When you divide by the length of the phrase, you are just shortening the vector, not changing its angular position. So your results look correct to me.