How to quickly convert a pandas dataframe to a list of tuples - pandas

I have a pandas dataframe as follows.
thi 0.969378
text 0.969378
is 0.969378
anoth 0.699030
your 0.497120
first 0.497120
book 0.497120
third 0.445149
the 0.445149
for 0.445149
analysi 0.445149
I want to convert it to a list of tuples as follows.
[["this", 0.969378], ["text", 0.969378], ..., ["analysi", 0.445149]]
My code is as follows.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
print(top_words)
I tried the following two options.
list(zip(*map(top_words.get, top_words)))
I got the error as TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.9693779251346359] of <class 'float'>
list(top_words.itertuples(index=True))
I got the error as AttributeError: 'Series' object has no attribute 'itertuples'.
Please let me know a quick way of doing this in pandas.
I am happy to provide more details if needed.

Use zip by index with map tuples to lists:
a = list(map(list,zip(top_words.index,top_words)))
Or convert index to column, convert to nupy array and then to lists:
a = top_words.reset_index().to_numpy().tolist()
print (a)
[['thi', 0.9693780000000001], ['text', 0.9693780000000001],
['is', 0.9693780000000001], ['anoth', 0.69903],
['your', 0.49712], ['first', 0.49712], ['book', 0.49712],
['third', 0.44514899999999996], ['the', 0.44514899999999996],
['for', 0.44514899999999996], ['analysi', 0.44514899999999996]]

Related

Tensor to Dataframe for each sentence

For a 6 class sentence classification task, I have a list of sentences where I retrieve the absolute values before the softmax is applied. Example list of sentences:
s = ['I like the weather today', 'The movie was very scary', 'Love is in the air']
I get the values the following way:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
print(output.logits.detach().numpy())
# returns [[-0.8390876 2.9480567 -0.5134539 0.70386493 -0.5019671 -2.619496 ]]
#[[-0.8847909 -0.9642067 -2.2108874 -0.43932158 4.3386173 -0.37383893]]
#[[-0.48750368 3.2949197 2.1660519 -0.6453249 -1.7101991 -2.817954 ]]
How do I create a data frame with columns sentence, class_1, class_2, class_3, class_4, class_5, class_6 where I add values iteratively or maybe in a more optimal way where I append each new sentence and its absolute values? What would be the best way?
Expected output:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...
If I only had one sentence, I could transform it to a data frame like this, but I would still need to append the sentence somehow
sentence = tokenizer("Love is in the air", return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
Maybe creating two separate data frames and then appending them would be one plausible way of doing this?
Save the model outputs in a list and then create the dataframe from an object:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
import pandas as pd
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
outputs = []
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
outputs.append(output.logits.detach().numpy()[0])
# convert to one numpy array
outputs = np.array(outputs)
# create dataframe
obj = {"sentence": s}
for class_id in range(outputs.shape[1]):
# get the data column for that class
obj[f"class_{class_id}"] = outputs[:,class_id].tolist()
df = pd.DataFrame(obj)
I managed to come up with a solution and I am posting it as someone might find it useful.
The idea is to initialize a data frame and to append the absolute values for every sentence while iterating
absolute_vals = pd.DataFrame()
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
absolute_vals = absolute_vals.append(px, ignore_index = True)
absolute_vals
Returns:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...

How to calculate tf-idf when working on .txt files in python 3.7?

I have books in pdf and I want to do NLP tasks such as preprocessing, tf-idf calculation, word2vec, etc on those books. So I converted them into .txt files and was trying to get tf-idf scores. Previously I performed tf-idf on a CSV file, so I made some changes in that code and tried to use it for .txt file. But I am unsuccessful in my attempt.
Below is my code:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = open('jungle book.txt', 'r+')
# print(data.read())
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec.fit(data)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data)
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(20)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(20)
print(tf_idf)
This code is working until print ('Non Zero Count :', cvec_count.shape) and printing:
Sparse Matrix Shape : (0, 7132)
Non Zero Count : 0
Then it is giving error:
ZeroDivisionError: division by zero
Even if I run this code with ignoring ZeroDivisionError, still it is wrong as it is not counting any frequencies.
I have no idea how to work around .txt file. What is the proper way to work on .txt file for NLP tasks?
Thanks in advance!
You are getting the error because data variable is empty or wrong type. Just opening the text file is not enough. You have to read the contents into a string variable and then do the preprocessing on that variable. Try replacing
data = open('jungle book.txt', 'r+')
# print(data.read())
with
with open('jungle book.txt', 'r') as file:
data = file.read()

Form Data to get a particular output from urlencode

What should be the dictionary form_data
Desired Output from python code >> data = parse.urlencode(form_data).encode():
"entry.330812148_sentinel=&entry.330812148=Test1&entry.330812148=Test2&entry.330812148=Test3&entry.330812148=Test4"
I tried various dictionary structures including ones with None, [] and dictionary within dictionary but I am unable to get this output
form_data = {'entry.330812148_sentinel':None,
'entry.330812148':'Test1',
'entry.330812148':'Test2',
'entry.330812148':'Test3',
'entry.330812148':'Test4'}
from urllib import request, parse
data = parse.urlencode(form_data).encode()
print("Printing Parsed Form Data........")
"entry.330812148_sentinel=&entry.330812148=Test1&entry.330812148=Test2&entry.330812148=Test3&entry.330812148=Test4"
You can use parse_qs from urllib.parse to return the python data structure
import urllib.parse
>>> s = 'entry.330812148_sentinel=&entry.330812148=Test1&entry.330812148=Test2&entry.330812148=Test3&entry.330812148=Test4'
>>> d1 = urllib.parse.parse_qs(s)
>>> d1
{b'entry.330812148': [b'Test1', b'Test2', b'Test3', b'Test4']}

My question is about "module 'textacy' has no attribute 'Doc'"

Can't find module 'textacy' has no attribute 'Doc'
I am trying to extract verb phrases from spacy but there is such no library. Please help me how can I extract the verb phrases or adjective phrases using spacy. I want to do full shallow parsing.
def extract_named_nouns(row_series):
"""Combine nouns and non-numerical entities.
Keyword arguments:
row_series -- a Pandas Series object
"""
ents = set()
idxs = set()
# remove duplicates and merge two lists together
for noun_tuple in row_series['nouns']:
for named_ents_tuple in row_series['named_ents']:
if noun_tuple[1] == named_ents_tuple[1]:
idxs.add(noun_tuple[1])
ents.add(named_ents_tuple)
if noun_tuple[1] not in idxs:
ents.add(noun_tuple)
return sorted(list(ents), key=lambda x: x[1])
def add_named_nouns(df):
"""Create new column in data frame with nouns and named ents.
Keyword arguments:
df -- a dataframe object
"""
df['named_nouns'] = df.apply(extract_named_nouns, axis=1)
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
from textacy import io
#using spacy for nlp
nlp = en_core_web_sm.load()
sentence = 'The author is writing a new book.'
pattern = r'<VERB>?<ADV>*<VERB>+'
doc = textacy.Doc.load(sentence, metadata=metadata, lang='en_core_web_sm')
# doc = textacy.corpus.Corpus(sentence, lang='en_core_web_sm')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
module 'textacy' has no attribute 'Doc'
Try following the examples here: https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#make-a-doc
It should be as simple as:
doc = textacy.make_spacy_doc("The author is writing a new book.", lang='en_core_web_sm')
You might look into just using spacy (without textacy) with its built-in Matcher instead (https://spacy.io/usage/rule-based-matching).
spacy_lang = textacy.load_spacy_lang("en_core_web_en")
docx_textacy = spacy_lang(sentence)

Python 3.6 Pandas Difflib Get_Close_Matches to filter a dataframe with user input

Using a csv imported using a pandas dataframe, I am trying to search one column of the df for entries similar to a user generated input. Never used difflib before and my tries have ended in a TypeError: object of type 'float' has no len() or an empty [] list.
import difflib
import pandas as pd
df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1")
word = input ("Enter a vendor: ")
def find_it(w):
w = w.lower()
return difflib.get_close_matches(w, df.vendorname, n=50, cutoff=.6)
alternatives = find_it(word)
print (alternatives)
The error seems to occur at "return.difflib.get_close_matches(w, df.vendorname, n=50, cutoff=.6)"
Am attempting to get similar results to "word" with a column called 'vendorname'.
Help is greatly appreciated.
Your column vendorname is of the incorrect type.
Try in your return statement:
return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6)
import difflib
import pandas as pd
df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1")
word = input ("Enter a vendor: ")
def find_it(w):
w = w.lower()
return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6)
alternatives = find_it(word)
print (alternatives)
As stated in the comments by #johnchase
The question also mentions the return of an empty list. The return of get_close_matches is a list of matches, if no item matched within the cutoff an empty list will be returned – johnchase
I've skipped the:
astype(str)in (return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6))
Instead used:
dtype='string' in (df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1"))