Get count vectorizer vocabulary in new dataframe column by applying vectorizer on existing dataframe column using pandas - pandas

I have dataframe column 'review' with content like 'Food was Awesome' and I want a new column which counts the number of repetition of each word.
name The First Years Massaging Action Teether
review A favorite in our house!
rating 5
Name: 269, dtype: object
Expecting output like ['Food':1,'was':1,'Awesome':1]
I tried with for loop but its taking too long to execute
for row in range(products.shape[0]):
try:
count_vect.fit_transform([products['review_without_punctuation'][row]])
products['word_count'][row]=count_vect.vocabulary_
except:
print(row)
I would like to do it without for loop.

I found a solution for this.
I have defined a function like this-
def Vectorize(text):
try:
count_vect.fit_transform([text])
return count_vect.vocabulary_
except:
return-1
and applied above function-
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
products['word_count'] = products['review_without_punctuation'].apply(Vectorize)
This solution worked and I got vocabulary in new column.

You can get the count vector for all docs like this:
cv = CountVectorizer()
count_vectors = cv.fit_transform(products['review_without_punctuation'])
To get the count vector in array format for a particular document by index, say, the 1st doc,
count_vectors[0].toarray()
The vocabulary is in
cv.vocabulary_
To get the words that make up a count vector, say, for the 1st doc, use
cv.inverse_transform(count_vectors[0])

Related

Pandas - finding most important words from each row

I have a pandas dataframe with a text column. I am trying to find the most important words from this text column for each row. How do I do this?
I am currently trying to do this using tf-idf:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer(stop_words='english')
x = v.fit_transform(df['cleansed_text'])
I see that x is a sparse matrix with same number of rows as my dataframe and looks like number of columns equals the number of words in the vocabulary.
How do I use this to find the most important words for each row?

How to compare one row in Pandas Dataframe to all other rows in the same Dataframe

I have a csv file with in which I want to compare each row with all other rows. I want to do a linear regression and get the r^2 value for the linear regression line and put it into a new matrix. I'm having trouble finding a way to iterate over all the other rows (it's fine to compare the primary row to itself).
I've tried using .iterrows but I can't think of a way to define the other rows once I have my primary row using this function.
UPDATE: Here is a solution I came up with. Please let me know if there is a more efficient way of doing this.
def bad_pairs(df, limit):
list_fluor = list(combinations(df.index.values, 2))
final = {}
for fluor in list_fluor:
final[fluor] = (r2_score(df.xs(fluor[0]),
df.xs(fluor[1])))
bad_final = {}
for i in final:
if final[i] > limit:
bad_final[i] = final[i]
return(bad_final)
My data is a pandas DataFrame where the index is the name of the color and there is a number between 0-1 for each detector (220 columns).
I'm still working on a way to make a new pandas Dataframe from a dictionary with all the values (final in the code above), not just those over the limit.

Pandas - get count of each boolean field

I have other programs where I group and count fields. Now, I want to get a count of each boolean field. Is there a Pandas way to do that rather than me looping and writing my own code? Ideally, I would generated a new dataframe with the results (kind of like what I did here).
Easy Example CSV Data (data about poker hands generated):
Hand,Other1,Other2,IsFourOfAKind,IsThreeOfAKind,IsPair
1,'a','b',1,0,0
2,'c','d',0,1,0
3,'a','b',0,1,0
4,'x','y',0,0,1
5,'a','b',0,0,1
6,'a','b',0,0,1
7,'a','b',0,0,1
Program:
import pandas as pd
import warnings
filename = "./data/TestGroup2.csv"
# tell run time to ignore certain read_csv type errors (from pandas)
warnings.filterwarnings('ignore', message="^Columns.*")
count_cols = ['IsFourOfAKind','IsThreeOfAKind','IsPair ']
enter code here
#TODO - use the above to get counts of only these columns
df = pd.read_csv(filename)
print(df.head(10))
Desired Output - could just be a new dataframe
Column Count
IsFourOfAKind 1
IsThreeOfAKind 2
IsPair 3
Please try:
df.filter(like='Is').sum(0)
or did you need;
df1=df.filter(like='Is').agg('sum').reset_index().rename(columns={'index':'column', 0:'count'})

How to index a column with two values pandas

I have two dataframes:
Dataframe #1
Reads the values--Will only be interested in NodeID AND GSE
sta = pd.read_csv(filename)
Dataframe #2
Reads the file, use pivot and get the following result
sim = pd.read_csv(headout,index_col=0)
sim['Layer'] = sim.groupby('date').cumcount() + 1
sim['Layer'] = 'L' + sim['Layer'].astype(str)
sim = sim.pivot(index = None , columns = 'Layer').T
This gives me the index column to be with two values. (The header is blank for the first one, and Layers for the second) i.e 1,L1.
What I need help on is:
I can not find a way to rename that first blank in the index to 'NodeID'.
I want to name it that so that I can do the lookup function and use NodeID in both dataframes so that I can bring in the 'GSE' values from the first dataframe to the second.
I have been googling way to rename that first column in the second dataframe and I can not seem to find an solution. Any ideas help at this point. I think my pivot function might be wrong...
This is a picture of dataframe #2 before pivot. The number 1-4 are the Node ID.
when I export it to csv to see what the dataframe looks like I get this..
Try
df.rename(columns={"Index": "your preferred name"})
if it is your index then do -
df = df.reset_index()
df.rename(columns={"index": "your preferred name"})

Python CountVectorizer for Pandas DataFrame

I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi
From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.