I have a pandas dataframe with a text column. I am trying to find the most important words from this text column for each row. How do I do this?
I am currently trying to do this using tf-idf:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer(stop_words='english')
x = v.fit_transform(df['cleansed_text'])
I see that x is a sparse matrix with same number of rows as my dataframe and looks like number of columns equals the number of words in the vocabulary.
How do I use this to find the most important words for each row?
Related
I have a Pandas dataframe with several columns wherein the entries of each column are a combination of numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|". Each entry of the column is of the form:
'x=ABCDefgh_5|123|'
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry 'x=ABCDefgh_5|123|' without stripping out anything. Is there an error in my code?
Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]
I have a csv file with in which I want to compare each row with all other rows. I want to do a linear regression and get the r^2 value for the linear regression line and put it into a new matrix. I'm having trouble finding a way to iterate over all the other rows (it's fine to compare the primary row to itself).
I've tried using .iterrows but I can't think of a way to define the other rows once I have my primary row using this function.
UPDATE: Here is a solution I came up with. Please let me know if there is a more efficient way of doing this.
def bad_pairs(df, limit):
list_fluor = list(combinations(df.index.values, 2))
final = {}
for fluor in list_fluor:
final[fluor] = (r2_score(df.xs(fluor[0]),
df.xs(fluor[1])))
bad_final = {}
for i in final:
if final[i] > limit:
bad_final[i] = final[i]
return(bad_final)
My data is a pandas DataFrame where the index is the name of the color and there is a number between 0-1 for each detector (220 columns).
I'm still working on a way to make a new pandas Dataframe from a dictionary with all the values (final in the code above), not just those over the limit.
I'm new to data science & pandas. I'm just trying to visualize the distribution of data from a single series (a single column), but the histogram that I'm generating is only a single column (see below where it's sorted descending).
My data is over 11 million rows. The max value is 27,235 and the min values are 1. I'd like to see the "count" column grouped into different bins and a column/bar whose height is the total for each bin. But, I'm only seeing a single bar and am not sure what to do.
Data
df = pd.DataFrame({'count':[27235,26000,25877]})
Solution
import matplotlib.pyplot as plt
df['count'].hist()
Alternatively
sns.distplot(df['count'])
I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi
From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.
I used this code
unclassified_df['COUNT'] = unclassified_df.tweet.str.count('mulcair')
to count the number of times mulcair appeared in each row in my pandas dataframe. I am trying to repeat the same but for a set of words such as
Liberal = ['lpc','ptlib','justin','trudeau','realchange','liberal', 'liberals', "liberal2015",'lib2015','justin2015', 'trudeau2015', 'lpc2015']
I saw somewhere that I could use collection.Counter(data) and its .most_common(k) method for such, please can anyone help me out.
from collections import Counter
import pandas as pd
#check frequency for the following for each row, but no repetition for row
Liberal = ['lpc','justin','trudeau','realchange','liberal', 'liberals', "liberal2015", 'lib2015','justin2015', 'trudeau2015', 'lpc2015']
#sample data
data = {'tweet': ['lpc living dream camerama', "jsutingnasndsa dnsadnsadnsa dsalpcdnsa", "but", 'mulcair suggests thereslcp bad lpc blood']}
# the data frame with one coloumn tweet
df = pd.DataFrame(data,columns=['tweet'])
#no duplicates per row
print [(df.tweet.str.contains(word).sum(),word) for word in Liberal]
#captures all duplicates located in each row
print pd.Series({w: df.tweet.str.count(w).sum() for w in Liberal})
References:
Contains & match