Count frequency of multiple words - pandas

I used this code
unclassified_df['COUNT'] = unclassified_df.tweet.str.count('mulcair')
to count the number of times mulcair appeared in each row in my pandas dataframe. I am trying to repeat the same but for a set of words such as
Liberal = ['lpc','ptlib','justin','trudeau','realchange','liberal', 'liberals', "liberal2015",'lib2015','justin2015', 'trudeau2015', 'lpc2015']
I saw somewhere that I could use collection.Counter(data) and its .most_common(k) method for such, please can anyone help me out.

from collections import Counter
import pandas as pd
#check frequency for the following for each row, but no repetition for row
Liberal = ['lpc','justin','trudeau','realchange','liberal', 'liberals', "liberal2015", 'lib2015','justin2015', 'trudeau2015', 'lpc2015']
#sample data
data = {'tweet': ['lpc living dream camerama', "jsutingnasndsa dnsadnsadnsa dsalpcdnsa", "but", 'mulcair suggests thereslcp bad lpc blood']}
# the data frame with one coloumn tweet
df = pd.DataFrame(data,columns=['tweet'])
#no duplicates per row
print [(df.tweet.str.contains(word).sum(),word) for word in Liberal]
#captures all duplicates located in each row
print pd.Series({w: df.tweet.str.count(w).sum() for w in Liberal})
References:
Contains & match

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

Pandas - finding most important words from each row

I have a pandas dataframe with a text column. I am trying to find the most important words from this text column for each row. How do I do this?
I am currently trying to do this using tf-idf:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer(stop_words='english')
x = v.fit_transform(df['cleansed_text'])
I see that x is a sparse matrix with same number of rows as my dataframe and looks like number of columns equals the number of words in the vocabulary.
How do I use this to find the most important words for each row?

Pandas - get count of each boolean field

I have other programs where I group and count fields. Now, I want to get a count of each boolean field. Is there a Pandas way to do that rather than me looping and writing my own code? Ideally, I would generated a new dataframe with the results (kind of like what I did here).
Easy Example CSV Data (data about poker hands generated):
Hand,Other1,Other2,IsFourOfAKind,IsThreeOfAKind,IsPair
1,'a','b',1,0,0
2,'c','d',0,1,0
3,'a','b',0,1,0
4,'x','y',0,0,1
5,'a','b',0,0,1
6,'a','b',0,0,1
7,'a','b',0,0,1
Program:
import pandas as pd
import warnings
filename = "./data/TestGroup2.csv"
# tell run time to ignore certain read_csv type errors (from pandas)
warnings.filterwarnings('ignore', message="^Columns.*")
count_cols = ['IsFourOfAKind','IsThreeOfAKind','IsPair ']
enter code here
#TODO - use the above to get counts of only these columns
df = pd.read_csv(filename)
print(df.head(10))
Desired Output - could just be a new dataframe
Column Count
IsFourOfAKind 1
IsThreeOfAKind 2
IsPair 3
Please try:
df.filter(like='Is').sum(0)
or did you need;
df1=df.filter(like='Is').agg('sum').reset_index().rename(columns={'index':'column', 0:'count'})

More efficient Pandas code

I am trying to learn Python and Pandas and coming from VBA I am still caught in the habit of looping through every single cell, but I am looking for ways to operate on entire rows at a time.
Below is my part of my code. I have about 3000 stocks in the columns and about 40 or so data points in the rows saved in a dataframe called df.
I do the same kind of loop as showed to test for multiple criterias based on row values for the stocks in each column. As you see my code uses .ix to loop through the 'cells' in the dataframe.
But I have looked for ways to operate on the entire rows at a time, but have failed every attempt.
This take about 7 minutes for the 3000 stocks (but only about 1 minut or so for 2000 stocks??). But this must be able to run much faster?
def piotrosky():
df_temp = pd.DataFrame(np.nan, index=range(10), columns=df.columns)
#bruger dictionary til rename input så man ikke skal gøre det for hver række
dic={0:'positiveNetIncome',1:'positiveOperatingCF',2:'increasingROA', 3:'QualityOfEarnings',4:'longTermDebtToAssets',
5:'currentRatio', 6:'sharesOutVsSharesLast',7:'increasingGrossM',8:'IncreasingAssetTurnOver', 9:'total' }
df_temp.rename(dic, inplace = True)
r=1
#df is a vector with stocks in the columns and datapoints in the rows
#so I always need to loop across the columns
for i in range(df.shape[1]-1):
#positive net income
if df.ix[2,r]>0:
df_temp.ix[0,r]=1
else:
df_temp.ix[0,r]=0
#positiveOpeCF
if df.ix[3,r]>0:
df_temp.ix[1,r]=1
else:
df_temp.ix[1,r]=0
#Continue with several simular loops
#total
df_temp.ix[9,r]=df_temp.ix[0,r]+df_temp.ix[1,r]+df_temp.ix[2,r]+df_temp.ix[3,r]+ \
df_temp.ix[4,r]+df_temp.ix[5,r]+df_temp.ix[6,r]+df_temp.ix[7,r]+df_temp.ix[8,r]
r=r+1
Edit:
All of the below is done on a dataframe that is the transpose of the one you describe in your post. df.T should produce properly formatted input.
Method:
For conditionals on pandas dataframes, you can use the numpy function np.where:
criteria = {}
# np.where(condition, value_if_true, value_if_false)
criteria['positive_net_income'] = np.where(df[2] > 0, 1, 0)
After you get these numpy arrays, you can construct a dataframe from them,
pd.DataFrame(criteria)
and sum across it
pd.DataFrame(criteria).sum(axis=1)
to get a Series you can add as a column to your initial DataFrame
def piotrosky(df):
criteria = {}
criteria['positive_net_income'] = np.where(df[2] > 0, 1, 0)
criteria['positive_operating_cf'] = np.where(df[3] > 0, 1, 0)
...
return pd.DataFrame(criteria).sum(axis=1)
df['piotrosky_score'] = piotrosky(df)

Counting Data based on Cover_Type using pandas

I have the following data in the excel sheet!
I need to count the number of times a given elevation occurs for a given cover_type. For example, elevation=1905 occurs twice for cover_type=6 and once for cover_type=3. I need to do the same Aspect, Slope, Horizontal_Distance_To_Hydrology, Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways, Hillshade_9am, Hillshade_Noon, Hillshade_3pm, Horizontal_Distance_To_Fire_Points, Soil, Wilderness_Area.
I will be using the count to calculate the entropy of the each column. I need to execute this formula.
You can do the following
import pandas as pd
df = pd.read_csv('train_data.csv')
grouped = df[['elevation','cover_type']].groupby(['elevation','cover_type'], as_index = False, sort = False)['cover_type'].count()