Pandas - get count of each boolean field

Pandas - get count of each boolean field - pandas

I have other programs where I group and count fields. Now, I want to get a count of each boolean field. Is there a Pandas way to do that rather than me looping and writing my own code? Ideally, I would generated a new dataframe with the results (kind of like what I did here).
Easy Example CSV Data (data about poker hands generated):
Hand,Other1,Other2,IsFourOfAKind,IsThreeOfAKind,IsPair
1,'a','b',1,0,0
2,'c','d',0,1,0
3,'a','b',0,1,0
4,'x','y',0,0,1
5,'a','b',0,0,1
6,'a','b',0,0,1
7,'a','b',0,0,1
Program:
import pandas as pd
import warnings
filename = "./data/TestGroup2.csv"
# tell run time to ignore certain read_csv type errors (from pandas)
warnings.filterwarnings('ignore', message="^Columns.*")
count_cols = ['IsFourOfAKind','IsThreeOfAKind','IsPair ']
enter code here
#TODO - use the above to get counts of only these columns
df = pd.read_csv(filename)
print(df.head(10))
Desired Output - could just be a new dataframe
Column Count
IsFourOfAKind 1
IsThreeOfAKind 2
IsPair 3

Please try:
df.filter(like='Is').sum(0)
or did you need;
df1=df.filter(like='Is').agg('sum').reset_index().rename(columns={'index':'column', 0:'count'})

Related

python - if-else in a for loop processing one column

I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.

If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])

How to delete two or more columns in csv using pandas

I have a csv file with 4 columns (Name, User_Name, Phone#, Email"). I want to delete those rows which have none value either in Phone# or Email. If there is none value in column (Phone#) and have some value in column(Email)or vise versa I don't want to delete that column. I hope you people will get what I want.
Sorry I don't have the code.
Thanks in advance

You can use the pandas notna() function to get a boolean series indicating which values are not missing. You can call this on both the email and the phone column and combine it with boolean | to get a truth series indicating that at least one of the email and phone columns is not missing. Then, you can use this series as a mask to filter the right rows.
import pandas as pd
# Import .csv file
df = pd.read_csv('mypath/myfile.csv')
# Filter to get rows where columns 'Email' and 'Phone#' are not both None
new_df = df[df['Email'].notna() | df['Phone#'].notna()]
# Write pandas df to disk
new_df.to_csv('mypath/mynewfile.csv', index=False)

Get count vectorizer vocabulary in new dataframe column by applying vectorizer on existing dataframe column using pandas

I have dataframe column 'review' with content like 'Food was Awesome' and I want a new column which counts the number of repetition of each word.
name The First Years Massaging Action Teether
review A favorite in our house!
rating 5
Name: 269, dtype: object
Expecting output like ['Food':1,'was':1,'Awesome':1]
I tried with for loop but its taking too long to execute
for row in range(products.shape[0]):
try:
count_vect.fit_transform([products['review_without_punctuation'][row]])
products['word_count'][row]=count_vect.vocabulary_
except:
print(row)
I would like to do it without for loop.

I found a solution for this.
I have defined a function like this-
def Vectorize(text):
try:
count_vect.fit_transform([text])
return count_vect.vocabulary_
except:
return-1
and applied above function-
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
products['word_count'] = products['review_without_punctuation'].apply(Vectorize)
This solution worked and I got vocabulary in new column.

You can get the count vector for all docs like this:
cv = CountVectorizer()
count_vectors = cv.fit_transform(products['review_without_punctuation'])
To get the count vector in array format for a particular document by index, say, the 1st doc,
count_vectors[0].toarray()
The vocabulary is in
cv.vocabulary_
To get the words that make up a count vector, say, for the 1st doc, use
cv.inverse_transform(count_vectors[0])

Python CountVectorizer for Pandas DataFrame

I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi

From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.

Count frequency of multiple words

I used this code
unclassified_df['COUNT'] = unclassified_df.tweet.str.count('mulcair')
to count the number of times mulcair appeared in each row in my pandas dataframe. I am trying to repeat the same but for a set of words such as
Liberal = ['lpc','ptlib','justin','trudeau','realchange','liberal', 'liberals', "liberal2015",'lib2015','justin2015', 'trudeau2015', 'lpc2015']
I saw somewhere that I could use collection.Counter(data) and its .most_common(k) method for such, please can anyone help me out.

from collections import Counter
import pandas as pd
#check frequency for the following for each row, but no repetition for row
Liberal = ['lpc','justin','trudeau','realchange','liberal', 'liberals', "liberal2015", 'lib2015','justin2015', 'trudeau2015', 'lpc2015']
#sample data
data = {'tweet': ['lpc living dream camerama', "jsutingnasndsa dnsadnsadnsa dsalpcdnsa", "but", 'mulcair suggests thereslcp bad lpc blood']}
# the data frame with one coloumn tweet
df = pd.DataFrame(data,columns=['tweet'])
#no duplicates per row
print [(df.tweet.str.contains(word).sum(),word) for word in Liberal]
#captures all duplicates located in each row
print pd.Series({w: df.tweet.str.count(w).sum() for w in Liberal})
References:
Contains & match

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas - get count of each boolean field - pandas

Please try: df.filter(like='Is').sum(0) or did you need; df1=df.filter(like='Is').agg('sum').reset_index().rename(columns={'index':'column', 0:'count'})

Related

python - if-else in a for loop processing one column

How to delete two or more columns in csv using pandas

Get count vectorizer vocabulary in new dataframe column by applying vectorizer on existing dataframe column using pandas

Python CountVectorizer for Pandas DataFrame

Count frequency of multiple words

Categories

Resources