I have other programs where I group and count fields. Now, I want to get a count of each boolean field. Is there a Pandas way to do that rather than me looping and writing my own code? Ideally, I would generated a new dataframe with the results (kind of like what I did here).
Easy Example CSV Data (data about poker hands generated):
Hand,Other1,Other2,IsFourOfAKind,IsThreeOfAKind,IsPair
1,'a','b',1,0,0
2,'c','d',0,1,0
3,'a','b',0,1,0
4,'x','y',0,0,1
5,'a','b',0,0,1
6,'a','b',0,0,1
7,'a','b',0,0,1
Program:
import pandas as pd
import warnings
filename = "./data/TestGroup2.csv"
# tell run time to ignore certain read_csv type errors (from pandas)
warnings.filterwarnings('ignore', message="^Columns.*")
count_cols = ['IsFourOfAKind','IsThreeOfAKind','IsPair ']
enter code here
#TODO - use the above to get counts of only these columns
df = pd.read_csv(filename)
print(df.head(10))
Desired Output - could just be a new dataframe
Column Count
IsFourOfAKind 1
IsThreeOfAKind 2
IsPair 3
Please try:
df.filter(like='Is').sum(0)
or did you need;
df1=df.filter(like='Is').agg('sum').reset_index().rename(columns={'index':'column', 0:'count'})
Related
I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.
If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])
I have a csv file with 4 columns (Name, User_Name, Phone#, Email"). I want to delete those rows which have none value either in Phone# or Email. If there is none value in column (Phone#) and have some value in column(Email)or vise versa I don't want to delete that column. I hope you people will get what I want.
Sorry I don't have the code.
Thanks in advance
You can use the pandas notna() function to get a boolean series indicating which values are not missing. You can call this on both the email and the phone column and combine it with boolean | to get a truth series indicating that at least one of the email and phone columns is not missing. Then, you can use this series as a mask to filter the right rows.
import pandas as pd
# Import .csv file
df = pd.read_csv('mypath/myfile.csv')
# Filter to get rows where columns 'Email' and 'Phone#' are not both None
new_df = df[df['Email'].notna() | df['Phone#'].notna()]
# Write pandas df to disk
new_df.to_csv('mypath/mynewfile.csv', index=False)
I have dataframe column 'review' with content like 'Food was Awesome' and I want a new column which counts the number of repetition of each word.
name The First Years Massaging Action Teether
review A favorite in our house!
rating 5
Name: 269, dtype: object
Expecting output like ['Food':1,'was':1,'Awesome':1]
I tried with for loop but its taking too long to execute
for row in range(products.shape[0]):
try:
count_vect.fit_transform([products['review_without_punctuation'][row]])
products['word_count'][row]=count_vect.vocabulary_
except:
print(row)
I would like to do it without for loop.
I found a solution for this.
I have defined a function like this-
def Vectorize(text):
try:
count_vect.fit_transform([text])
return count_vect.vocabulary_
except:
return-1
and applied above function-
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
products['word_count'] = products['review_without_punctuation'].apply(Vectorize)
This solution worked and I got vocabulary in new column.
You can get the count vector for all docs like this:
cv = CountVectorizer()
count_vectors = cv.fit_transform(products['review_without_punctuation'])
To get the count vector in array format for a particular document by index, say, the 1st doc,
count_vectors[0].toarray()
The vocabulary is in
cv.vocabulary_
To get the words that make up a count vector, say, for the 1st doc, use
cv.inverse_transform(count_vectors[0])
I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi
From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.
I used this code
unclassified_df['COUNT'] = unclassified_df.tweet.str.count('mulcair')
to count the number of times mulcair appeared in each row in my pandas dataframe. I am trying to repeat the same but for a set of words such as
Liberal = ['lpc','ptlib','justin','trudeau','realchange','liberal', 'liberals', "liberal2015",'lib2015','justin2015', 'trudeau2015', 'lpc2015']
I saw somewhere that I could use collection.Counter(data) and its .most_common(k) method for such, please can anyone help me out.
from collections import Counter
import pandas as pd
#check frequency for the following for each row, but no repetition for row
Liberal = ['lpc','justin','trudeau','realchange','liberal', 'liberals', "liberal2015", 'lib2015','justin2015', 'trudeau2015', 'lpc2015']
#sample data
data = {'tweet': ['lpc living dream camerama', "jsutingnasndsa dnsadnsadnsa dsalpcdnsa", "but", 'mulcair suggests thereslcp bad lpc blood']}
# the data frame with one coloumn tweet
df = pd.DataFrame(data,columns=['tweet'])
#no duplicates per row
print [(df.tweet.str.contains(word).sum(),word) for word in Liberal]
#captures all duplicates located in each row
print pd.Series({w: df.tweet.str.count(w).sum() for w in Liberal})
References:
Contains & match