frequency table for all columns in pandas - pandas

I want to run frequency table on each of my variable in my df.
def frequency_table(x):
return pd.crosstab(index=x, columns="count")
for column in df:
return frequency_table(column)
I got an error of 'ValueError: If using all scalar values, you must pass an index'
How can i fix this?
Thank you!

You aren't passing any data. You are just passing a column name.
for column in df:
print(column) # will print column names as strings
try
ctabs = {}
for column in df:
ctabs[column]=frequency_table(df[column])
then you can look at each crosstab by using the column name as keys in the ctabs dictionary

for column in df:
print(data[column].value_counts())
For example:
import pandas as pd
my_series = pd.DataFrame(pd.Series([1,2,2,3,3,3, "fred", 1.8, 1.8]))
my_series[0].value_counts()
will generate output like in below:
3 3
1.8 2
2 2
fred 1
1 1
Name: 0, dtype: int64

Related

Why would an extra column (unnamed: 0) appear after saving the df and then reading it through pd.read_csv?

My code to save the df is:
fdi_out_vdem.to_csv("fdi_out_vdem.csv")
To read the df into python is :
fdi_out_vdem = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv")
The df:
Unnamed: 0
country_name
value
1
Spain
190
2
Spain
311
Your df has two columns, but also an index with "0" and "1". When writing it to csv it looks like this:
,country_name,value
0,Spain,190
1,Spain,311
When importing it with pandas you it is considered as df with 3 columns (and the first has no name)
You have two possibilities here:
Save it without index column:
df.to_csv("fdi_out_vdem.csv", index=False)
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv")
or save it with index column and define an index col when reading it with pd.read_csv
df.to_csv("fdi_out_vdem.csv")
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col=[0])
UPDATE
As recommended by #ouroboros1 in the comments you could also name your index before saving it to csv, so you can define the index column by using that name
df.index.name = "index"
df.to_csv("fdi_out_vdem.csv")
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col="index")
You can either pass the parameter index_col=[0] to pandas.read_csv :
fdi_out_vdem = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col=[0])
Or even better, get rid of the index at the beginning when calling pandas.DataFrame.to_csv:
fdi_out_vdem.to_csv("fdi_out_vdem.csv", index=False)

Pandas DataFrame read_csv then GroupBy - How to get just a single count instead of one per column

I'm getting the counts I want, but I don't understand why it is creating a separate count for each data column. How can I create just one column called "count"? Would the counts only be different when a column as a Null (NAN) value?
Also, what are the actual column names below? Is the column name a tuple?
Can I change the groupby/agg to return just one column called "Count"?
CSV Data:
'Occupation','col1','col2'
'Carpenter','data1','x'
'Carpenter','data2','y'
'Carpenter','data3','z'
'Painter','data1','x'
'Painter','data2','y'
'Programmer','data1','z'
'Programmer','data2','x'
'Programmer','data3','y'
'Programmer','data4','z'
Program:
filename = "./data/TestGroup.csv"
df = pd.read_csv(filename)
print(df.head())
print("Computing stats by HandRank... ")
df_stats = df.groupby("'Occupation'").agg(['count'])
print(df_stats.head())
print("----- Columns-----")
for col_name in df_stats.columns:
print(col_name)
Output:
Computing stats by HandRank...
'col1' 'col2'
count count
'Occupation'
'Carpenter' 3 3
'Painter' 2 2
'Programmer' 4 4
----- Columns-----
("'col1'", 'count')
("'col2'", 'count')
The df.head() shows it is using "Occupation" as my column name.
Try with size
df_stats = df.groupby("'Occupation'").size().to_frame('count')

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

Pandas dataframe row data filtering

I have a column of data in pandas dataframe in Bxxxx-xx-xx-xx.y format. Only the first part (Bxxxx) is all I require. How do I split the data? In addition, I also have data in BSxxxx-xx-xx-xx format in the same column which I would like to remove using regex='^BS' command (For some reason, it's not working). Any help in this regard will be appreciated.BTW, I am using df.filter command.
This should work.
df[df.col1.apply(lambda x: x.split("-")[0][0:2]!="BS")].col1.apply(lambda x: x.split("-")[0])
Consider below example:
df = pd.DataFrame({
'col':['B123-34-gd-op','BS01010-9090-00s00','B000003-3frdef4-gdi-ortp','B1263423-304-gdcd-op','Bfoo3-poo-plld-opo', 'BSfewf-sfdsd-cvc']
})
print(df)
Output:
col
0 B123-34-gd-op
1 BS01010-9090-00s00
2 B000003-3frdef4-gdi-ortp
3 B1263423-304-gdcd-op
4 Bfoo3-poo-plld-opo
5 BSfewf-sfdsd-cvc
Now Let's do two tasks:
Extract Bxxxx part from Bxxx-xx-xx-xxx .
Remove BSxxx formated strings.
Consider below code which uses startswith():
df[~df.col.str.startswith('BS')].col.str.split('-').str[0]
Output:
0 B123
2 B000003
3 B1263423
4 Bfoo3
Name: col, dtype: object
Breakdown:
df[~df.col.str.startswith('BS')] gives us all the string which do not start with BS. Next, We are spliting those string with - and taking the first part with .col.str.split('-').str[0] .
You can define a function where in you treat Bxxxx-xx-xx-xx.y as a string and just extract the first 5 indexes.
>>> def edit_entry(x):
... return (str(x)[:5])
>>> df['Column_name'].apply(edit_entry)
A one-liner solution would be:
df["column_name"] = df["column_name"].apply(lambda x: x[:5])

Equivalent of Rs which in pandas

How do I get the column of the min in the example below, not the actual number?
In R I would do:
which(min(abs(_quantiles - mean(_quantiles))))
In pandas I tried (did not work):
_quantiles.which(min(abs(_quantiles - mean(_quantiles))))
You could do it this way, call np.min on the df as a np array, use this to create a boolean mask and drop the columns that don't have at least a single non NaN value:
In [2]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5)})
df
Out[2]:
a b
0 -0.860548 -2.427571
1 0.136942 1.020901
2 -1.262078 -1.122940
3 -1.290127 -1.031050
4 1.227465 1.027870
In [15]:
df[df==np.min(df.values)].dropna(axis=1, thresh=1).columns
Out[15]:
Index(['b'], dtype='object')
idxmin and idxmax exist, but no general which as far as I can see.
_quantiles.idxmin(abs(_quantiles - mean(_quantiles)))