Pandas DataFrame read_csv then GroupBy - How to get just a single count instead of one per column - pandas

I'm getting the counts I want, but I don't understand why it is creating a separate count for each data column. How can I create just one column called "count"? Would the counts only be different when a column as a Null (NAN) value?
Also, what are the actual column names below? Is the column name a tuple?
Can I change the groupby/agg to return just one column called "Count"?
CSV Data:
'Occupation','col1','col2'
'Carpenter','data1','x'
'Carpenter','data2','y'
'Carpenter','data3','z'
'Painter','data1','x'
'Painter','data2','y'
'Programmer','data1','z'
'Programmer','data2','x'
'Programmer','data3','y'
'Programmer','data4','z'
Program:
filename = "./data/TestGroup.csv"
df = pd.read_csv(filename)
print(df.head())
print("Computing stats by HandRank... ")
df_stats = df.groupby("'Occupation'").agg(['count'])
print(df_stats.head())
print("----- Columns-----")
for col_name in df_stats.columns:
print(col_name)
Output:
Computing stats by HandRank...
'col1' 'col2'
count count
'Occupation'
'Carpenter' 3 3
'Painter' 2 2
'Programmer' 4 4
----- Columns-----
("'col1'", 'count')
("'col2'", 'count')
The df.head() shows it is using "Occupation" as my column name.

Try with size
df_stats = df.groupby("'Occupation'").size().to_frame('count')

Related

Pyspark dynamic column selection from dataframe

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns).
The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.
columnname
t_pono
t_pric
:
:
I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.
How can this be done?
Let's say you have the following columns (which can be obtained using df.columns, which returns a list):
df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]
To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):
list(set(df1_cols).intersection(df2_cols))
And we get the result:
["t_pono", "t_pric"]
To put it all together and select only those columns:
select_columns = list(set(df1_cols).intersection(df2_cols))
new_df = df1.select(*select_columns)

My Dataframe contains 500 columns, but I only want to pick out 27 columns in a new Dataframe. How do I do that?

My Dataframe contains 500 columns, but I only want to pick out 27 columns in a new Dataframe.
How do I do that?
I used query()
but output
TypeError: query() takes from 2 to 3 positional arguments but 27 were given
If you want to select the columns based on their name, you can do the following:
df_new = df[["colA", "colB", "colC", ...]]
or use the "filter" function:
df_new = df.filter(["colA", "colB", "colC", ..])
In case that your column selection is based on the index of columns:
df_new = df.iloc[:, 0:27] # if columns are consecutive
df_new = df.iloc[:, [0,2,10,..]] # if columns are not consecutive (the numbers refer to the column indices)

Delete all rows with an empty cell anywhere in the table at once in pandas

I have googled it and found lots of question in stackoverflow. So suppose I have a dataframe like this
A B
-----
1
2
4 4
First 3 rows will be deleted. And suppose I have not 2 but 200 columns. How can I do that?
As per your request - first replace to Nan:
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.dropna()
If you want to remove on a specific column, then you need to specify the column name in the brackets

Replace a subset of pandas data frame with another data frame

I have a data frame(DF1) with 100 columns.( one of the column is ID)
I have one more data frame(DF2) with 30 columns.( one column is ID)
I have to update the first 30 columns of the data frame(DF1) with the values in second data frame (DF2) keeping the rest of the values in the remaining columns of first data frame (DF1) intact.
update the first 30 column value in DF1 out of the 100 columns when the ID in second data frame (DF2) is present in first data frame (DF1).
I tested this on Python 3.7 but I see no reason for it not to work on 2.7:
joined = df1.reset_index() \
[['index', 'ID']] \
.merge(df2, on='ID')
df1.loc[joined['index'], df1.columns[:30]] = joined.drop(columns=['index', 'ID'])
This assumes that df2 doesn't have a column called index or the merge will fail saying duplicate key with suffix.
Here a slow-motion of its inner workings:
df1.reset_index() returns a dataframe same as df1 but with an additional column: index
[['index', 'ID']] extracts a dataframe containing just these 2 columns from the dataframe in #1
.merge(...) merges with df2 , matching on ID . The result (joined) is a dataframe with 32 columns: index, ID and the original 30 columns of df2.
df1.loc[<row_indexes>, <column_names>] = <another_dataframe> mean you want to replace at those particular cells with data from another_dataframe. Since joined has 32 columns, we need to drop the extra 2 (index and ID)

frequency table for all columns in pandas

I want to run frequency table on each of my variable in my df.
def frequency_table(x):
return pd.crosstab(index=x, columns="count")
for column in df:
return frequency_table(column)
I got an error of 'ValueError: If using all scalar values, you must pass an index'
How can i fix this?
Thank you!
You aren't passing any data. You are just passing a column name.
for column in df:
print(column) # will print column names as strings
try
ctabs = {}
for column in df:
ctabs[column]=frequency_table(df[column])
then you can look at each crosstab by using the column name as keys in the ctabs dictionary
for column in df:
print(data[column].value_counts())
For example:
import pandas as pd
my_series = pd.DataFrame(pd.Series([1,2,2,3,3,3, "fred", 1.8, 1.8]))
my_series[0].value_counts()
will generate output like in below:
3 3
1.8 2
2 2
fred 1
1 1
Name: 0, dtype: int64