return column name of min greater than 0 pandas - pandas

I have a dataframe with one date column and rest numeric columns, something like this
date col1 col2 col3 col4
2020-1-30 0 1 2 3
2020-2-1 0 2 3 4
2020-2-2 0 2 2 5
I want to now find the name of the column which gives me minimum sum per column, but only when greater than 0. So in the above case, I want it to give me col2 as a result because the sum of this (5) is least of all other columns other than col1 which is 0. Appreciate any help with this

I would use:
# get only numeric columns
df2 = df.select_dtypes('number')
# drop the columns with 0, compute the sum
# get index of min
out = df2.loc[:, df2.ne(0).all()].sum().idxmin()
If you want to ignore a column only if all values are 0, use any in place of all:
df2.loc[:, df2.ne(0).any()].sum().idxmin()
Output: 'col2'
all minima
# get only numeric columns
df2 = df.select_dtypes('number')
# drop the columns with 0, compute the sum
s = df2.loc[:, df2.ne(0).any()].sum()
# get all minimal
out = s[s.eq(s.min())].index.tolist()
Output:
['col2']

Related

Subtracting the 2nd column of the 1st Dataframe to the 2nd column of the 2nd Dataframe based on conditions

I have 2 dataframes with 2 columns each and with different number of index. I want to subtract the 2nd column of my 2nd dataframe to the 2nd column of my 1st dataframe, And storing the answers to another dataframe.. NOTE: Subtract only the values with the same values in column 1.
e.g.
df:
col1 col2
29752 35023.0 40934.0
subtract it to
df2:
c1 c2
962 35023.0 40935.13
Here is my first Dataframe:
col1 col2
0 193431.0 40955.41
1 193432.0 40955.63
2 193433.0 40955.89
3 193434.0 40956.31
4 193435.0 40956.43
... ...
29752 35023.0 40934.89
29753 35024.0 40935.00
29754 35025.0 40935.13
29755 35026.0 40934.85
29756 35027.0 40935.18
Here is my 2nd dataframe;
c1 c2
0 194549.0 41561.89
1 194550.0 41563.96
2 194551.0 41563.93
3 194552.0 41562.75
4 194553.0 41561.22
.. ... ...
962 35027.0 41563.80
963 35026.0 41563.18
964 35025.0 41563.87
965 35024.0 41563.97
966 35023.0 41564.02
You can iterate for each row of the first dataframe, if the col1 of the row is equal to col1 of df2[i], then subtract the value of each column
for i, row in df1.iterrows():
if(row["col1"] == df2["col1"][i]):
diff = row["col2"] - df2["col2"][i]
print(i, diff)

How to remove columns that have all values below a certain threshold

I am trying to remove any columns in my dataframe that do not have one value above .9. I know this probably isn't the most efficient way to do it but I can't find the problem with it. I know it isn't correct because it only removes one column and I know it should be closer to 20. So I do a count to see how many values are below .9 and then if the count equals the length of the list of column values then drop that column. Thanks in advance.
for i in range(len(df3.columns)):
count=0
for j in df3.iloc[:,i].tolist():
if j<.9:
count+=1
if len(df3.iloc[:,i].tolist())==count:
df4=df3.drop(df3.columns[i], axis=1)
df4
You can loop through each column in the dataframe and check the maximum value in each column against your defined threshold, 0.9 in this case, if there are no values more than 0.9, drop the column.
The input:
col1 col2 col3
0 0.2 0.8 1.0
1 0.3 0.5 0.5
Code:
# define dataframe
df = pd.DataFrame({'col1':[0.2, 0.3], 'col2':[0.8, 0.5], 'col3':[1, 0.5]})
# define threshold
threshold = 0.9
# loop through each column in dataframe
for col in df:
# get the maximum value in column
# check if it is less than or equal to the defined threshold
if df[col].max() <= threshold:
# if true, drop the column
df = df.drop([col], axis=1)
This outputs:
col3
0 1.0
1 0.5

add a categorical column with three values assigned to each row in a pyspark df then perform aggregated functions to 30 columns

I have a dataframe as a result of validation codes:
df=\
(['c_1', 'c_1', 'c_1', 'c_2', 'c_3', 'c_1', 'c_2', 'c_2'],\
['valid','valid', 'invalid','missing','invalid','valid','valid', 'missing'],\
['missing','valid','invalid','invalid','valid', 'valid','missing','missing'],\
['invalid','valid','valid', 'missing', 'missing','valid','invalid','missing'])\
.toDF('clinic_id','name','phone','city')
I counted the number of valids, invalids, and missing using aggregated code grouped by clinic_id in pyspark
agg_table = (
df
.groupBy('clinic_id')
.agg(
# name
sum(when(col('name') == 'valid',1).otherwise(0)).alias('validname')
,sum(when(col('name') == 'invalid',1).otherwise(0)).alias('invalidname')
,sum(when(col('name') == 'missing',1).otherwise(0)).alias('missingname')
# phone
,sum(when(col('phone') == 'valid',1).otherwise(0)).alias('validphone')
,sum(when(col('phone') == 'invalid',1).otherwise(0)).alias('invalidphone')
,sum(when(col('phone') == 'missing',1).otherwise(0)).alias('missingphone')
# city
,sum(when(col('city') == 'valid',1).otherwise(0)).alias('validcity')
,sum(when(col('city') == 'invalid',1).otherwise(0)).alias('invalidcity')
,sum(when(col('city') == 'missing',1).otherwise(0)).alias('missingcity')
))
display(agg_table)
output:
clinic_id validname invalidname missingname ... invalidcity missingcity
--------- --------- ----------- ----------- ... ----------- -----------
c_1 3 1 0 ... 1 0
c_2 1 0 2 ... 1 0
c_3 0 1 0 ... 0 1
the resulting aggregated table is just fine, but is not ideal for further analysis. I tried the pivoting within pyspark trying to get something below:
#note: counts below are just made up, not the actual count from above, but I hope you get what I mean.
clinic_id category name phone city
-------- ------- ---- ------- ----
c_1 valid 3 1 3
c_1 invalid 1 0 2
c_1 missing 0 2 3
c_2 valid 3 1 3
c_2 invalid 1 0 2
c_2 missing 0 2 3
c_3 valid 3 1 3
c_3 invalid 1 0 2
c_3 missing 0 2 3
I initially searched pivot/unpivot, but I learned it is called unstack in pyspark and I also came across mapping.
I tried the suggested approach in How to unstack dataset (using pivot)? but it is showing me only one column and I cannot get the desired result when I try applying it to my dataframe of 30 columns.
I also tried the following using the validated table/dataframe
expression = ""
cnt=0
for column in agg_table.columns:
if column!='clinc_id':
cnt +=1
expression += f"'{column}' , {column},"
exprs = f"stack({cnt}, {expression[:-1]}) as (Type,Value)"
unpivoted = agg_table.select('clinic_id',expr(exprs))
I get an error just pointing to the line that may be referring to a return value.
I also tried grouping the results by id and the category but that is where I am stuck at finding solution. If I group by an aggregated variable, say the values of the validname, the agggregated function only counts the values in that column and would not apply to every count columns. So I thought of inserting a column using .withColumn function assigning the three categories to each ID so that each aggregated counts will be grouped by id and category as in the prior table, but I am not feeling lucky in finding solution to this.
Also, maybe a sql approach will be easier?
I found the right search phrase: "column to row in pyspark"
One of the suggested answer that fit my dataframe is this function:
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["clinic_id"])
This created a dataframe of three columns: clinic_id, column_names, status (valid, invalid, missing)
Then I created my aggregated table grouped by clinic_id, status:
display(long_df.groupBy('Clinic_id','Status')
.agg(
sum(when(col('column_names') == 'name',1).otherwise(0)).alias('name')
,sum(when(col('column_names') == 'phone',1).otherwise(0)).alias('phone')
,sum(when(col('column_names') == 'city',1).otherwise(0)).alias('city')
).show
I got my intended table.

Pandas DataFrame read_csv then GroupBy - How to get just a single count instead of one per column

I'm getting the counts I want, but I don't understand why it is creating a separate count for each data column. How can I create just one column called "count"? Would the counts only be different when a column as a Null (NAN) value?
Also, what are the actual column names below? Is the column name a tuple?
Can I change the groupby/agg to return just one column called "Count"?
CSV Data:
'Occupation','col1','col2'
'Carpenter','data1','x'
'Carpenter','data2','y'
'Carpenter','data3','z'
'Painter','data1','x'
'Painter','data2','y'
'Programmer','data1','z'
'Programmer','data2','x'
'Programmer','data3','y'
'Programmer','data4','z'
Program:
filename = "./data/TestGroup.csv"
df = pd.read_csv(filename)
print(df.head())
print("Computing stats by HandRank... ")
df_stats = df.groupby("'Occupation'").agg(['count'])
print(df_stats.head())
print("----- Columns-----")
for col_name in df_stats.columns:
print(col_name)
Output:
Computing stats by HandRank...
'col1' 'col2'
count count
'Occupation'
'Carpenter' 3 3
'Painter' 2 2
'Programmer' 4 4
----- Columns-----
("'col1'", 'count')
("'col2'", 'count')
The df.head() shows it is using "Occupation" as my column name.
Try with size
df_stats = df.groupby("'Occupation'").size().to_frame('count')

Pandas dataframe row removal

I am trying to repair a csv file.
Some data rows need to be removed based on a couple conditions.
Say you have the following dataframe:
-A----B-----C
000---0-----0
000---1-----0
001---0-----1
011---1-----0
001---1-----1
If two or more rows have column A in common, i want to keep the row that has column B set to 1.
The resulting dataframe should look like this:
-A----B-----C
000---1-----0
011---1-----0
001---1-----1
I've experimented with merges and drop_duplicates but cannot seem to get the result I need. It is not certain that the row with column B = 1 will be after a row with B = 0. The take_last argument of drop_duplicates seemed attractive but I don't think it applies here.
Any advice will be greatly appreciated.Thank you.
Not straight forward, but this should work
DF = pd.DataFrame({'A' : [0,0,1,11,1], 'B' : [0,1,0,1,1], 'C' : [0,0,1,0,1]})
DF.ix[DF.groupby('A').apply(lambda df: df[df.B == 1].index[0] if len(df) > 1 else df.index[0])]
A B C
1 0 1 0
4 1 1 1
3 11 1 0
Notes:
groupby divides DF into groups of rows with unique A values i.e. groups with A = 0 (2 rows), A=1 (2 rows) and A=11 (1 row)
Apply then calls the function on each group and assimilates the results
In the function (lambda) I'm looking for the index of row with value B == 1 if there is more than one row in the group, else I use the index of the default row
The result of apply is a list of index values that represent rows with B==1 if more than one row in the group else the default row for given A
The index values are then used to access the corresponding rows by ix operator
Was able to weave my way around panda to get the result I want.
It's not pretty but it gets the job done
res = DataFrame(columns=('CARD_NO', 'STATUS'))
for i in grouped.groups:
if len(grouped.groups[i]) > 1:
card_no = i
print card_no
for a in grouped.groups[card_no]:
status = df.iloc[a]['STATUS']
print 'iloc:'+str(a) +'\t'+'status:'+str(status)
if status == 1:
print 'yes'
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
else:
print 'no'
else:
#only 1 record found
#could be a status of 0 or 1
#add to dataframe
print 'UNIQUE RECORD'
card_no = i
print card_no
status = df.iloc[grouped.groups[card_no][0]]['STATUS']
print grouped.groups[card_no][0]
#print status
print 'iloc:'+str(grouped.groups[card_no][0]) +'\t'+'status:'+str(status)
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
print res