Copy the value to rows for distinct values - Spark dataframe - dataframe

I have a spark dataframe. I want to copy the value of rows based on the specific column
ColumnA columnB columnC
a Null Null
a 1 1
b 2 2
c Null Null
c 3 3
So in this dataframe ColumnC is a copy of ColumnB but , I want a single value for each value in ColumnA. If a value of Column A has null and a proper value, I want proper value in all places else Null.
Required dataframe:
ColumnA columnB columnC
a Null 1
a 1 1
b 2 2
c Null 3
c 3 3
I tried doing partitionby(ColumnA) but seems not working.
Can I get a scala code for this

If I understand correctly, you are looking to choose a B value for every A value.
This is a groupBy clause.
In order to correctly choose the value you need to find the proper aggregation function. In this case, the first function can be used:
import org.apache.spark.sql.{functions => F}
df
.groupBy($"columnA")
.agg(F.first($"columnB", true) as "columnC")
This will give you a single value from B for every value A
columnA columnC
a 1
b 2
c 3
You can then proceed to join it with the original DF by A.

Related

add a categorical column with three values assigned to each row in a pyspark df then perform aggregated functions to 30 columns

I have a dataframe as a result of validation codes:
df=\
(['c_1', 'c_1', 'c_1', 'c_2', 'c_3', 'c_1', 'c_2', 'c_2'],\
['valid','valid', 'invalid','missing','invalid','valid','valid', 'missing'],\
['missing','valid','invalid','invalid','valid', 'valid','missing','missing'],\
['invalid','valid','valid', 'missing', 'missing','valid','invalid','missing'])\
.toDF('clinic_id','name','phone','city')
I counted the number of valids, invalids, and missing using aggregated code grouped by clinic_id in pyspark
agg_table = (
df
.groupBy('clinic_id')
.agg(
# name
sum(when(col('name') == 'valid',1).otherwise(0)).alias('validname')
,sum(when(col('name') == 'invalid',1).otherwise(0)).alias('invalidname')
,sum(when(col('name') == 'missing',1).otherwise(0)).alias('missingname')
# phone
,sum(when(col('phone') == 'valid',1).otherwise(0)).alias('validphone')
,sum(when(col('phone') == 'invalid',1).otherwise(0)).alias('invalidphone')
,sum(when(col('phone') == 'missing',1).otherwise(0)).alias('missingphone')
# city
,sum(when(col('city') == 'valid',1).otherwise(0)).alias('validcity')
,sum(when(col('city') == 'invalid',1).otherwise(0)).alias('invalidcity')
,sum(when(col('city') == 'missing',1).otherwise(0)).alias('missingcity')
))
display(agg_table)
output:
clinic_id validname invalidname missingname ... invalidcity missingcity
--------- --------- ----------- ----------- ... ----------- -----------
c_1 3 1 0 ... 1 0
c_2 1 0 2 ... 1 0
c_3 0 1 0 ... 0 1
the resulting aggregated table is just fine, but is not ideal for further analysis. I tried the pivoting within pyspark trying to get something below:
#note: counts below are just made up, not the actual count from above, but I hope you get what I mean.
clinic_id category name phone city
-------- ------- ---- ------- ----
c_1 valid 3 1 3
c_1 invalid 1 0 2
c_1 missing 0 2 3
c_2 valid 3 1 3
c_2 invalid 1 0 2
c_2 missing 0 2 3
c_3 valid 3 1 3
c_3 invalid 1 0 2
c_3 missing 0 2 3
I initially searched pivot/unpivot, but I learned it is called unstack in pyspark and I also came across mapping.
I tried the suggested approach in How to unstack dataset (using pivot)? but it is showing me only one column and I cannot get the desired result when I try applying it to my dataframe of 30 columns.
I also tried the following using the validated table/dataframe
expression = ""
cnt=0
for column in agg_table.columns:
if column!='clinc_id':
cnt +=1
expression += f"'{column}' , {column},"
exprs = f"stack({cnt}, {expression[:-1]}) as (Type,Value)"
unpivoted = agg_table.select('clinic_id',expr(exprs))
I get an error just pointing to the line that may be referring to a return value.
I also tried grouping the results by id and the category but that is where I am stuck at finding solution. If I group by an aggregated variable, say the values of the validname, the agggregated function only counts the values in that column and would not apply to every count columns. So I thought of inserting a column using .withColumn function assigning the three categories to each ID so that each aggregated counts will be grouped by id and category as in the prior table, but I am not feeling lucky in finding solution to this.
Also, maybe a sql approach will be easier?
I found the right search phrase: "column to row in pyspark"
One of the suggested answer that fit my dataframe is this function:
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["clinic_id"])
This created a dataframe of three columns: clinic_id, column_names, status (valid, invalid, missing)
Then I created my aggregated table grouped by clinic_id, status:
display(long_df.groupBy('Clinic_id','Status')
.agg(
sum(when(col('column_names') == 'name',1).otherwise(0)).alias('name')
,sum(when(col('column_names') == 'phone',1).otherwise(0)).alias('phone')
,sum(when(col('column_names') == 'city',1).otherwise(0)).alias('city')
).show
I got my intended table.

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

Pandas finding average in a comma separated column

I want to take average based on one column which is comma separated and take mean on other column.
My file looks like this:
ColumnA ColumnB
A, B, C 2.9
A, C 9.087
D 6.78
B, D, C 5.49
My output should look like this:
A 7.4435
B 5.645
C 5.83
D 6.135
My code is this:
df = pd.DataFrame(data.ColumnA.str.split(',', expand=True).stack(), columns= ['ColumnA'])
df = df.reset_index(drop = True)
df_avg = pd.DataFrame(df.groupby(by = ['ColumnA'])['ColumnB'].mean())
df_avg = df_avg.reset_index()
It has to be around the same lines but can't figure it out.
In your solution is created index by column ColumnB for avoid lost column values after stack and Series.reset_index, last is added as_index=False for column after aggregation:
df = (df.set_index('ColumnB')['ColumnA']
.str.split(',', expand=True)
.stack()
.reset_index(name='ColumnA')
.groupby('ColumnA', as_index=False)['ColumnB']
.mean())
print (df)
ColumnA ColumnB
0 A 5.993500
1 B 4.195000
2 C 5.825667
3 D 6.135000
Or alternative solution with DataFrame.explode:
df = (df.assign(ColumnA = df['ColumnA'].str.split(','))
.explode('ColumnA')
.groupby('ColumnA', as_index=False)['ColumnB']
.mean())
print (df)
ColumnA ColumnB
0 A 5.993500
1 B 4.195000
2 C 5.825667
3 D 6.135000

Deleting/Selecting rows from pandas based on conditions on multiple columns

From a pandas dataframe, I need to delete specific rows based on a condition applied on two columns of the dataframe.
The dataframe is
0 1 2 3
0 -0.225730 -1.376075 0.187749 0.763307
1 0.031392 0.752496 -1.504769 -1.247581
2 -0.442992 -0.323782 -0.710859 -0.502574
3 -0.948055 -0.224910 -1.337001 3.328741
4 1.879985 -0.968238 1.229118 -1.044477
5 0.440025 -0.809856 -0.336522 0.787792
6 1.499040 0.195022 0.387194 0.952725
7 -0.923592 -1.394025 -0.623201 -0.738013
I need to delete some rows where the difference between column 1 and columns 2 is less than threshold t.
abs(column1.iloc[index]-column2.iloc[index]) < t
I have seen examples where conditions are applied individually on column values but did not find anything where a row is deleted based on a condition applied on multiple columns.
First select columns by DataFrame.iloc for positions, subtract, get Series.abs, compare by thresh with inverse opearator like < to >= or > and filter by boolean indexing:
df = df[(df.iloc[:, 0]-df.iloc[:, 1]).abs() >= t]
If need select columns by names, here 0 and 1:
df = df[(df[0]-df[1]).abs() >= t]

Replacing partial text in cells in a dataframe

This is an extension of a question asked and solved earlier (Replace specific values inside a cell without chaging other values in a dataframe)
I have a dataframe where different numeric codes are used in place of text strings and now I would like to replace those codes with text values. In the reference question (above link) it worked with the regex method before but now it is not working anymore and I am clueless if there are any changes made to the .replace method?
Example of my dataframe:
col1
0 1,2,3
1 1,2
2 2-3
3 2, 3
The code lines that I wrote use a dictionary of values that needs to changed and then regex is set to be true.
I used the following code:
d = {'1':'a', '2':'b', '3':'c'}
df['col2'] = df['col1'].replace(d, regex=True)
The result I got is:
col1 col2
0 1,2,3 a,2,3
1 1,2 a,2
2 2-3 b-3
3 2, 3 b, 3
Whereas, I was expecting:
col1 col2
0 1,2,3 a,b,c
1 1,2 a,b
2 2-3 b-c
3 2, 3 b, c
Or alternatively:
col1
0 a,b,c
1 a,b
2 b-c
3 b, c
Is there any changes to the .replace method in the last 1 year? or am I doing anything wrong here? Earlier the same code that I have written worked but not anymore.
Ok, after some experimenting, I found that for each code (numbers) in my cells I need to have a regex replacement statement, such as:
df.replace({'col1': r'1'}, {'col1': 'a'}, regex=True, inplace=True)
df.replace({'col1': r'2'}, {'col1': 'b'}, regex=True, inplace=True)
df.replace({'col1': r'3'}, {'col1': 'c'}, regex=True, inplace=True)
Which results in:
col1
0 a,b,c
1 a,b
2 b-c
3 b, c
This is just a work around as it will overwrite the existing column but it works in my case as my main objective was to replace the codes with values.