How do I take the mode of multiple columns and groupby the rest of the columns in Pyspark? - dataframe

Im trying to take a dataframe, and find the mode of 2 of the columns while simulatenously grouping by the rest of the columns:
W
X
Y
Z
50
20
a
cat
50
20
a
cat
40
30
b
dog
50
20
a
dog
40
30
a
cat
50
20
b
dog
50
20
a
cat
50
20
a
cat
50
20
a
cat
4 columns
1000 rows
How would I take the columns ('W', 'Y') and find the modes of those, then take
the columns ('Y', 'Z') and groupby them as well?
It is important that these steps are done at the same time to produce the correct output.

Related

why is groupby in pandas not displaying

I have a df like:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot','Elephant','Elephant','Elephant'],
'Max Speed': [380, 370, 24, 26,5,7,3]})
I would like to groupby Animal.
if I do in a notebook:
a = df.groupby(['Animal'])
display(a)
I get:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f945bdd7b80>
I expected something like:
What I ultimate want to do is sort the df by number of animal appearances (Elephant 3, falcon 2 etc)
Because you are not using any aggregated functions after groupby -
a = df.groupby(['Animal'])
display(a)
Rectified -
a = df.groupby(['Animal']).count()
display(a)
Now, after using count() function or sort_values() or sum() etc. you will get the groupby results
You need check DataFrame.groupby:
Group DataFrame using a mapper or by a Series of columns.
So it is not for remove duplicates values by column, but for aggregation.
If need remove duplicated vales, set to empty string use:
df.loc[df['Animal'].duplicated(), 'Animal'] = ''
print (df)
Animal Max Speed
0 Falcon 380
1 370
2 Parrot 24
3 26
4 Elephant 5
5 7
6 3
If need groupby:
for i, g in df.groupby(['Animal']):
print (g)
Animal Max Speed
4 Elephant 5
5 Elephant 7
6 Elephant 3
Animal Max Speed
0 Falcon 380
1 Falcon 370
Animal Max Speed
2 Parrot 24
3 Parrot 26
The groupby object requires an action, like a max or a min. This will result in two things:
A regular pandas data frame
The grouping key appearing once
You clearly expect both of the Falcon entries to remain so you don't actually want to do a groupby. If you want to see the entries with repeated animal values hidden, you would do that by setting the Animal column as the index. I say that because your input data frame is already in the order you wanted to display.
Use mask:
>>> df.assign(Animal=df['Animal'].mask(df['Animal'].duplicated(), ''))
Animal Max Speed
0 Falcon 380
1 370
2 Parrot 24
3 26
4 Elephant 5
5 7
6 3
>>>
Or as index:
df.assign(Animal=df['Animal'].mask(df['Animal'].duplicated(), '')).set_index('Animal')
Max Speed
Animal
Falcon 380
370
Parrot 24
26
Elephant 5
7
3
>>>

Pandas column merging on condition

This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30

how to groupby dataframe with two rows as header

I have dataframe with two rows as header(name and unit). is there a way to groupby dataframe with unit. what Iam trying to acheive is group by similar units and run analysis on them.
df = pd.read_csv('filename',header=[0.1])
Customer length height adress city
Name meter meter bldg name
A 10 20 1 Delhi
C 30 20 10 Delhi
B 20 40 19 Delhi
D 40 50 10 Delhi
i am trying to isolate the dataframe with units(second header)
for example:
length height
10 20
30 20
20 40
40 50

Groupby sum and average in pandas and make data frame [duplicate]

This question already has answers here:
Multiple aggregations of the same column using pandas GroupBy.agg()
(4 answers)
Closed 2 years ago.
I have a dataframe as shown below
ID Score
A 20
B 60
A 40
C 50
B 100
C 60
C 40
A 10
A 10
A 70
From the above I would like to calculate the average score for each ID and total score.
Expected output:
ID Average_score Total_score
A 30 150
B 80 160
C 50 150
Use named aggregation for custom columns names:
df1 = (df.groupby('ID').agg(Average_score=('Score','mean'),
Total_score=('Score','sum'))
.reset_index())
print (df1)
ID Average_score Total_score
0 A 30 150
1 B 80 160
2 C 50 150

Retrieve value from different fields for each record of an Access table

I would be more than appreciative for some help here, as I have been having some serious problems with this.
Background:
I have a list of unique records. For each record I have a monotonically increasing pattern (either A, B or C), and a development position (1 to 5) assigned to it.
So each of the 3 patterns is set out in five fields representing the development period.
Problem:
I need to retrieve the percentages relating to the relevant development periods, from different fields for each row. It should be in a single column called "Output".
Example:
Apologies, not sure how to attach a table here, but the fields are below, the table is a transpose of these fields.
ID - (1,2,3,4,5)
Pattern - (A, B, C, A, C)
Dev - (1,5,3,4,2)
1 - (20%, 15%, 25%, 20%, 25%)
2 - (40%, 35%, 40%, 40%, 40%)
3 - (60%, 65%, 60%, 60%, 60%)
4 - (80%, 85%, 65%, 80%, 65%)
5 - (100%, 100%, 100%, 100%, 100%)
Output - (20%, 100%, 60%, 80%, 40%)
In MS Excel, I could simply use a HLOOKUP or OFFSET function to do this. But how do I do this in Access? The best I have come up with so far is Output: Eval([Category]) but this doesn't seem to achieve what I want which is to select the "Dev" field, and treat this as a field when building an expression.
In practice, I have more than 100 development periods to play with, and over 800 different patterns, so "switch" methods can't work here I think.
Thanks in advance,
alch84
Assuming that
[ID] is a unique column (primary key), and
the source column for [Output] only depends on the value of [Dev]
then this seems to work:
UPDATE tblAlvo SET Output = DLOOKUP("[" & Dev & "]", "tblAlvo", "ID=" & ID)
Before:
ID Pattern Dev 1 2 3 4 5 Output
-- ------- --- -- -- -- -- --- ------
1 A 1 20 40 60 80 100
2 B 5 15 35 65 85 100
3 C 3 25 40 60 65 100
4 A 4 20 40 60 80 100
5 C 2 25 40 60 65 100
After:
ID Pattern Dev 1 2 3 4 5 Output
-- ------- --- -- -- -- -- --- ------
1 A 1 20 40 60 80 100 20
2 B 5 15 35 65 85 100 100
3 C 3 25 40 60 65 100 60
4 A 4 20 40 60 80 100 80
5 C 2 25 40 60 65 100 40