how to groupby dataframe with two rows as header - pandas

I have dataframe with two rows as header(name and unit). is there a way to groupby dataframe with unit. what Iam trying to acheive is group by similar units and run analysis on them.
df = pd.read_csv('filename',header=[0.1])
Customer length height adress city
Name meter meter bldg name
A 10 20 1 Delhi
C 30 20 10 Delhi
B 20 40 19 Delhi
D 40 50 10 Delhi
i am trying to isolate the dataframe with units(second header)
for example:
length height
10 20
30 20
20 40
40 50

Related

How do I take the mode of multiple columns and groupby the rest of the columns in Pyspark?

Im trying to take a dataframe, and find the mode of 2 of the columns while simulatenously grouping by the rest of the columns:
W
X
Y
Z
50
20
a
cat
50
20
a
cat
40
30
b
dog
50
20
a
dog
40
30
a
cat
50
20
b
dog
50
20
a
cat
50
20
a
cat
50
20
a
cat
4 columns
1000 rows
How would I take the columns ('W', 'Y') and find the modes of those, then take
the columns ('Y', 'Z') and groupby them as well?
It is important that these steps are done at the same time to produce the correct output.

Pandas column merging on condition

This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30

Divide two row values based on label and create a new column to populate the calculated value

New to Python and looking for some help.
I would like to divide values in two different rows (part of the same column) and then insert a new column with the calculated value
City 2017-18 Item
0 Boston 100 Primary
1 Boston 200 Secondary
2 Boston 300 Tertiary
3 Boston 400 Nat'l average
4 Chicago 500 Primary
5 Chicago 600 Secondary
6 Chicago 700 Tertiary
7 Chicago 800 Nat'l average
On the above Dataframe, I am trying to divide a City's Primary, Secondary and Tertiary values respectively by the Nat'l average for that City. The resultant answer to be populated in a new column part of the same Dataframe. After calculation, the row with the label 'Nat'l average' need to be deleted.
Appreciate your help...
City 2014-15 Item New_column
0 Boston 100 Primary 100/400
1 Boston 200 Secondary 200/400
2 Boston 300 Tertiary 300/400
3 Chicago 500 Primary 500/800
4 Chicago 600 Secondary 600/800
5 Chicago 700 Tertiary 700/800
If mean value is always last per groups divide column by Series created by GroupBy.transform and GroupBy.last:
df['new'] = df['2017-18'].div(df.groupby('City')['2017-18'].transform('last'))
If not first filter values with averages and divide by Series.maping Series:
s = df[df['Item'] == "Nat'l average"].set_index('City')['2017-18']
df['new'] = df['2017-18'].div(df['City'].map(s))
And last filter out rows by boolean indexing:
df = df[df['Item'] != "Nat'l average"]
print (df)
City 2017-18 Item new
0 Boston 100 Primary 0.250
1 Boston 200 Secondary 0.500
2 Boston 300 Tertiary 0.750
4 Chicago 500 Primary 0.625
5 Chicago 600 Secondary 0.750
6 Chicago 700 Tertiary 0.875
Detail:
print (df['City'].map(s))
0 400
1 400
2 400
3 400
4 800
5 800
6 800
7 800
Name: City, dtype: int64

Retrieve value from different fields for each record of an Access table

I would be more than appreciative for some help here, as I have been having some serious problems with this.
Background:
I have a list of unique records. For each record I have a monotonically increasing pattern (either A, B or C), and a development position (1 to 5) assigned to it.
So each of the 3 patterns is set out in five fields representing the development period.
Problem:
I need to retrieve the percentages relating to the relevant development periods, from different fields for each row. It should be in a single column called "Output".
Example:
Apologies, not sure how to attach a table here, but the fields are below, the table is a transpose of these fields.
ID - (1,2,3,4,5)
Pattern - (A, B, C, A, C)
Dev - (1,5,3,4,2)
1 - (20%, 15%, 25%, 20%, 25%)
2 - (40%, 35%, 40%, 40%, 40%)
3 - (60%, 65%, 60%, 60%, 60%)
4 - (80%, 85%, 65%, 80%, 65%)
5 - (100%, 100%, 100%, 100%, 100%)
Output - (20%, 100%, 60%, 80%, 40%)
In MS Excel, I could simply use a HLOOKUP or OFFSET function to do this. But how do I do this in Access? The best I have come up with so far is Output: Eval([Category]) but this doesn't seem to achieve what I want which is to select the "Dev" field, and treat this as a field when building an expression.
In practice, I have more than 100 development periods to play with, and over 800 different patterns, so "switch" methods can't work here I think.
Thanks in advance,
alch84
Assuming that
[ID] is a unique column (primary key), and
the source column for [Output] only depends on the value of [Dev]
then this seems to work:
UPDATE tblAlvo SET Output = DLOOKUP("[" & Dev & "]", "tblAlvo", "ID=" & ID)
Before:
ID Pattern Dev 1 2 3 4 5 Output
-- ------- --- -- -- -- -- --- ------
1 A 1 20 40 60 80 100
2 B 5 15 35 65 85 100
3 C 3 25 40 60 65 100
4 A 4 20 40 60 80 100
5 C 2 25 40 60 65 100
After:
ID Pattern Dev 1 2 3 4 5 Output
-- ------- --- -- -- -- -- --- ------
1 A 1 20 40 60 80 100 20
2 B 5 15 35 65 85 100 100
3 C 3 25 40 60 65 100 60
4 A 4 20 40 60 80 100 80
5 C 2 25 40 60 65 100 40

PowerPivot formula for row wise weighted average

I have a table in PowerPivot which contains the logged data of a traffic control camera mounted on a road. This table is filled the velocity and the number of vehicles that pass this camera during a specific time(e.g. 14:10 - 15:25). Now I want to know that how can I get the average velocity of cars for an specific hour and list them in a separate table with 24 rows(hour 0 - 23) where the second column of each row is the weighted average velocity of that hour? A sample of my stat_table data is given below:
count vel hour
----- --- ----
133 96.00237 15
117 91.45705 21
81 81.90521 6
2 84.29946 21
4 77.7841 18
1 140.8766 17
2 56.14951 14
6 71.72839 13
4 64.14309 9
1 60.949 17
1 77.00728 21
133 100.3956 6
109 100.8567 15
54 86.6369 9
1 83.96901 17
10 114.6556 21
6 85.39127 18
1 76.77993 15
3 113.3561 2
3 94.48055 2
In a separate PowerPivot table I have 24 rows and 2 columns but when I enter my formula, the whole rows get updated with the same number. My formula is:
=sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count] * stat_table[vel])/sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count])
Create a new calculated column named "WeightedVelocity" as follows
WeightedVelocity = [count]*[vel]
Create a measure "WeightedAverage" as follows
WeightedAverage = sum(stat_table[WeightedVelocity]) / sum(stat_table[count])
Use measure "WeightedAverage" in VALUES area of pivot Table and use "hour" column in ROWS to get desired result.