Group one column by another column in pandas? - pandas

I would like to get the median value of one column and use the associated value of another column. For example,
col1 col2 index
0 1 3 A
1 2 4 A
2 3 5 A
3 4 6 B
4 5 7 B
5 6 8 B
6 7 9 B
I group by the index to get the median value of col 1, and use the associated value of col 2 to get
col1 col2 index
2 4 A
5 7 B
I can't use the actual median value for index B because it will average the two middle values and that value won't have a corresponding value in col 2.
What's the best way to do this? Will a groupby method work? Or somehow use sort? Do I need to define my own function?

Seems you need take middle position not median from origial df
df.groupby('index')[['col1','col2']].apply(lambda x : pd.Series(sorted(x.values.tolist())[len(x)//2]))
Out[297]:
0 1
index
A 2 4
B 6 8

Related

Create a new column pandas based on another column condition [duplicate]

This question already has an answer here:
increment a value each time the next row is different from the previous one
(1 answer)
Closed 3 months ago.
I wanted to create a new column, let say named "group id" on the basis of:
compare the nth row with (n-1)th row.
if both the records are equal then in a "group id", previous "group id" is copied
If these records are not equal, then 1 should be added to "group id column".
I wanted to have the result in the following way:
The expected result
Column A
Column B
6-Aug-10
0
30-Aug-11
1
31-Aug-11
2
31-Aug-11
2
6-Sep-12
3
30-Aug-13
4
Looking for the result, similar to this excel function
=IF(T3=T2, U2, U2+1)
you can use ngroup:
df['Group ID']=df.groupby('DOB').ngroup()
#according to your example
df['Group ID']=df.groupby('Column A').ngroup()
Use factorize - consecutive groups are not count separately like compare shifted values with Series.cumsum and subtract 1:
print (df)
Column A Column B
0 6-Aug-10 0
1 30-Aug-11 1
2 31-Aug-11 2
3 31-Aug-11 2
4 6-Sep-12 3
5 30-Aug-13 4
6 30-Aug-11 5 <- added data for see difference
7 31-Aug-11 6 <- added data for see difference
df['Group ID1'] = pd.factorize(df['Column A'])[0]
df['Group ID2'] = df['Column A'].ne(df['Column A'].shift()).cumsum().sub(1)
print (df)
Column A Column B Group ID1 Group ID2
0 6-Aug-10 0 0 0
1 30-Aug-11 1 1 1
2 31-Aug-11 2 2 2
3 31-Aug-11 2 2 2
4 6-Sep-12 3 3 3
5 30-Aug-13 4 4 4
6 30-Aug-11 5 1 5
7 31-Aug-11 5 2 6

Remove duplicates from dataframe, based on two columns A,B, keeping [list of values] in another column C

I have a pandas dataframe which contains duplicates values according to two columns (A and B):
A B C
1 2 1
1 2 4
2 7 1
3 4 0
3 4 8
I want to remove duplicates keeping the values in column C inside a list of len N values in C (example 2 values in this example). This would lead to:
A B C
1 2 [1,4]
2 7 1
3 4 [0,8]
I cannot figure out how to do that. Maybe use groupby and drop_duplicates?

How to keep only the last index in groups of rows where a condition is met in pandas?

I have the following dataframe:
d = {'value': [1,1,1,1,1,1,1,1,1,1], 'flag_1': [0,1,0,1,1,1,0,1,1,1],'flag_2':[1,0,1,1,1,1,1,0,1,1],'index':[1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(data=d)
I need to perform the following filter on it:
If flag 1 and flag 2 are equal keep the row with the maximum index from the consecutive indices. Below for rows 4,5,6 and rows 9,10 flag 1 and flag 2 are equal. From the group of consecutive indices 4,5,6 therefore I wish to keep only row 6 and drop rows 4 and 5. For the next group of rows 9 and 10 I wish to keep only row 10. The rows where flag 1 and 2 are not equal should all be retained. I want my final output to look as shown below:
I am really not sure how to achieve what is required so I would be grateful for any advice on how to do it.
IIUC, you can compare consecutive rows with shift. This solution requires a sorted index.
In [5]: df[~df[['flag_1', 'flag_2']].eq(df[['flag_1', 'flag_2']].shift(-1)).all(axis=1)]
Out[5]:
value flag_1 flag_2 index
0 1 0 1 1
1 1 1 0 2
2 1 0 1 3
5 1 1 1 6
6 1 0 1 7
7 1 1 0 8
9 1 1 1 10

Calculate new column using relative row references

I would like to turn a data frame like this:
DF
Nrow
a
1
5
2
6
3
7
4
11
5
16
Into this:
DF
Nrow
a
b
1
5
NA
2
6
NA
3
7
2
4
11
5
5
16
9
Column 'b' is calculated as the value from column 'a' minus another value from column 'a', in [row-2]. For example b4 = a4-a2.
I have had no success so far with indexing or loops. Is there a tool or command for this or some obvious notation that I am missing? I need to do this continuously without splitting into groups.

Using temporary extended table to make a sum

From a given table I want to be able to sum values having the same number (should be easy, right?)
Problem: A given value can be assigned from 2 to n consecutive numbers.
For some reasons this information is stored in a single row describing the value, the starting number and the ending number as below.
TABLE A
id | starting_number | ending_number | value
----+-----------------+---------------+-------
1 2 5 8
2 0 3 5
3 4 6 6
4 7 8 10
For instance the first row means:
value '8' is assigned to numbers: 2, 3 and 4 (5 is excluded)
So, I would like the following intermediairy result table
TABLE B
id | number | value
----+--------+-------
1 2 8
1 3 8
1 4 8
2 0 5
2 1 5
2 2 5
3 4 6
3 5 6
4 7 10
So I can sum 'value' for elements having identical 'number'
SELECT number, sum(value)
FROM B
GROUP BY number
TABLE C
number | sum(value)
--------+------------
2 13
3 8
4 14
0 5
1 5
5 6
7 10
I don't know how to do this and didn't find any answer on the web (maybe not looking with appropriate key words...)
Any idea?
You can do what you want with generate_series(). So, TableB is basically:
select id, generate_series(starting_number, ending_number - 1, 1) as n, value
from tableA;
Your aggregation is then:
select n, sum(value)
from (select id, generate_series(starting_number, ending_number - 1, 1) as n, value
from tableA
) a
group by n;