pandas iterate over rows based on column values - pandas

I want to calculate the temperature difference at the same time between to cities. The data structure looks as follows:
dic = {'city':['a','a','a','a','a','b','b','b','b','b'],'week':[1,2,3,4,5,3,4,5,6,7],'temp':[20,21,23,21,25,20,21,24,21,22]}
df = pd.DataFrame(dic)
df
+------+------+------+
| city | week | temp |
+------+------+------+
| a | 1 | 20 |
| a | 2 | 21 |
| a | 3 | 23 |
| a | 4 | 21 |
| a | 5 | 25 |
| b | 3 | 20 |
| b | 4 | 21 |
| b | 5 | 24 |
| b | 6 | 21 |
| b | 7 | 22 |
+------+------+------+
I would like to calculate the difference in temperature between city a and b at week 3, 4, and 5. The final data structure should look as follows:
+--------+-------+------+------+
| city_1 | city2 | week | diff |
+--------+-------+------+------+
| a | b | 3 | 3 |
| a | b | 4 | 0 |
| a | b | 5 | 1 |
+--------+-------+------+------+

I would pivot your data, drop the NA values, and do the subtraction directly. This way you can keep the source temperatures associated with each city.
result = (
df.pivot(index='week', columns='city', values='temp')
.dropna(how='any', axis='index')
.assign(diff=lambda df: df['a'] - df['b'])
)
print(result)
city a b diff
week
3 23.0 20.0 3.0
4 21.0 21.0 0.0
5 25.0 24.0 1.0

Related

Why sorting a pandas column causing reordering the sub-groups? [duplicate]

This question already has answers here:
Sorting Dataframe using pandas. Keeping columns intact
(2 answers)
Closed last year.
The goal of my question is to understand why this happens and if this is a defined behaviour. I need to know to design my unittests in a predictable way. I do not want or need to change that behaviour or work around it.
Here is the initial data on the left side complete and on the right side just all ID.eq(1) but the order is the same as you can see in the index and the val column.
| | ID | val | | | ID | val |
|---:|-----:|:------| |---:|-----:|:------|
| 0 | 1 | A | | 0 | 1 | A |
| 1 | 2 | B | | 3 | 1 | x |
| 2 | 9 | C | | 4 | 1 | R |
| 3 | 1 | x | | 6 | 1 | G |
| 4 | 1 | R | | 9 | 1 | a |
| 5 | 4 | F | | 12 | 1 | d |
| 6 | 1 | G | | 13 | 1 | e |
| 7 | 9 | H |
| 8 | 4 | I |
| 9 | 1 | a |
| 10 | 2 | b |
| 11 | 9 | c |
| 12 | 1 | d |
| 13 | 1 | e |
| 14 | 4 | f |
| 15 | 2 | g |
| 16 | 9 | h |
| 17 | 9 | i |
| 18 | 4 | X |
| 19 | 5 | Y |
This right table is also the result I would expected when doing the following:
When I sort by ID the order of the rows inside the subgroups (e.g. ID.eq(1)) is modified. Why is it so?
This is the unexpected result
| | ID | val |
|---:|-----:|:------|
| 0 | 1 | A |
| 13 | 1 | e |
| 12 | 1 | d |
| 6 | 1 | G |
| 9 | 1 | a |
| 3 | 1 | x |
| 4 | 1 | R |
This is a full MWE
#!/usr/bin/env python3
import pandas as pd
# initial data
df = pd.DataFrame(
{
'ID': [1, 2, 9, 1, 1, 4, 1, 9, 4, 1,
2, 9, 1, 1, 4, 2, 9, 9, 4, 5],
'val': list('ABCxRFGHIabcdefghiXY')
}
)
print(df.to_markdown())
# only the group "1"
print(df.loc[df.ID.eq(1)].to_markdown())
# sort by 'ID'
df = df.sort_values('ID')
# only the group "1" (after sorting)
print(df.loc[df.ID.eq(1)].to_markdown())
As explained in the sort_values documentation, the stability of the sort is not always guaranteed depending on the chosen algorithm:
kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'
Choice of sorting algorithm. See also :func:`numpy.sort` for more
information. `mergesort` and `stable` are the only stable algorithms. For
DataFrames, this option is only applied when sorting on a single
column or label.
If you want to ensure using a stable sort:
df.sort_values('ID', kind='stable')
output:
ID val
0 1 A
3 1 x
4 1 R
6 1 G
9 1 a
...

Count values less than in another dataframe based on values in existing dataframe

I have two python pandas dataframes, in simplified form they look like this:
DF1
+---------+------+
| Date 1 | Item |
+---------+------+
| 1991-08 | A |
| 1992-08 | A |
| 1997-02 | B |
| 1998-03 | C |
| 1999-02 | D |
| 1999-02 | D |
+---------|------+
DF2
+---------+------+
| Date 2 | Item |
+---------+------+
| 1993-08 | A |
| 1993-09 | B |
| 1997-01 | C |
| 1999-03 | D |
| 2000-02 | E |
| 2001-03 | F |
+---------|------+
I want to count how many element in Item column DF2 appeared in DF1 if the date in DF1 are less than the date in DF2
Desired Output
+---------+------+-------+
| Date 2 | Item | Count |
+---------+------+-------+
| 1993-08 | A | 2 |
| 1993-09 | B | 0 |
| 1997-01 | C | 0 |
| 1999-03 | D | 2 |
| 2000-02 | E | 0 |
| 2001-03 | F | 0 |
+---------+------+-------+
Appreciate any comment and feedback, thanks in advance
Let's merge with a cartesian join and filter, then use value_counts and map back to your dataframe:
df_c = df1.merge(df2, on='Item')
df_c = df_c[df_c['Date 1'] < df_c['Date 2']]
df2['Count'] = df2['Item'].map(df_c['Item'].value_counts()).fillna(0)
print(df2)
Output:
Date 2 Item Count
0 1993-08 A 2.0
1 1993-09 B 0.0 # Note, I get no counts for B
2 1997-01 C 0.0
3 1999-03 D 2.0
4 2000-02 E 0.0
5 2001-03 F 0.0

Pandas Groupby : Get all rows for an ID for the earliest date (large dataset with many ID's)

I have a df frame with monthly values for an ID and its associated columns. There are "groups" of rows for an ID and the month. THere may be up to 12 months of data for each ID.
I want all the rows of data for all ID's where the month is the earliest date for each ID
The data looks like
+-------+----+--------+-------------+-------------+----------+
| index | ID | Date | X | Y | Category |
+-------+----+--------+-------------+-------------+----------+
| 0 | 1 | 1/1/18 | 0.118758835 | 0.954677438 | A |
| 1 | 1 | 1/1/18 | 0.148103273 | 0.976617504 | B |
| 2 | 1 | 1/1/18 | 0.365541214 | 0.551642346 | C |
| 3 | 1 | 1/2/18 | 0.405002687 | 0.343279097 | A |
| 4 | 1 | 1/2/18 | 0.130580643 | 0.144486528 | B |
| 5 | 1 | 1/2/18 | 0.395113106 | 0.113118681 | C |
| 6 | 2 | 1/1/18 | 0.425580038 | 0.725166189 | A |
| 7 | 2 | 1/1/18 | 0.889677796 | 0.386824338 | B |
| 8 | 2 | 1/1/18 | 0.835311629 | 0.363802849 | C |
| 9 | 2 | 1/2/18 | 0.8375818 | 0.769265522 | A |
| 10 | 2 | 1/2/18 | 0.648162611 | 0.075286355 | B |
| 11 | 2 | 1/2/18 | 0.639060695 | 0.791222309 | C |
+-------+----+--------+-------------+-------------+----------+
I am wondering if I can use Groupby to process the data to output
+-------+----+--------+-------------+-------------+----------+
| index | ID | Date | X | Y | Category |
+-------+----+--------+-------------+-------------+----------+
| 0 | 1 | 1/1/18 | 0.118758835 | 0.954677438 | A |
| 1 | 1 | 1/1/18 | 0.148103273 | 0.976617504 | B |
| 2 | 1 | 1/1/18 | 0.365541214 | 0.551642346 | C |
| 6 | 2 | 1/1/18 | 0.425580038 | 0.725166189 | A |
| 7 | 2 | 1/1/18 | 0.889677796 | 0.386824338 | B |
| 8 | 2 | 1/1/18 | 0.835311629 | 0.363802849 | C |
+-------+----+--------+-------------+-------------+----------+
``
N.B have left the index numbers in the output df as the same to show the rows I want to get
Note: There are varying numbers of Categories for each ID i.e. can't just get n rows - must use the earliest month for each ID
I have written a python loop that iterates thru each id and then selects the rows with the earliest date but with a large 2+ GB dataset it is very slow. Hope this is enough information.
If Groupby is not suitable, then other approaches welcome
Update:
I have done some more investigation on this and come up with a solution
see StackOverflow Pandas groupby rank date time
Use groupby and rank to create a DateRank column at the ID level
df['DateRank'] = df.groupby('ID')['Date'].rank(method='dense', ascending=True)
filter on rank 1 (first entries)
xdf = df[df['DateRank'] == 1.0 ]
Remove the ranking column
xdf.drop('DateRank', axis=1, inplace=True)
print the dataframe
xdf
The function below will return the smallest value grouped by ID and Category
df=df.groupby(["ID","Category"], group_keys=False).apply(lambda g: g.nsmallest(1, "Date"))
First you should create a month column. Then return the row with the smallest month value in each [ID, Category] group.
Computation
df['month'] = df['Date'].map(lambda x: int(x.split("/")[1])) # get month
df = df.groupby(["ID","Category"])['month'].nsmallest(1) # get earliest month for each ID+category observation
Output
print(df.to_string())
index ID Date X Y Category month
0 0 1 1/1/18 0.118758835 0.954677438 A 1
1 1 1 1/1/18 0.148103273 0.976617504 B 1
2 2 1 1/1/18 0.365541214 0.551642346 C 1
6 6 2 1/1/18 0.425580038 0.725166189 A 1
7 7 2 1/1/18 0.889677796 0.386824338 B 1
8 8 2 1/1/18 0.835311629 0.363802849 C 1

Vectorizing a variable length look-ahead loop in pandas

This is a very simplified version of my data:
+----+---------+---------------------+
| | user_id | seconds_since_start |
+----+---------+---------------------+
| 0 | 1 | 10 |
| 1 | 1 | 12 |
| 2 | 1 | 15 |
| 3 | 1 | 52 |
| 4 | 1 | 60 |
| 5 | 1 | 67 |
| 6 | 1 | 120 |
| 7 | 2 | 55 |
| 8 | 2 | 62 |
| 9 | 2 | 105 |
| 10 | 3 | 200 |
| 11 | 3 | 206 |
+----+---------+---------------------+
And this is the data I would like to produce:
+----+---------+---------------------+-----------------+------------------+
| | user_id | seconds_since_start | session_ordinal | session_duration |
+----+---------+---------------------+-----------------+------------------+
| 0 | 1 | 10 | 1 | 5 |
| 1 | 1 | 12 | 1 | 5 |
| 2 | 1 | 15 | 1 | 5 |
| 3 | 1 | 52 | 2 | 15 |
| 4 | 1 | 60 | 2 | 15 |
| 5 | 1 | 67 | 2 | 15 |
| 6 | 1 | 120 | 3 | 0 |
| 7 | 2 | 55 | 1 | 7 |
| 8 | 2 | 62 | 1 | 7 |
| 9 | 2 | 105 | 2 | 0 |
| 10 | 3 | 200 | 1 | 6 |
| 11 | 3 | 206 | 1 | 6 |
+----+---------+---------------------+-----------------+------------------+
My notion of a session is a group of events from a single user which occur not more than 10 seconds apart, and a session's duration is defined as the difference between the first event in the session and the last event (in seconds).
I have written working Python that achieves what I want.
import pandas as pd
events_data = [[1, 10], [1, 12], [1, 15], [1, 52], [1, 60], [1, 67], [1, 120],
[2, 55], [2, 62], [2, 105],
[3, 200], [3, 206]]
events = pd.DataFrame(data=events_data, columns=['user_id', 'seconds_since_start'])
def record_session(index_range, ordinal, duration):
for i in index_range:
events.at[i, 'session_ordinal'] = ordinal
events.at[i, 'session_duration'] = duration
session_indexes = []
current_user = previous_time = session_start = -1
session_num = 0
for i, row in events.iterrows():
if row['user_id'] != current_user or (row['seconds_since_start'] - previous_time) > 10:
record_session(session_indexes, session_num, previous_time - session_start)
session_indexes = [i]
session_num += 1
session_start = row['seconds_since_start']
if row['user_id'] != current_user:
current_user = row['user_id']
session_num = 1
previous_time = row['seconds_since_start']
session_indexes.append(i)
record_session(session_indexes, session_num, previous_time - session_start)
My problem is the length of time this takes to run. As I said, this is a very simplified version of my data, my actual data has 70,000,000 rows. Is there a way to vectorize (and thus speed-up) algorithms like this that formulate additional columns based on variable length look-aheads?
You could try:
# Create a helper boolean Series
s = df.groupby('user_id')['seconds_since_start'].diff().gt(10)
df['session_ordinal'] = s.groupby(df['user_id']).cumsum().add(1).astype(int)
df['session_duration'] = (df.groupby(['user_id', 'session_ordinal'])['seconds_since_start']
.transform(lambda x: x.max() - x.min()))
[output]
user_id seconds_since_start session_ordinal session_duration
0 1 10 1 5
1 1 12 1 5
2 1 15 1 5
3 1 52 2 15
4 1 60 2 15
5 1 67 2 15
6 1 120 3 0
7 2 55 1 7
8 2 62 1 7
9 2 105 2 0
10 3 200 1 6
11 3 206 1 6
Chris A's answer here is great. It contains several techniques or calls I was unfamiliar with. This answer copies his and adds copious annotations.
We start by building a helper Boolean series. This series records which events start additional sessions for any user. This is OK as a Boolean series because in numeric contexts they behave like the integers 0 and 1 (quoting from here). Let's put the series together bit by bit.
starts_session = events.groupby('user_id')['seconds_since_start'].diff().gt(10)
First we group events by user_id (documentation) and then choose the 'seconds_since_start' column and call diff (documentation) on that. The result of events.groupby('user_id')['seconds_since_start'].diff()is
+----+----------------------+
| | seconds_since_start |
+----+----------------------+
| 0 | NaN |
| 1 | 2.0 |
| 2 | 3.0 |
| 3 | 37.0 |
| 4 | 8.0 |
| 5 | 7.0 |
| 6 | 53.0 |
| 7 | NaN |
| 8 | 7.0 |
| 9 | 43.0 |
| 10 | NaN |
| 11 | 6.0 |
+----+----------------------+
I can see that the start of each group is already picking up the correct NaN difference as there's no previous event from that user to give a delta from.
Then using the element-wise greater than gt(10) (documentation) we get
+----+----------------------+
| | seconds_since_start |
+----+----------------------+
| 0 | False |
| 1 | False |
| 2 | False |
| 3 | True |
| 4 | False |
| 5 | False |
| 6 | True |
| 7 | False |
| 8 | False |
| 9 | True |
| 10 | False |
| 11 | False |
+----+----------------------+
(N.B. The column heading is odd, but it is not used and so does not matter.)
events['session_ordinal'] = starts_session.groupby(events['user_id']).cumsum().add(1).astype(int)
We then re-group starts_session by the user_ids in events and then do the cumulative sum cumsum (documentation) over each group. The grouping does the work for us here ensuring that each user's events are restarted at zero. We need the session ordinal to start at 1 not zero so we simply add one add(1) (documentation) and we cast them to int as none of them are NaN astype(int) (documentation). This gives the derived session_ordinal column I wanted.
events['session_duration'] = events.groupby(['user_id', 'session_ordinal'])['seconds_since_start'].transform(lambda x: x.max() - x.min())
To derive each session's duration we first group events by both the user_id and the new session_ordinal, i.e. we group them into sessions. Using transform (documentation) we find the minimum and maximum value of seconds_since_start for each group (i.e. each session) and the difference between them is the session's duration. This pattern, applying transform to grouped data is used extensively in the split-apply-combine process.
Thanks Chris.

output difference of two values same column to another column

Can anhone help me out or point me in the right direction? What is simplest way to get from current table to output table??
Current Table
ID | type | amount |
2 | A | 19 |
2 | B | 6 |
3 | A | 5 |
3 | B | 11 |
4 | A | 1 |
4 | B | 23 |
Desires output
ID | type | amount | change |
2 | A | 19 | 13 |
2 | B | 6 | -6 |
3 | A | 5 | -22 |
3 | B | 11 | |
4 | A | 1 | |
4 | B | 23 | |
I don't get how the values are put on rows. You can, for instance, subtract the "B" value from the "A" value for any given id. For instance:
select t.*,
(case when type = 'A'
then amount - max(amount) filter (type = 'B') over (partition by id)
end) as diff_a_b
from t;