Pandas groupby issue after melt bug? - pandas

Python version 3.8.12
pandas 1.4.1
Given the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1000] * 4,
'date': ['2022-01-01'] * 4,
'ts': pd.date_range('2022-01-01', freq='5M', periods=4),
'A': np.random.randint(1, 6, size=4),
'B': np.random.rand(4)
})
that looks like this:
id
date
ts
A
B
0
1000
2022-01-01
2022-01-01 00:00:00
4
0.98019
1
1000
2022-01-01
2022-01-01 00:05:00
3
0.82021
2
1000
2022-01-01
2022-01-01 00:10:00
4
0.549684
3
1000
2022-01-01
2022-01-01 00:15:00
5
0.0818311
I transposed the columns A and B with pandas melt:
melted = df.melt(
id_vars=['id', 'date', 'ts'],
value_vars=['A', 'B'],
var_name='label',
value_name='value',
ignore_index=True
)
that looks like this:
id
date
ts
label
value
0
1000
2022-01-01
2022-01-01 00:00:00
A
4
1
1000
2022-01-01
2022-01-01 00:05:00
A
3
2
1000
2022-01-01
2022-01-01 00:10:00
A
4
3
1000
2022-01-01
2022-01-01 00:15:00
A
5
4
1000
2022-01-01
2022-01-01 00:00:00
B
0.98019
5
1000
2022-01-01
2022-01-01 00:05:00
B
0.82021
6
1000
2022-01-01
2022-01-01 00:10:00
B
0.549684
7
1000
2022-01-01
2022-01-01 00:15:00
B
0.0818311
Then I groupby and select the first group:
melted.groupby(['id', 'date']).first()
that gives me this:
ts label value
id date
1000 2022-01-01 2022-01-01 A 4.0
but I would expect this output instead:
ts A B
id date
1000 2022-01-01 2022-01-01 00:00:00 4 0.980190
2022-01-01 2022-01-01 00:05:00 3 0.820210
2022-01-01 2022-01-01 00:10:00 4 0.549684
2022-01-01 2022-01-01 00:15:00 5 0.081831
What am I not getting? Or this is a bug? Also why the ts columns is converted to a date?

my bad!!! I thought first will get the first group but instead it will get the first element for each group, as stated in the documentation for the aggregation functions of pandas. Sorry folks, was doing this late at night and could not think straight :/
To select the first group, I needed to use get_group function.

Related

Execute an SQL query depending on the parameters of a pandas dataframe

I have a pandas data frame called final_data that looks like this
cust_id
start_date
end_date
10001
2022-01-01
2022-01-30
10002
2022-02-01
2022-02-30
10003
2022-01-01
2022-01-30
10004
2022-03-01
2022-03-30
10005
2022-02-01
2022-02-30
I have another table in my sql database called penalties that looks like this
cust_id
level1_pen
level_2_pen
date
10001
1
4
2022-01-01
10001
1
1
2022-01-02
10001
0
1
2022-01-03
10002
1
1
2022-01-01
10002
5
0
2022-02-01
10002
4
0
2022-02-04
10003
1
6
2022-01-02
I want the final_data frame to look like this where it aggregates the data from the penalties table in SQL database based on the cust_id, start_date and end_date
cust_id
start_date
end_date
total_penalties
10001
2022-01-01
2022-01-30
8
10002
2022-02-01
2022-02-30
9
10003
2022-01-01
2022-01-30
7
How do I combine a lambda function for each row where it aggregates the data from the SQL query based on the cust_id, start_date, and end_date variables from each row of the pandas dataframe
Suppose
df = final_data table
df2 = penalties table
you can get the final_data frame that you want using this query:
SELECT
df.cust_id,
df.start_date,
df.end_date,
SUM(df2.level1_pen + df2.level_2_pen) as total_penalties
FROM
df
LEFT JOIN df2 ON df.cust_id = df2.cust_id
AND df2.date BETWEEN df.start_date AND df.end_date
GROUP BY
df.cust_id,
df.start_date,
df.end_date;

How to groupby in Pandas by datetime range from different DF

I've stuck and can't solve this...
I have 2 dataframes.
One has datetimes intervals, another has datetimes and values.
I need to get MIN() values based on datetime ranges.
import pandas as pd
timeseries = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', '2018-01-01T03:00:00.000000000'],
['2018-01-02T00:00:00.000000000', '2018-01-02T03:00:00.000000000'],
['2018-01-03T00:00:00.000000000', '2018-01-03T03:00:00.000000000'],
], dtype='datetime64[ns]', columns=['Start DT', 'End DT'])
values = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', 1],
['2018-01-01T01:00:00.000000000', 2],
['2018-01-01T02:00:00.000000000', 0],
['2018-01-02T00:00:00.000000000', -1],
['2018-01-02T01:00:00.000000000', 3],
['2018-01-02T02:00:00.000000000', 10],
['2018-01-03T00:00:00.000000000', 7],
['2018-01-03T01:00:00.000000000', 11],
['2018-01-03T02:00:00.000000000', 2],
], columns=['DT', 'Value'])
Required output:
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
And ideas?
Use IntervalIndex created by timeseries columns, then get positions by Index.get_indexer, aggregate min and last add to column to timeseries:
s = pd.IntervalIndex.from_arrays(timeseries['Start DT'],
timeseries['End DT'],
closed='both')
values['new'] = timeseries.index[s.get_indexer(values['DT'])]
print (values)
DT Value new
0 2018-01-01 00:00:00 1 0
1 2018-01-01 01:00:00 2 0
2 2018-01-01 02:00:00 0 0
3 2018-01-02 00:00:00 -1 1
4 2018-01-02 01:00:00 3 1
5 2018-01-02 02:00:00 10 1
6 2018-01-03 00:00:00 7 2
7 2018-01-03 01:00:00 11 2
8 2018-01-03 02:00:00 2 2
df = timeseries.join(values.groupby('new')['Value'].min().rename('Min'))
print (df)
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
EDIT: If no match is added missing values instead -1, so was selected last index value, here 2:
timeseries = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', '2018-01-01T03:00:00.000000000'],
['2018-01-02T00:00:00.000000000', '2018-01-02T03:00:00.000000000'],
['2018-01-03T00:00:00.000000000', '2018-01-03T03:00:00.000000000'],
], dtype='datetime64[ns]', columns=['Start DT', 'End DT'])
values = pd.DataFrame(
[ ['2017-12-31T00:00:00.000000000', -10],
['2018-01-01T00:00:00.000000000', 1],
['2018-01-01T01:00:00.000000000', 2],
['2018-01-01T02:00:00.000000000', 0],
['2018-01-02T00:00:00.000000000', -1],
['2018-01-02T01:00:00.000000000', 3],
['2018-01-02T02:00:00.000000000', 10],
['2018-01-03T00:00:00.000000000', 7],
['2018-01-03T01:00:00.000000000', 11],
['2018-01-03T02:00:00.000000000', 2],
], columns=['DT', 'Value'])
values['DT'] = pd.to_datetime(values['DT'])
print (values)
DT Value
0 2017-12-31 00:00:00 -10
1 2018-01-01 00:00:00 1
2 2018-01-01 01:00:00 2
3 2018-01-01 02:00:00 0
4 2018-01-02 00:00:00 -1
5 2018-01-02 01:00:00 3
6 2018-01-02 02:00:00 10
7 2018-01-03 00:00:00 7
8 2018-01-03 01:00:00 11
9 2018-01-03 02:00:00 2
s = pd.IntervalIndex.from_arrays(timeseries['Start DT'],
timeseries['End DT'], closed='both')
pos = s.get_indexer(values['DT'])
values['new'] = timeseries.index[pos].where(pos != -1)
print (values)
DT Value new
0 2017-12-31 00:00:00 -10 NaN
1 2018-01-01 00:00:00 1 0.0
2 2018-01-01 01:00:00 2 0.0
3 2018-01-01 02:00:00 0 0.0
4 2018-01-02 00:00:00 -1 1.0
5 2018-01-02 01:00:00 3 1.0
6 2018-01-02 02:00:00 10 1.0
7 2018-01-03 00:00:00 7 2.0
8 2018-01-03 01:00:00 11 2.0
9 2018-01-03 02:00:00 2 2.0
df = timeseries.join(values.dropna(subset=['new']).groupby('new')['Value'].min().rename('Min'))
print (df)
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
One possible solution is to create a variable (key) on which to join the two datasets
# create 'key' variable
timeseries['key'] = timeseries['Start DT'].astype(str)
values['key'] = pd.to_datetime(values['DT'].str.replace('T', ' '), format='%Y-%m-%d %H:%M:%S.%f').dt.date.astype(str)
# create dataset with minima
mins = values.groupby('key').agg({'Value': 'min'}).reset_index()
# join
timeseries.merge(mins, on='key').drop(columns=['key'])
Start DT End DT Value
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
values['DT']=values['DT'].astype(str) #convert to string
s=values['DT'].str.split(' ')#split on space
values['day']=s.str[0] #take the day part
df4=values.groupby(by='day').min()#groupby and take min value
df4.reset_index(inplace=True) #reset index
df4['day']=pd.to_datetime(df4['day'])#convert back to datetime for merging
final=pd.merge(timeseries,df4,left_on='Start DT',right_on='day',how='inner') #merge

Error in creating a new columns in pandas dataframe

Tried creating a new column to categorize different time frames into categories using np.select. However, python throws an error saying shape mismatch. I'm not sure how to get it corrected.
For your logic, it's simple to use hour attribute of datetime
import numpy as np
s = pd.Series(pd.date_range("1-Apr-2021", "now", freq="4H"), name="start_date")
(s.to_frame()
.join(pd.Series(np.select([s.dt.hour.between(1,6),
s.dt.hour.between(7,12)],
[1,2],0), name="cat"))
.head(8)
)
start_date
cat
0
2021-04-01 00:00:00
0
1
2021-04-01 04:00:00
1
2
2021-04-01 08:00:00
2
3
2021-04-01 12:00:00
2
4
2021-04-01 16:00:00
0
5
2021-04-01 20:00:00
0
6
2021-04-02 00:00:00
0
7
2021-04-02 04:00:00
1

Pandas take daily mean within resampled date

I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0

Converting dataframe object to date using to_datetime

I have a data set that looks like this:
date id
0 2014-01-01 11000929
1 2014-01-01 11000190
2 2014-01-01 11000216
3 2014-01-01 11000822
4 2014-01-01 11000971
5 2014-01-01 11000721
6 2014-01-01 11000970
7 2014-01-01 11000574
8 2014-01-01 11000967
9 2014-01-01 11000172
10 2014-01-01 11000208
11 2014-01-01 11000966
12 2014-01-01 11000344
13 2014-01-01 11000965
14 2014-01-01 11000935
15 2014-01-01 11000964
16 2014-01-01 11000741
17 2014-01-01 11000868
18 2014-01-01 11000035
19 2014-01-01 11000203
20 2014-01-02 11000574
as you can see there is a lot of duplciate date times for different products, I will merge this table with another table which requires me to convert date column, which is currently and object, to datetime64[ns].
I tried
df_date_id.date = pd.to_datetime(df_date_id.date)
but I end up having the error:
TypeError: <class 'pandas._libs.tslibs.period.Period'> is not convertible to datetime
p.s: the table I am going to merge with looks like this:
date id score
0 2014-01-01 11000035 75
1 2014-01-02 11000035 84
2 2014-01-03 11000035 55
so date format of both tables looks the same to me.
Thanks in advance.
I think is necessary convert period to datetimes with to_timestamp:
df['date'] = df['date'].dt.to_timestamp()
print (df['date'].dtypes)
datetime64[ns]
Another solution is convert column in another DataFrame to periods like:
df2['date'] = df2['date'].dt.to_period('d')
Works for me by specifying the format:
df.date = pd.to_datetime(df.date, format='%Y-%M-%d')
date id
0 2014-01-01 00:01:00 11000929
1 2014-01-01 00:01:00 11000190
2 2014-01-01 00:01:00 11000216
3 2014-01-01 00:01:00 11000822
4 2014-01-01 00:01:00 11000971
If not try:
df.date = pd.to_datetime(df.date.astype(str), format='%Y-%M-%d')