Varying apply function on Columns when using Pandas TimeGrouper

Varying apply function on Columns when using Pandas TimeGrouper - pandas

I have a very large time series dataset, I would like to do a count() on close_p but a sum() on prd_vlm.
open_p high_p low_p close_p tot_vlm prd_vlm
datetime
2005-09-06 16:33:00 1234.25 1234.50 1234.25 1234.25 776 98
2005-09-06 16:34:00 1234.50 1234.75 1234.25 1234.50 1199 423
2005-09-06 16:35:00 1234.50 1234.50 1234.25 1234.50 1330 131
...
2017-06-25 18:41:00 2431.75 2432.00 2431.75 2432.00 5436 189
2017-06-25 18:42:00 2431.75 2432.25 2431.75 2432.25 5654 218
2017-06-25 18:43:00 2432.25 2432.75 2432.25 2432.75 5877 223
2017-06-25 18:44:00 2432.75 2432.75 2432.50 2432.75 5894 17
2017-06-25 18:45:00 2432.50 2432.50 2432.25 2432.25 6098 204
I can achieve this using the following code. But was wondering if there is a better way of achieve this using an apply function
group_count = df['close_p'].groupby(pd.TimeGrouper('D')).count()
group_volume = df['prd_vlm'].groupby(pd.TimeGrouper('D')).sum()
grouped = pd.concat([group_count,group_volume], axis=1)
print(grouped)
close_p prd_vlm
datetime
2005-09-06 232 4776.0
2005-09-07 1039 631548.0
2005-09-08 999 544112.0
2005-09-09 810 595044.0

You can use agg and apply different functions to different columns.
df.groupby(pd.TimeGrouper('D')).agg({'close_p':'count','prd_vlm':'sum'})

Related

Need SQL query for max value per latest date per unique item - Novice SQL user needing assistance

REVISED POST
I need a query with a desired output shown in bullet #2. Below is a simple query of the data for a specific inventoryno. Notice avgcost can fluctuate for any given date. I need the highest avgcost on the most recent date, distinct to the inventoryno.
Note I have included sample snippets for additional reference however stackoverflow links my images instead of pasting directly here because I am a new OP.
Current query and output
select inventoryno, avgcost, dts
from invtrans
where DTS < '01-JAN-23'
order by dts desc;
INVENTORYNO
AVGCOST
DTS
264
52.36411
12/31/2022
264
52.36411
12/31/2022
264
52.36411
12/31/2022
507
149.83039
12/31/2022
6005
57.45968
12/31/2022
6005
57.45968
12/31/2022
6005
57.45968
12/31/2022
1518
4.05530
12/31/2022
1518
4.05530
12/31/2022
1518
4.05530
12/31/2022
1518
4.15254
12/31/2022
1518
4.15254
12/31/2022
1518
4.1525
12/31/2022
365
0.00000
2/31/2022
365
0.00000
2/31/2022
365
0.00000
2/31/2022
Snippet for above
My proposed query which doesn't work due to 'not a single-group group function
Select distinct inventoryno, Max(avgcost), max(dts)
from invtrans
where DTS < '01-JAN-23'
order by inventoryno;
DESIRED OUTPUT
INVENTORYNO
AVGCOST
DTS
264
52.36411
12/31/2022
507
149.83039
12/31/2022
6005
57.45968
12/31/2022
1518
4.15254
12/31/2022
365
0.00000
2/31/2022
Desired for above snippet
I have included the raw table with a few rows below for better context.
Raw table for reference
select * from invtrans
KEY
SOURCE
INVENTORYNO
WAREHOUSENO
QUANTITY
QOH
AVGCOST
DTS
EMPNO
INVTRANSNO
TOTALAMT
CO_ID
1805
INVXFER
223
3
1200
2811
0.78377
5/22/2018
999
112029
940.80000
1
076394
PROJ
223
3
-513
2298
0.78376
5/23/2018
999
112030
-402.19000
1
111722
APVCHR
223
3
3430
5728
0.79380
6/1/2018
999
112033
2862.68000
1
073455
PROJ
223
3
-209
5519
0.79392
6/8/2018
999
112034
-163.86000
1
076142
PROJ
223
3
-75
5444
0.79396
6/12/2018
999
112035
-58.80000
1
073492
PROJ
223
3
-252
5192
0.79411
6/13/2018
999
112036
-197.57000
1
072377
PROJ
223
3
-1200
3992
0.79414
8/22/2018
999
112056
-952.80000
1
If anyone could assist me further, it would be ideal for the query below to contain the 'avgcost' column. Otherwise I can take the fixed query from step 2 and the one below to excel and combine there, but would prefer not to.
Remember, Avgcost NEEDS to be the maximum avgcost based on the most recent date. I cannot figure it out. Thank you.
select inventoryno,
count(inventoryno),
MAX(DTS),
sum(quantity),
sum(totalamt)
from invtrans
where DTS < '01-JAN-23'
group by inventoryno
order by inventoryno;
INVENTORYNO
COUNT(INVENTORYNO)
MAX(DTS)
SUM(QUANTITY)
SUM(TOTALAMT)
1
103
11/28/2022 7:07:46 AM
75
1153.46
10
888
9/26/2022 9:31:20 AM
0
0
100
1287
12/31/2022
162
70486.77
1001
241
11/28/2022 7:27:04 PM
181
14207.43
1002
759
12/31/2022
566
76424.46
1003
936
12/31/2022
120
25252.61
1004
263
11/30/2022 10:48:00 AM
550
1627.62
1005
487
11/28/2022 5:05:56 PM
750
4435.51
1006
9
11/23/2022 8:38:05 AM
1311
504.63
1008
13
11/30/2022 10:48:00 AM
0
0
1009
38
10/31/2022 6:50:27 AM
90
2680.36
101
535
12/31/2022
79
48153.44
102
238
11/28/2022 6:42:01 PM
24
17802.91
1020
2
12/13/2019
50
119.89
1021
262
12/31/2022
2000
4844.37
1022
656
11/23/2022 4:49:35 PM
300
1315.17
1023
1693
12/31/2022
1260
2002.56
1025
491
11/28/2022 5:05:56 PM
225
864.75
1026
62
9/23/2022 4:35:14 PM
375
11956.17
1027
109
10/28/2022 8:44:21 AM
300
2157.97
1028
39
9/4/2019 12:30:00 AM
50
244.62
Example output of what I ultimately need

Future dates calculating incorrectly in FBProphet - make_future_dataframe method

I'm trying to do a weekly forecast in FBProphet for just 5 weeks ahead. The make_future_dataframe method doesn't seem to be working right....makes the correct one week intervals except for one week between jul 3 and Jul 5....every other interval is correct at 7 days or a week. Code and output below:
INPUT DATAFRAME
ds y
548 2010-01-01 3117
547 2010-01-08 2850
546 2010-01-15 2607
545 2010-01-22 2521
544 2010-01-29 2406
... ... ...
4 2020-06-05 2807
3 2020-06-12 2892
2 2020-06-19 3012
1 2020-06-26 3077
0 2020-07-03 3133
CODE
future = m.make_future_dataframe(periods=5, freq='W')
future.tail(9)
OUTPUT
ds
545 2020-06-12
546 2020-06-19
547 2020-06-26
548 2020-07-03
549 2020-07-05
550 2020-07-12
551 2020-07-19
552 2020-07-26
553 2020-08-02

All you need to do is create a dataframe with the dates you need for predict method. utilizing the make_future_dataframe method is not necessary.

select avg for specific values of date

I have this table 'meteorecords' with date, temperature, rh and the meteo station which made the record.
rerowid date temp rh meteostid
1 2019-09-9 28.8 55.6 AITNIA2
2 2019-09-10 30.3 51.3 AITNIA2
3 2019-09-11 28.6 49.0 AITNIA2
4 2019-09-12 26.7 51.9 AITNIA2
5 2019-09-13 25.3 48.1 AITNIA2
6 2019-09-14 25.3 38.5 AITNIA2
7 2019-09-15 25.0 42.2 AITNIA2
8 2019-09-16 24.1 52.1 AITNIA2
9 2019-09-17 23.3 65.2 AITNIA2
10 2019-09-18 22.7 72.2 AITNIA2
11 2019-09-19 23.4 73.9 AITNIA2
12 2019-09-20 23.1 76.7 AITNIA2
13 2019-09-21 22.5 60.3 AITNIA2
14 2019-09-22 20.9 61.6 AITNIA2
15 2019-09-23 21.9 73.9 AITNIA2
16 2019-09-24 23.2 79.6 AITNIA2
17 2019-09-25 21.8 73.6 AITNIA2
18 2019-09-26 22.2 77.6 AITNIA2
19 2019-09-27 22.9 77.1 AITNIA2
20 2019-09-28 22.8 68.4 AITNIA2
21 2019-09-29 22.6 75.5 AITNIA2
...........................
I want to select all the fields plus the average temperature of the last 3 days.
I'm using postgresql because I have some geometric and spatial data in the db.
I tried this with no luck:
SELECT rerowid,redate,retemp,rerh,meteostid,
(SELECT AVG(retemp)
FROM meteorecords m
WHERE meteostid = m.meteostid AND m.redate BETWEEN redate-2 AND redate)
FROM meteorecords
which returns a result like this:
rerowid date temp rh meteostid AVG_Last_3_Days
1 2019-09-09 28.8 55.6 AITNIA2 22.2824
2 2019-09-10 30.3 51.3 AITNIA2 22.2824
3 2019-09-11 28.6 49.0 AITNIA2 22.2824
4 2019-09-12 26.7 51.9 AITNIA2 22.2824
5 2019-09-13 25.3 48.1 AITNIA2 22.2824
6 2019-09-14 25.3 38.5 AITNIA2 22.2824
7 2019-09-15 25.1 42.2 AITNIA2 22.2824
..................
But I want a result like this:
rerowid date temp rh meteostid AVG_Last_3_Days
1 2019-09-09 28.8 55.6 AITNIA2 28.8
2 2019-09-10 30.3 51.3 AITNIA2 29.5
3 2019-09-11 28.6 49.0 AITNIA2 29.2
4 2019-09-12 26.7 51.9 AITNIA2 28.5
5 2019-09-13 25.3 48.1 AITNIA2 26.9
6 2019-09-14 25.3 38.5 AITNIA2 25.8
7 2019-09-15 25.1 42.2 AITNIA2 25.2
..................

Use window functions. If you have one row per date or you want the previous three dates *in the data):
SELECT rerowid, redate, retemp, rerh, meteostid,
AVG(retemp) OVER (PARTITION BY meteostid ORDER BY redate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as avg_retemp_3
FROM meteorecords;
If you want 3 chronological days, use RANGE:
SELECT rerowid, redate, retemp, rerh, meteostid,
AVG(retemp) OVER (PARTITION BY meteostid
ORDER BY redate
RANGE BETWEEN '2 DAY' PRECEDING AND CURRENT ROW) as avg_retemp_3
FROM meteorecords;

Convert more than one Columns into Rows

I have a table with columns and value like
ID Values FirstCol 2ndCol 3rdCol 4thCol 5thCol
1 1stValue 5466 34556 53536 54646 566
1 2ndValue 3544 957 667 1050 35363
1 3rdValue 1040 1041 4647 6477 1045
1 4thValue 1048 3546 1095 1151 65757
2 1stValue 845 5466 86578 885 859
2 2ndValue 35646 996 1300 7101 456467
2 3rdValue 102 46478 565 657 107
2 4thValue 5509 55110 1411 1152 1144
3 1stValue 845 854 847 884 675
3 2ndValue 984 994 4647 1041 1503
3 3rdValue 1602 1034 1034 1055 466
3 4thValue 1069 1610 6111 1124 1144
Now I want a result set in below form, is this possible with Pivot or Case statment?
ID Cols 1stValue 2ndValue 3rdValue 4thValue
1 FirstCol 5466 3544 1040 1048
1 2ndCol 34556 957 1041 3546
1 3rdCol 53536 667 4647 1095
1 4thCol 54646 1050 6477 1151
1 5thCol 566 35363 1045 65757
2 FirstCol 845 35646 102 5509
2 2ndCol 5466 996 46478 55110
2 3rdCol 86578 1300 565 1411
2 4thCol 885 7101 657 1152
2 5thCol 859 456467 107 1144
3 FirstCol 845 984 1602 1069
3 2ndCol 854 994 1034 1610
3 3rdCol 847 4647 1034 6111
3 4thCol 884 1041 1055 1124
3 5thCol 675 1503 466 1144

Assuming the table name is t1 this should do the trick:
SELECT * FROM t1
UNPIVOT (val FOR name IN ([FirstCol], [2ndCol], [3rdCol], [4thCol], [5thCol])) unpiv
PIVOT (SUM(val) FOR [Values] IN ([1stValue], [2ndValue], [3rdValue], [4thValue])) piv
There's sorting issue, it'd be good to rename FirstCol to 1stCol, then ORDER BY ID, name would put it in required order.

Grouping different results from same table

I would like to get a result from a table:
Date Charges
22/04/2010 1764
22/04/2010 200
22/04/2010 761
22/04/2010 3985
22/04/2010 473
22/04/2010 677
22/04/2010 1361
22/04/2010 6232
22/04/2010 4095
23/04/2010 7224
23/04/2010 1748
23/04/2010 1355
23/04/2010 2095
23/04/2010 2063
23/04/2010 2331
23/04/2010 2331
23/04/2010 4473
23/04/2010 478
23/04/2010 1901
23/04/2010 1250
23/04/2010 1743
24/04/2010 1743
24/04/2010 3923
24/04/2010 1575
24/04/2010 1859
24/04/2010 2431
24/04/2010 1208
24/04/2010 158
24/04/2010 3246
24/04/2010 2898
24/04/2010 1517
24/04/2010 2368
24/04/2010 961
24/04/2010 4111
24/04/2010 3066
24/04/2010 740
25/04/2010 2651
25/04/2010 2693
25/04/2010 4847
25/04/2010 312
25/04/2010 1247
25/04/2010 5858
25/04/2010 1040
25/04/2010 941
25/04/2010 942
25/04/2010 1784
25/04/2010 418
25/04/2010 2248
25/04/2010 1834
25/04/2010 418
25/04/2010 2263
26/04/2010 2746
26/04/2010 942
26/04/2010 883
26/04/2010 3339
26/04/2010 3517
26/04/2010 761
26/04/2010 1738
26/04/2010 1370
26/04/2010 1501
26/04/2010 1197
26/04/2010 2452
26/04/2010 209
26/04/2010 1092
26/04/2010 4316
26/04/2010 1208
26/04/2010 1213
26/04/2010 2179
26/04/2010 1213
26/04/2010 1538
26/04/2010 1939
26/04/2010 956
26/04/2010 10715
26/04/2010 4321
26/04/2010 956
26/04/2010 2975
26/04/2010 798
26/04/2010 1738
where it shows the following fields:
Date, Count of >2500, Total of >2500, Total Count and Grand total between 1/4/2010 to 30/4/2010
i.e.
22/4/2010, 3, 14312, 9, 19548
23/4/2010, 2, 11697, 12, 28992
24/4/2010, 5, 17244, 15, 31804
25/4/2010, 4, 16049, 15, 29496
26/4/2010, 7, 31929, 27, 57812
...
...
All help are much appreciated! Thanks in advance.

Basics would be to use SUM and CASE, something like:
SELECT
DATEADD(day,DATEDIFF(day,'20010101',DateTimeActivity),'20010101') as Date,
SUM(CASE WHEN Charges > 2500 THEN 1 ELSE 0 END) as Count2500,
SUM(CASE WHEN Charges > 2500 THEN Charges END) as Sum2500,
COUNT(*) as CountTotal,
SUM(Charges) as SumTotal
FROM
AccActivity
WHERE
DateTimeActivity >= '20100401' and
DateTimeActivity < '20100501'
GROUP BY
DATEADD(day,DATEDIFF(day,'20010101',DateTimeActivity),'20010101')
Updated based on your comment, to use real table/column names. I assume you want to include transactions which occur on 30th April.
Note that I'm using a safe date format for my date literals (YYYYMMDD) - most other formats are ambiguous based on the regional settings on the server.
Also, I'm using DATEADD(day,DATEDIFF(day,'20010101',DateTimeActivity),'20010101') to strip the time component from the datetime - it looks slightly funky, but it's reasonable fast, and the same pattern can be used to do other datetime conversions relatively easily (e.g. if you need to group on months, you can just change both day options to month, and the dates will all be set to the 1st of their respective month)

You can try with:
SELECT date,
count(if(charges>2500, 1, NULL)) as countGt2500,
sum(if(charges>2500, charges, 0)) as totalGt2500,
count(charges) as countTotal,
sum(charges) as sumTotal,
FROM yourTable
WHERE date >= '2010/04/01'
AND date <= '2010/04/30'
GROUP BY date;
If you saved the full datetime on the field date you have to extract the date part from the datetime, to do it you can use the DATE function on the following way:
SELECT DATE(date) as day,
count(if(charges>2500, 1, NULL)) as countGt2500,
sum(if(charges>2500, charges, 0)) as totalGt2500,
count(charges) as countTotal,
sum(charges) as sumTotal,
FROM yourTable
WHERE date >= '2010/04/01'
AND date <= '2010/04/30'
GROUP BY day;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Varying apply function on Columns when using Pandas TimeGrouper - pandas

You can use agg and apply different functions to different columns. df.groupby(pd.TimeGrouper('D')).agg({'close_p':'count','prd_vlm':'sum'})

Related

Need SQL query for max value per latest date per unique item - Novice SQL user needing assistance

Future dates calculating incorrectly in FBProphet - make_future_dataframe method

select avg for specific values of date

Convert more than one Columns into Rows

Grouping different results from same table

Categories

Resources