Distinct values based on correlated MIN() and MAX() values - sql

I have a table called EVENTS and I need to write 4 scripts to generate 4 separate tables of distinct EVENT_ID values. Below is the first script and the other 3 will be of a similar logic. Can anyone help with this first script please so I can then hopefully use it as a template for the other 3 scripts. I am writing these scripts in SQL2005 which must be backwards compatible with SQL2000. I have removed any duplicates so there shouldn't be a need to involve the rank of the EVENT_ID in the logic.
For each CARE_ID select the value of EVENT_ID which has an EVENT_TYPE
of CP and has a MAX(EVENT_DATE) which is <= the MIN(EVENT_DATE) where
the EVENT_TYPE is in ('B','CH','S', 'T')
CARE_ID EVENT_ID EVENT_DATE EVENT_TYPE
3 194 01/10/2012 S
3 228 07/07/2010 S
3 104 12/05/2010 CH
3 16 12/07/2010 B
3 17 13/07/2010 B
3 43 15/01/2010 P
3 189 15/04/2010 S
39 45 09/10/2009 T
39 4 21/07/2009 P
39 6 21/07/2009 CH
78 28 08/07/2009 S
78 706 08/12/2010 CP
78 707 09/12/2010 CP
78 9 28/07/2009 T
78 11 28/07/2009 CH
95 21 31/07/2009 CH
95 21 31/07/2009 T
107 1474 21/09/2012 S
107 93 23/02/2010 CP
107 59 29/10/2012 P
107 58 29/12/2009 P
151 186 19/03/2010 S
151 49 21/03/2010 T
152 69 26/08/2009 CH
206 85 21/08/2009 CP
206 84 28/07/2009 CP
217 158 18/02/2010 S
217 102 30/03/2010 CH
218 159 12/03/2010 S
227 1378 01/04/2011 CP
355 19 13/07/2010 B
355 20 13/07/2010 B
355 239 13/07/2010 S
355 56 16/07/2010 T
355 111 16/07/2010 CH
364 1136 18/02/2011 CP
364 569 19/02/2011 S
364 774 23/08/2012 CH
364 1122 26/01/2011 CP
367 247 01/07/2010 S
367 151 21/06/2010 CP
369 108 26/07/2010 P
369 152 27/07/2010 CP
369 109 28/07/2010 P
369 117 28/07/2010 CH
369 248 28/07/2010 S
380 277 08/07/2011 T
396 1573 06/06/2011 CP
481 63 07/09/2010 T
481 116 07/09/2010 P
481 194 07/09/2010 CP
481 289 07/09/2010 S
502 200 13/08/2010 CP
530 220 14/06/2010 CP
535 222 05/07/2010 CP
535 303 13/07/2010 S
535 223 19/07/2010 CP
535 224 26/07/2010 CP
536 135 10/09/2010 CH
536 225 23/08/2010 CP
568 155 06/10/2010 P
568 315 15/10/2010 S
631 148 02/02/2010 CH
631 74 15/01/2010 T
631 256 15/12/2009 CP
631 345 15/12/2009 S
631 147 25/12/2009 CH
632 259 18/09/2010 CP
653 189 29/10/2010 P
653 360 30/09/2010 S
655 1570 06/06/2011 CP
680 569 08/12/2010 CP
680 1191 24/11/2011 S
680 530 25/01/2011 S
680 151 30/09/2010 P
680 281 30/09/2010 CP
680 480 30/11/2010 CP
689 306 02/11/2010 CP
689 158 06/10/2010 P
689 372 06/10/2010 S
689 2720 06/11/2012 CP
689 2736 11/11/2012 CP
689 2752 13/11/2012 CP
689 2765 15/11/2012 CP
689 2125 22/09/2011 CP
689 2654 24/09/2012 CP
689 1944 26/08/2011 CP
689 307 26/10/2010 CP
689 1947 27/08/2011 CP
729 299 15/09/2010 CP
811 413 27/10/2010 S
834 622 01/01/2012 CH
834 1233 06/01/2012 S
834 624 15/01/2012 CH
834 625 23/01/2012 CH
834 627 23/01/2012 CH
838 629 02/01/2012 CH
838 630 20/01/2012 CH
838 632 27/01/2012 CH
846 416 05/10/2010 S
849 195 03/11/2010 P
849 336 21/02/2011 CP
923 441 26/07/2010 S
963 371 29/10/2010 CP
981 624 23/03/2011 S
984 384 13/11/2010 CP
984 392 18/11/2010 CP

Tried using a HAVING clause? Maybe I've messed something up, but your test data only seems to have one such entry, namely for the S case, as follows:
SELECT e.[CARE_ID], e.[EVENT_ID]
FROM dbo.EVENTS e
WHERE e.[EVENT_TYPE] = 'CP'
GROUP BY e.[CARE_ID], e.[EVENT_ID]
HAVING MAX( e.[EVENT_DATE] ) <= ( SELECT MIN( [EVENT_DATE] )
FROM [EVENTS]
WHERE [EVENT_ID] = e.[EVENT_ID]
AND [EVENT_TYPE] = 'S' );
Here's an SQL Fiddle to help sort you out!
Edit: The original fiddle is looking for the maximum date of the EVENT_ID, when we're supposed to be looking for the maximum date of the CARE_ID. I think this will get you on the right track!
SELECT e.[CARE_ID], e.[EVENT_ID]
FROM dbo.EVENTS e
WHERE e.[EVENT_TYPE] = 'CP'
GROUP BY e.[CARE_ID], e.[EVENT_ID]
HAVING MAX( e.[EVENT_DATE] ) <= ( SELECT MIN( [EVENT_DATE] )
FROM [EVENTS]
WHERE [CARE_ID] = e.[CARE_ID]
AND [EVENT_TYPE] = 'S' );
Edit 3: Now with proper DATETIME!
Edit 4: Distinct EVENT_ID!
SELECT DISTINCT e.[EVENT_ID]
FROM dbo.EVENTS e
WHERE e.[EVENT_TYPE] = 'CP'
GROUP BY e.[CARE_ID], e.[EVENT_ID]
HAVING MAX( e.[EVENT_DATE] ) <= ( SELECT MIN( [EVENT_DATE] )
FROM dbo.[EVENTS]
WHERE [CARE_ID] = e.[CARE_ID]
AND [EVENT_TYPE] = 'S' );

This translate quite directly into SQL using window functions:
select care_id, event_id
from (select e.*,
max(case when event_type = 'CP' then event_date end) over (partition by care_id) as MaxED_CP,
min(case when event_type = ('B','CH','S', 'T') then event_date end) over (partition by care_id) as MinED_others,
from events e
) e
where event_type = 'CP' and
MaxED_CP <= MinED_others and
event_date = MaxED_CP;
These functions calculate the max() and min() values on every row. The outer query just selects the appropriate rows.
Note that if no such event exists for a given care_id then the care_id is not in the output.
This is not backwards compatible with SQL Server 2000. If it works, then you can replace the window functions with subqueries.

Related

rename column titles unstacked data pandas

I have a data table derived via unstacking an existing dataframe:
Day 0 1 2 3 4 5 6
Hrs
0 223 231 135 122 099 211 217
1 156 564 132 414 156 454 157
2 950 178 121 840 143 648 192
3 025 975 151 185 341 145 888
4 111 264 469 330 671 201 345
-- -- -- -- -- -- -- --
I want to simply change the column titles so I have the days of the week displayed instead of numbered. Something like this:
Day Mon Tue Wed Thu Fri Sat Sun
Hrs
0 223 231 135 122 099 211 217
1 156 564 132 414 156 454 157
2 950 178 121 840 143 648 192
3 025 975 151 185 341 145 888
4 111 264 469 330 671 201 345
-- -- -- -- -- -- -- --
I've tried .rename(columns = {'original':'new', etc}, inplace = True) and other similar functions, none of which have worked.
I also tried going to the original dataframe and creating a dt.day_name column from the parsed dates, but it come out with the days of the week mixed up.
I'm sure it's a simple fix, but I'm living off nothing but caffeine, so help would be appreciated.
You can try:
import pandas as pd
df = pd.DataFrame(columns=[0,1,2,3,4,5,6])
df.columns = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

How to aggregate multiple columns - Pandas

I have this df:
ID Date XXX 123_Var 456_Var 789_Var 123_P 456_P 789_P
A 07/16/2019 1 987 551 313 22 12 94
A 07/16/2019 9 135 748 403 92 40 41
A 07/18/2019 8 376 938 825 14 69 96
A 07/18/2019 5 259 176 674 52 75 72
B 07/16/2019 9 690 304 948 56 14 78
B 07/16/2019 8 819 185 699 33 81 83
B 07/18/2019 1 580 210 847 51 64 87
I want to group the df by ID and Date, aggregate the XXX column by the maximum value, and aggregate 123_Var, 456_Var, 789_Var columns by the minimum value.
* Note: The df contains many of these columns. The shape is: {some int}_Var.
This is the current code I've started to write:
df = (df.groupby(['ID','Date'], as_index=False)
.agg({'XXX':'max', list(df.filter(regex='_Var')): 'min'}))
Expected result:
ID Date XXX 123_Var 456_Var 789_Var
A 07/16/2019 9 135 551 313
A 07/18/2019 8 259 176 674
B 07/16/2019 9 690 185 699
B 07/18/2019 1 580 210 847
Create dictionary dynamic with dict.fromkeys and then merge it with {'XXX':'max'} dict and pass to GroupBy.agg:
d = dict.fromkeys(df.filter(regex='_Var').columns, 'min')
df = df.groupby(['ID','Date'], as_index=False).agg({**{'XXX':'max'}, **d})
print (df)
ID Date XXX 123_Var 456_Var 789_Var
0 A 07/16/2019 9 135 551 313
1 A 07/18/2019 8 259 176 674
2 B 07/16/2019 9 690 185 699
3 B 07/18/2019 1 580 210 847

To find avg in pig and sort it in ascending order

have a schema with 9 fields and i want to take only two fields(6,7 i.e $5,$6) and i want to calculate the average of $5 and i want to sort the $6 in ascending order so how to do this task can some one help me.
Input Data:
N368SW 188 170 175 17 -1 MCO MHT 1142
N360SW 100 115 87 -10 5 MCO MSY 550
N626SW 114 115 90 13 14 MCO MSY 550
N252WN 107 115 84 -10 -2 MCO MSY 550
N355SW 104 115 85 -1 10 MCO MSY 550
N405WN 113 110 96 14 11 MCO ORF 655
N456WN 110 110 92 24 24 MCO ORF 655
N743SW 144 155 124 7 18 MCO PHL 861
N276WN 142 150 129 -2 6 MCO PHL 861
N369SW 153 145 134 30 22 MCO PHL 861
N363SW 151 145 137 5 -1 MCO PHL 861
N346SW 141 150 128 51 60 MCO PHL 861
N785SW 131 145 118 -15 -1 MCO PHL 861
N635SW 144 155 127 -6 5 MCO PHL 861
N242WN 298 300 276 68 70 MCO PHX 1848
N439WN 130 140 111 -4 6 MCO PIT 834
N348SW 140 135 124 7 2 MCO PIT 834
N672SW 136 135 122 9 8 MCO PIT 834
N493WN 151 160 136 -9 0 MCO PVD 1073
N380SW 170 155 155 13 -2 MCO PVD 1073
N705SW 164 160 147 6 2 MCO PVD 1073
N233LV 157 160 143 1 4 MCO PVD 1073
N786SW 156 160 139 6 10 MCO PVD 1073
N280WN 160 160 146 1 1 MCO PVD 1073
N282WN 104 95 81 10 1 MCO RDU 534
N694SW 89 100 77 3 14 MCO RDU 534
N266WN 94 95 82 9 10 MCO RDU 534
N218WN 98 100 77 12 14 MCO RDU 534
N355SW 47 50 35 15 18 MCO RSW 133
N388SW 44 45 30 37 38 MCO RSW 133
N786SW 46 50 31 4 8 MCO RSW 133
N707SA 52 50 33 10 8 MCO RSW 133
N795SW 176 185 153 -9 0 MCO SAT 1040
N402WN 176 185 161 4 13 MCO SAT 1040
N690SW 123 130 107 -1 6 MCO SDF 718
N457WN 135 130 105 20 15 MCO SDF 718
N720WN 144 155 131 13 24 MCO STL 880
N775SW 147 160 135 -6 7 MCO STL 880
N291WN 136 155 122 96 115 MCO STL 880
N247WN 144 155 127 43 54 MCO STL 880
N748SW 179 185 159 -4 2 MDW ABQ 1121
N709SW 176 190 158 21 35 MDW ABQ 1121
N325SW 110 105 97 36 31 MDW ALB 717
N305SW 116 110 90 107 101 MDW ALB 717
N403WN 145 165 128 -6 14 MDW AUS 972
N767SW 136 165 125 59 88 MDW AUS 972
N730SW 118 120 100 28 30 MDW BDL 777
i have written the code like this but it is not working properly:
a = load '/path/to/file' using PigStorage('\t');
b = foreach a generate (int)$5 as field_a:int,(chararray)$6 as field_b:chararray;
c = group b all;
d = foreach c generate b.field_b,AVG(b.field_a);
e = order d by field_b ASC;
dump e;
I am facing error at order by:
grunt> a = load '/user/horton/sample_pig_data.txt' using PigStorage('\t');
grunt> b = foreach a generate (int)$5 as fielda:int,(chararray)$6 as fieldb:chararray;
grunt> describe #;
b: {fielda: int,fieldb: chararray}
grunt> c = group b all;
grunt> describe #;
c: {group: chararray,b: {(fielda: int,fieldb: chararray)}}
grunt> d = foreach c generate b.fieldb,AVG(b.fielda);
grunt> e = order d by fieldb ;
2017-01-05 15:51:29,623 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
<line 6, column 15> Invalid field projection. Projected field [fieldb] does not exist in schema: :bag{:tuple(fieldb:chararray)},:double.
Details at logfile: /root/pig_1483631021021.log
I want output like(not related to input data):
(({(Bharathi),(Komal),(Archana),(Trupthi),(Preethi),(Rajesh),(siddarth),(Rajiv) },
{ (72) , (83) , (87) , (75) , (93) , (90) , (78) , (89) }),83.375)
If you have found the answer, best practice is to post it so that others referring to this can have a better understanding.

How to query using an array of columns on SQL Server 2008

Can you please help on this, Im trying to write a query which retrieves a total amount from an array of columns, I dont know if there is a way to do this, I retrieve the array of columns I need from this query:
USE Facebook_Global
GO
SELECT c.name AS column_name
FROM sys.tables AS t
INNER JOIN sys.columns AS c
ON t.OBJECT_ID = c.OBJECT_ID
WHERE t.name LIKE '%Lifetime Likes by Gender and###$%' and c.name like '%m%'
Which gives me this table
column_name
M#13-17
M#18-24
M#25-34
M#35-44
M#45-54
M#55-64
M#65+
So I need a query that gives me a TotalAmount of those columns listed in that table. Can this be possible?
Just to clarify a little:
I have this table
Date F#13-17 F#18-24 F#25-34 F#35-44 F#45-54 F#55-64 F#65+ M#13-17 M#18-24 M#25-34 M#35-44 M#45-54 M#55-64 M#65+
2015-09-06 00:00:00.000 257 3303 1871 572 235 116 71 128 1420 824 251 62 32 30
2015-09-07 00:00:00.000 257 3302 1876 571 234 116 72 128 1419 827 251 62 32 30
2015-09-08 00:00:00.000 257 3304 1877 572 234 116 73 128 1421 825 253 62 32 30
2015-09-09 00:00:00.000 257 3314 1891 575 236 120 73 128 1438 828 254 62 33 30
2015-09-10 00:00:00.000 259 3329 1912 584 245 131 76 128 1460 847 259 66 37 31
2015-09-11 00:00:00.000 259 3358 1930 605 248 136 79 128 1475 856 261 67 39 31
2015-09-12 00:00:00.000 259 3397 1953 621 255 139 79 128 1486 864 264 68 41 31
2015-09-13 00:00:00.000 259 3426 1984 642 257 144 80 129 1499 883 277 74 42 32
And I need a column with a SUM of all the columns containing the word F and other containig the word M, instead of using something like this:
F#13-17+F#18-24+F#25-34+F#35-44+F#45-54+etc.
Is this possible?
Try something like this:
with derivedTable as
(sql from your question goes here)
select column_name
from derivedTable
union
select cast(count(*) as varchar (10) + 'records'
from derivedTable

Group clause in SQL command

I have 3 tables: Deliveries, IssuedWarehouse, ReturnedStock.
Deliveries: ID, OrderNumber, Material, Width, Gauge, DelKG
IssuedWarehouse: OrderNumber, IssuedKG
ReturnedStock: OrderNumber, IssuedKG
What I'd like to do is group all the orders by Material, Width and Gauge and then sum the amount delivered, issued to the warehouse and issued back to stock.
This is the SQL that is really quite close:
SELECT
DELIVERIES.Material,
DELIVERIES.Width,
DELIVERIES.Gauge,
Count(DELIVERIES.OrderNo) AS [Orders Placed],
Sum(DELIVERIES.DeldQtyKilos) AS [KG Delivered],
Sum(IssuedWarehouse.[Qty Issued]) AS [Film Issued],
Sum([Film Retns].[Qty Issued]) AS [Film Returned],
[KG Delivered]-[Film Issued]+[Film Returned] AS [Qty Remaining]
FROM (DELIVERIES
INNER JOIN IssuedWarehouse
ON DELIVERIES.OrderNo = IssuedWarehouse.[Order No From])
INNER JOIN [Film Retns]
ON DELIVERIES.OrderNo = [Film Retns].[Order No From]
GROUP BY Material, Width, Gauge, ActDelDate
HAVING ActDelDate Between [start date] And [end date]
ORDER BY DELIVERIES.Material;
This groups the products almost perfectly. However if you take a look at the results:
Material Width Gauge Orders Placed Delivered Qnty Kilos Film Issued Film Returned Qty Remaining
COEX-GLOSS 590 75 1 534 500 124 158
COEX-MATT 1080 80 1 4226 4226 52 52
CPP 660 38 8 6720 2768 1384 5336
CPP 666 47 1 5677 5716 536 497
CPP 690 65 2 1232 717 202 717
CPP 760 38 3 3444 1318 510 2636
CPP 770 38 4 4316 3318 2592 3590
CPP 786 38 2 672 442 212 442
CPP 800 47 1 1122 1122 116 116
CPP 810 47 1 1127 1134 69 62
CPP 810 47 2 2250 1285 320 1285
CPP 1460 38 12 6540 4704 2442 4278
LD 975 75 1 502 502 182 182
LDPE 450 50 1 252 252 50 50
LDPE 520 70 1 250 250 95 95
LDPE 570 65 2 504 295 86 295
LDPE 570 65 2 508 278 48 278
LDPE 620 50 1 252 252 67 67
LDPE 660 50 1 256 256 62 62
LDPE 670 75 1 248 248 80 80
LDPE 690 47 1 476 476 390 390
LDPE 790 38 2 2104 1122 140 1122
LDPE 790 50 1 286 286 134 134
LDPE 790 50 1 250 250 125 125
LDPE 810 30 1 4062 4062 100 100
LDPE 843 33 1 408 408 835 835
LDPE 850 80 1 412 412 34 34
LDPE 855 30 1 740 740 83 83
LDPE 880 60 1 304 304 130 130
LDPE 900 70 2 1000 650 500 850
LDPE 1017 60 1 1056 1056 174 174
OPP 25 1100 1 381 381 95 95
OPP 1000 30 2 1358 1112 300 546
OPP 1000 30 1 1492 1491 100 101
OPP 1200 20 1 418 417 461 462
PET 760 12 3 1227 1876 132 -517
You'll see that there are some materials that have the same width and gauge yet they are not grouped. I think this is because the delivered qty is different on the orders. For example:
Material Width Gauge Orders Placed Delivered Qnty Kilos Film Issued Film Returned Qty Remaining
LDPE 620 50 1 252 252 67 67
LDPE 660 50 1 256 256 62 62
I would like these two rows to be grouped. They have the same material, width and gauge but the delivered qty is different therefore it hasn't grouped it.
Can anyone help me group these strange rows?
Your "problem" is that the deliveries occurred on different dates, and you're grouping by ActDelDate so the data splits, but because you haven't selected the ActDelDate column, this isn't obvious.
The fix is: Remove ActDelDate from the group by list
You should also remove the unnecessary brackets around the first join, and change
HAVING ActDelDate Between [start date] And [end date]
to
WHERE ActDelDate Between [start date] And [end date]
and have it before the GROUP BY
You are grouping by the delivery date, which is causing the rows to be split. Either omit the delivery date from the results and group by, or take the min/max of the delivery date.