Conditional sum in SQL (SAS) (SUMIFS equivalent) - Part 2 - sql

Let say I have to table:
Table1:
ID Item
1 A
1 B
1 A
2 B
2 B
3 A
3 B
3 B
3 A
Table2:
ID A B C
1 91 94 90
2 100 97 93
1 97 94 96
2 97 95 90
3 99 100 93
1 90 97 97
Now I would like to take the sum conditional for my table1 from table2 (when the ID by row and the Item match by COLUMN):
ID Item Want
1 A 278
1 B 285
2 A 197
2 B
2 B
3 A
3 B
3 B
3 A
So that I have 278 is the sum of all item 1 in column A, 285 is the sum of all itme 1 in column B, 197 is the sum of all item 2 in column A.
So what am I supposed to do in SQL?
Thanks in advance.

You can use join and conditional aggregation:
select t1.id, t1.item,
sum(case when t1.item = 'A' then t2.A
when t1.item = 'B' then t2.B
when t1.item = 'C' then t2.C
end) as want
from table1 t1 left join
table2 t2
on t1.id = t2.id
group by t1.id, t1.item

Proc MEANS is built from the ground up for the sole purpose of computing statistics for aggregates.
Consider this example:
data have; input
ID $ A B C; datalines;
1 91 94 90
2 100 97 93
1 97 94 96
2 97 95 90
3 99 100 93
1 90 97 97
;
ods _all_ close;
proc means data=have stackodsoutput sum;
class id;
var a b c;
ods output summary=want;
run;
that produces data set

Related

Randomly select rows in BigQuery table for each ID

I have a Biguery table consisting of multiple entries for each ID for each day. Basically, the IDs are stores with a list of products for which 2 columns represent properties.
Store Product Date property1 property2
0 ID1 A1 202212-01 1 5
1 ID1 A1 202212-02 2 6
2 ID1 A1 202212-03 3 7
3 ID1 A1 202212-04 4 8
4 ID1 A1 202212-05 5 9
5 ID1 A1 202212-06 6 10
6 ID1 A1 202212-07 7 11
7 ID1 A1 202212-08 8 12
8 ID1 A1 202212-09 9 13
9 ID1 A1 202212-10 10 14
10 ID1 A2 202212-01 11 15
11 ID1 A2 202212-02 12 16
12 ID1 A2 202212-03 13 17
13 ID1 A2 202212-04 14 18
14 ID1 A2 202212-05 15 19
15 ID1 A2 202212-06 16 20
16 ID1 A2 202212-07 17 21
17 ID1 A2 202212-08 18 22
18 ID1 A2 202212-09 19 23
19 ID1 A2 202212-10 20 24
20 ID2 B1 202212-01 21 25
21 ID2 B1 202212-02 22 26
22 ID2 B1 202212-03 23 27
23 ID2 B1 202212-04 24 28
24 ID2 B1 202212-05 25 29
25 ID2 B1 202212-06 26 30
26 ID2 B1 202212-07 27 31
27 ID2 B1 202212-08 28 32
28 ID2 B1 202212-09 29 33
29 ID2 B1 202212-10 30 34
30 ID2 B2 202212-01 31 35
31 ID2 B2 202212-02 32 36
32 ID2 B2 202212-03 33 37
33 ID2 B2 202212-04 34 38
34 ID2 B2 202212-05 35 39
35 ID2 B2 202212-06 36 40
36 ID2 B2 202212-07 37 41
37 ID2 B2 202212-08 38 42
38 ID2 B2 202212-09 39 43
39 ID2 B2 202212-10 40 44
Now, the real table consists of more than a billion rows, so I want to take a random sample consisting of a sample of product for the last day of entry but it needs to be from ALL stores.
I tried the following approach:
Since I want the last date of entry I use a with clause to limit to the last date (max(DATE(product_timestamp))) and list all the stores with another with clause on stores. I then take the random sample:
query_random_sample = """
with maxdate as (select max(DATE(product_timestamp)) as maxdate from `MyProject.DataSet1.product_timeline`)
,
stores as (select store from `MyProject.DataSet1.stores`)
select t.*,
t2.ProductDescription,
t2.ProductName,
t2.CreatedDate,
from (`MyProject.DataSet1.product_timeline` as t
join `MyProject.DataSet2.LableStore` as t2
on t.store = t2.store
and t.barcode = t2.barcode
join maxdate
on maxdate.maxdate = DATE(t.product_timestamp)
)
join stores
on stores.store = t.store
where rand()< 0.01
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
]
)
sampled_labels = bigquery_client.query(query_random_sample, job_config=job_config).to_dataframe()
The problem is that it even samples on store, but I want the sample to be on product for each store.
I work in Python and an alternative would be to do the query for each store, but the cost of such a query would be huge (over 1200 stores).
How can I solve this is a cost efficient way.
If I'm right to assume you want a random sample specific to each store, then I think your best bet is using a window function to do your random selection, using a window partitioned by Store:
SELECT
Store,
Product,
Date,
property1,
property2,
FROM
`MyProject.DataSet1.product_timeline`
QUALIFY
PERCENT_RANK() OVER(all_stores_rand) < 0.01
WINDOW
all_stores_rand AS (
PARTITION BY Store
ORDER BY RAND()
)
To explain that, we are partitioning the table into one group per value of Store (analogous to what we'd do for a GROUP BY), then calculating PERCENT_RANK over a set of random numbers separately for each store (generating these numbers using RAND()).
Since the part of the table corresponding to each Store must then yield a set of values evenly spanning 0 to 1, we can throw this into a QUALIFY (BigQuery's filter clause for window expressions) in order to just grab 1% of the values for each Store.

How to find a specific value in consecutive date

I need your help for a little issue.
I use MS ACCESS to work with a database and I need to resolve a query. My query asks:
Find the CUSTOMER_ID and TRANSC_ID where 2 consecutive value between 200 and 500 WITHIN the same transc_id.
I explain.
I have this table in this format:
CUSTOMER_ID TRANSC_ID VALUE VALUE_DATE
51 10 15 29-12-1999
51 10 20 15-07-2000
51 10 35 18-08-2000
51 10 250 30-08-2000
51 10 13 10-09-2000
51 10 450 15-09-2000
51 11 5 15-09-2000
51 11 23 30-09-2000
51 11 490 10-10-2000
51 11 300 12-10-2000
51 11 85 30-10-2000
51 11 98 01-01-2000
53 10 65 15-10-2000
53 10 14 29-12-2000
And I need just
51 11 490 10-10-2000
51 11 300 12-10-2000
because the two values is consecutive (and both of them is >250 and <500).
How can I make a query in MS ACCESS to obtain this result?
Thank you.
You can get the "next" and "previous" values using correlated subqueries, and then do the comparison:
select t.*
from t
where t.value between 200 and 500 and
( (select top 1 t2.value
from t as t2
where t2.CUSTOMER_ID = t.CUSTOMER_ID and t2.TRANSC_ID = t.TRANSC_ID and
t2.value_date > t.value_date
order by t2.value_date
) between 200 and 500 or
(select top 1 t2.value
from t as t2
where t2.CUSTOMER_ID = t.CUSTOMER_ID and t2.TRANSC_ID = t.TRANSC_ID and
t2.value_date < t.value_date
order by t2.value_date desc
) between 200 and 500
);

SQL: Counting and Numbering Duplicates - Optimising Correlated Subquery

In an SQLite database I have one table where I need to count the duplicates across certain columns (i.e. rows where 3 particular columns are the same) and then also number each of these cases (i.e. if there are 2 occurrences of a particular duplicate, they need to be numbered as 1 and 2). I'm finding it a bit difficult to explain in words so I'll use a simplified example below.
The data I have is similar to the following (first line is header row, table is referenced in following as "idcountdata"):
id match1 match2 match3 data
1 AbCde BC 0 data01
2 AbCde BC 0 data02
3 AbCde BC 1 data03
4 AbCde AB 0 data04
5 FGhiJ BC 0 data05
6 FGhiJ AB 0 data06
7 FGhiJ BC 1 data07
8 FGhiJ BC 1 data08
9 FGhiJ BC 2 data09
10 HkLMop BC 1 data10
11 HkLMop BC 1 data11
12 HkLMop BC 1 data12
13 HkLMop DE 1 data13
14 HkLMop DE 2 data14
15 HkLMop DE 2 data15
16 HkLMop DE 2 data16
17 HkLMop DE 2 data17
And the output I need to generate for the above would be:
id match1 match2 match3 data matchid matchcount
1 AbCde BC 0 data01 1 2
2 AbCde BC 0 data02 2 2
3 AbCde BC 1 data03 1 1
4 AbCde AB 0 data04 1 1
5 FGhiJ BC 0 data05 1 1
6 FGhiJ AB 0 data06 1 1
7 FGhiJ BC 1 data07 1 2
8 FGhiJ BC 1 data08 2 2
9 FGhiJ BC 2 data09 1 1
10 HkLMop BC 1 data10 1 3
11 HkLMop BC 1 data11 2 3
12 HkLMop BC 1 data12 3 3
13 HkLMop DE 1 data13 1 1
14 HkLMop DE 2 data14 1 4
15 HkLMop DE 2 data15 2 4
16 HkLMop DE 2 data16 3 4
17 HkLMop DE 2 data17 4 4
Previously I was using a couple of correlated subqueries to achieve this as follows:
SELECT id, match1, match2, match3, data,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3
AND d2.id<=d1.id)
AS matchid,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3)
AS matchcount
FROM idcountdata d1;
But the table has over 200,000 rows (and the data can be variable in length/content) and hence this takes hours to run. (Strangely, when I first used the same query on the same data back in mid-to-late 2013 it took minutes rather than hours, but that is beside the point - even back then I thought it was inelegant and inefficient.)
I've already converted the correlated subquery for "matchcount" in the above to an uncorrelated subquery with a JOIN as follows:
SELECT d1.id, d1.match1, d1.match2, d1.match3, d1.data,
matchcount
FROM idcountdata d1
JOIN
(SELECT id,match1,match2,match3,count(*) matchcount
FROM idcountdata
GROUP BY match1,match2,match3) d2
ON (d1.match1=d2.match1 and d1.match2=d2.match2 and d1.match3=d2.match3);
So it's just the subquery for "matchid" that I would like some help to optimise.
In short, the following query runs too slowly for larger datasets:
SELECT id, match1, match2, match3, data,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3
AND d2.id<=d1.id)
matchid
FROM idcountdata d1;
How can I improve the performance of the above query?
It doesn't have to run in seconds, but it needs to be minutes rather than hours (for around 200,000 rows).
A self join may be faster than a correlated subquery
SELECT d1.id, d1.match1, d1.match2, d1.match3, d1.data, count(*) matchid
FROM idcountdata d1
JOIN idcountdata d2 on d1.match1 = d2.match1
and d1.match2 = d2.match2
and d1.match3 = d2.match3
and d1.id >= d2.id
GROUP BY d1.id, d1.match1, d1.match2, d1.match3, d1.data
This query can take advantage of a composite index on (match1,match2,match3,id)

SQL join two table together

I have the following tables:
table1
a b
1 100
2 200
3 300
4 400
table2
c b
55 100
55 200
56 300
I want to get the following output:
55 100 1
55 200 2
55 300 -
55 400 -
56 100 -
56 200 -
56 300 3
56 400 -
I tried the following:
SELECT *
FROM table1
full JOIN table2
output:
a b c a
1 100 55 100
1 100 55 200
1 100 55 100
1 100 55 200
2 300 56 300
....
also I tried:
SELECT *
FROM table1
join table2 on table1.b = table2.b
union
SELECT *
FROM table2
join table1 on table1.b = table2.b
the output:
1 100 55 100
1 200 55 200
3 300 56 300
Is this possible in microsoft SQL 2012? and how
I'm not completely sure I understand your expected outcome, but it sounds like you're looking for a FULL OUTER JOIN.
SELECT table1.a, COALESCE(table1.b, table2.b), table2.c
FROM table1
FULL OUTER JOIN table2 ON table1.b = table2.b
This will get the fields from table1 and, if any exist, map them to those from table2.
Given your example, it will return the following table.
A B C
1 100 55
2 200 55
3 300 56
4 400 (null)
I know that isn't the same as the expected result you gave, but this will correlate the data that actually exists.
I'm requesting clarification in a comment and will revise this as necessary.

Can't get unique values joining two tables

I have 2 tables that I need to join and select the unique rows from. Here is a sample of my data: (there are more columns)
tbl1:
MB# MBName PCCNo_PRI Primary_IP PCCNo_SEC Secondary_IP ID
100 name 0 10.1.9.10 30 10.1.9.10 1
103 name3 17 10.1.9.27 47 10.1.9.67 4
403 name13 17 10.1.9.27 47 10.1.9.67 14
tbl2:
RTU PCC#_PRI PCC#_SEC STATION ADDRESS
15 0 30 6
52 12 42 1
53* 17 47 1
54 18 48 1
63 9 39 2
69* 17 47 2
I need to join the two tables and get the unique RTU(s) in tbl2 for a given MB# in tbl1.
Query =
SELECT t1.MB#,t2.RTU,t2.[Device Manufacturer],t2.PCC#_PRI,t2.PCC#_SEC,t2.[STATION ADDRESS]
INTO C300_RTU_MASTERBLK_Map
FROM mbm_PCDIMasterBlk_tbl as t1, dbo.WOA_PCC_Conn_tbl as t2
WHERE t1.PCCNo_PRI = t2.PCC#_PRI
I am getting duplicate rows for tbl2 53 and 69 (* above). 53 ends up with 2 entries; one to 103 and one 403 (69 gets same). How can I query this for unique RTU(s) to MB#?
The duplicate rows appears because you join on "17" which gives 2 rows on each side
Then, as it stands, you can't with that SELECT list.
How do you decide which t1.MB# you want for the t2 columns?
There is no secondary JOIN column that I can see.
So the best you can get is use MAX (or MIN) to pick either 403 or 103.
SELECT
MAX(t1.MB#) AS MB#,
t2.RTU,t2.[Device Manufacturer],t2.PCC#_PRI,t2.PCC#_SEC,t2.[STATION ADDRESS]
INTO C300_RTU_MASTERBLK_Map
FROM
dbombm_PCDIMasterBlk_tbl as t1
JOIN
dbo.WOA_PCC_Conn_tbl as t2 ON t1.PCCNo_PRI = t2.PCC#_PRI
GROUP BY
t2.RTU,t2.[Device Manufacturer],t2.PCC#_PRI,t2.PCC#_SEC,t2.[STATION ADDRESS]