Randomly select rows in BigQuery table for each ID - sql

I have a Biguery table consisting of multiple entries for each ID for each day. Basically, the IDs are stores with a list of products for which 2 columns represent properties.
Store Product Date property1 property2
0 ID1 A1 202212-01 1 5
1 ID1 A1 202212-02 2 6
2 ID1 A1 202212-03 3 7
3 ID1 A1 202212-04 4 8
4 ID1 A1 202212-05 5 9
5 ID1 A1 202212-06 6 10
6 ID1 A1 202212-07 7 11
7 ID1 A1 202212-08 8 12
8 ID1 A1 202212-09 9 13
9 ID1 A1 202212-10 10 14
10 ID1 A2 202212-01 11 15
11 ID1 A2 202212-02 12 16
12 ID1 A2 202212-03 13 17
13 ID1 A2 202212-04 14 18
14 ID1 A2 202212-05 15 19
15 ID1 A2 202212-06 16 20
16 ID1 A2 202212-07 17 21
17 ID1 A2 202212-08 18 22
18 ID1 A2 202212-09 19 23
19 ID1 A2 202212-10 20 24
20 ID2 B1 202212-01 21 25
21 ID2 B1 202212-02 22 26
22 ID2 B1 202212-03 23 27
23 ID2 B1 202212-04 24 28
24 ID2 B1 202212-05 25 29
25 ID2 B1 202212-06 26 30
26 ID2 B1 202212-07 27 31
27 ID2 B1 202212-08 28 32
28 ID2 B1 202212-09 29 33
29 ID2 B1 202212-10 30 34
30 ID2 B2 202212-01 31 35
31 ID2 B2 202212-02 32 36
32 ID2 B2 202212-03 33 37
33 ID2 B2 202212-04 34 38
34 ID2 B2 202212-05 35 39
35 ID2 B2 202212-06 36 40
36 ID2 B2 202212-07 37 41
37 ID2 B2 202212-08 38 42
38 ID2 B2 202212-09 39 43
39 ID2 B2 202212-10 40 44
Now, the real table consists of more than a billion rows, so I want to take a random sample consisting of a sample of product for the last day of entry but it needs to be from ALL stores.
I tried the following approach:
Since I want the last date of entry I use a with clause to limit to the last date (max(DATE(product_timestamp))) and list all the stores with another with clause on stores. I then take the random sample:
query_random_sample = """
with maxdate as (select max(DATE(product_timestamp)) as maxdate from `MyProject.DataSet1.product_timeline`)
,
stores as (select store from `MyProject.DataSet1.stores`)
select t.*,
t2.ProductDescription,
t2.ProductName,
t2.CreatedDate,
from (`MyProject.DataSet1.product_timeline` as t
join `MyProject.DataSet2.LableStore` as t2
on t.store = t2.store
and t.barcode = t2.barcode
join maxdate
on maxdate.maxdate = DATE(t.product_timestamp)
)
join stores
on stores.store = t.store
where rand()< 0.01
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
]
)
sampled_labels = bigquery_client.query(query_random_sample, job_config=job_config).to_dataframe()
The problem is that it even samples on store, but I want the sample to be on product for each store.
I work in Python and an alternative would be to do the query for each store, but the cost of such a query would be huge (over 1200 stores).
How can I solve this is a cost efficient way.

If I'm right to assume you want a random sample specific to each store, then I think your best bet is using a window function to do your random selection, using a window partitioned by Store:
SELECT
Store,
Product,
Date,
property1,
property2,
FROM
`MyProject.DataSet1.product_timeline`
QUALIFY
PERCENT_RANK() OVER(all_stores_rand) < 0.01
WINDOW
all_stores_rand AS (
PARTITION BY Store
ORDER BY RAND()
)
To explain that, we are partitioning the table into one group per value of Store (analogous to what we'd do for a GROUP BY), then calculating PERCENT_RANK over a set of random numbers separately for each store (generating these numbers using RAND()).
Since the part of the table corresponding to each Store must then yield a set of values evenly spanning 0 to 1, we can throw this into a QUALIFY (BigQuery's filter clause for window expressions) in order to just grab 1% of the values for each Store.

Related

T-SQL creating a hierarchy out of orderly numbers

I have such table:
Id code
1 10
2 11
3 20
4 21
5 30
6 31
7 32
8 40
9 10
10 11
11 20
12 21
13 30
14 31
15 32
16 40
17 20
18 21
19 30
20 31
21 32
22 40
23 20
24 21
25 30
26 31
27 32
28 40
29 20
30 21
31 30
32 31
33 32
34 40
35 20
36 21
37 30
38 31
39 32
40 40
41 41
42 90
The column id represents simply the order of the records.
The column code represent the type of record.
The problem is that the records are part of a hierarchy, as shown here:
What I need to obtain is the parent of every record:
Id code Parent
1 10 1
2 11 1
3 20 1
4 21 3
5 30 3
6 31 3
7 32 3
8 40 3
9 10 9
10 11 9
11 20 9
12 21 11
13 30 11
14 31 11
15 32 11
16 40 11
17 20 9
18 21 17
19 30 17
20 31 17
21 32 17
22 40 17
23 20 9
24 21 23
25 30 23
26 31 23
27 32 23
28 40 23
29 20 9
30 21 29
31 30 29
32 31 29
33 32 29
34 40 29
35 20 9
36 21 35
37 30 35
38 31 35
39 32 35
40 40 35
41 41 40
42 90 42
The parent of every record should be expressed as its Id.
The rules are like this:
10s are their own parents since they are the roots
90s are their own parents since they are the end of data
20s parent is the previous 10
21 30 31 32 33 parent is the previous 20
40 and 50 parents is the previous 20
41 parent is the previous 40
As you can see the order in which records are is very important.
I tried to solve this declaratively (with lag() etc) and imperatively with loops but I could not find a solution.
Please help
This should work. Probably not optimal performance, but its pretty clear what its doing so should be easy to modify if (when!) your hierarchy changes.
It can obviously produce nulls if your hierarchy or ordering is not as you have prescribed
CREATE TABLE #data(id INT, code INT);
INSERT INTO #data values
(1 , 10),(2 , 11),(3 , 20),(4 , 21),(5 , 30),(6 , 31),(7 , 32),(8 , 40),(9 , 10),(10 , 11),
(11 , 20),(12 , 21),(13 , 30),(14 , 31),(15 , 32),(16 , 40),(17 , 20),(18 , 21),(19 , 30),(20 , 31),
(21 , 32),(22 , 40),(23 , 20),(24 , 21),(25 , 30),(26 , 31),(27 , 32),(28 , 40),(29 , 20),(30 , 21),
(31 , 30),(32 , 31),(33 , 32),(34 , 40),(35 , 20),(36 , 21),(37 , 30),(38 , 31),(39 , 32),(40 , 40),
(41 , 41),(42 , 90);
WITH
tens AS (SELECT id FROM #data WHERE code = 10),
twenties AS (SELECT id FROM #data WHERE code = 20),
forties AS (SELECT id FROM #data WHERE code = 40)
SELECT #data.id,
#data.code,
CASE WHEN code IN (10,90) THEN #data.id
WHEN code IN (11,20) THEN prev_ten.id
WHEN code IN (21,30,31,32,33,40,50) THEN prev_twenty.id
WHEN code = 41 THEN prev_forty.id
ELSE NULL
END AS Parent
FROM #data
OUTER APPLY (SELECT TOP (1) id FROM tens WHERE tens.id < #data.id ORDER BY tens.id DESC) AS prev_ten
OUTER APPLY (SELECT TOP (1) id FROM twenties WHERE twenties.id < #data.id ORDER BY twenties.id DESC) AS prev_twenty
OUTER APPLY (SELECT TOP (1) id FROM forties WHERE forties.id < #data.id ORDER BY forties.id DESC) AS prev_forty;
i think u should add FOREIGN KEY parentId referencing Id to existing table, fill this new column by UPDATE or gain data to fill it from external source and then u should do SELECT * FROM tableName ORDER BY parentId to receive tree structure

I have a table fetching values from an SP. I'd like to know if a combination of data is reversely repeated

I have a table named #hierachy, for which there are three columns:
childId, parentId, and linkType
The data is as follows:
childID parentID linktype
30 31 53
31 42 56
31 415349 18
31 437327 18
31 438333 18
35 32 56
32 35 18
32 38 52
32 39 52
32 439395 51
34 40 51
I'd like to spot any reverse repetitions and return a "true" or "1" if there is a reverse repetition between the first two columns childID and parentID
For example, the sixth and seventh lines are reversely similar: One is 35 then 32 and the other is the reverse: 32 and 35.
Thank you!
You can use exists:
select t.*,
(case when exists (select 1
from t t2
where t2.childID = t.parentID and t2.parentID = t.childID
)
then 1 else 0
end) as is_reversed
from t;
If the data is being returned from a stored procedure, you need to either put this logic in the stored procedure or put the data into a table and run the logic using that table.

Conditional sum in SQL (SAS) (SUMIFS equivalent) - Part 2

Let say I have to table:
Table1:
ID Item
1 A
1 B
1 A
2 B
2 B
3 A
3 B
3 B
3 A
Table2:
ID A B C
1 91 94 90
2 100 97 93
1 97 94 96
2 97 95 90
3 99 100 93
1 90 97 97
Now I would like to take the sum conditional for my table1 from table2 (when the ID by row and the Item match by COLUMN):
ID Item Want
1 A 278
1 B 285
2 A 197
2 B
2 B
3 A
3 B
3 B
3 A
So that I have 278 is the sum of all item 1 in column A, 285 is the sum of all itme 1 in column B, 197 is the sum of all item 2 in column A.
So what am I supposed to do in SQL?
Thanks in advance.
You can use join and conditional aggregation:
select t1.id, t1.item,
sum(case when t1.item = 'A' then t2.A
when t1.item = 'B' then t2.B
when t1.item = 'C' then t2.C
end) as want
from table1 t1 left join
table2 t2
on t1.id = t2.id
group by t1.id, t1.item
Proc MEANS is built from the ground up for the sole purpose of computing statistics for aggregates.
Consider this example:
data have; input
ID $ A B C; datalines;
1 91 94 90
2 100 97 93
1 97 94 96
2 97 95 90
3 99 100 93
1 90 97 97
;
ods _all_ close;
proc means data=have stackodsoutput sum;
class id;
var a b c;
ods output summary=want;
run;
that produces data set

Aggregate result from query by quarter SQL

Lets say I have a table which holds all exports for some time back in Microsoft SQL database:
Name:
ExportTable
Columns:
id - numeric(18)
exportdate - datetime
In order to get the number of exports per week I can run the following query:
SELECT DATEPART(ISO_WEEK,[exportdate]) as 'exportdate', count(exportdate) as 'totalExports'
FROM [ExportTable]
Group By DATEPART(ISO_WEEK,[exportdate])
order by exportdate;
Returns:
exportdate totalExports
---------- ------------
27 13
28 12
29 15
30 8
31 17
32 10
33 7
34 15
35 4
36 18
37 10
38 14
39 14
40 21
41 19
Would it be possible to aggregate the week results by quarter so the output becomes something like the bellow?
UPDATE
Sorry for not being crystal clear, I would like the current result to add upp with previous result up to a new quarter.
Note week 41 contains 21+19 = 40
Week 39 contains 157 (13+12+15+8+17+10+7+15+4+18+10+14+14)
exportdate totalExports Quarter
---------- ------------ -------
27 13 3
28 25 3
29 40 3
30 48 3
31 65 3
32 75 3
33 82 3
34 97 3
35 101 3
36 119 3
37 129 3
38 143 3
39 157 3 -- Sum of 3 Quarter values.
40 21 4 -- New Quarter show current week value
41 40 4 -- (21+19)
You can use this.
SELECT
DATEPART(ISO_WEEK,[exportdate]) as 'exportdate'
, SUM( count(exportdate) ) OVER ( PARTITION BY DATEPART(QUARTER,MIN([exportdate])) ORDER BY DATEPART(ISO_WEEK,[exportdate]) ROWS UNBOUNDED PRECEDING ) as 'totalExports'
, DATEPART(QUARTER,MIN([exportdate])) [Quarter]
FROM [ExportTable]
Group By DATEPART(ISO_WEEK,[exportdate])
order by exportdate;
You could use a case statement to separate the dates into quarters.
e.g.
CASE
WHEN EXPORT_DATE BETWEEN '1' AND '4' THEN 1
WHEN Export_Date BETWEEN '5' and '9' THEN 2
ELSE 0 AS [Quarter]
END
Its just an example but you get the idea.
You could then use the alias from the case
SELECT DATEPART(ISO_WEEK,[exportdate]) as 'exportdate', count(exportdate) as 'totalExports', DATEPART(quarter,[exportdate]) as quarter FROM [ExportTable] Group By DATEPART(ISO_WEEK,[exportdate]), DATEPART(quarter,[exportdate]) order by exportdate;

How to display on a certain amount of data for a specific column

Consider the following table
Dept product name parts WO
32 aa abc 11 1234
32 aa aas 18 2213
32 bb asd 16 3424
32 aa adf 19 1255
32 cc asa 10 7567
32 aa agd 11 1233
31 ss fsf 23 3434
I have around 100 dept. in my table. What I want is that when the dept. is 32 and the product is "aa", I only want to display 30 parts or less. So in this case the total number of parts for aa is 59. So the first aa product has 11 parts and the next aa product has 18 parts so that's 29. It should now ignore all the other aa products.
Expected Output
Dept product name parts WO
32 aa abc 11 1234
32 aa aas 18 2213
32 bb asd 16 3424
32 cc asa 10 7567
31 ss fsf 23 3434
Appreciate any help provided.
Assuming WO is a primary key then use SUM window function to solve it.
SELECT yt.Dept, yt.product, yt.name, yt.parts, yt.WO
FROM yourtable yt
LEFT JOIN (
SELECT *, sum(y.parts) over (partition by y.dept order by y.parts) tsum
FROM yourtable y
WHERE y.product = 'aa'
) t ON yt.WO= t.WO
WHERE yt.dept != 32 or (yt.dept = 32 and t.tsum < 59) or (yt.dept = 32 and yt.product != 'aa')
you can use SUM() window function where you have to partition by dept and product
SELECT dept,
product,
name,
parts,
wo
FROM (SELECT *,
SUM(parts) OVER (PARTITION BY dept, product ORDER BY name) rt
FROM t
) t_rt
WHERE rt <= 30
ORDER BY dept DESC,
product,
wo
Result
dept product name parts wo
32 aa abc 11 1234
32 aa aas 18 2213
32 bb asd 16 3424
32 cc asa 10 7567
31 ss fsf 23 3434