bigquery group by and get all elements except the groupby value - google-bigquery

I'm dealing with a some transactional history in bigquery. The table contains two columns:
transaction_number and item_id.
I'm trying to identify two features:
How many (average and std) products are purchased along with a certain item_id in the same transaction?
What are the list of products purchased along with the certain item_id in the same transaction?
For example: if we assume these are the products purchased in the same transaction,
|---------------------|------------------|
| trans_num | item_id |
|---------------------|------------------|
| 1 | 34 |
|---------------------|------------------|
| 1 | 35 |
|---------------------|------------------|
| 2 | 36 |
|---------------------|------------------|
| 2 | 37 |
|---------------------|------------------|
| 2 | 34 |
|---------------------|------------------|
I want the first output to be
|----------------------|------------------|
| item_id | feature_1 |
|----------------------|------------------|
| 34 | 2.5 |
|----------------------|------------------|
| 35 | 2 |
|----------------------|------------------|
| 36 | 2 |
|----------------------|------------------|
| 37 | 2 |
|----------------------|------------------|
| 38 | 2 |
|----------------------|------------------|
And feature_2 should contain
|--------|------------|
|item_id | feature 2 |
|--------|------------|
| 34 |[35, 36, 37]|
|--------|------------|
| 35 | [34] |
|--------|------------|
| 36 | [37, 34] |
|--------|------------|
| 37 | [36, 34] |
|--------|------------|
How should I approach this?

Below is for BigQuery Standard SQL
#standardSQL
with pre_aggregation as (
select a.trans_num, a.item_id, array_agg(b.item_id) other_items
from `project.dataset.table` a
join `project.dataset.table` b
on a.trans_num = b.trans_num
and a.item_id != b.item_id
group by trans_num, item_id
order by item_id, trans_num
)
select item_id,
feature_1,
array (
select distinct item
from t.feature_2 item
order by item
) as feature_2
from (
select item_id,
avg(array_length(other_items)) as feature_1,
array_concat_agg(other_items) as feature_2
from pre_aggregation
group by item_id
) t
if to apply to sample data from your question
`project.dataset.table` as (
select 1 trans_num, 34 item_id union all
select 1, 35 union all
select 2, 36 union all
select 2, 37 union all
select 2, 34
)
output is

Related

Dynamic intersections between groups based on relation

I have 2 tables:
product_facet_values_facet_value
+-----------+--------------+
| productId | facetValueId |
+-----------+--------------+
| 6 | 1 |
| 6 | 34 |
| 7 | 39 |
| 8 | 34 |
| 8 | 1 |
| 8 | 11 |
| 9 | 1 |
| 9 | 39 |
+-----------+--------------+
facet_value
+--------------+---------+
| facetValueId | facetId |
+--------------+---------+
| 1 | 2 |
| 34 | 6 |
| 39 | 2 |
| 44 | 2 |
| 56 | 11 |
+--------------+---------+
I need to be able to get all productIds with those facetValueIds I ask for but with one extra step - I need an intersection between facetValueId groups based on same facetId.
For example I want to get all product ids with facetValueId 1, 34, 39 and result of this query should be same as I would get with the following query:
select "productId"
from "product_facet_values_facet_value"
where "facetValueId" in (1, 39)
INTERSECT
select "productId"
from "product_facet_values_facet_value"
where "facetValueId" in (34)
I wrote this query based on: facetValueIds 1 or 39 has same "facetId"=2, facetValueId 34 has "facetId"=6.
I need a query that would result in same result without having it to group it manually. If for example next time I ask for all products that have facetValueIds 1, 34, 39, 56 the result of such dynamic query should be same as if I would write 3 INTERSECTIONs between IN (1, 39) & IN(34) & IN(56) like:
select "productId"
from "product_facet_values_facet_value"
where "facetValueId" in (1, 39)
INTERSECT
select "productId"
from "product_facet_values_facet_value"
where "facetValueId" in (34)
INTERSECT
select "productId"
from "product_facet_values_facet_value"
where "facetValueId" in (56)
https://dbfiddle.uk/?rdbms=postgres_13&fiddle=d06344b4a68c7b97fc1fad46c7437894
This is the same method as #a_horse_with_no_name used, but generalised very slightly.
WITH
targets AS
(
SELECT * FROM facet_value WHERE facetId IN (2, 6)
)
SELECT
map.productId
FROM
product_facet_values_facet_value AS map
INNER JOIN
targets AS tgt
ON tgt.facetValueId = map.facetValueId
GROUP BY
map.productId
HAVING
COUNT(DISTINCT tgt.facetId) = (SELECT COUNT(DISTINCT facetId) FROM targets)

count total items, sold items (in another table reference by id) and grouped by serial number

I have a table of items in the shop, an item may have different entries with same serial number (sn) (but different ids) if the same item was bought again later on with different price (price here is how much did a single item cost the shop)
id | sn | amount | price
----+------+--------+-------
1 | AP01 | 100 | 7
2 | AP01 | 50 | 8
3 | X2P0 | 200 | 12
4 | X2P0 | 30 | 18
5 | STT0 | 20 | 20
6 | PLX1 | 200 | 10
and a table of transactions
id | item_id | price
----+---------+-------
1 | 1 | 10
2 | 1 | 9
3 | 1 | 10
4 | 2 | 11
5 | 3 | 15
6 | 3 | 15
7 | 3 | 15
8 | 4 | 18
9 | 5 | 22
10 | 5 | 22
11 | 5 | 22
12 | 5 | 22
and transaction.item_id references items(id)
I want to group items by serial number (sn), get their sum(amount) and avg(price), and join it with a sold column that counts number of transactions with referenced id
I did the first with
select i.sn, sum(i.amount), avg(i.price) from items i group by i.sn;
sn | sum | avg
------+-----+---------------------
STT0 | 20 | 20.0000000000000000
PLX1 | 200 | 10.0000000000000000
AP01 | 150 | 7.5000000000000000
X2P0 | 230 | 15.0000000000000000
Then when I tried to join it with transactions I got strange results
select i.sn, sum(i.amount), avg(i.price) avg_cost, count(t.item_id) sold, sum(t.price) profit from items i left join transactions t on (i.id=t.item_id) group by i.sn;
sn | sum | avg_cost | sold | profit
------+-----+---------------------+------+--------
STT0 | 80 | 20.0000000000000000 | 4 | 88
PLX1 | 200 | 10.0000000000000000 | 0 | (null)
AP01 | 350 | 7.2500000000000000 | 4 | 40
X2P0 | 630 | 13.5000000000000000 | 4 | 63
As you can see, only the sold and profit columns show correct results, the sum and avg show different results than the expected
I can't separate the statements because I am not sure how can I add the count to the sn group which has the item_id as its id?
select
j.sn,
j.sum,
j.avg,
count(item_id)
from (
select
i.sn,
sum(i.amount),
avg(i.price)
from items i
group by i.sn
) j
left join transactions t
on (j.id???=t.item_id);
There are multiple matches in both tables, so the join multiplies the rows (and eventually produces wron results). I would recommend pre-joining, then aggregating:
select
sn,
sum(amount) total_amount,
avg(price) avg_price,
sum(no_transactions) no_transactions
from (
select
i.*,
(
select count(*)
from transactions t
where t.item_id = i.id
) no_transactions
from items i
) t
group by sn

eSQL multiple join but with conditions

I've 3 tables as under
MERCHANDISE
+-----------+-----------+---------------+
| MERCH_NUM | MERCH_DIV | MERCH_SUB_DIV |
+-----------+-----------+---------------+
| 1 | car | awd |
| 1 | car | awd |
| 2 | bike | 1kcc |
| 3 | cycle | hybrid |
| 3 | cycle | city |
| 4 | moped | fixie |
+-----------+-----------+---------------+
PRIORITY
+----------+-----------+---------+---------+------------+------------+---------------+
| CUST_NUM | SALES_NUM | DOC_NUM | BALANCE | PRIORITY_1 | PRIORITY_2 | PRIORITY_CODE |
+----------+-----------+---------+---------+------------+------------+---------------+
| 90 | 1000 | 10 | 23 | 1 | 6 | NO |
| 91 | 1001 | 20 | 32 | 3 | 7 | PRI |
| 92 | 1002 | 30 | 11 | 2 | 8 | LATE |
| 93 | 1003 | 40 | 22 | 5 | 9 | 1MON |
+----------+-----------+---------+---------+------------+------------+---------------+
ORDER
+----------+-----------+---------+---------+-----------+-----------+
| CUST_NUM | SALES_NUM | DOC_NUM | COUNTRY | MERCH_NUM | MERCH_DIV |
+----------+-----------+---------+---------+-----------+-----------+
| 90 | 1000 | 10 | INDIA | 1 | car |
| 91 | 1001 | 20 | CHINA | 2 | bike |
| 92 | 1002 | 30 | USA | 3 | cycle |
| 93 | 1003 | 40 | UK | 4 | moped |
+----------+-----------+---------+---------+-----------+-----------+
I want to join the left joined table from the last two tables with the first one such that the MERCH_SUB_DIV 'awd' appears only once for each unique combination of merch_num and merch_div
the code I came up with is as under, but I'm not sure how do I eliminate the duplicate row just for the awd
select
ROW#, MERCH.MERCH_NUMBER, ORDPRI.MERCH_NUMBER, ORDPRI.CUST_NUM,
BALANCE, SALES_NUM, ITEM_NUM, RANK, PRIORITY_1
from (
select
ROW_NUMBER() OVER(
PARTITION BY ORD.DOC_NUM, ORD.ITEM_NUM
ORDER BY ORD.DOC_NUM, ORD.ITEM_NUM ASC
) AS Row#,
ORD.CUST_NUM, PRI.CUST_NUM, ORD.MERCH_NUM, ORD.MERCH_DIV, PRI.BALANCE,
pri.DOC_NUM, pri.SALES_NUM, pri.PRIORITY_1, pri.PRIORITY_2
from ORDER as ORD
left join PRIORITY as PRI on ORD.DOC_NUM = PRI.DOC_NUM
and ORD.SALES_NUMBER = PRI.SALES_NUM
where country_name in ('USA', ‘INDIA’)
) as ORDPRI
left join MERCHANDISE as MERCH on ORDPRI.DIV = MERCH.DIV
and ORDPRI.MERCH_NUM = MERCH.MERCH_NUM
You have to use 'DISTINCT' keyword to get unique values, but if your 'Priority table' & 'Order table' contains different values for Same MERCH_NUM then the final result contains the repetation of the 'MERCH_NUM'.
SELECT DISTINCT M.MERCH_NUMBER, O.MERCH_NUMBER, O.CUST_NUM, BALANCE, SALES_NUM,ITEM_NUM,RANK,PRIORITY_1
FROM priority_table P
LEFT JOIN order_table O ON P.CUST_NUM = O.CUST_NUM AND P.SALES_NUM=O.SALES_NUM AND P.DOC_NUM = O.DOC_NUM
LEFT JOIN merchandise_table M ON M.MERCH_NUM = O.MERCH_NUM
A way around can be to add one new Row_Number() in the outermost query having Partition by MERCH_SUB_DIV + all the columns in the final list and then filter final results based on the New Row_Number() . Follows a pseudo code that might help:
select
-- All expected columns in final result except the newRow#
ROW#, MERCH_NUM, CUST_NUM,
BALANCE, SALES_NUM, PRIORITY_1
from (
select
ROW#,
-- the new row number includes all column you want to show in final result
row_number() over ( PARTITION BY MERCH.MERCH_SUB_DIV ,
MERCH.MERCH_NUM, ORDPRI.MERCH_NUM, ORDPRI.CUST_NUM,
BALANCE, SALES_NUM, PRIORITY_1
order by (select 1 )) as newRow# ,
MERCH.MERCH_NUM, ORDPRI.CUST_NUM,
BALANCE, SALES_NUM, PRIORITY_1
from (
-- main query goes here
select
ROW_NUMBER() OVER(
PARTITION BY ORD.DOC_NUM --, ORD.ITEM_NUM
ORDER BY ORD.DOC_NUM ASC --, ORD.ITEM_NUM
) AS Row#,
ORD.CUST_NUM, ORD.MERCH_NUM, ORD.MERCH_DIV as DIV, PRI.BALANCE,
pri.DOC_NUM, pri.SALES_NUM, pri.PRIORITY_1, pri.PRIORITY_2
from #ORDER as ORD
left join #PRIORITY as PRI on ORD.DOC_NUM = PRI.DOC_NUM
and ORD.SALES_NUMBER = PRI.SALES_NUM
where country_name in ('USA', 'INDIA')
) as ORDPRI
left join #MERCHANDISE as MERCH on ORDPRI.DIV = MERCH.DIV
and ORDPRI.MERCH_NUM = MERCH.MERCH_NUM
) as T
-- final filter to get distinct values
where newRow# = 1
Sample code here .. Hope this helps!!

Want to JOIN fourth table in query

I have four tables:
mls_category
points_matrix
mls_entry
bonus_points
My first table (mls_category) is like below:
*--------------------------------*
| cat_no | store_id | cat_value |
*--------------------------------*
| 10 | 101 | 1 |
| 11 | 101 | 4 |
*--------------------------------*
My second table (points_matrix) is like below:
*----------------------------------------------------*
| pm_no | store_id | value_per_point | maxpoint |
*----------------------------------------------------*
| 1 | 101 | 1 | 10 |
| 2 | 101 | 2 | 50 |
| 3 | 101 | 3 | 80 |
*----------------------------------------------------*
My third table (mls_entry) is like below:
*-------------------------------------------*
| user_id | category | distance | status |
*-------------------------------------------*
| 1 | 10 | 20 | approved |
| 1 | 10 | 30 | approved |
| 1 | 11 | 40 | approved |
*-------------------------------------------*
My fourth table (bonus_points) is like below:
*--------------------------------------------*
| user_id | store_id | bonus_points | type |
*--------------------------------------------*
| 1 | 101 | 200 | fixed |
| 2 | 102 | 300 | fixed |
| 1 | 103 | 4 | per |
*--------------------------------------------*
Now, I want to add bonus points value into the sum of total distance according to the store_id, user_id and type.
I am using the following code to get total distance:
SELECT MIN(b.value_per_point) * d.total_distance FROM points_matrix b
JOIN
(
SELECT store_id, sum(t1.totald/c.cat_value) as total_distance FROM mls_category c
JOIN
(
SELECT SUM(distance) totald, user_id, category FROM mls_entry
WHERE user_id= 1 AND status = 'approved' GROUP BY user_id, category
) t1 ON c.cat_no = t1.category
) d ON b.store_id = d.store_id AND b.maxpoint >= d.total_distance
The above code is correct to calculate value, now I want to JOIN my fourth table.
This gives me sum (60*3 = 180) as total value. Now, I want (60+200)*3 = 780 for user 1 and store id 101 and value is fixed.
i think your query will be like below
SELECT Max(b.value_per_point)*( max(d.total_distance)+max(bonus_points)) FROM mls_point_matrix b
JOIN
(
SELECT store_id, sum(t1.totald/c.cat_value) as total_distance FROM mls_category c
JOIN
(
SELECT SUM(distance) totald, user_id, category FROM mls_entry
WHERE user_id= 1 AND status = 'approved' GROUP BY user_id, category
) t1 ON c.cat_no = t1.category group by store_id
) d ON b.store_id = d.store_id inner join bonus_points bp on bp.store_id=d.store_id
DEMO fiddle

Add Values to Grouping Column

I am having a lot of trouble with a scenario that I think some of you might have come across.
(the whole thing about Business Trips, two tables, one filled with payments done on Business trips, and the other is about the Business Trips, so the first one has more Rows than the other, (there are more Payments that happened than Trips))
I have two tables, Table A and Table B.
Table A looks as follows
| TableA_ID | TableB_ID | PaymentMethod | ValuePayed |
| 52 | 1 | Method1 | 23,2 |
| 21 | 1 | Method2 | 23,2 |
| 33 | 2 | Method3 | 23,2 |
| 42 | 1 | Method2 | 14 |
| 11 | 14 | Method1 | 267 |
| 42 | 1 | Method2 | 14,7 |
| 13 | 32 | Method1 | 100,2 |
Table B looks like this
| TableB_ID | TravelExpenses | OperatingExpense |
| 1 | 23 | 12 |
| 1 | 234 | 24 |
| 2 | 12 | 7 |
| 1 | 432 | 12 |
| 14 | 110 | 12 |
I am trying to create a measure Table (Table C) that looks like this:
| TableC_ID | TypeofCost | Amount |
| 1 | Method1 | 100,2 |
| 2 | Method2 | 52 |
| 3 | TravelExpenses | 7 |
| 4 | OperatingExpense| 12 |
| 5 | Method3 | 12 |
| 6 | OperatingExpense| 7 |
| 7 | Method3 | 12 |
(the Amount results are to be Summed and Columns - Employee, Month, TypeofCost Grouped)
So I pretty much have to group not only by the PaymentMethod which I get from table A,
but also insert new values in the group (TravelExpenses and OperatingExpense)
Can anybody give me any Idea about how this can be done in SQL ?
Here is what I have tried so far
SELECT PaymentMethod as TypeofCost
,Sum(ValuePayed) as Amount
FROM TableA Left Outer Join TableB on TableA.TableB_ID = TableB.TableB_ID
GROUP PaymentMethod
UNION
SELECT 'TravelExpenses' as TypeofCost
,Sum(TableB.TravelExpenses) as Amount
FROM TableA Left Outer Join TableB on TableA.TableB_ID = TableB.TableB_ID
GROUP PaymentMethod
UNION
SELECT 'OperatingExpense' as TypeofCost
,Sum(TableB.OperatingExpense) as Amount
FROM TableA Left Outer Join TableB on TableA.TableB_ID = TableB.TableB_ID
GROUP PaymentMethod
It should be something like this:
Select
row_number() OVER(ORDER BY TableB_ID) as 'TableC_ID',
u.TypeofCost,
u.Amount
from (
Select
a.TableB_ID,
a.PaymentMethod as 'TypeofCost',
SUM(a.ValuePayed) as 'Amount'
from
Table_A as a
group by a.TableB_ID, a.PaymentMethod
union
Select
b1.TableB_ID,
'TravelExpenses' as 'TypeofCost',
SUM(b1.TravelExpenses) as 'Amount'
from
Table_B as b1
group by b1.TableB_ID
union
Select
b2.TableB_ID,
'OperatingExpenses' as 'TypeofCost',
SUM(b2.OperatingExpenses) as 'Amount'
from
Table_B as b2
group by b2.TableB_ID
) as u
EDIT: Generate TableC_ID