Generating JSON on the fly with BigQuery - google-bigquery

BigQuery has a TO_JSON_STRING() function that converts the result of SQL expressions to json strings. I'm trying to figure out how to use it with a data structure that has a nested array represented as a one-to-many relationship in BigQuery's tables.
This is the query I'm trying to run:
SELECT a.account_id,
TO_JSON_STRING((SELECT s.skill_id FROM skills s WHERE s.account_id = a.account_id))
FROM accounts a
I get this error from BigQuery
Scalar subquery produced more than one element
The final objective with be to get the account_id into the json also, and persisted into a string column.

Here is another solution;
with accounts as (
select *
from unnest([struct(1 as account_id, 'first acc' as account_name)
,struct(2 as account_id, 'second acc' as account_name)
,struct(3 as account_id, 'third acc' as account_name)
])
)
, skills as (
select *
from unnest([struct(1 as account_id, 1 as skill_id)
,struct(1 as account_id, 2 as skill_id)
,struct(1 as account_id, 3 as skill_id)
,struct(2 as account_id, 1 as skill_id)
,struct(2 as account_id, 4 as skill_id)
])
)
, nest as (
select a.account_id
,any_value(a.account_name) as account_name
,to_json_string(ifnull(array_agg(s.skill_id ignore nulls),[])) as skills
from accounts a
left join skills s
on a.account_id = s.account_id
group by a.account_id
)
select *
from nest
Output will look like:

You can join it first then generate the json, like this:
select TO_JSON_STRING(t) from (select account.account_id as acc_id, skills.* from account inner join skills using(account_id)) as t

Related

How to aggregate different CTEs in outer query SQL

i am trying to join two ctes to get the difference in performance of different countries and group on id here is my example
every campaign can be done in different countries, so how can i group by at the end to have 1 row per campaign id ?
CTE 1: (planned)
select
country
, campaign_id
, sum(sales) as planned_sales
from table x
group by 1,2
CTE 2: (Actual)
select
country
, campaign_id
, sum(sales) as actual_sales
from table y
group by 1,2
outer select
select
country,
planned_sales,
actual_sales
planned - actual as diff
from cte1
join cte2
on campaign_id = campaign_id
This should do it:
select
cte1.campaign_id,
sum(cte1.planned_sales),
sum(cte2.actual_sales)
sum(cte1.planned_sales) - sum(cte2.actual_sales) as diff
from cte1
join cte2
on cte1.campaign_id = cte2.campaign_id and cte1.country = cte2.country
group by 1
I would suggest using full join, so all data is included in both tables, not just data in one or the other. Your query is basically correct but it needs a group by.
select campaign_id,
sum(cte1.planned_sales) as planned_sales
sum(cte2.actual_sales) as actual_sales,
(coalesce(sum(cte1.planned_sales), 0) -
coalesce(sum(cte2.actual_sales), 0)
) as diff
from cte1 full join
cte2
using (campaign_id, country)
group by campaign_id;
That said, there is no reason why the CTEs should aggregate by both campaign and country. They could just aggregate by campaign id -- simplifying the query and improving performance.

How to write SQL query without join?

Recently during an interview I was asked a question: if I have a table like as below:
The requirement is: how many orders and how many shipments per day (based on date column) - output needs to be like this:
I have written the following code, but interviewer ask me to write a SQL query without JOIN and UNION, achieve the same output.
SELECT
COALESCE(a.order_date, b.ship_date), orders, shipments
FROM
(SELECT
order_date, COUNT(1) AS orders
FROM
table
GROUP BY 1) a
FULL JOIN
(SELECT
ship_date, COUNT(1) AS shipments
FROM table) b ON a.order_date = b.ship_date
Is this possible? Could you guys please advice?
You can use UNION and GROUP BY with conditional aggregation as follows:
SELECT DATE_,
COUNT(CASE WHEN FLAG = 'ORDER' THEN 1 END) AS ORDERS,
COUNT(CASE WHEN FLAG = 'SHIP' THEN 1 END) AS SHIPMENTS
FROM (SELECT ORDER_DATE AS DATE_, 'ORDER' AS FLAG FROM YOUR_TABLE
UNION ALL
SELECT SHIP_DATE AS DATE_, 'SHIP' AS FLAG FROM YOUR_TABLE) T
In BigQuery, I would express this as:
select date, countif(n = 0) as orders, countif(n = 1) as numships
from t cross join
unnest(array[order_date, ship_date]) date with offset n
group by 1
order by date;
The advantage of this approach (over union all) is two-fold. First, it only scans the table once. More importantly, the unnest() is all on the same node where the data resides -- so data does not need to be moved for the unpivot.

SQL: Sum / Group By Issue for Multiple Rows

I have looked elsewhere, but not managed to get an answer to this, so hoping someone with much more SQL experience can help me out on this!
I have the following portfolio table:
Ticker Company_ID Exposure
ABC 1 0.02
DEF 2 0.10
XYZ 3 0.01
GTS 3 0.01
And the following information table (where there are duplicates, with other information, and they cannot be deleted):
Company_ID Company_Name
1 Alpha
2 Defacto
2 Defacto
3 XeeWhy
3 XeeWhy
And I would like the result to be of the form
Company_ID Company_Name Sum(Exposure)
1 Alpha 0.02
2 Defacto 0.10
3 XeeWhy 0.02
I can run something to get a simple sum from the portfolio table, but this does not include the company name:
Select Distinct Company_ID, Sum(Exposure)
From Portfolio
Group By Company_ID
But whenever I join the tables to get the Company Name, I get the sum duplicated depending how many times they appear in the Information table.
Any help or pointers would be much appreciated!
Thanks!
Your simplest way would be to make the JOIN to your companies table DISTINCT, something like this:
Select p.Company_ID,
c.Company_name,
Sum(Exposure) as Exposure
From Portfolio p
INNER JOIN (
SELECT DISTINCT Company_id, Company_Name
FROM Companies) c
ON c.Company_id = p.Company_ID
Group By p.Company_ID,
c.Company_Name
Try to join a subquery, that gets the distinct company information and a subquery getting the grouped portfolio data.
SELECT x1.company_id,
x1.company_name,
x2.exposure
FROM (SELECT DISTINCT
company_id,
company_name
FROM information) x1
LEFT JOIN (SELECT company_id,
sum(exposure) exposure
FROM portfolio
GROUP BY company_id) x2
ON x2.company_id = x1.company_id;
I wasn't sure, if you want all companies in the result or only those, that have portfolio data. If you want the latter, change the LEFT JOIN to INNER JOIN.
Try this simple query:
SELECT (SELECT TOP 1 Company_Name FROM CompanyTable
WHERE Company_ID = P.Company_ID) Company_Name,
Sum(Exposure)
FROM Portfolio P
GROUP BY Company_ID
I used a CTE to get the Aggregation out of the way first:
Create table #portfolio (Ticker varchar(10), Company_ID int,Exposure decimal(10,2))
Insert into #portfolio values
('ABC', 1, 0.02),
('DEF', 2, 0.10),
('XYZ', 3, 0.01),
('GTS', 3, 0.01)
Create table #Information (Company_ID int,Company_Name varchar(10))
Insert into #Information values
(1,'Alpha'),
(2,'Defacto'),
(2,'Defacto'),
(3,'XeeWhy'),
(3,'XeeWhy')
;WITH CTE as(
SELECT Company_ID, SUM(Exposure) EXP from #portfolio GROUP BY Company_ID
)
SELECT t1.Company_ID,t2.Company_Name, t1.EXP
from CTE t1
INNER JOIN (SELECT DISTINCT Company_ID, Company_Name from #Information) t2 on
t1.Company_ID = t2.Company_ID

BigQuery Standard SQL count original rows after CROSS JOIN UNNEST

I have a table with a repeated field that requires a CROSS JOIN UNNEST and I want to be able to get the count of the original, nested rows. For example.
SELECT studentId, COUNT(1) as studentCount
FROM myTable
CROSS JOIN UNNEST classes
WHERE classes.id in ('1', '2')
Right now, if a student is in class 1 and 2 it will count that student twice in studentCount.
I know I can do count(distinct(student.id)) to workaround this, but this ends up being a lot slower than a simple count. It's not taking advantage of the fact there's exactly one row per student.
So is there any way to compute count of the original rows before unnesting (but after the where clause) but still include the unnest in the query?
Note this must be in Standard SQL.
I understood your "challenge" as to show only students from classes id 1 and 2 while still showing total count of student in all classes. If this is it - see below
#standardSQL
SELECT studentId, studentCount
FROM myTable
CROSS JOIN (SELECT COUNT(1) studentCount FROM myTable)
WHERE studentId IN (
SELECT studentID FROM UNNEST(classes) AS classes
WHERE classes.id IN ('1', '2')
)
you can test / play with it using dummy data as below
#standardSQL
WITH myTable AS (
SELECT 1 AS studentId, [STRUCT<id STRING>('1'),STRUCT('2'),STRUCT('3')] AS classes UNION ALL
SELECT 2, [STRUCT<id STRING>('4'),STRUCT('5')]
)
SELECT studentId, studentCount
FROM myTable
CROSS JOIN (SELECT COUNT(1) studentCount FROM myTable)
WHERE studentId IN (
SELECT studentID FROM UNNEST(classes) AS classes
WHERE classes.id IN ('1', '2')
)
If your desired output is different from what I guessed - you still might find above useful for calculating studentCount
Just given the original constraints--that unnesting is required and you need to count the number of students--you can use a query of this form:
SELECT studentId, (SELECT COUNT(*) FROM myTable) AS studentCount
FROM myTable
CROSS JOIN UNNEST classes
WHERE classes.id in ('1', '2')

SQL Select one row over a matching row from two tables

I have two tables with the same fields, but a final value that is calculated slightly differently. I need to combine the data from these two tables into one but need to prioritise one record over another when there is a match. Do you know how this might be possible?
Below is a mock up of two matching records:
ID Balance Type CCY Payment Final_Balance
28 1068376.037 F - CC GBP 78124 990252.0367
28 1068376.037 F - DD GBP 982905 85470.08293
Apologies if the format comes out poorly, I'm unsure how to format table data.
I have thousands of records in these two tables but for a handful of records I have the same information in both tables. Essentially what I'm trying to get to is where there is a match I want it to select F-CC over F-DD so I end up with unique records in my final table.
Thanks
I personally use ROW_NUMBER() for things like this, but there may be a better solution.
You can re-run this SQL to show how the final answer is slowly built up:
declare #t1 table (id int)
declare #t2 table (id int, txt varchar(2))
insert into #t1
select 1 union
select 2
insert into #t2
select 1, 'FC' union
select 1, 'FD' union
select 2, 'FC' union
select 2, 'FD'
select *, row_number() over (partition by id order by txt) as we_want_the_ones
from #t2
select * from (
select id, txt, row_number() over (partition by id order by txt) as we_want_the_ones
from #t2
) z
where we_want_the_ones = 1
select *
from #t1 a
join (
select * from (
select id, txt, row_number() over (partition by id order by txt) as we_want_the_ones
from #t2
) z
where we_want_the_ones = 1
) b on a.id = b.id
My understanding of the question is that you have two tables (A and B) which have the exact same columns. You want to UNION these tables into one dataset, but sometimes you have rows in the two tables which "match" each other. In this case you only take one of the rows based on some priority.
From your example it seems that..
Match: Occurs when the ID is the same.
Priority: Is based on the Type column, prioritized by lower alphabetical order.
Also I'm assuming SQL Server, since that's what I prefer and you didn't say.
Hopefully all that is correct.. Now, here is how I would approach it.
I would start by performing the UNION of the two tables. Taking all records and not worrying about matching yet, putting them in a temp table to use later.
SELECT ID, Balance, Type, CCY, Payment, Final_Balance
INTO #AllRecords
FROM A
UNION
SELECT ID, Balance, Type, CCY, Payment, Final_Balance
FROM B
Next, I would GROUP BY the fields which determine a match, then use MIN or MAX to get the correct value for priority columns. By my understanding of your problem that means..
SELECT ID, MIN(Type) AS Type
FROM #AllRecords
GROUP BY ID
With that query you now have the natural key for all the records you want to display in your final result. All that is left to do is look up the rest of the columns using those keys, we can do this by using that query as a subquery.
SELECT ID, Balance, Type, CCY, Payment, Final_Balance
FROM #AllRecords r
INNER JOIN (
SELECT ID, MIN(Type) AS Type
FROM #AllRecords
GROUP BY ID ) final ON r.ID = final.ID AND r.Type = final.Type
So all together the resulting query is..
SELECT ID, Balance, Type, CCY, Payment, Final_Balance
INTO #AllRecords
FROM A
UNION
SELECT ID, Balance, Type, CCY, Payment, Final_Balance
FROM B
SELECT ID, Balance, Type, CCY, Payment, Final_Balance
FROM #AllRecords r
INNER JOIN (
SELECT ID, MIN(Type) AS Type
FROM #AllRecords
GROUP BY ID ) final ON r.ID = final.ID AND r.Type = final.Type