BigQuery SQL Query Optimization - sql

I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).
The BigQuery dataset that I'm working with comes from Hubspot. It's being kept in sync by Stitch. (For those unfamiliar with BigQuery, most integrations are append-only so I have to filter out old copies via the ROW_NUMBER() OVER line you'll see below, so that's why it exists. Seems like the standard way to deal with this quirk.)
The wrinkle with the companies table is every single field, except for two ID ones, is of type RECORD. (See the screenshot at the bottom for an example). It serves to keep a history of field value changes. Unfortunately they don't seem to be in any order so wrapping up the fields - properties.first_conversion_event_name for example - in a MIN() or MAX() and grouping by companyid formula doesn't work.
This is what I ended up with (the final query is much longer; I didn't include all of the fields in the sample below):
WITH companies AS (
SELECT
o.companyid as companyid,
ARRAY_AGG(STRUCT(o.properties.name.value, o.properties.name.timestamp) ORDER BY o.properties.name.timestamp DESC)[SAFE_OFFSET(0)] as name,
ARRAY_AGG(STRUCT(o.properties.industry.value, o.properties.industry.timestamp) ORDER BY o.properties.industry.timestamp DESC)[SAFE_OFFSET(0)] as industry,
ARRAY_AGG(STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) ORDER BY o.properties.lifecyclestage.timestamp DESC)[SAFE_OFFSET(0)] as lifecyclestage
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) as seqnum
FROM `project.hubspot.companies` o) o
WHERE seqnum = 1
GROUP BY companyid)
SELECT
companyid,
name.value as name,
industry.value as industry,
lifecyclestage.value as lifecyclestage
FROM companies
The WITH clause at the top is to get rid of the extra fields that the ARRAY_AGG(STRUCT()) includes. For each field I would have two columns - [field].value and [field].timestamp - and I only want the [field].value one.
Thanks in advance!
Schema Screenshot

I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).
Based on schema you presented and assuming your query really returns what you expect - below "optimized" version should return same result
#standardSQL
WITH companies AS (
SELECT
o.companyid AS companyid,
STRUCT(o.properties.name.value, o.properties.name.timestamp) AS name,
STRUCT(o.properties.industry.value, o.properties.industry.timestamp) AS industry,
STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) AS lifecyclestage
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) AS seqnum
FROM `project.hubspot.companies` o
) o
WHERE seqnum = 1
)
SELECT
companyid,
name.value AS name,
industry.value AS industry,
lifecyclestage.value AS lifecyclestage
FROM companies
As you can see I just simply removed GROUP BY companyid because you already have just one entry/row per companyid after you apply WHERE seqnum = 1, so no reason at all to group just one row per companyid. For the very same reason I removed ARRAY_AGG( ORDER BY)[SAFE_OFFSET(0)] - it just aggregated one struct and then extracted that one element out of array - so no need in doing this

Related

Selecting 1 column's value in a group after grouping by another column

How would I include the name of any one of the books that belong to that particular type in the below query?
select distinct
(select sum(ob.Balance)),
ob.BookType
from orders.OrderBooks ob
group by ob.BookType
In its current state it does what I need it to and groups books by BookType and sums their balances, as seen below.
However I need the name of any book that belongs to that BookType as part of the result.
If I select the BookName column and then group by it like below, it results in more unique entries and to an extent undoes the original grouping.
select distinct
(select sum(ob.Balance)),
ob.BookType,
ob.BookName
from orders.OrderBooks ob
group by ob.BookType, ob.BookName
;WITH x AS
(
SELECT
Balance = SUM(Balance) OVER (PARTITION BY BookType),
BookType,
BookName,
rn = ROW_NUMBER() OVER (PARTITION BY BookType ORDER BY BookName DESC)
FROM orders.OrderBooks
)
SELECT Balance, BookType, BookName
FROM x
WHERE rn = 1;
db<>fiddle
ORDER BY BookName DESC was dealer's choice. If you truly don't care which title shows up in the result, you can use any ordering you like. If you want the results to be random every time, you can use ORDER BY NEWID().
In general I like this flexibility better than the TOP (1) subquery approach, in addition to a single scan instead of an additional table access per row. But you can also do it a different way; just take min/max of the bookname, too:
SELECT Balance = SUM(Balance),
BookType,
BookName = MIN(BookName) -- or MAX()
FROM dbo.OrderBooks
GROUP BY BookType;
You can see these give similar results in this db<>fiddle. Plan is simpler, too; most notably: no spools. However when you use an aggregate function against that column, it makes it harder to provide arbitrary/random results, and if you intend to add other columns pulled from the right row, you'll need to go back to the row_number solution.
You can use a correlated subquery to get a single book name of that type. This assumes there's an ID field and you want to pull the most recent one:
select
Balance = (select sum(ob.Balance)),
ob.BookType,
BookName = (SELECT TOP(1) ob.BookName FROM orders.OrderBooks ob2 WHERE ob2.BookType = ob.BookType ORDER BY ob2.ID DESC)
from orders.OrderBooks ob
group by ob.BookType, ob.BookName

SQL Question regarding fields associated with the MAX([field]) only

I'm trying to gather the entire row information associated with the MAX() of a particular field.
I essentially have several [Flight_Leg] for a unique [Shipment_ID], and each one has unique [Destination_Aiport], [Departure_Time], and [Arrival_Time]. Obviously, each [Shipment_ID] can have multiple [Flight_Leg], and each [Flight_Leg] has a unique row of information.
SELECT
[Shipment_ID],
MAX([Flight_Leg]) AS "Final Leg",
[Arrival_Time],
[Destination_Airport]
FROM
[Flight_Info]
Group By
[Shipment_ID],
[Arrival_Time]
The output is multiple lines, rather than having one unique line for [Shipment_ID]. I'm just trying to isolate the FINAL flight info for a shipment.
Depending on your database, most support window functions. Here's one option using row_number():
select *
from (
select *, row_number() over (partition by shipment_id order by flight_leg desc) rn
from flight_info
) t
where rn = 1
Alternatively here's a more generic approach joining back to itself:
select fi.*
from flight_info fi
join (select shipment_id, max(flight_leg) max_flight_leg
from flight_info
group by shipment_id) t on fi.shipment_id = t.shipment_id
and fi.flight_leg = t.max_flight_leg

How to work with problems correlated subqueries that reference other tables, without using Join

I am trying to work on public dataset bigquery-public-data.austin_crime.crime of the BigQuery. My goal is to get the output as three column that shows the
discription(of the crime), count of them, and top district for that particular description(crime).
I am able to get the first two columns with this query.
select
a.description,
count(*) as district_count
from `bigquery-public-data.austin_crime.crime` a
group by description order by district_count desc
and was hoping I can get that done with one query and then I tried this in order to get the third column showing me the Top district for that particular description (crime) by adding the code below
select
a.description,
count(*) as district_count,
(
select district from
( select
district, rank() over(order by COUNT(*) desc) as rank
FROM `bigquery-public-data.austin_crime.crime`
where description = a.description
group by district
) where rank = 1
) as top_District
from `bigquery-public-data.austin_crime.crime` a
group by description
order by district_count desc
The error i am getting is this. "Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN."
I think i can do that by joins. Can someone has better solution possibly to do that using without join.
Below is for BigQuery Standard SQL
#standardSQL
SELECT description,
ANY_VALUE(district_count) AS district_count,
STRING_AGG(district ORDER BY cnt DESC LIMIT 1) AS top_district
FROM (
SELECT description, district,
COUNT(1) OVER(PARTITION BY description) AS district_count,
COUNT(1) OVER(PARTITION BY description, district) AS cnt
FROM `bigquery-public-data.austin_crime.crime`
)
GROUP BY description
-- ORDER BY district_count DESC

Extract and concatenate the same field from multiple records in big query

I would like to be able to extract one field from multiple records from within a single table. For example, assuming I have a schema as follows
userId, eventTimestamp, theField
And what I want to do is be able to concatenate all instances of the field 'theField' together into a single string for a given userId ordered by eventTimestamp. And for an extra wrinkle, lets say I only want to include the first fiftiest oldest records.
My first attempt was to try something like:
SELECT
userId,
eventTimestamp,
LEAD(theField,0) OVER (PARTITION BY userId ORDER BY eventTimestamp) AS step0,
LEAD(theField,1) OVER (PARTITION BY userId ORDER BY eventTimestamp) AS step1,
....,
LEAD(theField,50) OVER (PARTITION BY userId ORDER BY eventTimestamp) AS step50,
And then the next step was to wrap that first step up in another SELECT statement as follows:
SELECT userId, eventTimestamp, CONCAT(STRING(step0), STRING(step1),...,STRING(step50)) as concatenatedString
FROM [whateverDataset.whateverTable],
GROUP BY
userId, eventTimestamp
This approach doesn't work though because if I have more than 50 steps (which I do), then I end up getting multiple rows for each of those outer SELECT statements, basically N-50 rows, where N = the total number of records for a particular userId. A 'solution' to this would be to have a HAVING statement in the inner SELECT statement to limit itself to only reporting the first 50 records, but overall this seems like a rather cumbersome solution. In non-BigQuery variants of SQL the GROUP_CONCAT seems to be a good way to go forward, but it either doesn't work here or I lack the creativity to get it to work. Anyone have any suggestions?
Thanks,
Brad
For BigQuery Legacy SQL:
SELECT
userid, GROUP_CONCAT(theField) AS Fields
FROM (
SELECT
userid, eventTimestamp, theField,
ROW_NUMBER() OVER(PARTITION BY userid ORDER BY eventTimestamp DESC) AS pos
FROM YourTable
ORDER BY eventTimestamp
)
WHERE pos < 51
GROUP BY userid
Please note: inner ORDER BY does not guarantee the order of theField in GROUP_CONCAT. But, so far, in all practical cases I see the order is carrying. So, test carefuly
For BigQuery Standard SQL:
Don't forget to uncheck Use Legacy SQL checkbox under Show Options
SELECT
userid,
(SELECT STRING_AGG(fields) FROM t.fields) AS fields
FROM (
SELECT
userid,
ARRAY(SELECT theField FROM t.fields ORDER BY eventTimestamp) fields
FROM (
SELECT
userid,
ARRAY_AGG(STRUCT(theField, eventTimestamp)) fields
FROM (
SELECT
userid,
eventTimestamp,
theField,
ROW_NUMBER() OVER(PARTITION BY userid ORDER BY eventTimestamp DESC) AS pos
FROM YourTable
)
WHERE pos < 51
GROUP BY userid
) t
) t

Return row data based on Distinct Column value

I'm using SQL Azure with asp script, and for the life of me, have had no luck trying to get this to work. The table I'm running a query on has many columns, but I want to query for distinct values on 2 columns (name and email), from there I want it to return the entire row's values.
What my query looks like now:
SELECT DISTINCT quote_contact, quote_con_email
FROM quote_headers
WHERE quote_contact LIKE '"&query&"%'
But I need it to return the whole row so I can retrieve other data points. Had I been smart a year ago, I would have created a separate table just for the contacts, but that's a year ago.
And before I implemented LiveSearch features.
One approach would be to use a CTE (Common Table Expression). With this CTE, you can partition your data by some criteria - i.e. your quote_contact and quote_con_email - and have SQL Server number all your rows starting at 1 for each of those partitions, ordered by some other criteria - i.e. probably SomeDateTimeStamp.
So try something like this:
;WITH DistinctContacts AS
(
SELECT
quote_contact, quote_con_email, (other columns here),
RN = ROW_NUMBER() OVER(PARTITION BY quote_contact, quote_con_email ORDER BY SomeDateTimeStamp DESC)
FROM
dbo.quote_headers
WHERE
quote_contact LIKE '"&query&"%'
)
SELECT
quote_contact, quote_con_email, (other columns here)
FROM
DistinctContacts
WHERE
RowNum = 1
Here, I am selecting only the last entry for each "partition" (i.e. for each pair of name/email) - ordered in a descending fashion by the time stamp.
Does that approach what you're looking for??
You need to provide more details.
This is what I could come up with Without them:
WITH dist as (
SELECT DISTINCT quote_contact, quote_con_email, RANK() OVER(ORDER BY quote_contact, quote_con_email) rankID
FROM quote_headers
WHERE quote_contact LIKE '"&query&"%'
),
data as (
SELECT *, RANK() OVER(ORDER BY quote_contact, quote_con_email) rankID FROM quote_headers
)
SELECT * FROM dist d INNER JOIN data src ON d.rankID = src.rankID