Multiple Left Joins in BigQuery - sql

I'm trying to make a currently working SQL query that I have in BigQuery more streamlines and am running into the following issue:
Error: ON clause must be AND of = comparisons of one field name from each table, with all field names prefixed with table name. Consider using Standard SQL .google.com/bigquery/docs/reference/standard-sql/), which allows non-equality JOINs and comparisons involving expressions and residual predicates.
Below is the query that is giving the error above. The first LEFT JOIN works. When I added the second one, right below, I started getting the error. What I'm trying to do is get the human readable own.o.firstname and own.o.lastname values rather than the owner_id value of the deal record (o.properties.hubspot_owner_id.value), but in order to do so I need to join some tables.
I had to use CAST on the ON clause of the second JOIN because the fields are of different types in each table's respective schema. If I don't do that, I get the following error: Error: Join keys o.properties.hubspot_owner_id.value (string) and o.ownerid (int64) have types that cannot be automatically coerced.
The WHERE clause is just a suppression list to not return entries that have been deleted from the database.
SELECT o.*
FROM (
SELECT
o.dealid,
o.properties.dealname.value,
stages.Label,
o.properties.closedate.value,
o.properties.hubspot_owner_id.value,
own.o.firstname,
own.o.lastname,
o.properties.amount.value,
o.properties.createdate.value,
o.properties.pipeline.value,
o.associations.associatedcompanyids,
ROW_NUMBER() OVER (PARTITION BY o.dealid ORDER BY o._sdc_batched_at DESC) as seqnum
FROM [sample-table:hubspot.deals] o
LEFT JOIN [sample-table:hubspot.sales_stages_lookup] stages ON o.properties.dealstage.value = stages.Internal_Value
LEFT JOIN [sample-table:hubspot.owners_reporting] own ON CAST(o.properties.hubspot_owner_id.value AS INTEGER) = CAST(own.o.ownerid AS INTEGER)) o
WHERE o.dealid NOT IN (SELECT objectid FROM [sample-table:hubspot_suppression_list.data] WHERE subscriptiontype = 'deal.deletion') AND seqnum = 1

Use standard SQL in BigQuery instead, which supports expressions as part of the ON clause:
#standardSQL
SELECT o.*
FROM (
SELECT
o.dealid,
o.properties.dealname.value AS dealname_value,
stages.Label,
o.properties.closedate.value AS closedate_value,
o.properties.hubspot_owner_id.value AS hubspot_owner_id_value,
own.o.firstname,
own.o.lastname,
o.properties.amount.value AS amount_value,
o.properties.createdate.value AS createdate_value,
o.properties.pipeline.value AS pipeline_value,
o.associations.associatedcompanyids,
ROW_NUMBER() OVER (PARTITION BY o.dealid ORDER BY o._sdc_batched_at DESC) as seqnum
FROM `sample-table.hubspot.deals` o
LEFT JOIN `sample-table.hubspot.sales_stages_lookup` stages ON o.properties.dealstage.value = stages.Internal_Value
LEFT JOIN `sample-table.hubspot.owners_reporting` own ON CAST(o.properties.hubspot_owner_id.value AS INT64) = CAST(own.o.ownerid AS INT64)) o
WHERE o.dealid NOT IN (SELECT objectid FROM `sample-table.hubspot_suppression_list.data` WHERE subscriptiontype = 'deal.deletion') AND seqnum = 1
For more on the differences between legacy and standard SQL in BigQuery, see the migration guide.

Related

postgres: LEFT JOIN table and field does not exist

This is my query
SELECT org.id,
org.name,
org.type,
org.company_logo,
(SELECT org_profile.logo_url FROM org_profile WHERE org_profile.org_id = org.id AND org_profile.status = 'active' ORDER BY org_profile.id DESC LIMIT 1) as logo_url,
org.short_description,
org_profile.value_prop,
count(*) OVER () AS total
FROM org
LEFT JOIN user_info ON user_info.id = org.approved_by
INNER JOIN (select distinct org_profile.org_id from org_profile) org_profile ON org_profile.org_id = org.id
WHERE
org.type = 'Fintech'
AND org.status = 'APPROVED'
AND org.draft != TRUE
ORDER BY org.id DESC
I am using LEFT JOIN query with my org_profile table. I used distinct for unique org id but the problem is org_profile.value_prop column does not work. The error is showing column org_profile.value_prop does not exist
I'm trying to solve this issue. But I don't figure out this issue.
basically, the error informs that you try to get the value_prop field from org_profile subquery, which basically doesn't exist.
It's difficult to give a working query by writting just on the paper, but I think that:
it's worth to apply the handy aliasses for each subquery
deduplication, if required, should be done in the subquery. When multiple fields used DISTINCT may be insufficient - RANK function may be required.
you make some operations to get the logo_url by a scalar subquery - it seems a bit strange, especially the same table is used in JOIN - I would suggest to move all logic related to org_profile to the subquery. Scalar expressions would throw an error in case multiple values would be found in output.
SELECT
org.id,
org.name,
org.type,
org.company_logo,
prof.logo_url,
org.short_description,
prof.value_prop,
count(*) OVER () AS total
FROM org
JOIN (
select distinct org_id, logo_url, value_prop -- other deduplication type (RANK) may be required
from org_profile
where status = 'active' -- ?
) prof ON org.id = prof.org_id
LEFT JOIN user_info usr ON usr.id = org.approved_by
WHERE
org.type = 'Fintech'
AND org.status = 'APPROVED'
AND org.draft != TRUE
ORDER BY org.id DESC

Query performance: CTE using ROW_NUMBER() to select first row

We have three environments and when I run my SQL query in two of them just takes 30 or 38 seconds to run but in the other environment running is not completed and I should cancel it. Query is based on two parts, a CTE and a very simple select from a table, in both CTE and select I'm using the same table.
Could you please tell me why it takes so long time? how can I improve the query?
ALTER VIEW [fact].[vPurchase]
AS
WITH VKPL AS
(
SELECT *
FROM
(SELECT
iv.[Delivery_FK],
1 AS column2,
ROW_NUMBER() OVER(PARTITION BY [Delivery_FK] ORDER BY iv.UpdateDate) AS rk
FROM
[fact].[KRMFact] iv
LEFT JOIN
[dimension].[Product] pr ON iv.Product_FK =pr.Product_SK
LEFT JOIN
[dimension].[Delivery] le ON le.Delivery_FK = iv.Delivery_FK
WHERE
pr.Product_Key = '740') X
WHERE
rk = 1
)
SELECT
-- .... here are some columns
Delivery_FK,
Product_FK,
CAST(column2 AS VARCHAR) AS column2,
f.[UpdateDate] AS [Update date]
FROM
[fact].[KRMFact] f
LEFT JOIN
VKPL v ON f.Delivery_FK = v.Delivery_FK
This is guesswork.
I guess the environment where this query is slow is the one with lots of production data in it.
I guess some index on your KRMFact table will, maybe, help you. Here's how to figure out what you need: SQL Server Management Studio (SSMS) has a feature to show you a query's execution plan. Put your query (not simplified, please, the actual query) into SSMS, right click and choose "Include Actual Execution Plan." Then run the query. The execution plan display may recommend an index for you to create to get this query to run faster.
I guess you're trying to find rows with the earliest values of UpdateDate.
Your subquery
SELECT *
FROM
(SELECT
iv.[Delivery_FK],
1 AS column2,
ROW_NUMBER() OVER(PARTITION BY [Delivery_FK] ORDER BY iv.UpdateDate) AS rk
FROM
[fact].[KRMFact] iv
LEFT JOIN
[dimension].[Product] pr ON iv.Product_FK =pr.Product_SK
LEFT JOIN
[dimension].[Delivery] le ON le.Delivery_FK = iv.Delivery_FK
WHERE
pr.Product_Key = '740') X
WHERE
rk = 1
looks like it picks out the row with the earliest KRMFact.UpdateDate for each value of KRMFact.Delivery_FK. That's what the ROW_NUMBER() OVER... WHERE rk=1 language does.
If my guess about that is correct you can do that a different way, which may be more efficient.
SELECT *
FROM
(SELECT
iv.[Delivery_FK],
1 AS column2,
1 AS rk
FROM
[fact].[KRMFact] iv
JOIN ( SELECT Delivery_FK, MIN(UpdateDate) first_update
FROM [fact].[KRMFact]
GROUP BY Delivery_FK
) first_update ON iv.UpdateDate = first_update.first_update
LEFT JOIN
[dimension].[Product] pr ON iv.Product_FK =pr.Product_SK
LEFT JOIN
[dimension].[Delivery] le ON le.Delivery_FK = iv.Delivery_FK
WHERE
pr.Product_Key = '740') X
WHERE
rk = 1
You should probably try out the old and new versions of the subquery to determine whether they will yield the same results.
If you use this subquery query I suggest, this index will help make it run faster by optimizing the new sub-sub-query's MIN() ... GROUP BY operation.
CREATE INDEX x_KRMFact_Product_Update
ON [fact].[KRMFact]
([Product_FK],[UpdateDate])
By the way, WHERE pr.Product_Key = '740' turns your LEFT JOIN [dimension].[Product] operation into an ordinary inner JOIN.

SQL not efficient enough, tuning assistance required

We have some SQL that is ok on smaller data volumes but poor once we scale up to selecting from larger volumes. Is there a faster alternative style to achieve the same output as below? The idea is to pull back a single unique row to get latest version of the data... The SQL does reference another view but this view runs very fast - so we expect the issue is here below and want to try a different approach
SELECT *
FROM
(SELECT (select CustomerId from PremiseProviderVersionsToday
where PremiseProviderId = b.PremiseProviderId) as CustomerId,
c.D3001_MeterId, b.CoreSPID, a.EnteredBy,
ROW_NUMBER() OVER (PARTITION BY b.PremiseProviderId
ORDER BY a.effectiveDate DESC) AS rowNumber
FROM PremiseMeterProviderVersions a, PremiseProviders b,
PremiseMeterProviders c
WHERE (a.TransactionDateTimeEnd IS NULL
AND a.PremiseMeterProviderId = c.PremiseMeterProviderId
AND b.PremiseProviderId = c.PremiseProviderId)
) data
WHERE data.rowNumber = 1
As Bilal Ayub stated above, the correlated subquery can result in performance issues. See here for more detail. Below are my suggestions:
Change all to explicit joins (ANSI standard)
Use aliases that are more descriptive than single characters (this is mostly to help readers understand what each table does)
Convert data subquery to a temp table or cte (temp tables and ctes usually perform better than subqueries)
Note: normally, you should explicitly create and insert into your temp table but I chose not to do that here as I do not know the data types of your columns.
SELECT d.CustomerId
, c.D3001_MeterId
, b.CoreSPID
, a.EnteredBy
, rowNumber = ROW_NUMBER() OVER(PARTITION BY b.PremiseProviderId ORDER BY a.effectiveDate DESC)
INTO #tmp_RowNum
FROM PremiseMeterProviderVersions a
JOIN PremiseMeterProviders c ON c.PremiseMeterProviderId = a.PremiseMeterProviderId
JOIN PremiseProviders b ON b.PremiseProviderId = c.PremiseProviderId
JOIN PremiseProviderVersionsToday d ON d.PremiseProviderId = b.PremiseProviderId
WHERE a.TransactionDateTimeEnd IS NULL
SELECT *
FROM #tmp_RowNum
WHERE rowNumber = 1
You are running a correlated query that will run in loop, if size of table is small it will be faster, i would suggest to change it and try to join the table in order to get customerid.
(select CustomerId from PremiseProviderVersionsToday where PremiseProviderId = b.PremiseProviderId) as CustomerId
Consider derived tables including an aggregate query that calculates maximum EffectoveDate by PremiseProviderId and unit level query, each using explicit joins (current ANSI SQL standard) and not implicit as you currently use:
SELECT data.*
FROM
(SELECT t.CustomerId, c.D3001_MeterId, b.CoreSPID, a.EnteredBy,
b.PremiseProviderId, a.EffectiveDate
FROM PremiseMeterProviders c
INNER JOIN PremiseMeterProviderVersions a
ON a.PremiseMeterProviderId = c.PremiseMeterProviderId
AND a.TransactionDateTimeEnd IS NULL
INNER JOIN PremiseProviders b
ON b.PremiseProviderId = c.PremiseProviderId
INNER JOIN PremiseProviderVersionsToday t
ON t.PremiseProviderId = b.PremiseProviderId
) data
INNER JOIN
(SELECT b.PremiseProviderId, MAX(a.EffectiveDate) As MaxEffDate
FROM PremiseMeterProviders c
INNER JOIN PremiseMeterProviderVersions a
ON a.PremiseMeterProviderId = c.PremiseMeterProviderId
AND a.TransactionDateTimeEnd IS NULL
INNER JOIN PremiseProviders b
ON b.PremiseProviderId = c.PremiseProviderId
GROUP BY b.PremiseProviderId
) agg
ON data.PremiseProviderId = agg.PremiseProviderId
AND data.EffectiveDate = agg.MaxEffDate

SQL - select only newest record with WHERE clause

I have been trying to get some data off our database but got stuck when I needed to only get the newest file upload for each file type. I have done this before using the WHERE clause but this time there is an extra table involved that is needed to determine the file type.
My query looks like this so far and i am getting six records for this user (2x filetypeNo4 and 4x filetypeNo2).
SELECT db_file.fileID
,db_profile.NAME
,db_applicationFileType.fileTypeID
,> db_file.dateCreated
FROM db_file
LEFT JOIN db_applicationFiles
ON db_file.fileID = db_applicationFiles.fileID
LEFT JOIN db_profile
ON db_applicationFiles.profileID = db_profile.profileID
LEFT JOIN db_applicationFileType
ON db_applicationFiles.fileTypeID = > > db_applicationFileType.fileTypeID
WHERE db_profile.profileID IN ('19456')
AND db_applicationFileType.fileTypeID IN ('2','4')
I have the WHERE clause looking like this which is not working:
(db_file.dateCreated IS NULL
OR db_file.dateCreated = (
SELECT MAX(db_file.dateCreated)
FROM db_file left join
db_applicationFiles on db_file.fileID = db_applicationFiles.fileID
WHERE db_applicationFileType.fileTypeID = db_applicationFiles.FiletypeID
))
Sorry I am a noob so this may be really simple, but I just learn this stuff as I go on my own..
SELECT
ff.fileID,
pf.NAME,
ff.fileTypeID,
ff.dateCreated
FROM db_profile pf
OUTER APPLY
(
SELECT TOP 1 af.fileTypeID, df.dateCreated, df.fileID
FROM db_file df
INNER JOIN db_applicationFiles af
ON df.fileID = af.fileID
WHERE af.profileID = pf.profileID
AND af.fileTypeID IN ('2','4')
ORDER BY create_date DESC
) ff
WHERE pf.profileID IN ('19456')
And it looks like all of your joins are actually INNER. Unless there may be profile without files (that's why OUTER apply instead of CROSS).
What about an obvious:
SELECT * FROM
(SELECT * FROM db_file ORDER BY dateCreated DESC) AS files1
GROUP BY fileTypeID ;

SQL - Subtraction within a Common Table Expression,

------ SOLVED --------
Instead of trying to perform the subtraction within the scope of the CTE, I just had to place it into the sub query which was using this particular CTE, in the main select list.
The Problem was, `InnerOQLI.FreePlaceCount is not being recognized, since it hasn't been defined within the scope of the CTE and is only being used in the Exists statement.
------ PROBLEM ----------
This is the first time I've used CTE's, I've joined multiple CTE's together so that I can retrieve an overall total in a single column.
I need to perform a subtraction within one of the CTE's
I first wrote it like this
MyCount2
AS
(
SELECT DISTINCT
O.ID AS OrderID,
(
(
(SELECT SUM(InnerOC.[Count])
FROM Order InnerO
INNER JOIN SubOrder InnerSO ON InnerO.ID = InnerSO.OrderID
INNER JOIN OrderComponent InnerOC ON SO.ID = OC.SubOrderID
WHERE OC.OrderComponentTypeID IN (1,2,4,5)
AND EXISTS (SELECT * FROM OrderQuoteLineItem InnerOQLI
WHERE InnerOQLI.OrderQuoteLineItemTypeID = 9 AND Order.ID = InnerO.ID)
AND Inner0.ID = ).ID)
)
- --< Minus Here
OQLI.FreePlaceCount
) AS [SHPCommExpression2]
FROM Order O
INNER JOIN SubOrder SO ON O.ID = SO.OrderID
INNER JOIN OrderComponent OC ON SO.ID = OC.SubOrderID
INNER JOIN OrderQuoteLineItem OQLI ON SO.ID = 0QLI.SubOrderID
),
Without going into to much detail, this brings back incorrect data because of repeated rows in the main query. (I believe it cos of the same joins within the main query)
So I then wrote this
MyCount2
AS
(SELECT InnerO.ID AS OrderID
SUM(InnerOC.[Count]
- InnerOQLI.FreePlaceCount) --- Tried to place subtraction here ----
AS [SHPCommExpression12])
FROM Order InnerO
INNER JOIN SubOrder InnerSO ON InnerO.ID = InnerSO.OrderID
INNER JOIN OrderComponent InnerOC ON SO.ID = OC.SubOrderID
WHERE OC.OrderComponentTypeID IN (1,2,4,5)
AND EXISTS (SELECT * FROM OrderQuoteLineItem InnerOQLI
WHERE InnerOQLI.OrderQuoteLineItemTypeID = 9 AND Order.ID = InnerO.ID)
GROUP BY InnerO.ID)
),
You can see where I've attempted to perform the subtraction, but It doesn't recognize InnerOQLI, where I've tried to add it to perform the subtraction. I can't work out how to correct this, I realize that it cant fully recognize the InnerOQLI since it's in the Exists statement, Is there away around this? If anyone could help I'd appreciate it
Thanks
Instead of trying to perform the subtraction within the scope of the CTE, I just had to place it into the sub query which was using this particular CTE, in the main select list.
The Problem was, `InnerOQLI.FreePlaceCount is not being recognized, since it hasn't been defined within the scope of the CTE and is only being used in the Exists statement.