Redshift subquery not accepted - sql

I'm trying to execute the following query against my dataset stored in Redshift:
SELECT v_users.user_id AS user_id,
v_users.first_name AS first_name,
v_users.email AS email,
COALESCE(v_users.country, accounts.region) AS country_code,
profiles.language AS language,
v_users.mobilenum AS mobile_num,
NULL as mobile_verification_date,
COALESCE(v_users.registration_date, accounts.date_created) AS activation_date,
EXISTS (SELECT 1
FROM cds.user_session_201612 AS users_session,
cds.access_logs_summary_201612 AS access_logs_summary,
views_legacy AS views_legacy
WHERE users_session.userid = v_users.user_id
OR access_logs_summary.userid = v_users.user_id
OR views_legacy.user_id = v_users.user_id) AS has_viewed,
NULL as preferred_genre_1,
NULL as preferred_genre_2,
NULL as preferred_genre_3
FROM users AS v_users,
users_metadata AS v_users_metadata,
account.account AS accounts,
account.profile AS profiles
WHERE accounts.id = v_users.user_id
AND profiles.id = v_users.user_id
AND v_users_metadata.user_id = v_users.user_id
The problem which I get is the following:
ERROR: This type of correlated subquery pattern is not supported due to internal error
which is caused by the subquery but how can I solve it? can you provide me some suggestions?

Redshift doesn't allow correlated subqueries in the SELECT clause, which I don't think is a limitation as all the examples I've encountered can be otherwise expressed.
I've refactored the subquery as a CTE, and used a left join with an is not null to mark users who have or not viewed some thing.
This particular query below may not work, but any solution will likely take the following form:
WITH has_viewed AS (
SELECT
u.user_id
FROM users u
LEFT JOIN cds.user_session_201612 AS users_session
ON users_session.userid = u.user_id
LEFT JOIN cds.access_logs_summary_201612 AS access_logs_summary
ON access_logs_summary.userid = users.user_id
LEFT JOIN views_legacy
ON views_legacy.user_id = v_users.user_id
WHERE users_session.userid IS NOT NULL
OR access_logs_summary.userid IS NOT NULL
OR views_legacy.user_id
GROUP BY 1
)
SELECT
v_users.user_id AS user_id
, v_users.first_name AS first_name
, v_users.email AS email
, COALESCE(v_users.country, accounts.region) AS country_code
, profiles.language AS language
, v_users.mobilenum AS mobile_num
, NULL as mobile_verification_date
, COALESCE(v_users.registration_date, accounts.date_created) AS activation_date
, has_viewed.user_id IS NOT NULL AS has_viewed
, NULL as preferred_genre_1
, NULL as preferred_genre_2
, NULL as preferred_genre_3
FROM users AS v_users
JOIN users_metadata AS v_users_metadata
ON v_users_metadata.user_id = v_users.user_id
JOIN account.account AS accounts
ON accounts.id = v_users.user_id
JOIN account.profile AS profiles ON profiles.id = v_users.user_id
LEFT JOIN has_viewed
ON has_viewed.user_id = v_users.user_id

I have tried all possible combinations,
SELECT subquery doesn't work
CTE (Common Table Expression) as shown by Haleemur Ali doesn't work either.
Now what I have tried - I needed an alternative to GROUP BY, as redshift doesn't accept GROUP BY.
So I got this solution -
the OVER keyword.
SO as a replacement for GROUP BY I used OVER and PARTITION BY which goes like -
SELECT *
FROM (
SELECT *,ROW_NUMBER()
OVER (PARTITION BY **VARIOUS COLUMNS** ORDER BY datetime DESC) rn
FROM schema.tableName
) derivedTable
WHERE derivedTable.rn = 1;
Maybe OVER might help you out. I am not sure though.

Related

Multiple left joins with aggregation on same table causes huge performance hit in SAP HANA

I am joining two tables on HANA and, to get some statistics, I am LEFT joining the items table 3 times to get a total count, number of entries processed and number of errors, as shown below.
This is a dev system and the items table has only 1500 items. But the query below runs for 17 seconds.
When I remove any of the three aggregation terms (but leave the corresponding JOIN in place), the query executes almost immediately.
I have also tried adding indexes on the fields used in the specific JOINs, but that makes no difference.
select rk.guid, rk.run_id, rk.status, rk.created_at, rk.created_by,
count( distinct rp.guid ),
count( distinct rp2.guid ),
count( distinct rp3.guid )
from zbsbpi_rk as rk
left join zbsbpi_rp as rp
on rp.header = rk.guid
left join zbsbpi_rp as rp2
on rp2.header = rk.guid
and rp2.processed = 'X'
left join zbsbpi_rp as rp3
on rp3.header = rk.guid
and rp3.result_status = 'E'
where rk.run_id = '0000000010'
group by rk.guid, run_id, status, created_at, created_by
I think you can re-write you query to improve the performance:
select rk.guid, rk.run_id, rk.status, rk.created_at, rk.created_by,
count( distinct rp.guid ),
count( distinct (CASE WHEN rp.processed = 'X' then rp.guid else null end) ),
count( distinct (CASE WHEN rp.result_status = 'E' then rp.guid else null end))
from zbsbpi_rk as rk
left join zbsbpi_rp as rp
on rp.header = rk.guid
where rk.run_id = '0000000010'
group by rk.guid, run_id, status, created_at, created_by
I'm not entirely sure if the count distinct case construct will work on hana but you may try.
My apologies, but I forgot that I had posted this question here. I had posted the same question at answers.sap.com after not getting any joy here: https://answers.sap.com/questions/172096/multiple-left-joins-with-aggregation-on-same-table.html
I eventually came up with the solution, which was a bit of a "doh!" moment:
select rk.guid, rk.run_id, rk.status, rk.created_at, rk.created_by,
count( distinct rp.guid ),
count( distinct rp2.guid ),
count( distinct rp3.guid )
from zbsbpi_rk as rk
join zbsbpi_rp as rp
on rp.header = rk.guid
left join zbsbpi_rp as rp2
on rp2.guid = rp.guid
and rp2.processed = 'X'
left join zbsbpi_rp as rp3
on rp3.guid = rp.guid
and rp3.result_status = 'E'
where rk.run_id = '0000000010'
group by rk.guid, run_id, status, created_at, created_by
The subsequent left joins needed only to be joined to the first join on the same table, as the first join contained a superset of all the records anyway.

Providing Language FallBack In A SQL Select Statement

I have a table that represents an Object. It has many columns but also fields that require language support.
For simplicity let's say I have 3 tables:
MainObjectTable
LanguageDependantField1
LanguageDependantField2.
MainObjectTable has a PK int called ID, and both LanguageDependantTables have a foreign key link back to the MainObjectTable along with a language code and the date they were added.
I've created a stored procedure that accepts the MainObjectTable ID and a Language. It will return a single row containing the most recent items from the language tables. The select statement looks like
SELECT
MainObjectTable.VariousColumns,
LanguageDependantField1.Description,
LanguageDependantField2.SomeOtherText
FROM
MainObjectTable
OUTER APPLY
(SELECT TOP 1 LanguageDependantField1.Description
FROM LanguageDependantField1
WHERE LanguageDependantField1.MainObjectTable_ID = MainObjectTable.ID
AND LanguageDependantField1.Language_ID = #language
ORDER BY
LanguageDependantField1.[Default], LanguageDependantField1.CreatedDate DESC) LanguageDependantField1
OUTER APPLY
(SELECT TOP 1 LanguageDependantField2.SomeOtherText
FROM LanguageDependantField2
WHERE LanguageDependantField2.MainObjectTable_ID = MainObjectTable.ID
AND LanguageDependantField2.Language_ID = #language
ORDER BY
LanguageDependantField2.[Default] DESC, LanguageDependantField2.CreatedDate DESC) LanguageDependantField2
WHERE
MainObjectTable.ID = #MainObjectTableID
What I want to add is the ability to fallback to a default language if a row isn't found in the specified language. Let's say we use "German" as the selected language. Is it possible to return an English row from LanguageDependantField1 if the German does not exist presuming we have #fallbackLanguageID
Also am I right to use OUTER APPLY in this scenario or should I be using JOIN?
Many thanks for your help.
Try this:
SELECT MainObjectTable.VariousColumns,
COALESCE(PrefLang.Description,Fallback.Description,'Not Found Desc')
as Description,
COALESCE(PrefLang.SomeOtherText,FallBack.SomeOtherText,'Not found')
as SomeOtherText
FROM MainObjectTable
LEFT JOIN
(SELECT TOP 1 pl.Description,pl.SomeOtherText
FROM LanguageDependantField1 pl
WHERE pl.MainObjectTable_ID = MainObjectTable.ID
AND pl.Language_ID = #language
ORDER BY
pl.[Default], pl.CreatedDate DESC)
PrefLang ON 1=1
LEFT JOIN
(SELECT TOP 1 fb.Description,fb.SomeOtherText
FROM LanguageDependantField1 fb
WHERE fb.MainObjectTable_ID = MainObjectTable.ID
AND fb.Language_ID = #fallbackLanguageID
ORDER BY
fb.[Default], fb.CreatedDate DESC)
Fallback ON 1=1
WHERE
MainObjectTable.ID = #MainObjectTableID
Basically, make two queries, one to the preferred language and one to English (Default). Use the LEFT JOIN, so if the first one isn't found, the second query is used...
I don't have your actual tables, so there might be a syntax error in above, but hope it gives you the concept you want to try...
Yes, the use of Outer Apply is correct if you want to correlate the MainObjectTable table rows to the inner queries. You cannot use Joins with references in the derived table to the outer table. If you wanted to use Joins, you would need to include the joining column(s) and in this case pre-filter the results. Here is what that might look like:
With RankedLanguages As
(
Select LDF1.MainObjectTable_ID, LDF1.Language_ID, LDF1.Description, LDF1.SomeOtherText, ...
, Row_Number() Over ( Partition By LDF1.MainObjectTable_ID, LDF1.Language_ID
Order By LDF1.[Default] Desc, LDF1.CreatedDate Desc ) As Rnk
From LanguageDependantField1 As LDF1
Where LDF1.Language_ID In( #languageId, #defaultLanguageId )
)
Select M.VariousColumns
, Coalesce( SpecificLDF.Description, DefaultLDF.Description ) As Description
, Coalesce( SpecificLDF.SomeOtherText, DefaultLDF.SomeOtherText ) As SomeOtherText
, ...
From MainObjectTable As M
Left Join RankedLanguages As SpecificLDF
On SpecificLDF.MainObjectTable_ID = M.ID
And SpecifcLDF.Language_ID = #languageId
And SpecifcLDF.Rnk = 1
Left Join RankedLanguages As DefaultLDF
On DefaultLDF.MainObjectTable_ID = M.ID
And DefaultLDF.Language_ID = #defaultLanguageId
And DefaultLDF.Rnk = 1
Where M.ID = #MainObjectTableID

Eliminating "not in" for SQL command

I would like to take a sample of an Oracle table, but not include entries from another table. I have a query that currently works, but I'm pretty sure it will blow-up when the sub-select gets more than 1000 records.
select user_key from users sample(5)
where active_flag = 'Y'
and user_key not in (
select user_key from user_validation where validation_state <> 'expired'
);
How could this be re-written without the not in. I thought of using minus, but then my sample size would keep going down as new entries were added to the user_validation table.
You can do this with a left outer join:
select *
from (select u.user_key,
count(*) over () as numrecs
from users u left outer join
user_validation uv
on u.user_key = uv.user_key and
uv.validation_state <> 'expired'
where u.active_flag = 'Y' and uv.user_key is null
) t
where rownum <= numrecs * 0.05
You are using the sample clause. It is not clear if you just want the non-matches in the 5% you choose or if you want 5% of the data that is non-matches. This is the latter.
EDIT: Added example based on author's comment:
select user_key from (
select u.user_key, row_number() over (order by dbms_random.value) as randval
from users u
left outer join user_validation uv
on u.user_key = uv.user_key
and uv.validation_state <> 'expired'
where u.active_flag = 'Y'
and uv.user_key is null
) myrandomjoin where randval <=100;
select us.user_key
from users us -- sample(5)
where us.active_flag = 'Y'
and NOT EXISTS (
SELECT *
from user_validation nx
where nx.user_key = us.user_key
AND nx.validation_state <> 'expired'
);
BTW: I commented-out the sample(5) because I don't know what it means. (I strongly believe that it is not relevant, though)
select u.user_key from users u, user_validation uv
where u.active_flag = 'Y'
and u.user_key=uv.user_key
uv.validation_state= 'expired';
This was a double negation query, x not in list of non expired ids, which is equivalent to x is in the list of expired IDs, which is what I did, in addition to changing the subquery to a join.

Limit join to one row

I have the following query:
SELECT sum((select count(*) as itemCount) * "SalesOrderItems"."price") as amount, 'rma' as
"creditType", "Clients"."company" as "client", "Clients".id as "ClientId", "Rmas".*
FROM "Rmas" JOIN "EsnsRmas" on("EsnsRmas"."RmaId" = "Rmas"."id")
JOIN "Esns" on ("Esns".id = "EsnsRmas"."EsnId")
JOIN "EsnsSalesOrderItems" on("EsnsSalesOrderItems"."EsnId" = "Esns"."id" )
JOIN "SalesOrderItems" on("SalesOrderItems"."id" = "EsnsSalesOrderItems"."SalesOrderItemId")
JOIN "Clients" on("Clients"."id" = "Rmas"."ClientId" )
WHERE "Rmas"."credited"=false AND "Rmas"."verifyStatus" IS NOT null
GROUP BY "Clients".id, "Rmas".id;
The problem is that the table "EsnsSalesOrderItems" can have the same EsnId in different entries. I want to restrict the query to only pull the last entry in "EsnsSalesOrderItems" that has the same "EsnId".
By "last" entry I mean the following:
The one that appears last in the table "EsnsSalesOrderItems". So for example if "EsnsSalesOrderItems" has two entries with "EsnId" = 6 and "createdAt" = '2012-06-19' and '2012-07-19' respectively it should only give me the entry from '2012-07-19'.
SELECT (count(*) * sum(s."price")) AS amount
, 'rma' AS "creditType"
, c."company" AS "client"
, c.id AS "ClientId"
, r.*
FROM "Rmas" r
JOIN "EsnsRmas" er ON er."RmaId" = r."id"
JOIN "Esns" e ON e.id = er."EsnId"
JOIN (
SELECT DISTINCT ON ("EsnId") *
FROM "EsnsSalesOrderItems"
ORDER BY "EsnId", "createdAt" DESC
) es ON es."EsnId" = e."id"
JOIN "SalesOrderItems" s ON s."id" = es."SalesOrderItemId"
JOIN "Clients" c ON c."id" = r."ClientId"
WHERE r."credited" = FALSE
AND r."verifyStatus" IS NOT NULL
GROUP BY c.id, r.id;
Your query in the question has an illegal aggregate over another aggregate:
sum((select count(*) as itemCount) * "SalesOrderItems"."price") as amount
Simplified and converted to legal syntax:
(count(*) * sum(s."price")) AS amount
But do you really want to multiply with the count per group?
I retrieve the the single row per group in "EsnsSalesOrderItems" with DISTINCT ON. Detailed explanation:
Select first row in each GROUP BY group?
I also added table aliases and formatting to make the query easier to parse for human eyes. If you could avoid camel case you could get rid of all the double quotes clouding the view.
Something like:
join (
select "EsnId",
row_number() over (partition by "EsnId" order by "createdAt" desc) as rn
from "EsnsSalesOrderItems"
) t ON t."EsnId" = "Esns"."id" and rn = 1
this will select the latest "EsnId" from "EsnsSalesOrderItems" based on the column creation_date. As you didn't post the structure of your tables, I had to "invent" a column name. You can use any column that allows you to define an order on the rows that suits you.
But remember the concept of the "last row" is only valid if you specifiy an order or the rows. A table as such is not ordered, nor is the result of a query unless you specify an order by
Necromancing because the answers are outdated.
Take advantage of the LATERAL keyword introduced in PG 9.3
left | right | inner JOIN LATERAL
I'll explain with an example:
Assuming you have a table "Contacts".
Now contacts have organisational units.
They can have one OU at a point in time, but N OUs at N points in time.
Now, if you have to query contacts and OU in a time period (not a reporting date, but a date range), you could N-fold increase the record count if you just did a left join.
So, to display the OU, you need to just join the first OU for each contact (where what shall be first is an arbitrary criterion - when taking the last value, for example, that is just another way of saying the first value when sorted by descending date order).
In SQL-server, you would use cross-apply (or rather OUTER APPLY since we need a left join), which will invoke a table-valued function on each row it has to join.
SELECT * FROM T_Contacts
--LEFT JOIN T_MAP_Contacts_Ref_OrganisationalUnit ON MAP_CTCOU_CT_UID = T_Contacts.CT_UID AND MAP_CTCOU_SoftDeleteStatus = 1
--WHERE T_MAP_Contacts_Ref_OrganisationalUnit.MAP_CTCOU_UID IS NULL -- 989
-- CROSS APPLY -- = INNER JOIN
OUTER APPLY -- = LEFT JOIN
(
SELECT TOP 1
--MAP_CTCOU_UID
MAP_CTCOU_CT_UID
,MAP_CTCOU_COU_UID
,MAP_CTCOU_DateFrom
,MAP_CTCOU_DateTo
FROM T_MAP_Contacts_Ref_OrganisationalUnit
WHERE MAP_CTCOU_SoftDeleteStatus = 1
AND MAP_CTCOU_CT_UID = T_Contacts.CT_UID
/*
AND
(
(#in_DateFrom <= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateTo)
AND
(#in_DateTo >= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateFrom)
)
*/
ORDER BY MAP_CTCOU_DateFrom
) AS FirstOE
In PostgreSQL, starting from version 9.3, you can do that, too - just use the LATERAL keyword to achieve the same:
SELECT * FROM T_Contacts
--LEFT JOIN T_MAP_Contacts_Ref_OrganisationalUnit ON MAP_CTCOU_CT_UID = T_Contacts.CT_UID AND MAP_CTCOU_SoftDeleteStatus = 1
--WHERE T_MAP_Contacts_Ref_OrganisationalUnit.MAP_CTCOU_UID IS NULL -- 989
LEFT JOIN LATERAL
(
SELECT
--MAP_CTCOU_UID
MAP_CTCOU_CT_UID
,MAP_CTCOU_COU_UID
,MAP_CTCOU_DateFrom
,MAP_CTCOU_DateTo
FROM T_MAP_Contacts_Ref_OrganisationalUnit
WHERE MAP_CTCOU_SoftDeleteStatus = 1
AND MAP_CTCOU_CT_UID = T_Contacts.CT_UID
/*
AND
(
(__in_DateFrom <= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateTo)
AND
(__in_DateTo >= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateFrom)
)
*/
ORDER BY MAP_CTCOU_DateFrom
LIMIT 1
) AS FirstOE
Try using a subquery in your ON clause. An abstract example:
SELECT
*
FROM table1
JOIN table2 ON table2.id = (
SELECT id FROM table2 WHERE table2.table1_id = table1.id LIMIT 1
)
WHERE
...

SQL WHERE In a many-to-many or many-to-many empty

Does anyone know a way to simplify this WHERE expression?
WHERE (
(#UserSpecialtyID in
(
SELECT CharacteristicSpecialties_Id
FROM ModalityVariantSpecialty
WHERE ModalityVariants_Id = ModalityVariants.Id
)
)
OR
NOT EXISTS
(
SELECT CharacteristicSpecialties_Id
FROM ModalityVariantSpecialty
WHERE ModalityVariants_Id = ModalityVariants.Id
)
)
Something like this should probably work but Im not exactly clear on the relationships for your tables. I could probably give a better example if you could explain the relationships.
SELECT
*
FROM MadalityVariants mv
LEFT JOIN ModalityVariantSpecialty mvs on mvs.ModalityVariants_ID = mv.ID
WHERE
#UserSpecialtyID = mvs.CharacteristicSpecialties_ID
OR
mvs.CharacteristicSpecialties_ID is null
WHERE (
#UserSpecialtyID in
(
SELECT COALESCE(CharacteristicSpecialties_Id, A.A)
FROM (SELECT #UserSpecialtyID A) A LEFT JOIN ModalityVariantSpecialty
ON ModalityVariants_Id = ModalityVariants.Id
)
)
this works well if CharacteristicSpecialties_Id is a NON NULLABLE field.
I am assuming that this is a WHERE clause of a SELECT on the table ModalityVariants
Would this work (The SQL is not tested)?
SELECT *
FROM ModalityVariants
LEFT OUTER JOIN ModalityVariantSpeciality
ON ModalityVariants.Id = ModalityVariants_ID
WHERE CharacteristicSpecialities_Id = #UserSpecialityID or
CharacteristicSpecialities_Id is NULL
Here's my attempt:
WHERE #UserSpecialtyID = COALESCE
(
SELECT TOP 1 CharacteristicSpecialties_Id
FROM ModalityVariantSpecialty
WHERE ModalityVariants_Id = ModalityVariants.Id
ORDER BY
CASE WHEN CharacteristicSpecialties_Id = UserSpecialtyID THEN 1
ELSE 2 END ASC
), #UserSpecialtyID)
If both ModalityVariants_Id and UserSpecialtyID match, the subquery returns CharacteristicSpecialties_Id, and the where succeeds
If only ModalityVariants_Id matches, the subquery returns a different ID, and the where fails
If neither matches, the subquery returns NULL, the COALESCE returns #UserSpecialtyID, and the where succeeds
Probably clearest is a variety of John Hartsock's answer, with a subquery to ensure the left join doesn't add any rows.
select *
from ModalityVariants mv
left join
(
select distinct ModalityVariants_ID
, CharacteristicSpecialties_ID
from ModalityVariantSpecialty
) as mvs
on mvs.ModalityVariants_ID = mv.ID
where #UserSpecialtyID = mvs.CharacteristicSpecialties_ID
OR
mvs.CharacteristicSpecialties_ID is null
I'll vote for John's answer :)