Eliminating "not in" for SQL command - sql

I would like to take a sample of an Oracle table, but not include entries from another table. I have a query that currently works, but I'm pretty sure it will blow-up when the sub-select gets more than 1000 records.
select user_key from users sample(5)
where active_flag = 'Y'
and user_key not in (
select user_key from user_validation where validation_state <> 'expired'
);
How could this be re-written without the not in. I thought of using minus, but then my sample size would keep going down as new entries were added to the user_validation table.

You can do this with a left outer join:
select *
from (select u.user_key,
count(*) over () as numrecs
from users u left outer join
user_validation uv
on u.user_key = uv.user_key and
uv.validation_state <> 'expired'
where u.active_flag = 'Y' and uv.user_key is null
) t
where rownum <= numrecs * 0.05
You are using the sample clause. It is not clear if you just want the non-matches in the 5% you choose or if you want 5% of the data that is non-matches. This is the latter.
EDIT: Added example based on author's comment:
select user_key from (
select u.user_key, row_number() over (order by dbms_random.value) as randval
from users u
left outer join user_validation uv
on u.user_key = uv.user_key
and uv.validation_state <> 'expired'
where u.active_flag = 'Y'
and uv.user_key is null
) myrandomjoin where randval <=100;

select us.user_key
from users us -- sample(5)
where us.active_flag = 'Y'
and NOT EXISTS (
SELECT *
from user_validation nx
where nx.user_key = us.user_key
AND nx.validation_state <> 'expired'
);
BTW: I commented-out the sample(5) because I don't know what it means. (I strongly believe that it is not relevant, though)

select u.user_key from users u, user_validation uv
where u.active_flag = 'Y'
and u.user_key=uv.user_key
uv.validation_state= 'expired';
This was a double negation query, x not in list of non expired ids, which is equivalent to x is in the list of expired IDs, which is what I did, in addition to changing the subquery to a join.

Related

Enter data for missing category in snowflake

I have a table like
For each keyword, there are 2 devices - mobile and desktop. If entry for only one device is found, then it should automatically create the entry for other device keeping the data in rest of the columns same. I am currently doing a full outer join which is working fine for the case where one device category is missing but generating duplicates where both devices are present. For example,
my current query is giving the result as
select a.keyword, b.device, a.rating
from kw a full outer join kw b
on a.keyword=b.keyword and a.rating=b.rating
How do I get the result as
The first step will be to identify records that don't have a paired record. There's a couple of ways to do this, but the easiest is probably just a quick GROUP BY/HAVING:
SELECT keyword
FROM kw
GROUP BY keyword
HAVING COUNT(*) = 1
You can those join those results back into the original table to generate the new records that are needed:
SELECT sk.keyword,
CASE WHEN kw.device = 'mobile' THEN 'desktop' ELSE 'mobile' END as device,
kw.rating
FROM
(
SELECT keyword
FROM kw
GROUP BY keyword
HAVING COUNT(*) = 1
)sk
INNER JOIN kw ON kw.keyword = sk.keyword
Then you can UNION back in the original table to bring your new records and existing records into a single result set:
SELECT sk.keyword,
CASE WHEN kw.device = 'mobile' THEN 'desktop' ELSE 'mobile' END as device,
kw.rating
FROM
(
SELECT keyword
FROM kw
GROUP BY keyword
HAVING COUNT(*) = 1
)sk
INNER JOIN kw ON kw.keyword = sk.keyword
UNION ALL
SELECT * FROM kw;
As another option that will scale if you add in more 'devices' is to cross join all the potential device/keyword combinations and then left join to your original table:
SELECT
fe.keyword,
fe.device,
CASE WHEN kw.rating IS NULL THEN max(rating) OVER (PARTITION BY fe.keyword) ELSE kw.rating END AS rating
FROM
(
SELECT DISTINCT kw.keyword, kw2.device
FROM kw, kw kw2
) fe
LEFT OUTER JOIN kw ON kw.keyword = fe.keyword
AND kw.device = fe.device;

(probably) very simple SQL query needed

Having a slow day....could use some assistance writing a simple ANSI SQL query.
I have a list of individuals within families (first and last names), and a second table which lists a subset of those individuals. I would like to create a third table which flags every individual within a family if ANY of the individuals are not listed in the second table. The goal is essentially to flag "incomplete" families.
Below is an example of the two input tables, and the desired third table.
As I said...very simple...having a slow day. Thanks!
I think you want a left join and case expression:
select t1.*,
(case when t2.first_name is null then 'INCOMPLETE' else 'OK' end) as flag
from table1 t1 left join
table2 t2
on t1.first_name = t2.first_name and t1.last_name = t2.last_name;
Of course, this marks "Diane Thomson" as "OK", but I think that is an error in the question.
EDIT:
Oh, I see. The last name defines the family (that seems like a pretty big assumption). But you can do this with window functions:
select t1.*,
(case when count(t2.first_name) over (partition by t1.last_name) =
count(*) over (partition by t1.last_name)
then 'OK'
else 'INCOMPLETE'
end) as flag
from table1 t1 left join
table2 t2
on t1.first_name = t2.first_name and t1.last_name = t2.last_name;
That's not simple, at least not in SAS :-)
Standard SQL, when Windowed Aggregates are supported:
select ft.*,
-- counts differ when st.first_name is null due to the outer join
case when count(*) over (partition by ft.last_name)
= count(st.first_name) over (partition by ft.last_name)
then 'OK'
else 'INCOMPLETE'
end
from first_table as ft
left join second_table as st
on ft.first_name = st.first_name
and ft.last_name = ft.last_name
Otherwise you need to a standard aggregate and join back:
select ft.*, st.flag
from first_table as ft
join
(
select ft.last_name,
case when count(*)
= count(st.first_name)
then 'OK'
else 'INCOMPLETE'
end as flag
from first_table as ft
left join second_table as st
on ft.first_name = st.first_name
and ft.last_name = st.last_name
group by ft.last_name
) as st
on ft.last_name = st.last_name
It is pretty easy to do in SAS if you want to take advantage of its non-ANSI SQL feature of automatically re-merging aggregate function results back onto detail records.
select
a.first
, a.last
, case when 1=max(missing(b.last)) then 'INCOMPLETE'
else 'OK'
end as flag
from table1 a left join table2 b
on a.last=b.last and a.first=b.first
group by 2
order by 2,1
;

Multiple left joins with aggregation on same table causes huge performance hit in SAP HANA

I am joining two tables on HANA and, to get some statistics, I am LEFT joining the items table 3 times to get a total count, number of entries processed and number of errors, as shown below.
This is a dev system and the items table has only 1500 items. But the query below runs for 17 seconds.
When I remove any of the three aggregation terms (but leave the corresponding JOIN in place), the query executes almost immediately.
I have also tried adding indexes on the fields used in the specific JOINs, but that makes no difference.
select rk.guid, rk.run_id, rk.status, rk.created_at, rk.created_by,
count( distinct rp.guid ),
count( distinct rp2.guid ),
count( distinct rp3.guid )
from zbsbpi_rk as rk
left join zbsbpi_rp as rp
on rp.header = rk.guid
left join zbsbpi_rp as rp2
on rp2.header = rk.guid
and rp2.processed = 'X'
left join zbsbpi_rp as rp3
on rp3.header = rk.guid
and rp3.result_status = 'E'
where rk.run_id = '0000000010'
group by rk.guid, run_id, status, created_at, created_by
I think you can re-write you query to improve the performance:
select rk.guid, rk.run_id, rk.status, rk.created_at, rk.created_by,
count( distinct rp.guid ),
count( distinct (CASE WHEN rp.processed = 'X' then rp.guid else null end) ),
count( distinct (CASE WHEN rp.result_status = 'E' then rp.guid else null end))
from zbsbpi_rk as rk
left join zbsbpi_rp as rp
on rp.header = rk.guid
where rk.run_id = '0000000010'
group by rk.guid, run_id, status, created_at, created_by
I'm not entirely sure if the count distinct case construct will work on hana but you may try.
My apologies, but I forgot that I had posted this question here. I had posted the same question at answers.sap.com after not getting any joy here: https://answers.sap.com/questions/172096/multiple-left-joins-with-aggregation-on-same-table.html
I eventually came up with the solution, which was a bit of a "doh!" moment:
select rk.guid, rk.run_id, rk.status, rk.created_at, rk.created_by,
count( distinct rp.guid ),
count( distinct rp2.guid ),
count( distinct rp3.guid )
from zbsbpi_rk as rk
join zbsbpi_rp as rp
on rp.header = rk.guid
left join zbsbpi_rp as rp2
on rp2.guid = rp.guid
and rp2.processed = 'X'
left join zbsbpi_rp as rp3
on rp3.guid = rp.guid
and rp3.result_status = 'E'
where rk.run_id = '0000000010'
group by rk.guid, run_id, status, created_at, created_by
The subsequent left joins needed only to be joined to the first join on the same table, as the first join contained a superset of all the records anyway.

Redshift subquery not accepted

I'm trying to execute the following query against my dataset stored in Redshift:
SELECT v_users.user_id AS user_id,
v_users.first_name AS first_name,
v_users.email AS email,
COALESCE(v_users.country, accounts.region) AS country_code,
profiles.language AS language,
v_users.mobilenum AS mobile_num,
NULL as mobile_verification_date,
COALESCE(v_users.registration_date, accounts.date_created) AS activation_date,
EXISTS (SELECT 1
FROM cds.user_session_201612 AS users_session,
cds.access_logs_summary_201612 AS access_logs_summary,
views_legacy AS views_legacy
WHERE users_session.userid = v_users.user_id
OR access_logs_summary.userid = v_users.user_id
OR views_legacy.user_id = v_users.user_id) AS has_viewed,
NULL as preferred_genre_1,
NULL as preferred_genre_2,
NULL as preferred_genre_3
FROM users AS v_users,
users_metadata AS v_users_metadata,
account.account AS accounts,
account.profile AS profiles
WHERE accounts.id = v_users.user_id
AND profiles.id = v_users.user_id
AND v_users_metadata.user_id = v_users.user_id
The problem which I get is the following:
ERROR: This type of correlated subquery pattern is not supported due to internal error
which is caused by the subquery but how can I solve it? can you provide me some suggestions?
Redshift doesn't allow correlated subqueries in the SELECT clause, which I don't think is a limitation as all the examples I've encountered can be otherwise expressed.
I've refactored the subquery as a CTE, and used a left join with an is not null to mark users who have or not viewed some thing.
This particular query below may not work, but any solution will likely take the following form:
WITH has_viewed AS (
SELECT
u.user_id
FROM users u
LEFT JOIN cds.user_session_201612 AS users_session
ON users_session.userid = u.user_id
LEFT JOIN cds.access_logs_summary_201612 AS access_logs_summary
ON access_logs_summary.userid = users.user_id
LEFT JOIN views_legacy
ON views_legacy.user_id = v_users.user_id
WHERE users_session.userid IS NOT NULL
OR access_logs_summary.userid IS NOT NULL
OR views_legacy.user_id
GROUP BY 1
)
SELECT
v_users.user_id AS user_id
, v_users.first_name AS first_name
, v_users.email AS email
, COALESCE(v_users.country, accounts.region) AS country_code
, profiles.language AS language
, v_users.mobilenum AS mobile_num
, NULL as mobile_verification_date
, COALESCE(v_users.registration_date, accounts.date_created) AS activation_date
, has_viewed.user_id IS NOT NULL AS has_viewed
, NULL as preferred_genre_1
, NULL as preferred_genre_2
, NULL as preferred_genre_3
FROM users AS v_users
JOIN users_metadata AS v_users_metadata
ON v_users_metadata.user_id = v_users.user_id
JOIN account.account AS accounts
ON accounts.id = v_users.user_id
JOIN account.profile AS profiles ON profiles.id = v_users.user_id
LEFT JOIN has_viewed
ON has_viewed.user_id = v_users.user_id
I have tried all possible combinations,
SELECT subquery doesn't work
CTE (Common Table Expression) as shown by Haleemur Ali doesn't work either.
Now what I have tried - I needed an alternative to GROUP BY, as redshift doesn't accept GROUP BY.
So I got this solution -
the OVER keyword.
SO as a replacement for GROUP BY I used OVER and PARTITION BY which goes like -
SELECT *
FROM (
SELECT *,ROW_NUMBER()
OVER (PARTITION BY **VARIOUS COLUMNS** ORDER BY datetime DESC) rn
FROM schema.tableName
) derivedTable
WHERE derivedTable.rn = 1;
Maybe OVER might help you out. I am not sure though.

Limit join to one row

I have the following query:
SELECT sum((select count(*) as itemCount) * "SalesOrderItems"."price") as amount, 'rma' as
"creditType", "Clients"."company" as "client", "Clients".id as "ClientId", "Rmas".*
FROM "Rmas" JOIN "EsnsRmas" on("EsnsRmas"."RmaId" = "Rmas"."id")
JOIN "Esns" on ("Esns".id = "EsnsRmas"."EsnId")
JOIN "EsnsSalesOrderItems" on("EsnsSalesOrderItems"."EsnId" = "Esns"."id" )
JOIN "SalesOrderItems" on("SalesOrderItems"."id" = "EsnsSalesOrderItems"."SalesOrderItemId")
JOIN "Clients" on("Clients"."id" = "Rmas"."ClientId" )
WHERE "Rmas"."credited"=false AND "Rmas"."verifyStatus" IS NOT null
GROUP BY "Clients".id, "Rmas".id;
The problem is that the table "EsnsSalesOrderItems" can have the same EsnId in different entries. I want to restrict the query to only pull the last entry in "EsnsSalesOrderItems" that has the same "EsnId".
By "last" entry I mean the following:
The one that appears last in the table "EsnsSalesOrderItems". So for example if "EsnsSalesOrderItems" has two entries with "EsnId" = 6 and "createdAt" = '2012-06-19' and '2012-07-19' respectively it should only give me the entry from '2012-07-19'.
SELECT (count(*) * sum(s."price")) AS amount
, 'rma' AS "creditType"
, c."company" AS "client"
, c.id AS "ClientId"
, r.*
FROM "Rmas" r
JOIN "EsnsRmas" er ON er."RmaId" = r."id"
JOIN "Esns" e ON e.id = er."EsnId"
JOIN (
SELECT DISTINCT ON ("EsnId") *
FROM "EsnsSalesOrderItems"
ORDER BY "EsnId", "createdAt" DESC
) es ON es."EsnId" = e."id"
JOIN "SalesOrderItems" s ON s."id" = es."SalesOrderItemId"
JOIN "Clients" c ON c."id" = r."ClientId"
WHERE r."credited" = FALSE
AND r."verifyStatus" IS NOT NULL
GROUP BY c.id, r.id;
Your query in the question has an illegal aggregate over another aggregate:
sum((select count(*) as itemCount) * "SalesOrderItems"."price") as amount
Simplified and converted to legal syntax:
(count(*) * sum(s."price")) AS amount
But do you really want to multiply with the count per group?
I retrieve the the single row per group in "EsnsSalesOrderItems" with DISTINCT ON. Detailed explanation:
Select first row in each GROUP BY group?
I also added table aliases and formatting to make the query easier to parse for human eyes. If you could avoid camel case you could get rid of all the double quotes clouding the view.
Something like:
join (
select "EsnId",
row_number() over (partition by "EsnId" order by "createdAt" desc) as rn
from "EsnsSalesOrderItems"
) t ON t."EsnId" = "Esns"."id" and rn = 1
this will select the latest "EsnId" from "EsnsSalesOrderItems" based on the column creation_date. As you didn't post the structure of your tables, I had to "invent" a column name. You can use any column that allows you to define an order on the rows that suits you.
But remember the concept of the "last row" is only valid if you specifiy an order or the rows. A table as such is not ordered, nor is the result of a query unless you specify an order by
Necromancing because the answers are outdated.
Take advantage of the LATERAL keyword introduced in PG 9.3
left | right | inner JOIN LATERAL
I'll explain with an example:
Assuming you have a table "Contacts".
Now contacts have organisational units.
They can have one OU at a point in time, but N OUs at N points in time.
Now, if you have to query contacts and OU in a time period (not a reporting date, but a date range), you could N-fold increase the record count if you just did a left join.
So, to display the OU, you need to just join the first OU for each contact (where what shall be first is an arbitrary criterion - when taking the last value, for example, that is just another way of saying the first value when sorted by descending date order).
In SQL-server, you would use cross-apply (or rather OUTER APPLY since we need a left join), which will invoke a table-valued function on each row it has to join.
SELECT * FROM T_Contacts
--LEFT JOIN T_MAP_Contacts_Ref_OrganisationalUnit ON MAP_CTCOU_CT_UID = T_Contacts.CT_UID AND MAP_CTCOU_SoftDeleteStatus = 1
--WHERE T_MAP_Contacts_Ref_OrganisationalUnit.MAP_CTCOU_UID IS NULL -- 989
-- CROSS APPLY -- = INNER JOIN
OUTER APPLY -- = LEFT JOIN
(
SELECT TOP 1
--MAP_CTCOU_UID
MAP_CTCOU_CT_UID
,MAP_CTCOU_COU_UID
,MAP_CTCOU_DateFrom
,MAP_CTCOU_DateTo
FROM T_MAP_Contacts_Ref_OrganisationalUnit
WHERE MAP_CTCOU_SoftDeleteStatus = 1
AND MAP_CTCOU_CT_UID = T_Contacts.CT_UID
/*
AND
(
(#in_DateFrom <= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateTo)
AND
(#in_DateTo >= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateFrom)
)
*/
ORDER BY MAP_CTCOU_DateFrom
) AS FirstOE
In PostgreSQL, starting from version 9.3, you can do that, too - just use the LATERAL keyword to achieve the same:
SELECT * FROM T_Contacts
--LEFT JOIN T_MAP_Contacts_Ref_OrganisationalUnit ON MAP_CTCOU_CT_UID = T_Contacts.CT_UID AND MAP_CTCOU_SoftDeleteStatus = 1
--WHERE T_MAP_Contacts_Ref_OrganisationalUnit.MAP_CTCOU_UID IS NULL -- 989
LEFT JOIN LATERAL
(
SELECT
--MAP_CTCOU_UID
MAP_CTCOU_CT_UID
,MAP_CTCOU_COU_UID
,MAP_CTCOU_DateFrom
,MAP_CTCOU_DateTo
FROM T_MAP_Contacts_Ref_OrganisationalUnit
WHERE MAP_CTCOU_SoftDeleteStatus = 1
AND MAP_CTCOU_CT_UID = T_Contacts.CT_UID
/*
AND
(
(__in_DateFrom <= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateTo)
AND
(__in_DateTo >= T_MAP_Contacts_Ref_OrganisationalUnit.MAP_KTKOE_DateFrom)
)
*/
ORDER BY MAP_CTCOU_DateFrom
LIMIT 1
) AS FirstOE
Try using a subquery in your ON clause. An abstract example:
SELECT
*
FROM table1
JOIN table2 ON table2.id = (
SELECT id FROM table2 WHERE table2.table1_id = table1.id LIMIT 1
)
WHERE
...