Select values based on DISTINCT combination of rest of columns Oracle DB - sql

I want to select row IDs associated with distinct column combinations in the remainder of a table. For instance, if the distinct rows are
I want to get the row IDs associated with each row. I can't query for distinct IDs since they are the row's primary key (and hence are all distinct).
So far I have:
SELECT e.ID
FROM E_UPLOAD_TEST e
INNER JOIN (
SELECT DISTINCT WHAT, MATERIALS, ERROR_FIELD, UNITS, SEASONALITY, DATA_TYPE, DETAILS, METHODS, DATA_FORMAT
FROM E_UPLOAD_TEST) c
ON e.WHAT = c.WHAT AND e.MATERIALS = c.MATERIALS AND e.ERROR_FIELD = c.ERROR_FIELD AND e.DATA_TYPE = c.DATA_TYPE AND e.METHODS = c.METHODS AND e.DATA_FORMAT = c.DATA_FORMAT;
which runs but doesn't return anything. Am I missing a GROUP BY and/or MIN() statement?

#serg is correct. Every single row in your example has at least one column value that is null. That means that no row will match your join condition. That is why your query results in no rows found.
Modifying your condition might get you what you want so long has your data isn't changing frequently. If it is changing frequently, then you probably want a single query for the entire job otherwise you'll have to set your transaction so that it is immune to data changes.
An example of such a condition change is this:
( (e.WHAT is null and c.WHAT is null) or (e.WHAT = c.WHAT) )
But such a change makes sense only if two rows having a null value in the same column means the same thing for both rows and it has to mean the same thing as time marches on. What "WHAT is null" means today might not be the same thing tomorrow. And that is probably why C. J. Date hates nulls so much.

Instead of comparing, use the decode function which compares two null values correctly.
e.WHAT = c.WHAT -> DECODE(e.WHAT, c.WHAT, 1) = 1

Related

RETURNING causes error: missing FROM-clause entry for table

I am getting the users data from UUID WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'.
Since I don't want to make an additional query to fetch additional user data I'm trying to sneak them through the INSERT.
WITH _u AS (
SELECT
eu.empl_user_pvt_uuid,
ee.email,
ep.name_first
FROM employees.users eu
LEFT JOIN (
SELECT DISTINCT ON (ee.empl_user_pvt_uuid)
ee.empl_user_pvt_uuid,
ee.email
FROM employees.emails ee
ORDER BY ee.empl_user_pvt_uuid, ee.t DESC
) ee ON eu.empl_user_pvt_uuid = ee.empl_user_pvt_uuid
LEFT JOIN (
SELECT DISTINCT ON (ep.empl_user_pvt_uuid)
ep.empl_user_pvt_uuid,
ep.name_first
FROM employees.profiles ep
) ep ON eu.empl_user_pvt_uuid = ep.empl_user_pvt_uuid
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
)
INSERT INTO employees.password_resets (empl_pwd_reset_uuid, empl_user_pvt_uuid, t_valid, for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', _u.empl_user_pvt_uuid, '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM _u
RETURNING _u.empl_user_pvt_uuid, _u.email, _u.name_first;
However I get:
[42P01] ERROR: missing FROM-clause entry for table "_u"
Position: 994
What am I doing wrong?
It's true, as has been noted, that the RETURNING clause of an INSERT only sees the inserted row. More specifically, quoting the manual here:
The optional RETURNING clause causes INSERT to compute and return
value(s) based on each row actually inserted (or updated, if an ON CONFLICT DO UPDATE clause was used). This is primarily useful for
obtaining values that were supplied by defaults, such as a serial
sequence number. However, any expression using the table's columns
is allowed. The syntax of the RETURNING list is identical to that
of the output list of SELECT. Only rows that were successfully
inserted or updated will be returned. [...]
Bold emphasis mine.
So nothing keeps you from adding a correlated subquery to the RETURNING list:
INSERT INTO employees.password_resets AS ep
(empl_pwd_reset_uuid , empl_user_pvt_uuid , t_valid , for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', eu.empl_user_pvt_uuid , '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM employees.users eu
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
RETURNING for_empl_user_pvt_uuid AS empl_user_pvt_uuid -- alias to meet your org. query
, (SELECT email
FROM employees.emails
WHERE empl_user_pvt_uuid = ep.empl_user_pvt_uuid
ORDER BY t DESC -- NULLS LAST ?
LIMIT 1
) AS email
, (SELECT name_first
FROM employees.profiles
WHERE empl_user_pvt_uuid = ep.empl_user_pvt_uuid
-- ORDER BY ???
LIMIT 1
) AS name_first;
This is also much more efficient than the query you had (or what was proposed) for multiple reasons.
We don't run the subqueries ee and ep over all rows of the tables employees.emails and employees.profiles. That would be efficient if we needed major parts of those tables, but we only fetch a single row of interest from each. With appropriate indexes, a correlated subquery is much more efficient for this. See:
Efficient query to get greatest value per group from big table
Two SQL LEFT JOINS produce incorrect result
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest row per user
We don't add the overhead of one or more CTEs.
We only fetch additional data after a successful INSERT, so no time is wasted if the insert didn't go through for any reason. (See quote at the top!)
Plus, possibly most important, this is correct. We use data from the row that has actually been inserted - after inserting it. (See quote at the top!) After possible default values, triggers or rules have been applied. We can be certain that what we see is what's actually in the database (currently).
You have no ORDER BY for profiles.name_first. That's not right. Either there is only one qualifying row, then we need no DISTINCT nor LIMIT 1. Or there can be multiple, then we also need a deterministic ORDER BY to get a deterministic result.
And if emails.t can be NULL, you'll want to add NULLS LAST in the ORDER BY clause. See:
Sort by column ASC, but NULL values first?
Indexes
Ideally, you have these multicolumn indexes (with columns in this order):
users (empl_user_pub_uuid, empl_user_pvt_uuid)
emails (empl_user_pvt_uuid, email)
profiles (empl_user_pvt_uuid, name_first)
Then, if the tables are vacuumed enough, you get three index-only scans and the whole operation is lightening fast.
Get pre-INSERT values?
If you really want that (which I don't think you do), consider:
Return pre-UPDATE column values using SQL only
According Postgres Docs about 6.4. Returning Data From Modified Rows :
In an INSERT, the data available to RETURNING is the row as it was
inserted.
But here you are trying to return columns from source table instead of destination. Returning will not be able to return columns form _u table rather only from employees.password_resets table's inserted row. But you can write nested cte for insertion and can select data from source table as well. Please try below approach.
WITH _u AS (
SELECT
eu.empl_user_pvt_uuid,
ee.email,
ep.name_first
FROM employees.users eu
LEFT JOIN (
SELECT DISTINCT ON (ee.empl_user_pvt_uuid)
ee.empl_user_pvt_uuid,
ee.email
FROM employees.emails ee
ORDER BY ee.empl_user_pvt_uuid, ee.t DESC
) ee ON eu.empl_user_pvt_uuid = ee.empl_user_pvt_uuid
LEFT JOIN (
SELECT DISTINCT ON (ep.empl_user_pvt_uuid)
ep.empl_user_pvt_uuid,
ep.name_first
FROM employees.profiles ep
) ep ON eu.empl_user_pvt_uuid = ep.empl_user_pvt_uuid
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
), I as
(
INSERT INTO employees.password_resets (empl_pwd_reset_uuid, empl_user_pvt_uuid, t_valid, for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', _u.empl_user_pvt_uuid, '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM _u
)
select _u.empl_user_pvt_uuid, _u.email, _u.name_first from _u

SQL Server ISNULL multiple columns

I have the following query which works great but how do I add multiple columns in its select statement? Following is the query:
SELECT ISNULL(
(SELECT DISTINCT a.DatasourceID
FROM [Table1] a
WHERE a.DatasourceID = 5 AND a.AgencyID = 4 AND a.AccountingMonth = 201907), NULL) TEST
So currently I only get one column (TEST) but would like to add other columns such as DataSourceID, AgencyID and AccountingMonth.
If you want to output a row for some condition (or requested values ) and output a row when it does not meet condition,
you can set a pseudo table for your requested values in the FROM clause and make a left outer join with your Table1.
SELECT ISNULL(Table1.DatasourceId, 999999),
Table1.AgencyId,
Table1.AccountingMonth,
COUNT(*) as count
FROM ( VALUES (5, 4, 201907 ),
(6, 4, 201907 ))
AS requested(DatasourceId, AgencyId, AccountingMonth)
LEFT OUTER JOIN Table1 ON requested.agencyid=Table1.AgencyId
AND requested.datasourceid = Table1.DatasourceId
AND requested.AccountingMonth = Table1.AccountingMonth
GROUP BY Table1.DatasourceId, Table1.AgencyId, Table1.AccountingMonth
Note that:
I have put a ISNULL for the first column like you did to output a particular value (9999) when no value is found.
I did not put the ISNULL(...,NULL) like your query in the other columns since IMHO it is not necessary: if there is no value, a null will be output anyway.
I added a COUNT(*) column to illustrate an aggregate, you could use another (SUM, MIN, MAX) or none if you do not need it.
The set of requested values is provided as a constant table values (see https://learn.microsoft.com/en-us/sql/t-sql/queries/table-value-constructor-transact-sql?view=sql-server-2017)
I have added multiple rows for requested conditions : you can request for multiple datasources, agencies or months in one query with one line for each in the output.
If you want only one row, put only one row in "requested" pseudo table values.
There must be a GROUP BY, even if you do not want to use an aggregate (count, sum or other) in order to have the same behavior as your distinct clause , it restricts the output to single lines for requested values.
To me it seems that you want to see does data exists, i guess that your's AgencyID is foreign key to agency table, DataSourceID also to DataSource, and that you have AccountingMonth table which has all accounting periods:
SELECT ds.ID as DataSourceID , ag.ID as AgencyID , am.ID as AccountingMonth ,
ISNULL(COUNT(a.*),0) as Count
FROM [Table1] a
RIGHT JOIN [Datasource] ds ON ds.ID = a.DataSourceID
RIGHT JOIN [Agency] ag ON ag.ID = a.AgencyID
RIGHT JOIN [AccountingMonth] am on am.ID = a.AccountingMonth
GROUP BY ds.ID, ag.ID, am.ID
In this way you can see count of records per group by criteria. Notice RIGHT join, you must use RIGHT JOIN if you want to include all record from "Right" table.
In yours query you have DISTINCT a.DatasourceID and WHERE a.DatasourceID = 5 and it returns 5 if in table exists rows that match yours WHERE criteria, and returns null if there is no data. If you remove WHERE a.DatasourceID = 5 your query would break with error: subquery returned multiple rows.
the way you are doing only allows for one column and one record and giving it the name of test. It does not look like you really need to test for null. because you are returning null so that does nothing to help you. Remove all the null testing and return a full recordset distinct will also limit your returns to 1 record. When working with a single table you don't need an alias, if there are no spaces or keywords braced identifiers not required. if you need to see if you have an empty record set, test for it in the calling program.
SELECT DatasourceID, AgencyID,AccountingMonth
FROM Table1
WHERE DatasourceID = 5 AND AgencyID = 4 AND AccountingMonth = 201907

SQL Server Comparing 2 tables having Millions of Records

I have 2 tables in SQL Server: Table 1 and Table 2.
Table 1 has 500 Records and Table 2 has Millions of Records.
Table 2 may/may not have the 500 Records of Table 1 in it.
I have to compare Table 1 and Table 2. But the result should give me only the Records of Table 1 which has any data change in Table 2. Means the Result should be less than or equal to 500.
I don't have any primary key but the columns in the 2 tables are same. I have written the following query. But I am getting time out exception and it is taking much time to process. Please help.
With CTE_DUPLICATE(OLD_FIRSTNAME ,New_FirstName,
OLD_LASTNAME ,New_LastName,
OLD_MINAME ,New_MIName ,
OLD_FAMILYID,NEW_FAMILYID,ROWNUMBER)
as (
Select distinct
OLD.FIRST_NAME AS 'OLD_FIRSTNAME' ,New.First_Name AS 'NEW_FIRSTNAME',
OLD.LAST_NAME AS 'OLD_LASTNAME',New.Last_Name AS 'NEW_LASTNAME',
OLD.MI_NAME AS 'OLD_MINAME',New.MI_Name AS 'NEW_MINAME',
OLD.FAMILY_ID AS 'OLD_FAMILYID',NEW.FAMILY_ID AS 'NEW_FAMILYID',
row_number()over(partition by OLD.FIRST_NAME ,New.First_Name,
OLD.LAST_NAME ,New.Last_Name,
OLD.MI_NAME ,New.MI_Name ,
OLD.FAMILY_ID,NEW.FAMILY_ID
order by OLD.FIRST_NAME ,New.First_Name,
OLD.LAST_NAME ,New.Last_Name,
OLD.MI_NAME ,New.MI_Name ,
OLD.FAMILY_ID,NEW.FAMILY_ID )as rank
From EEMSCDBStatic OLD,EEMS_VIPFILE New where
OLD.MPID <> New.MPID and old.FIRST_NAME <> New.First_Name
and OLD.LAST_NAME <> New.Last_Name and OLD.MI_NAME <> New.MI_Name
and old.Family_Id<>New.Family_id
)
sELECT OLD_FIRSTNAME ,New_FirstName,
OLD_LASTNAME ,New_LastName,
OLD_MINAME ,New_MIName ,
OLD_FAMILYID,NEW_FAMILYID FROM CTE_DUPLICATE where rownumber=1
I think the main problem here is that your query is forcing the DB to fully multiply your tables, which means processing ~500M combinations. It happens because you're connecting any record from T1 with any record from T2 that has at least one different value, including MPID that looks like the unique identifier that must be used to connect records.
If MPID is really the column that identifies records in both tables then your query should have a bit different structure:
SELECT old.FIRSTNAME, new.FirstName,
old.LASTNAME, new.LastName,
old.MINAME, new.MIName,
old.FAMILYID, new.FAMILYID
FROM EEMSCDBStatic old
INNER JOIN EEMS_VIPFILE new ON old.MPID = new.MPID
WHERE old.FIRST_NAME <> New.First_Name
AND OLD.LAST_NAME <> New.Last_Name
AND OLD.MI_NAME <> New.MI_Name
AND old.Family_Id <> New.Family_id
ORDER BY old.FIRSTNAME, new.FirstName,
old.LASTNAME, new.LastName,
old.MINAME, new.MIName,
old.FAMILYID, new.FAMILYID
A couple of other thoughts:
If you're looking for any change in a record (even if only one column has different values), you should use ORs in the WHERE clause, not ANDs. Now you're only looking for records that changed values in all columns. For instance, you'll fail to find a person who changed his or her first name but decided to keep last name.
You should obviously consider indexing your tables if it's possible.
Surely it is pointless to use DISTINCT keyword together with ROWNUMBER.
See this sql query distinct with Row_Number.
You are doing CROSS JOIN, which is terribly big in your case.
Perhaps in that condition you
where OLD.MPID <> New.MPID and old.FIRST_NAME <> New.First_Name and ...
you wanted to have OR instead of AND?
It is also not entirely clear why you use ROWNUMBER at all - perhaps to find the best match.
All this is because as #Shnugo correctly remarked, the logic behind your comparing is faulty - you must have some logic defined that would JOIN the tables (Like First and second name must be the same).

SQL Server where column in where clause is null

Let's say that we have a table named Data with Id and Weather columns. Other columns in that table are not important to this problem. The Weather column can be null.
I want to display all rows where Weather fits a condition, but if there is a null value in weather then display null value.
My SQL so far:
SELECT *
FROM Data d
WHERE (d.Weather LIKE '%'+COALESCE(NULLIF('',''),'sunny')+'%' OR d.Weather IS NULL)
My results are wrong, because that statement also shows values where Weather is null if condition is not correct (let's say that users mistyped wrong).
I found similar topic, but there I do not find appropriate answer.
SQL WHERE clause not returning rows when field has NULL value
Please help me out.
Your query is correct for the general task of treating NULLs as a match. If you wish to suppress NULLs when there are no other results, you can add an AND EXISTS ... condition to your query, like this:
SELECT *
FROM Data d
WHERE d.Weather LIKE '%'+COALESCE(NULLIF('',''),'sunny')+'%'
OR (d.Weather IS NULL AND EXISTS (SELECT * FROM Data dd WHERE dd.Weather LIKE '%'+COALESCE(NULLIF('',''),'sunny')+'%'))
The additional condition ensures that NULLs are treated as matches only if other matching records exist.
You can also use a common table expression to avoid duplicating the query, like this:
WITH cte (id, weather) AS
(
SELECT *
FROM Data d
WHERE d.Weather LIKE '%'+COALESCE(NULLIF('',''),'sunny')+'%'
)
SELECT * FROM cte
UNION ALL
SELECT * FROM Data WHERE weather is NULL AND EXISTS (SELECT * FROM cte)
statement show also values where Wether is null if condition is not correct (let say that users typed wrong sunny).
This suggests that the constant 'sunny' is coming from end-user's input. If that is the case, you need to parameterize your query to avoid SQL injection attacks.

What is the difference between the IN operator and = operator in SQL?

I am just learning SQL, and I'm wondering what the difference is between the following lines:
WHERE s.parent IN (SELECT l.parent .....)
versus
WHERE s.parent = (SELECT l.parent .....)
IN
will not generate an error if you have multiple results on the subquery. Allows to have more than one value in the result returned by the subquery.
=
will generate an error if you have more than one result on the subquery.
SQLFiddle Demo (IN vs =)
when you are using 'IN' it can compare multiple values....like
select * from tablename where student_name in('mari','sruthi','takudu')
but when you are using '=' you can't compare multiple values
select * from tablenamewhere student_name = 'sruthi'
i hope this is the right answer
The "IN" clause is also much much much much slower. If you have many results in the select portion of
IN (SELECT l.parent .....),
it will be extremely inefficient as it actually generates a separate select sql statement for each and every result within the select statement ... so if you return 'Cat', 'Dog', 'Cow'
it will essentially create a sql statement for each result... if you have 200 results... you get the full sql statement 200 times...takes forever... (This was as of a few years ago... maybe imporved by now... but it was horribly slow on big result sets.)
Much more efficient to do an inner join such as:
Select id, parent
from table1 as T
inner join (Select parent from table2) as T2 on T.parent = T2.parent
For future visitors.
Basically in case of equals (just remember that here we are talking like where a.name = b.name), each cell value from table 1 will be compared one by one to each cell value of all the rows from table 2, if it matches then that row will be selected (here that row will be selected means that row from table 1 and table 2) for the overall result set otherwise will not be selected.
Now, in case of IN, complete result set on the right side of the IN will be used for comparison, so its like each value from table 1 will be checked on whether this cell value is present in the complete result set of the IN, if it is present then that value will be shown for all the rows of the IN’s result set, so let say IN result set has 20 rows, so that cell value from table 1 will be present in overall result set 20 times (i.e. that particular cell value will have 20 rows).
For more clarity see below screen shot, notice below that how complete result set from the right of the IN (and NOT IN) is considered in the overall result set; whole emphasis is on the fact that in case comparison using =, matching row from second table is selected, while in case of IN complete result from the second table is selected.
In can match a value with more than one values, in other words it checks if a value is in the list of values so for e.g.
x in ('a', 'b', 'x') will return true result as x is in the the list of values
while = expects only one value, its as simple as
x = y returns false
and
x = x returns true
The general rule of thumb is:
The = expects a single value to compare with. Like this:
WHERE s.parent = 'father_name'
IN is extremely useful in scenarios where = cannot work i.e. scenarios where you need the comparison with multiple values.
WHERE s.parent IN ('father_name', 'mother_name', 'brother_name', 'sister_name')
Hope this is useful!!!
IN
This helps when a subquery returns more than one result.
=
This operator cannot handle more than one result.
Like in this example:
SQL>
Select LOC from dept where DEPTNO = (select DEPTNO from emp where
JOB='MANAGER');
Gives ERROR ORA-01427: single-row subquery returns more than one row
Instead use
SQL>
Select LOC from dept where DEPTNO in (select DEPTNO from emp
where JOB='MANAGER');
1) Sometimes = also used as comparison operator in case of joins which IN doesn't.
2) You can pass multiple values in the IN block which you can't do with =. For example,
SELECT * FROM [Products] where ProductID IN((select max(ProductID) from Products),
(select min(ProductID) from Products))
would work and provide you expected number of rows.However,
SELECT * FROM [Products] where ProductID = (select max(ProductID) from Products)
and ProductID =(select min(ProductID) from Products)
will provide you 'no result'. That means, in case subquery supposed to return multiple number of rows , in that case '=' isn't useful.