Subqueries: What am I doing fundamentally wrong? - sql

I thought that selecting values from a subquery in SQL would only yield values from that subset until I found a very nasty bug in code. Here is an example of my problem.
I'm selecting the rows that contain the latest(max) function by date. This correctly returns 4 rows with the latest check in of each function.
select *, max(date) from cm where file_id == 5933 group by function_id;
file_id function_id date value max(date)
5933 64807 1407941297 1 1407941297
5933 64808 1407941297 11 1407941297
5933 895175 1306072348 1306072348
5933 895178 1363182349 1363182349
When selecting only the value from the subset above, it returns function values from previous dates, i.e. rows that don't belong in the subset above. You can see the result below where the dates are older than in the first subset.
select temp.function_id, temp.date, temp.value
from (select *, max(date)
from cm
where file_id 5933
group by function_id) as temp;
function_id date value
64807 1306072348 1 &lt-outdated row, not in first subset
64808 1306072348 17 &lt-outdated row, not in first subset
895175 1306072348
895178 1363182349
What am I doing fundamentally wrong? Shouldn't selects performed on subqueries only return possible results from those subqueries?

SQLite allows you to use MAX() to select the row to be returned by a GROUP BY, but this works only if the MAX() is actually computed.
When you throw the max(date) column away, this no longer works.
In this case, you actually want to use the date value, so you can just keep the MAX():
SELECT function_id,
max(date) AS date,
value
FROM cm
WHERE file_id = 5933
GROUP BY function_id

You seem to be missing the fact that your subquery is returning ALL rows for the given file_id. If you want to restrict your subquery to recs with the most recent date, then you need to restrict it with a WHERE NOT EXISTS clause to check that no more recent records exist for the given condition.

Perhaps my question was not formulated correctly, but this post had the solutions I was essentially looking for:
https://stackoverflow.com/a/123481/2966951
https://stackoverflow.com/a/121435/2966951
Filtering out the most recent row was my problem. I was surprised that selecting from a subquery with a max value could yield anything other than that value.

Related

Start date and end date assigning based on date ranges and value change

I have two tables temp_N and temp_C . Table script and data is given below . I am using teradata
Table Script and data
First image is temp_N and second one is temp_C
Now I will try to explain my requirement. Key column for this two tables are 'nbr'. This two table contains all the changes for a particular period of time.( this is sample data and this two tables will get daily loaded based on the updates). Now I need to merge this two tables into one table with date range assigned correctly. The expected result is given below. To explain the logic behind the expected result, first row in the expected result, fstrtdate is the least date which from the two tables which is 2022-01-31 and for the same row if we notice the end date is given as 2022-07-10 as there is a change in the cpnrate on 2022-07-11. second row is start with 2022-07-11 giving the changed cpnrate, now when comes to third row there is a change in ntr on 2022-08-31 and the data is update accordingly. Please note all this are date fields, there wont be any timestamp, please ignore the timestamp in screenshots
Now I would like to know how to achieve this in sql or is it possible to achieve ?
You can combine all the changes into a single table and order by effective start date (fstrtdate). Then you can compute effective end date as day prior to next change, and where one of the data values is NULL use LAG IGNORE NULLS to "bring down" the desired previous not-NULL value:
SELECT nbr, fstrtdate,
PRIOR(LEAD(fstrtdate) OVER (PARTITION BY nbr ORDER BY fstrtdate)) as fenddate,
COALESCE(ntr,LAG(ntr) IGNORE NULLS OVER (PARTITION BY nbr ORDER BY fstrtdate)) as ntr,
COALESCE(cpnrate,LAG(cpnrate) IGNORE NULLS OVER (PARTITION BY nbr ORDER BY fstrtdate)) as cpnrate
FROM (
SELECT nbr, fstrtdate, max(ntr) as ntr, max(cpnrate) as cpnrate
FROM (
SELECT nbr, fstrtdate, ntr, NULL (DECIMAL(9,2)) as cpnrate
from temp_n
UNION ALL
SELECT nbr, fstrtdate, NULL (DECIMAL(9,2)) as ntr, cpnrate
from temp_c
) AS COMBINED
GROUP BY 1, 2
) AS UNIQUESTART
ORDER BY fstrtdate;
The innermost SELECTs make the structure the same for data from both tables with NULLs for the values that come from the other table, so we can do a UNION to form one COMBINED derived table with rows for both types of change events. Note that you should explicitly assign datatype for those added NULL columns to match the datatype for the corresponding column in the other table; I somewhat arbitrarily chose DECIMAL(9,2) above since I didn't know the real data types. They can't be INT as in the example, though, since that would truncate the decimal part. There's no reason to carry along the original fenddate; a new non-overlapping fenddate will be computed in the outer query.
The intermediate GROUP BY is only to combine what would otherwise be two rows in the special case where both ntr and cpnrate changed on the same day for the same nbr. That case is not present in the example data - the dates are already unique - but it might be necessary to do this when processing the full table. The syntax requires an aggregate function, but there should be at most two rows for a (nbr, fstrtdate) group; and when there are two rows, in each of the other columns one row has NULL and the other row does not. In that case either MIN or MAX will return the non-NULL value.
In the outer query, the COALESCEs will return the value for that column from the current row in the UNIQUED derived table if it's not NULL, otherwise LAG is used to obtain the value from a previous row.
The first two rows in the result won't match the screenshot above but they do accurately reflect the data provided - specifically, the example does not identify a cpnrate for any date prior to 2022-05-11.
nbr
fstrtdate
fenddate
ntr
cpnrate
233
2022-01-31
2022-05-10
311,000.00
NuLL
233
2022-05-11
2022-07-10
311,000.00
3.31
...
-
-
-
-

get the latest records

I am currently still on my SQL educational journey and need some help!
The query I have is as below;
SELECT
Audit_Non_Conformance_Records.kf_ID_Client_Reference_Number,
Audit_Non_Conformance_Records.TimeStamp_Creation,
Audit_Non_Conformance_Records.Clause,
Audit_Non_Conformance_Records.NC_type,
Audit_Non_Conformance_Records.NC_Rect_Received,
Audit_Non_Conformance_Records.Audit_Num
FROM Audit_Non_Conformance_Records
I am trying to tweak this to show only the most recent results based on Audit_Non_Conformance_Records.TimeStamp_Creation
I have tried using MAX() but all this does is shows the latest date for all records.
basically the results of the above give me this;
But I only need the result with the date 02/10/2019 as this is the latest result. There may be multiple results however. So for example if 02/10/2019 had never happened I would need all of the idividual recirds from the 14/10/2019 ones.
Does that make any sense at all?
You can filter with a subquery:
SELECT
kf_ID_Client_Reference_Number,
TimeStamp_Creation,
Clause,
NC_type,
NC_Rect_Received,
Audit_Num
FROM Audit_Non_Conformance_Records a
where TimeStamp_Creation = (
select max(TimeStamp_Creation)
from Audit_Non_Conformance_Records
)
This will give you all whose TimeStamp_Creation is equal to the greater value available in the table.
If you want all records that have the greatest day (exluding time), then you can do:
SELECT
kf_ID_Client_Reference_Number,
TimeStamp_Creation,
Clause,
NC_type,
NC_Rect_Received,
Audit_Num
FROM Audit_Non_Conformance_Records a
where cast(TimeStamp_Creation as date) = (
select cast(max(TimeStamp_Creation) as date)
from Audit_Non_Conformance_Records
)
Edit
If you want the latest record per refNumber, then you can correlate the subquery, like so:
SELECT
kf_ID_Client_Reference_Number,
TimeStamp_Creation,
Clause,
NC_type,
NC_Rect_Received,
Audit_Num
FROM Audit_Non_Conformance_Records a
where TimeStamp_Creation = (
select max(TimeStamp_Creation)
from Audit_Non_Conformance_Records a1
where a1.refNumber = a.refNumber
)
For performance, you want an index on (refNumber, TimeStamp_Creation).
If you want the latest date in SQL Server, you can express this as:
SELECT TOP (1) WITH TIES ancr.kf_ID_Client_Reference_Number,
ancr.TimeStamp_Creation,
ancr.Clause,
ancr.NC_type,
ancr.NC_Rect_Received,
ancr.Audit_Num
FROM Audit_Non_Conformance_Records ancr
ORDER BY CONVERT(date, ancr.TimeStamp_Creation) DESC;
SQL Server is pretty good about handling dates with conversions, so I would not be surprised if this used an index on TimeStamp_Creation.

SELECT MIN from a subset of data obtained through GROUP BY

There is a database in place with hourly timeseries data, where every row in the DB represents one hour. Example:
TIMESERIES TABLE
id date_and_time entry_category
1 2017/01/20 12:00 type_1
2 2017/01/20 13:00 type_1
3 2017/01/20 12:00 type_2
4 2017/01/20 12:00 type_3
First I used the GROUP BY statement to find the latest date and time for each type of entry category:
SELECT MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category;
However now, I want to find which is the date and time which is the LEAST RECENT among the datetime's I obtained with the query listed above. I will need to use somehow SELECT MIN(date_and_time), but how do I let SQL know I want to treat the output of my previous query as a "new table" to apply a new SELECT query on? The output of my total query should be a single value—in case of the sample displayed above, date_and_time = 2017/01/20 12:00.
I've tried using aliases, but don't seem to be able to do the trick, they only rename existing columns or tables (or I'm misusing them..).There are many questions out there that try to list the MAX or MIN for a particular group (e.g. https://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ or Select max value of each group) which is what I have already achieved, but I want to do work now on this list of obtained datetime's. My database structure is very simple, but I lack the knowledge to string these queries together.
Thanks, cheers!
You can use your first query as a sub-query, it is similar to what you are describing as using the first query's output as the input for the second query. Here you will get the one row out put of the min date as required.
SELECT MIN(date_and_time)
FROM (SELECT MAX(date_and_time) as date_and_time, entry_category
FROM timeseries_table
GROUP BY entry_category)a;
Is this what you want?
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC;
This returns ties. If you do not want ties, then include an additional sort key:
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC, entry_category;

SQL: Getting the latest date using Max() while using group by

I'm struggling to get the correct result with this query:
select max(kts.my_date), kts.name
join ktt on ktt.someId = kts.someOtherId
where ktt.someId = 'example'
group by kts.name;
I have two (possibly stupid) questions:
Will this max() take time into account? I know that order by does if the dates are the same. Does max do the same?
This is connected to my previous question, but when I run the query above, if the dates are same, it orders it by the name. I want the latest date at the top. Do I need to put an order by clause for the date in? If so, using Max is pointless, right?
Thanks for the help.
Yes,
--2
select max(kts.my_date) over (partition by kts.name) as maxdate, kts.name
from -- chose your table
join ktt on ktt.someId = kts.someOtherId
where ktt.someId = 'example'
order by --chose here your column
give this a try

Understanding a Correlated Subquery

I want to create a query that returns the most recent date for a date field and the highest value of a integer field for each "assessment" record. What I think is required is a correlated subquery and using the MAX function.
example data would be as follows
the date field could have duplicate dates for each assessment but each duplicate date group would have a different the integer in the integer field.
eg
1256 2/6/14 0
1256 2/6/14 1
1256 1/6/14 0
4534 3/6/14 0
4534 3/6/14 1
4534 3/6/14 2
select assessment, Max(correctnum) maxofcorrectnum, dateeffect
from lraassm outerassm
where dateeffect =
(select MAX(dateeffect) maxofdateeffect
from pthdbo.lraassm innerassm
innerassm.assessment = outerassm.assessment
group by innerassm.assessment)
group by assessment, dateeffect
so my theory is that the inner query executes and gives the outer query the criteria for the dateeffect field in the outer query and then the outer query would return the maximum of the correctnum field for this dateeffect and also return its corresponding assessment and the dateeffect.
Could someone please confirm this is correct. How does the subquery handle the rows? what other ways are there to solve this problem? thanks
Your query is doing the right thing, but granted, the correlated subquery is a little difficult to understand. What the subquery does is, it filters the records based on assessment from the outer query and then returns the maximum dateeffect for that assessment. In fact, you don't need the group by clause on the correlated query.
These types of queries are where common when working with data in ERP systems, when you're only interested in "latest" records, etc. This is also known as a "top segment" type of query (which the query optimizer is sometimes able to figure out by itself). I've found, that on SQL Server 2005 or newer, it is a lot easier to use the ROW_NUMBER() function. The following query should return the same as yours, namely one record from lraassm for each assessment, that has the highest value of dateeffect and correctnum.
select * from (
select
assessment, dateeffect, correctnum,
ROW_NUMBER() OVER (
PARTITION BY assessment,
ORDER BY dateeffect DESC, correctnum DESC
) AS segment
from lraassm) AS innerQuery
where segment = 1
This is the query I worked out using my tables. But it will get you on the right track and you should be able to substitute your fields/tables in.
Select * from Decode
where updated_time = (Select MAX(updated_time)from DECODE)
That Query gives you every record that has the most recent updated_time. The next query will return the greatest entry_id value as well as the most recent updated_time from those Records
Select MAX(entry_id), updated_time from Decode
where updated_time = (Select MAX(updated_time)from DECODE)
group by updated_time
The result is 2 columns 1 record, 1st column is the Maximum value of entry id, the second is the most recent updated_time. Is that what you wanted to return?