Very huge data avoid analytical function - sql

Assume there are 100 million distinct SID values in this table.
Example: Table test, columns SID, date_operation, score
So in this table, score changes everyday so if I want to get report of all the SID with most recent score. Don't want to use analytical function as otherwise cost would be very high. Tried self join also but looks like that is also costly.
If this question is redundant please direct me to similar question I will delete it.

select sid, max(date_operation)
from test
group by sid
will return what you asked for:
get all the Sid with most recent score

One method is:
select t.*
from t
where t.date = (select max(t2.date) from test t2 where t2.sid = t.sid);
This can take advantage of an index on test(sid, date).
However I have found good performance in Oracle with keep:
select sid, max(date),
max(score) keep (dense_rank first order by date desc) as most_recent_score
from test
group by sid;

Related

SQL Server : verify that two columns are in same sort order

I have a table with an ID and a date column. It's possible (likely) that when a new record is created, it gets the next larger ID and the current datetime. So if I were to sort by date or I were to sort by ID, the resulting data set would be in the same order.
How do I write a SQL query to verify this?
It's also possible that an older record is modified and the date is updated. In that case, the records would not be in the same sort order. I don't think this happens.
I'm trying to move the data to another location, and if I know that there are no modified records, that makes it a lot simpler.
I'm pretty sure I only need to query those two columns: ID, RecordDate. Other links indicate I should be able to use LAG, but I'm getting an error that it isn't a built-in function name.
In other words, both https://dba.stackexchange.com/questions/42985/running-total-to-the-previous-row and Is there a way to access the "previous row" value in a SELECT statement? should help, but I'm still not able to make that work for what I want.
If you cannot use window functions, you can use a correlated subquery and EXISTS.
SELECT *
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.id < t1.id
AND t2.recorddate > t1.recorddate);
It'll select all records where another record with a lower ID and a greater timestamp exists. If the result is empty you know that no such record exists and the data is like you want it to be.
Maybe you want to restrict it a bit more by using t2.recorddate >= t1.recorddate instead of t2.recorddate > t1.recorddate. I'm not sure how you want it.
Use this:
SELECT ID, RecordDate FROM tablename t
WHERE
(SELECT COUNT(*) FROM tablename WHERE tablename.ID < t.ID)
<>
(SELECT COUNT(*) FROM tablename WHERE tablename.RecordDate < t.RecordDate);
It counts for each row, how many rows have id less than the row's id and
how many rows have RecordDate less than the row's RecordDate.
If these counters are not equal then it outputs this row.
The result is all the rows that would not be in the same position after sorting by ID and RecordDate
One method uses window functions:
select count(*)
from (select t.*,
row_number() over (order by id) as seqnum_id,
row_number() over (order by date, id) as seqnum_date
from t
) t
where seqnum_id <> seqnum_date;
When the count is zero, then the two columns have the same ordering. Note that the second order by includes id. Two rows could have the same date. This makes the sort stable, so the comparison is valid even when date has duplicates.
the above solutions are all good but if both dates and ids are in increment then this should also work
select modifiedid=t2.id from
yourtable t1 join yourtable t2
on t1.id=t2.id+1 and t1.recordDate<t2.recordDate

SQL Query Deduplication / Join Issue

I've been having the worst time trying to write what I feel should be a pretty simple query to deal with duplicate entries.
For context: I've created a data warehouse using Big Query and am using Stitch to pull data from Hubspot. Everything works as expected as in: I have confirmed that I have the right number of records in BigQuery.
The issue comes into how Stitch refreshes data. Instead of updating records based on object id, it appends a new row. According to their documentation, the query below should work, but it doesn't for the simple reason that there exist multiple versions of a given record with the same _sdc_sequence (which I don't think should exist). There are other _sdc (stitch system fields) that I can use to help, but it's also not completely reliable for the same reasons as above.
SELECT DISTINCT o.*
FROM [sample-table:hubspot.companies] o
INNER JOIN (
SELECT
MAX(_sdc_sequence) AS seq,
id
FROM [sample-table:hubspot.companies]
GROUP BY companyid ) oo
ON o.companyid = oo.companyid
AND o._sdc_sequence = oo.seq
The query above returns fewer results than it should. If I run the following query, I get the right number of results, but I need the other fields besides companyid like name, description, revenue, etc.
SELECT o.companyid
FROM [samples_table:hubspot.companies] o
GROUP BY o.companyid
I was trying something like this, but it doesn't work (I'm getting the following error (Expression 'oo.properties.name.value' is not present in the GROUP BY list).
SELECT o.companyid,
oo.properties.name.value,
oo.properties.hubspot_owner_id.value,
oo.properties.description.value
FROM [sample_table:hubspot.companies] o
LEFT JOIN [sample_table:hubspot.companies] oo
ON o.companyid = oo.companyid
GROUP BY o.companyid
I'm my mind, the way that I'm thinking about this is:
Get list of unique records id (companyid)
Do a SQL "vlookup equivalent" of the raw, ungrouped company table that is sorted by insert time to get the first record that matches the id (which will be the most recent since the table is sorted)
I just don't know how to write this...
Try using window functions:
#standardSQL
SELECT c.*
FROM (SELECT c.*,
ROW_NUMBER() OVER (PARTITION BY companyid ORDER BY _sdc_sequence DESC) as seqnum
FROM `sample-table.hubspot.companies` c
) c
WHERE seqnum = 1;
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)]
FROM `sample-table.hubspot.companies` t
GROUP BY companyid

Retrieving last record in each group from database with additional max() condition in MSSQL

This is a follow-up question to Retrieving last record in each group from database - SQL Server 2005/2008
In the answers, this example was provided to retrieve last record for a group of parameters (example below retrieves last updates for each value in computername):
select t.*
from t
where t.lastupdate = (select max(t2.lastupdate)
from t t2
where t2.computername = t.computername
);
In my case, however, "lastupdate" is not unique (some updates come in batches and have same lastupdate value, and if two updates of "computername" come in the same batch, you will get non-unique output for "computername + lastupdate").
Suppose I also have field "rowId" that is just auto-incremental. The mitigation would be to include in the query another criterion for a max('rowId') field.
NB: while the example employs time-specific name "lastupdate", the actual selection criteria may not be related to the time at all.
I, therefore, like to ask, what would be the most performant query that selects the last record in each group based both on "group-defining parameter" (in the case above, "computername") and on maximal rowId?
If you don't have uniqueness, then row_number() is simpler:
select t.*
from (select t.*,
row_number() over (partition by computername order by lastupdate, rowid desc) as seqnum
from t
) t
where seqnum = 1;
With the right indexes, the correlated subquery is usually faster. However, the performance difference is not that great.

Find the Max and related fields

Here is my (simplified) problem, very common I guess:
create table sample (client, recordDate, amount)
I want to find out the latest recording, for each client, with recordDate and amount.
I made the below code, which works, but I wonder if there is any better pattern or Oracle tweaks to improve the efficiency of such SELECT. (I am not allowed to modify to the structure of the database, so indexes etc are out of reach for me, and out of scope for the question).
select client, recordDate, Amount
from sample s
inner join (select client, max(recordDate) lastDate
from sample
group by client) t on s.id = t.id and s.recordDate = t.lastDate
The table has half a million records and the select takes 2-4 secs, which is acceptable but I am curious to see if that can be improved.
Thanks
In most cases Windowed Aggregate Functions might perform better (at least it's easier to write):
select client, recordDate, Amount
from
(
select client, recordDate, Amount,
rank() over (partition by client order by recordDate desc) as rn
from sample s
) dt
where rn = 1
Another structure for the query is not exists. This can perform faster under some circumstances:
select client, recordDate, Amount
from sample s
where not exists (select 1
from sample s2
where s2.client = s.client and
s2.recordDate > s.recordDate
);
This would take good advantage of an index on sample(client, recordDate), if one were available.
And, another thing to try is keep:
select client, max(recordDate),
max(Amount) keep (dense_rank first order by recordDate desc)
from sample s
group by client;
This version assumes only one max record date per client (your original query does not make that assumption).
These queries (plus the one by dnoeth) should all have different query plans and you might get lucky on one of them. The best solution, though, is to have the appropriate index.

SQL query to get single row value from an aggregate

I have an Oracle table with two columns ID and START_DATE, I want to run a query to get the ID of the record with the most recent date, initially i wrote this:
select id from (select * from mytable order by start_date desc) where rownum = 1
Is there a more cleaner and efficient way of doing this? I often run into this pattern in SQL and end up creating a nested query.
SELECT id FROM mytable WHERE start_date = (SELECT MAX(start_date) FROM mytable)
Still a nested query, but more straightforward and also, in my experience, more standard.
This looks to be a pretty clean and efficient solution to me - I don't think you can get any better than that, of course assuming that you've an index on start_date. If you want all ids for the latest start date then froadie's solution is better.