Getting a single row based on unique column - sql

I think this is mostly a terminology issue, where I'm having a hard time articulating a problem.
I've got a table with a couple columns that manage some historical log data. The two columns I'm interested in are timestamp(or Id, as the id is generated sequentially) and terminalID.
I'd like to supply a list of terminal ids and find only the latest data, that is highest id or timestamp per terminalID
Ended up using group solution as #Danny suggested, and the other solution he referenced
I found the time difference to be quite noticeable, so I'm posting both results here for anyone's FYI.
S1:
SELECT UR.* FROM(
SELECT TerminalID, MAX(ID) as lID
FROM dbo.Results
WHERE TerminalID in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
GROUP BY TerminalID
) GT left join dbo.Results UR on UR.id=lID
S2
SELECT *
FROM (
SELECT TOP 100
Row_Number() OVER (PARTITION BY terminalID ORDER BY Id DESC) AS [Row], *
FROM dbo.Results
WHERE TerminalID in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
ORDER BY Id DESC
) a
WHERE a.row=1
the results were:
S1:
CPU time = 297 ms, elapsed time = 343 ms.
Query Cost 36%
Missing index impact - 94%
S2:
CPU time = 562 ms, elapsed time = 1000 ms.
Query Cost 64%
Missing index impact - 41%
After adding the missing index to solution one (indexing ID only, as opposed to s2, where multiple columns needed an index), I got the query down to 15ms

use the TOP keyword:
SELECT TOP 1 ID, terminalID FROM MyTable WHERE <your condition> ORDER BY <something that orders it like you need so that the correct top row is returned>.

I think you're on the right track with GROUP BY. Sounds like you want:
SELECT TerminalID, MAX(Timestamp) AS LastTimestamp
FROM [Table_Name]
WHERE TerminalID IN (.., .., .., ..)
GROUP BY TerminalID

While not as obvious as using MAX with a GROUP BY, this can offer extra flexibility if you need to have more than one column determining which row or rows you want pulled back.
SELECT *
FROM (
SELECT
Row_Number() OVER (PARTITION BY terminalID ORDER BY Id DESC) AS [Row],
[terminalID],[Id],[timestamp]
FROM <TABLE>
ORDER BY Id DESC
) a
WHERE a.row=1

Related

ROW_NUMBER with PARTITION BY &ORDER BY performing very slow in hive (3million rows)

I have a Hive table with 50 columns and over 3 million records. The requirement is to fetch latest 200 records based on the date column, hence applied a row_number function. It worked really well initially when the number of records were under 100K, unfortunately it runs forever now. Is there any particular optimization technique that I can try ?
It is a partitioned table and this is the implementation for more details: ROW_NUMBER() OVER (PARTITION BY date, rule_id, run_id ORDER BY load_date DESC) as rule_row_num from table
My guess is that the column you are partitioning by is skewed. You should group by that column and count the number of rows to confirm.
If skew actually happens, a general way of optimizing is to indroduce one stage ahead and try to partially aggregate and reduce the data.
As per your case, you can try this:
let's suppose your original SQL is like:
select * from (
select
ca,
cb,
row_number() over(partition by ca order by cb desc) as rk1
from my_table
) tmp
where rk1 <=200;
The optimized version is gonna be sth like this:
with tt1 as(
select
ca,
cb,
row_number() over(partition by ca, cast(rand()*20 as int) order by cb desc) as rk1
from my_table
)
select * from (
select
ca,
cb,
row_number() over(partition by ca order by cb desc) as rk2
from tt1
where rk1 <= 200
) tmp
where rk2 <= 200;
When skew happens for the column ca, the second SQL is expected to be faster than the first one. More parallelism is added at the first stage by the cast(rand()*20 as int). And after that, the predicate where rk1 <= 200 is performed, significantly reducing data to be processed later.

I need to find top 2 most frequently occurring device_id and how many time they occur

I have a table like this like this
I want to find top 2 most frequently occurring device ids with their counts.
device_id count
32145678665 3
3214567866555 4
I'm only really answering because it is slightly more tricky than simply using GROUP BY and COUNT()
SELECT *
FROM(
SELECT device_id , COUNT(*)
FROM table_name
GROUP BY device_id
ORDER BY COUNT(*) DESC
)
WHERE rownum <=2
The subquery (inline view) will find all device_ids and how often they come up as well as order them from most frequent to least frequent.
Then we can just query from there and keep only the two most frequent rows by using the pseudocolumn ROWNUM
select top 2 DEVICEID,COUNT(DEVICEID) as CountOfDevice from yourtable
group by DEVICEID
order by COUNT(DEVICEID) DESC

Select all but last row in Oracle SQL

I want to pull all rows except the last one in Oracle SQL
My database is like this
Prikey - Auto_increment
common - varchar
miles - int
So I want to sum all rows except the last row ordered by primary key grouped by common. That means for each distinct common, the miles will be summed (except for the last one)
Note: the question was changed after this answer was posted. The first two queries work for the original question. The last query (in the addendum) works for the updated question.
This should do the trick, though it will be a bit slow for larger tables:
SELECT prikey, authnum FROM myTable
WHERE prikey <> (SELECT MAX(prikey) FROM myTable)
ORDER BY prikey
This query is longer but for a large table it should faster. I'll leave it to you to decide:
SELECT * FROM (
SELECT
prikey,
authnum,
ROW_NUMBER() OVER (ORDER BY prikey DESC) AS RowRank
FROM myTable)
WHERE RowRank <> 1
ORDER BY prikey
Addendum There was an update to the question; here's the updated answer.
SELECT
common,
SUM(miles)
FROM (
SELECT
common,
miles,
ROW_NUMBER() OVER (PARTITION BY common ORDER BY prikey DESC) AS RowRank
FROM myTable
)
WHERE RowRank <> 1
GROUP BY common
Looks like I am a little too late but here is my contribution, similar to Ed Gibbs' first solution but instead of calculating the max id for each value in the table and then comparing I get it once using an inline view.
SELECT d1.prikey,
d1.authnum
FROM myTable d1,
(SELECT MAX(prikey) prikey myTable FROM myTable) d2
WHERE d1.prikey != d2.prikey
At least I think this is more efficient if you want to go without the use of Analytics.
query to retrieve all the records in the table except first row and last row
select * from table_name
where primary_id_column not in
(
select top 1 * from table_name order by primary_id_column asc
)
and
primary_id_column not in
(
select top 1 * from table_name order by primary_id_column desc
)

Over clause in SQL Server

I have the following query
select * from
(
SELECT distinct
rx.patid
,rx.fillDate
,rx.scriptEndDate
,MAX(datediff(day, rx.filldate, rx.scriptenddate)) AS longestScript
,rx.drugClass
,COUNT(rx.drugName) over(partition by rx.patid,rx.fillDate,rx.drugclass) as distinctFamilies
FROM [I 3 SCI control].dbo.rx
where rx.drugClass in ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
GROUP BY rx.patid, rx.fillDate, rx.scriptEndDate,rx.drugName,rx.drugClass
) r
order by distinctFamilies desc
which produces results that look like
This should mean that between the two dates in the table the patID that there should be 5 unique drug names. However, when I run the following query:
select distinct *
from rx
where patid = 1358801781 and fillDate between '2008-10-17' and '2008-11-16' and drugClass='H4B'
I have a result set returned that looks like
You can see that while there are in fact five rows returned for the second query between the dates of 2008-10-17 and 2009-01-15, there are only three unique names. I've tried various ways of modifying the over clause, all with different levels of non-success. How can I alter my query so that I only find unique drugNames within the timeframe specified for each row?
Taking a shot at it:
SELECT DISTINCT
patid,
fillDate,
scriptEndDate,
MAX(DATEDIFF(day, fillDate, scriptEndDate)) AS longestScript,
drugClass,
MAX(rn) OVER(PARTITION BY patid, fillDate, drugClass) as distinctFamilies
FROM (
SELECT patid, fillDate, scriptEndDate, drugClass,rx.drugName,
DENSE_RANK() OVER(PARTITION BY patid, fillDate, drugClass ORDER BY drugName) as rn
FROM [I 3 SCI control].dbo.rx
WHERE drugClass IN ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
)x
GROUP BY x.patid, x.fillDate, x.scriptEndDate,x.drugName,x.drugClass,x.rn
ORDER BY distinctFamilies DESC
Not sure if DISTINCT is really necessary - left it in since you've used it.

Fastest/most efficient way to perform this SQL Server 2008 query?

I have a table which contains:
-an ID for a financial instrument
-the price
-the date the price was recorded
-the actual time the price was recorded
-the source of the price
I want to get the index ID, the latest price, price source and the date of this latest price, for each instrument, where the source is either "L" or "R". I prefer source "L" to "R", but the latest price is more important (so if the latest price date only has a source of "R"- take this, but if for the latest date we have both, take "L").
This is the SQL I have:
SELECT tab1.IndexID, tab1.QuoteDate, tab2.Source, tab2.ActualTime FROM
(SELECT IndexID, Max(QuoteDate) as QuoteDate FROM PricesTable GROUP BY IndexID) tab1
JOIN
(SELECT IndexID, Min(Source) AS Source, Max(UpdatedTime) AS ActualTime, QuoteDate FROM PricesTable WHERE Source IN ('L','R') GROUP BY IndexID, QuoteDate) tab2
ON tab1.IndexID = tab2.IndexID AND tab1.QuoteDate = tab2.QuoteDate
However, I also want to extract the price field but cannot get this due to the GROUP BY clause. I cannot extract the price without including price in either the GROUP BY, or an aggregate function.
Instead, I have had to join the above SQL code to another piece of SQL which just gets the prices and index IDs and joins on the index ID.
Is there a faster way of performing this query?
EDIT: thanks for the replies so far. Would it be possible to have some advice on which are more efficient in terms of performance?
Thanks
Use ROW_NUMBER within a subquery or CTE to order the rows how you're interested in them, then just select the rows that come at the top of that ordering. (Use PARITION so that row numbers are reaassigned starting at 1 for each IndexId):
;WITH OrderedValues as (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY IndexID ORDER BY QuoteDate desc,Source asc) as rn
FROM
PricesTable
)
SELECT * from OrderedValues where rn=1
Try:
select * from
(select p.*,
row_number() over (partition by IndexID
order by QuoteDate desc, Source) rn
from PricesTable p
where Source IN ('L','R')
) sq
where rn = 1
(This syntax should work in relatively recent versions of Oracle, SQLServer or PostgreSQL, but won't work in MySQL.)