Query historized data - sql

To describe my query problem, the following data is helpful:
A single table contains the columns ID (int), VAL (varchar) and ORD (int)
The values of VAL may change over time by which older items identified by ID won't get updated but appended. The last valid item for ID is identified by the highest ORD value (increases over time).
T0, T1 and T2 are points in time where data got entered.
How do I get in an efficient manner to the Result set?
A solution must not involve materialized views etc. but should be expressible in a single SQL-query. Using Postgresql 9.3.

The correct way to select groupwise maximum in postgres is using DISTINCT ON
SELECT DISTINCT ON (id) sysid, id, val, ord
FROM my_table
ORDER BY id,ord DESC;
Fiddle

You want all records for which no newer record exists:
select *
from mytable
where not exists
(
select *
from mytable newer
where newer.id = mytable.id
and newer.ord > mytable.ord
)
order by id;
You can do the same with row numbers. Give the latest entry per ID the number 1 and keep these:
select sysid, id, val, ord
from
(
select
sysid, id, val, ord,
row_number() over (partition by id order by ord desc) as rn
from mytable
)
where rn = 1
order by id;

Left join the table (A) against itself (B) on the condition that B is more recent than A. Pick only the rows where B does not exist (i.e. A is the most recent row).
SELECT last_value.*
FROM my_table AS last_value
LEFT JOIN my_table
ON my_table.id = last_value.id
AND my_table.ord > last_value.ord
WHERE my_table.id IS NULL;
SQL Fiddle

Related

Foreach/per-item iteration in SQL

I'm new to SQL and I think I must just be missing something, but I can't find any resources on how to do the following:
I have a table with three relevant columns: id, creation_date, latest_id. latest_id refers to the id of another entry (a newer revision).
For each entry, I would like to find the min creation date of all entries with latest_id = this.id. How do I perform this type of iteration in SQL / reference the value of the current row in an iteration?
select
t.id, min(t2.creation_date) as min_creation_date
from
mytable t
left join
mytable t2 on t2.latest_id = t.id
group by
t.id
You could solve this with a loop, but it's not anywhere close the best strategy. Instead, try this:
SELECT tf.id, tf.Creation_Date
FROM
(
SELECT t0.id, t1.Creation_Date,
row_number() over (partition by t0.id order by t1.creation_date) rn
FROM [MyTable] t0 -- table prime
INNER JOIN [MyTable] t1 ON t1.latest_id = t0.id -- table 1
) tf -- table final
WHERE tf.rn = 1
This connects the id to the latest_id by joining the table to itself. Then it uses a windowing function to help identify the smallest Creation_Date for each match.

How to select rows corresponding to a randomly selected column value in SQL

My query returns a result like shown in the table. I would like to randomly pick an ID from the ID column and get all the rows having that ID. How can I do that in SnowFlake or SQL:
ID
Postalcode
Value
...
1e3d
NK25F4
3214
...
1e3d
NK25F4
3258
...
1e3d
NK25F4
3354
...
1f74
NG2LK8
5524
1f74
NG2LK8
5548
3e9a
N6B7H4
3694
3e9a
N6B7H4
3325
38e4
N6C7H2
3654
...
There is a Snowflake function to return a fix number of "random" rows SAMPLE, so using that will reduce the need to read all rows.
SELECT t.*
FROM your_table as t
JOIN (SELECT ID FROM your_table SAMPLE (1 ROWS)) as r
ON t.id = r.id
thus using your data above:
with your_table(id, postalcode, value) as (
select * from values
('1e3d', 'NK25F4', 3214),
('1e3d', 'NK25F4', 3258),
('1e3d', 'NK25F4', 3354),
('1f74', 'NG2LK8', 5524),
('1f74', 'NG2LK8', 5548),
('3e9a', 'N6B7H4', 3694),
('3e9a', 'N6B7H4', 3325),
('38e4', 'N6C7H2', 3654)
)
I get (random set) but one looks like:
ID
POSTALCODE
VALUE
1f74
NG2LK8
5,524
1f74
NG2LK8
5,548
You could also use a NATURAL JOIN like:
SELECT *
FROM your_table
NATURAL JOIN (SELECT ID FROM your_table SAMPLE (1 ROWS))
You could put your existing query in a common table expression, then pick a random ID from it, and use it to filter the dataset:
with
dat as ( ... your query ...),
tid as (select id from dat order by random() fetch first 1 row)
select d.*
from dat d
inner join tid t on t.id = d.id
The second CTE, tid picks the random id; it does that by randomly ordering the dataset, then getting the id of the top row.
Something like
SELECT *
FROM Table_NAME
WHERE ID IN (SELECT ID FROM Table_Name ORDER BY RAND() LIMIT 1);
Should work. Though it's not particularly efficient and in many application scenarios it would arguably be more reasonable overall to compute the random ID in your application (e.g. keeping the set of all ids cached, periodically pulling it separately if need be etc).
(Note: The query assumes MYSQL, other variants may have slightly different keywords/structure, e.g. for the random function).
WITH DATA AS (
select '1e3d' id,'NK25F4' postalcode,3214 some_value union all
select '1e3d' id,'NK25F4' postalcode,3258 some_value union all
select '1e3d' id,'NK25F4' postalcode,3354 some_value union all
select '1f74' id,'NG2LK8' postalcode,5524 some_value union all
select '1f74' id,'NG2LK8' postalcode,5548 some_value union all
select '3e9a' id,'N6B7H4' postalcode,3694 some_value union all
select '3e9a' id,'N6B7H4' postalcode,3325 some_value union all
select '38e4' id,'N6C7H2' postalcode,3654 some_value )
SELECT * FROM DATA ,LATERAL (SELECT ID FROM DATA SAMPLE(2 ROWS)) I WHERE I.ID = DATA.ID
You can also play with the window frame a little and let qualify do the work
select *
from your_table
qualify id=first_value(id) over (order by random() rows between unbounded preceding and unbounded following)
Snowflake deviates from ANSI standard on the default window frames for rank-related functions (first_value, last_value, nth_value), so that makes the above equivalent to :
select *
from your_table
qualify id=first_value(id) over (order by random())

How to delete the duplicate data in table (Postgres)

I want to delete the duplicated data in a table , I know there is a way use
SELECT
fruit,
COUNT( fruit )
FROM
basket
GROUP BY
fruit
HAVING
COUNT( fruit )> 1
ORDER BY
fruit;
to find them , buy I need to determine every column's value is equal , which means tableA.* = tableA.* (except id , id is the auto-increment primary key )
and I tried this:
SELECT
*,
COUNT( * )
FROM
myTable
GROUP BY
*
HAVING
COUNT( * )> 1
ORDER BY
id;
but it says I can't use GROUP BY * , so how can I find & delete the duplicated data(need every column's value is equal except id)?
using
SELECT * DISTINCT
DISTINCT remove duplicated result
You need to try something similar to be below query. You apply PARTITION BY for the columns other than Id (as it is incrementing unique value). PARTITION BY should be applied for columns, for which you want to check duplicates.
Also refer to Row_Number in Postgres & Common Table expression in Postgres
WITH DuplicateTableRows AS
(
SELECT Id, Row_Number() OVER (PARTITION BY col1, col2... ORDER BY Id)
FROM
Table1
)
DELETE FROM Table1
WHERE Id IN (SELECT Id FROM Table1 WHERE row_number > 1)
You can do this using JSON:
select (to_jsonb(b) - 'id')
from basket b
group by 1
having count(*) > 1;
The result is as JSON. Unfortunately, to extract the values back into a record, you need to list the columns individually.

Filter SQL data by repetition on a column

Very simple basic SQL question here.
I have this table:
Row Id __________Hour__Minute__City_Search
1___1409346767__23____24_____Balears (Illes)
2___1409346767__23____13_____Albacete
3___1409345729__23____7______Balears (Illes)
4___1409345729__23____3______Balears (Illes)
5___1409345729__22____56_____Balears (Illes)
What I want to get is only one distinct row by ID and select the last City_Search made by the same Id.
So, in this case, the result would be:
Row Id __________Hour__Minute__City_Search
1___1409346767__23____24_____Balears (Illes)
3___1409345729__23____7______Balears (Illes)
What's the easier way to do it?
Obviously I don't want to delete any data just query it.
Thanks for your time.
SELECT Row,
Id,
Hour,
Minute,
City_Search
FROM Table T
JOIN
(
SELECT MIN(Row) AS Row,
ID
FROM Table
GROUP BY ID
) AS M
ON M.Row = T.Row
AND M.ID = T.ID
Can you change hour/minute to a timestamp?
What you want in this case is to first select what uniquely identifies your row:
Select id, max(time) from [table] group by id
Then use that query to add the data to it.
SELECT id,city search, time
FROM (SELECT id, max(time) as lasttime FROM [table] GROUP BY id) as Tkey
INNER JOIN [table] as tdata
ON tkey.id = tdata.id AND tkey.lasttime = tdata.time
That should do it.
two options to do it without join...
use Row_Number function to find the last one
Select * FROM
(Select *,
row_number() over(Partition BY ID Order BY Hour desc Minute Desc) as RNB
from table)
Where RNB=1
Manipulate the string and using simple Max function
Select ID,Right(MAX(Concat(Hour,Minute,RPAD(Searc,20,''))),20)
From Table
Group by ID
avoiding Joins is usually much faster...
Hope this helps

SQL Select highest values from table on two (or more) columns

not sure if there's an elegant way to acheive this:
Data
ID Ver recID (loads more columns of stuff)
1 1 1
2 2 1
3 3 1
4 1 2
5 1 3
6 2 3
So, we have ID as the Primary Key, the Ver as the version and recID as a record ID (an arbitary base ID to tie all the versions together).
So I'd like to select from the following data, rows 3, 4 and 6. i.e. the highest version for a given record ID.
Is there a way to do this with one SQL query? Or would I need to do a SELECT DISTINCT on the record ID, then a seperate query to get the highest value? Or pull the lot into the application and filter from there?
A GROUP BY would be sufficient to get each maximum version for every recID.
SELECT Ver = MAX(Ver), recID
FROM YourTable
GROUP BY
recID
If you also need the corresponding ID, you can wrap this into a subselect
SELECT yt.*
FROM Yourtable yt
INNER JOIN (
SELECT Ver = MAX(Ver), recID
FROM YourTable
GROUP BY
recID
) ytm ON ytm.Ver = yt.Ver AND ytm.recID = yt.RecID
or, depending on the SQL Server version you are using, use ROW_NUMBER
SELECT *
FROM (
SELECT ID, Ver, recID
, rn = ROW_NUMBER() OVER (PARTITION BY recID ORDER BY Ver DESC)
FROM YourTable
) yt
WHERE yt.rn = 1
Getting maximum ver for a given recID is easy. To get the ID, you need to join on a nested query that gets these maximums:
select ID, ver, recID from table x
inner join
(select max(ver) as ver, recID
from table
group by recID) y
on x.ver = y.ver and x.recID = y.recID
You could use a cte with ROW_NUMBER function:
WITH cte AS(
SELECT ID, Ver, recID
, ROW_NUMBER()OVER(PARTITION BY recID ORDER BY Ver DESC)as RowNum
FROM data
)
SELECT ID,Ver,recID FROM cte
WHERE RowNum = 1
straighforward example using a subquery:
SELECT a.*
FROM tab a
WHERE ver = (
SELECT max(ver)
FROM tab b
WHERE b.recId = a.recId
)
(Note: this assumes that the combination of (recId, ver) is unique. Typically there would be a primary key or unique constraint on those columns, in that order, and then that index can be used to optimize this query)
This works on almost all RDBMS-es, although the correlated subquery might not be handled very efficiently (depending on RDBMS). SHould work fine in MS SQL 2008 though.