Join complementary tables in Postgesql - sql

Let's say I have two tables table1and table2, with three columns each, id, time, value. They store the same kind of information, ie. a 30-minutes timeseries data for several ids (let's imagine a machine that produces an amount of energy per day). table2 contains more precise information than table1, but not for all timestamp nor all ids.
I want to get the best out of the two tables over the period defined by table2, ie. storing data from table2 when available, and discarding to table1 when required (to add some more complexity, let's say that table1 is not a real table, but rather a view that takes a hell lot of time to be fully computed, so that I want to avoid compute it in its integrality).
I thought I could define a perimeter of id-time to indicate which value should be kept each day (teh daily scale should be equivalent to the 30-minutes timestamp, and be less resource-consuming). Thus I went for :
with perimeter_per_day_table2 as (
select distinct
id,
date_trunc('day', time) as day
from table2
),
perimeter_per_day_table1 as (
select id,
date_trunc('day', time) as day,
from table1
where day >= (select min(time) from table2)
and day <= (select max(time) from table2)
and (id, day) not in (select id, day from perimeter_per_day_table2)
)
select * from perimeter_per_day_table1
but that takes a hell lot of time. In particular, it seems like the condition where (id, day) not in (select id, day from perimeter_per_day_table2) is very hard for Potsgresql to handle.
Any suggestion ?

Indeed, NOT IN isn't optimized as well as NOT EXIST in Postgres. So an equivalent not exists () condition is typically faster.
However, in neither case do you need to apply a (costly) DISTINCT on the rows in the sub-query.
with perimeter_per_day_table1 as (
select t1.id,
date_trunc('day', t1.time) as day
from table1 t1
where t1.day >= (select min(time) from table2)
and t1.day <= (select max(time) from table2)
and not exists (select *
from table2 t2
where t1.id = t2.id
and t1.day = t2.day)
)
select *
from perimeter_per_day_table1;
You can even avoid querying table2 twice for the min/max, but I doubt that will make a huge difference if there is an index on the time column:
with min_max as (
select min(time) as min_time,
max(time) as max_time
from table2
), perimeter_per_day_table1 as (
select t1.id,
date_trunc('day', t1.time) as day
from table1 t1
cross join min_max
where t1.day >= min_max.min_time
and t1.day <= min_max.max_time
and not exists (select *
from table2 t2
where t1.id = t2.id
and t1.day = t2.day)
)
select *
from perimeter_per_day_table1;

Related

Add a few rows from one table to another according to criteria in the other table, and add a group code to each group

I'm new to the field, so I'd love help.
I prefer answers that can be implemented in access or excel.
I have two tables, the first has information of specific trips, here is an example:
example table1
and the second has information of other trips here is an example:example table2
,both of which have the following columns: 1. tripid, 2. time.
I want to copy all the trips from Table 2 that were made within an hour from the trips listed in Table 1,
And I want each group (trip in Table 1 and all of its continuation trips within an hour) to have a common code (in a new column named group code,where each group will have the same code).
Here's an example of how I want it to come out:exsample result
Thank you
As the "continuation code", just concatenate the tripid and time from table1. So, this results in:
select t1.tripid, t1.time, t1.tripid & t1.time as group_code
from table1 as t1
union all
select t2.tripid, t2.time,
(select top 1 t1.tripid & t1.time
from table1 as t1
where t1.time <= t2.time and t1.time > dateadd("hour", -1, t2.time)
order by t1.time desc
) as group_code
from table2 as t2

SQL get max date grouped by person from 2 identical tables (with different data)

I have a fairly simple model in MSSQL
Table1 (approximately 1000 rows)
Id, PersonId, Time
Table2 (approximately 10.000.000 rows)
Id, PersonId, Time
I need the latest entry (time) for each person based on data in these 2 tables. If it helps performance I can add that table1 is significantly smaller than Table2. There are no rules though, that the latest entry is in Table1 or in Table2.
It strikes me to be a fairly simple query, but i simply cannot crack it (without seemingly complex measures). Any inputs out there?
I would use union all & do aggregation :
select max(id), PersonId, max(time)
from (select t1.id, t1.PersonId, t1.Time
from table1 t1
union all
select t2.id, t2.PersonId, t2.Time
from table2 t2
) t
group by PersonId;
EDIT : You can use row_number() function with ties clase :
select top (1) with ties t.*
from ( . . .
) t
order by row_number() over (partition by PersonId order by time desc);
One method uses row_number(), on the union alled results:
select top (1) with ties t.*
from (select t1.id, t1.PersonId, t1.Time
from table1 t1
union all
select t2.id, t2.PersonId, t2.Time
from table2 t2
) t
order by row_number() over (partition by PersonId order by time desc);

ORA-01427 - Need the counts of each value

I get "ORA-01427: single-row subquery returns more than one row" when I run the following query:
select count(*)
from table1
where to_char(timestamp,'yyyymmddhh24') = to_char(sysdate-1/24,'yyyymmddhh24')
and attribute = (select distinct attribute from table2);
I want to get the counts of each value of attribute in the specific time frame.
I would recommend writing this as:
select count(*)
from table1 t1
where timestamp >= trunc(sysdate-1/24, 'HH') and
timestamp < trunc(sysdate, 'HH') and
exists (select 1 from table2 t2 where t2.attribute = t1.attribute);
This formulation makes it easier to use indexes and statistics for optimizing the query. Also, select distinct is not appropriate with in (although I think Oracle will optimize away the distinct).
EDIT:
You appear to want to aggregate by attribute as well:
select t1.attribute, count(*)
from table1 t1
where timestamp >= trunc(sysdate-1/24, 'HH') and
timestamp < trunc(sysdate, 'HH') and
exists (select 1 from table2 t2 where t2.attribute = t1.attribute)
group by t1.attribute;
You can do it with a join and GROUP BY:
SELECT
count(*) AS Cnt
, a.attribute
FROM table1 t
JOIN table2 a ON t.attribute=a.attribute
WHERE to_char(t.timestamp,'yyyymmddhh24') = to_char(sysdate-1/24,'yyyymmddhh24')
GROUP BY a.attribute
This produces a row for each distinct attribute from table2, paired up with the corresponding count from table1.

PostgreSQL Selecting Most Recent Entry for a Given ID

Table Essentially looks like:
Serial-ID, ID, Date, Data, Data, Data, etc.
There can be Multiple Rows for the Same ID. I'd like to create a view of this table to be used in Reports that only shows the most recent entry for each ID. It should show all of the columns.
Can someone help me with the SQL select? thanks.
There's about 5 different ways to do this, but here's one:
SELECT *
FROM yourTable AS T1
WHERE NOT EXISTS(
SELECT *
FROM yourTable AS T2
WHERE T2.ID = T1.ID AND T2.Date > T1.Date
)
And here's another:
SELECT T1.*
FROM yourTable AS T1
LEFT JOIN yourTable AS T2 ON
(
T2.ID = T1.ID
AND T2.Date > T1.Date
)
WHERE T2.ID IS NULL
One more:
WITH T AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Date DESC) AS rn
FROM yourTable
)
SELECT * FROM T WHERE rn = 1
Ok, i'm getting carried away, here's the last one I'll post(for now):
WITH T AS (
SELECT ID, MAX(Date) AS latest_date
FROM yourTable
GROUP BY ID
)
SELECT yourTable.*
FROM yourTable
JOIN T ON T.ID = yourTable.ID AND T.latest_date = yourTable.Date
I would use DISTINCT ON
CREATE VIEW your_view AS
SELECT DISTINCT ON (id) *
FROM your_table a
ORDER BY id, date DESC;
This works because distinct on suppresses rows with duplicates of the expression in parentheses. DESC in order by means the one that normally sorts last will be first, and therefor be the one that shows in the result.
https://www.postgresql.org/docs/10/static/sql-select.html#SQL-DISTINCT
This seems like a good use for correlated subqueries:
CREATE VIEW your_view AS
SELECT *
FROM your_table a
WHERE date = (
SELECT MAX(date)
FROM your_table b
WHERE b.id = a.id
)
Your date column would need to uniquely identify each row (like a TIMESTAMP type).

Redundancy in doing sum()

table1 -> id, time_stamp, value
This table consists of 10 id's. Each id would be having a value for each hour in a day.
So for 1 day, there would be 240 records in this table.
table2 -> id
Table2 consists of a dynamically changing subset of id's present in table1.
At a particular instance, the intention is to get sum(value) from table1, considering id's only in table2,
grouping by each hour in that day, giving the summarized values a rank and repeating this each day.
the query is at this stage:
select time_stamp, sum(value),
rank() over (partition by trunc(time_stamp) order by sum(value) desc) rn
from table1
where exists (select t2.id from table2 t2 where id=t2.id)
and
time_stamp >= to_date('05/04/2010 00','dd/mm/yyyy hh24') and
time_stamp <= to_date('25/04/2010 23','dd/mm/yyyy hh24')
group by time_stamp
order by time_stamp asc
If the query is correct, can this be made more efficient, considering that, table1 will actually consist of thousand's of id's instead of 10 ?
EDIT: I am using sum(value) 2 times in the query, which I am not able to get a workaround such that the sum() is done only once. Pls help on this
from table1
where exists (select t2.id from table2 t2 where value=t2.value)
The table2 doesn't have Value field. Why is the above query with t2.Value?
You could use a join here
from table1 t1 join table2 t2 on t1.id = t2.id
EDIT: Its been a while that I worked on Oracle. Pardon me, if my comment on t2.Value doesn't make sense.