Time series querying in Postgres - sql

This is a follow on question from #Erwin's answer to Efficient time series querying in Postgres.
In order to keep things simple I'll use the same table structure as that question
id | widget_id | for_date | score |
The original question was to get score for each of the widgets for every date in a range. If there was no entry for a widget on a date then show the score from the previous entry for that widget. The solution using a cross join and a window function worked well if all the data was contained in the range you were querying for. My problem is I want the previous score even if it lies outside the date range we are looking at.
Example data:
INSERT INTO score (id, widget_id, for_date, score) values
(1, 1337, '2012-04-07', 52),
(2, 2222, '2012-05-05', 99),
(3, 1337, '2012-05-07', 112),
(4, 2222, '2012-05-07', 101);
When I query for the range May 5th to May 10th 2012 (ie generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')) I would like to get the following:
DAY WIDGET_ID SCORE
May, 05 2012 1337 52
May, 05 2012 2222 99
May, 06 2012 1337 52
May, 06 2012 2222 99
May, 07 2012 1337 112
May, 07 2012 2222 101
May, 08 2012 1337 112
May, 08 2012 2222 101
May, 09 2012 1337 112
May, 09 2012 2222 101
May, 10 2012 1337 112
May, 10 2012 2222 101
The best solution so far (also by #Erwin) is:
SELECT a.day, a.widget_id, s.score
FROM (
SELECT d.day, w.widget_id
,max(s.for_date) OVER (PARTITION BY w.widget_id ORDER BY d.day) AS effective_date
FROM (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d
CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
) a
LEFT JOIN score s ON s.for_date = a.effective_date AND s.widget_id = a.widget_id
ORDER BY a.day, a.widget_id;
But as you can see in this SQL Fiddle it produces null scores for widget 1337 on the first two days. I would like to see the earlier score of 52 from row 1 in its place.
Is it possible to do this in an efficient way?

As #Roman mentioned, DISTINCT ON can solve this. Details in this related answer:
Select first row in each GROUP BY group?
Subqueries are generally a bit faster than CTEs, though:
SELECT DISTINCT ON (d.day, w.widget_id)
d.day, w.widget_id, s.score
FROM generate_series('2012-05-05'::date, '2012-05-10'::date, '1d') d(day)
CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT JOIN score s ON s.widget_id = w.widget_id AND s.for_date <= d.day
ORDER BY d.day, w.widget_id, s.for_date DESC;
You can use a set returning function like a table in the FROM list.
SQL Fiddle
One multicolumn index should be the key to performance:
CREATE INDEX score_multi_idx ON score (widget_id, for_date, score)
The third column score is only included to make it a covering index in Postgres 9.2 or later. You would not include it in earlier versions.
Of course, if you have many widgets and a wide range of days, the CROSS JOIN produces a lot of rows, which has a price-tag. Only select the widgets and days you actually need.

Like you wrote, you should find matching score, but if there is a gap - fill it with nearest earlier score. In SQL it will be:
SELECT d.day, w.widget_id,
coalesce(s.score, (select s2.score from score s2
where s2.for_date<d.day and s2.widget_id=w.widget_id order by s2.for_date desc limit 1)) as score
from (select distinct widget_id FROM score) AS w
cross join (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d
left join score s ON (s.for_date = d.day AND s.widget_id = w.widget_id)
order by d.day, w.widget_id;
Coalesce in this case means "if there is a gap".

You can use distinct on syntax in PostgreSQL
with cte_d as (
select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day
), cte_w as (
select distinct widget_id from score
)
select distinct on (d.day, w.widget_id)
d.day, w.widget_id, s.score
from cte_d as d
cross join cte_w as w
left outer join score as s on s.widget_id = w.widget_id and s.for_date <= d.day
order by d.day, w.widget_id, s.for_date desc;
or get max date by subquery:
with cte_d as (
select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day
), cte_w as (
select distinct widget_id from score
)
select
d.day, w.widget_id, s.score
from cte_d as d
cross join cte_w as w
left outer join score as s on s.widget_id = w.widget_id
where
exists (
select 1
from score as tt
where tt.widget_id = w.widget_id and tt.for_date <= d.day
having max(tt.for_date) = s.for_date
)
order by d.day, w.widget_id;
The performance really depends on indexes you have on your table (unique widget_id, for_date if possible). I think if you have many rows for each widget_id then second one would be more efficient, but you have to test it on your data.
>> sql fiddle demo <<

Related

SQL Editing Year

The query works but only give years 1985 values. How do I add unlimited amount of years (1985-2014)
use baseball;
SELECT CAST(tf.franchname AS CHAR(20)), s.yearID, s.lgid, AVG(s.salary)
FROM salaries s, teams t, teamsfranchises tf
WHERE s.teamID = t.teamID AND
t.franchID = tf.franchID AND
s.yearID = 1985 AND
(s.lgid='AL' OR s.lgid='NL') GROUP BY tf.franchname, s.yearID, s.lgid order BY
s.yearID;
You could just use BETWEEN.
Your where clause should then look like
(s.yearID BETWEEN 1985 AND 2014) and
Alternatively you could use the < and > operators:
(s.yearID >= 1984 and <= 2014)
If, for any reason you don't have a continous range of years (You only want 5 years). IN could also be an option:
s.yearID IN (1984, 1991, 1996, 2001, 2006)
Your query has a condition filtering on the year and s.yearID = 1985, you may want to change it using the keyword BETWEEN or removing it altogether depending of your need.
select cast(tf.franchname as char(20)), s.yearID, s.lgid, avg(s.salary)
from salaries s, teams t, teamsfranchises tf
where s.teamID = t.teamID and
t.franchID = tf.franchID and
(s.yearID between 1985 and 2014 )and
(s.lgid='AL' OR s.lgid='NL')
group by tf.franchname, s.yearID, s.lgid
order by s.yearID;
This is another view, when there is no data and still you want to get the year with zero count. Just check this link
In this you can create a temporary table which return the list of your years ie 1985 to 2015 , then just join with left outer join and see the magic.
I just get yourquery, you can replace with the accepted answer query too.
Declare #Startyear int = 1985
--1st approach to get continues year
;with yearlist as
(
select 1985 as year
union all
select yl.year + 1 as year
from yearlist yl
where yl.year + 1 <= YEAR(GetDate())
)
select year from yearlist order by year desc;
--2nd approach to get continues year
;WITH n(n) AS
(
SELECT 0
UNION ALL
SELECT n+1 FROM n WHERE n < (year(getdate()) -#Startyear)
)
SELECT year(DATEADD( YY, -n, GetDate()))
FROM n ORDER BY n
--take anyone approach and then join with your query
;with yearlist as
(
select 1985 as year
union all
select yl.year + 1 as year
from yearlist yl
where yl.year + 1 <= YEAR(GetDate())
)
select year from yearlist
left join
(
SELECT CAST(tf.franchname AS CHAR(20)), s.yearID, s.lgid, AVG(s.salary)
FROM salaries s, teams t, teamsfranchises tf
WHERE s.teamID = t.teamID AND
t.franchID = tf.franchID AND
s.yearID = 1985 AND
(s.lgid='AL' OR s.lgid='NL') GROUP BY tf.franchname, s.yearID, s.lgid order BY
s.yearID
) yourtable on yourtable.yearID = yearlist.year
order by year desc;

distinct count with group by

I have already searched SO but found no answer to my question. My question is if I use the query below I get correct count which is 90:
select count(distinct account_id)
from FactCustomerAccount f
join DimDate d on f.date_id = d.datekey
-- 90
But when I group by CalendarYear as below I am missing 12 counts. The query and output is below:
select CalendarYear,count(distinct account_id) as accountCount
from FactCustomerAccount f
join DimDate d on f.date_id = d.datekey
group by CalendarYear
output:
CalendarYear accountCount
2005 10
2006 26
2007 49
2008 63
2009 65
2010 78
I am not sure why I am missing 12 counts. To debug I run following query if I have missing date_id in FactCustomerAccount but found no missing keys:
select distinct f.date_id from FactCustomerAccount f
where f.date_id not in
(select DateKey from dimdate d)
I am using SQL Server 2008 R2.
Can anyone please suggest what could be the reason for missing 12 counts?
Thanks in advance.
EDIT ONE:
I did not quite understand reason/answer given to my question in the 2 replies so I would like to add 2 queries below using AdventureWorksDW2008R2 where no count is missing:
select count (distinct EmployeeKey)
from FactSalesQuota f
join dimdate d on f.DateKey = d.DateKey
-- out: 17
select d.CalendarYear, count (distinct EmployeeKey) as Employecount
from FactSalesQuota f
join dimdate d on f.DateKey = d.DateKey
group by d.CalendarYear
-- out:
-- CalendarYear Employecount
-- 2005 10
-- 2006 14
-- 2007 17
-- 2008 17
So please correct me what I am missing.
Your queries are very different:
The first:
select count(distinct account_id)
from FactCustomerAccount f
join DimDate d on f.date_id = d.datekey
Return a count of different accounts (over all years), so if you have an account_id present in two years, you have 1 (count) returned.
The second:
Grouped by CalendarYear so if you have an account_id in two different years, this information goes in two different rows.
select CalendarYear,count(distinct account_id) as accountCount
from FactCustomerAccount f
join DimDate d on f.date_id = d.datekey
group by CalendarYear
EDIT
I try to explain better:
I suppose this data set of order couple: (year, account_id)
`2008 10`
`2009 10`
`2010 10`
`2010 12`
If you run two upper queries you have:
`2`
and
`2008 1`
`2009 1`
`2010 2`
because exist two different account_id (10 and 12) and only in the last year (2010) account_ids 10 and 12 have written their rows.
But if you have this data set:
`2008 10`
`2009 10`
`2009 12`
`2010 12`
You'll have:
First query result:
2
Second query result:
2008 1
2009 2
2010 1
You aren't missing 12. It could be that some accounts didn't have activities in the final years.
i will say to analyze this,check number of rows.check calender column.Is there any rows with null in calenderyear .or try rank,i am not sure
select *,
ROW_NUMBER()over(partition by CalendarYear,account_id order by CalendarYear)
from FactSalesQuota f
join dimdate d on f.DateKey = d.DateKey

Get rows with difference of dates being one

I have the following table and rows defined in SQLFiddle
I need to select rows from products table where difference between two rows start_date and
nvl(return_date,end_date) is 1. i.e. start_date of current row and nvl(return_date,end_date) of previous row should be one
For example
PRODUCT_NO TSH098 and PRODUCT_REG_NO FLDG, the END_DATE is August, 15 2012 and
PRODUCT_NO TSH128 and PRODUCT_REG_NO FLDG start_date is August, 16 2012, so the difference is only of a day.
How can I get the desired output using sql.
Any help is highly appreciable.
Thanks
You can use lag analytical function to get access to a row at a given physical offset prior to the current position. According to your sorting order it might look like this(not so elegant though).
select *
from products p
join (select *
from(select p.Product_no
, p.Product_Reg_No
, case
when (lag(start_date, 1, start_date) over(order by product_reg_no)-
nvl(return_date, end_date)) = 1
then lag(start_date, 1, start_date)
over(order by product_reg_no)
end start_date
, End_Date
, Return_Date
from products p
order by 2,1 desc
)
where start_date is not null
) s
on (p.start_date = s.start_date or p.end_date = s.end_date)
order by 2, 1 desc
SQL FIddle DEMO
In SQL, date + X adds X days to the date. So you can:
select *
from products
where start_date + 1 = nvl(end_date, return_date)
If the dates could contain a time part, use trunc to remove the time part:
select *
from products
where trunc(start_date) + 1 = trunc(nvl(end_date, return_date))
Live example at SQL Fiddle.
I am under the impression you only want the matching dates differing by 1 day if the product reg no matches. So I simply joint it and I think this is what you want
select p1.product_reg_no,
p1.product_no product_no_1,
p2.product_no product_no_2,
p1.start_date start_date_1,
nvl(p2.return_date,p2.end_date) return_or_end_date_2
from products p1
join products p2 on (p1.product_reg_no = p2.product_reg_no)
where p1.start_date-1 = nvl(p2.return_date,p2.end_date)
SQL Fiddle
If I was wrong with the grouping then just leave the join condition away which with the given example products table brings the same result
select p1.product_reg_no,
p1.product_no product_no_1,
p2.product_no product_no_2,
p1.start_date start_date_1,
nvl(p2.return_date,p2.end_date) return_or_end_date_2
from products p1, products p2
where p1.start_date-1 = nvl(p2.return_date,p2.end_date)
SQL Fiddle 2
Now you say the difference is 1 day. I automatically assumed that start_date is 1 day higher than the nvl(return_date,end_date). Also I assumed that the date is always midnight. But to have all that also excluded you can work with trunc and go in both directions:
select p1.product_reg_no,
p1.product_no product_no_1,
p2.product_no product_no_2,
p1.start_date start_date_1,
nvl(p2.return_date,p2.end_date) return_or_end_date_2
from products p1, products p2
where trunc(p1.start_date)-1 = trunc(nvl(p2.return_date,p2.end_date))
or trunc(p1.start_date)+1 = trunc(nvl(p2.return_date,p2.end_date))
SQL Fiddle 3
And this all works because dates (not timestamp) can be calculated by adding and subtracting.
EDIT: Following your comment you want return_date or end_date to be compared and equal dates are also wanted:
select p1.product_reg_no,
p1.product_no product_no_1,
p2.product_no product_no_2,
p1.start_date start_date_1,
p2.return_date return_date_2,
p2.end_date end_date_2
from products p1, products p2
where trunc(p1.start_date) = trunc(p2.return_date)
or trunc(p1.start_date)-1 = trunc(p2.return_date)
or trunc(p1.start_date)+1 = trunc(p2.return_date)
or trunc(p1.start_date) = trunc(p2.end_date)
or trunc(p1.start_date)-1 = trunc(p2.end_date)
or trunc(p1.start_date)+1 = trunc(p2.end_date)
SQL Fiddle 4
The way to compare the current row with the previous row is to user the LAG() function. Something like this:
select * from
(
select p.*
, lag (end_date) over
(order by start_date )
as prev_end_date
, lag (return_date) over
(order by start_date )
as prev_return_date
from products p
)
where (trunc(start_date) - 1) = trunc(nvl(prev_return_date, prev_end_date))
order by 2,1 desc
However, this will not return the results you desire, because you have not defined a mechanism for defining a sort order. And without a sort order the concept of "previous row" is meaningless.
However, what you can do is this:
select p1.*
, p2.*
from products p1 cross join products p2
where (trunc(p2.start_date) - 1) = trunc(nvl(p1.return_date, p1.end_date))
order by 2, 1 desc
This SQL queries your table twice, filtering on the basis of dates. The each row in the result set contains a record from each table. If a given start_date matches more than one end_date or vice versa you will get records for multiple hits.
You mean like this?
SELECT T2.*
FROM PRODUCTS T1
JOIN PRODUCTS T2 ON (
nvl(T1.end_date, T1.return_date) + 1 = T2.start_date
);
In your SQL Fiddle example, it returns:
PRODUCT_NO PRODUCT_REG_NO START_DATE END_DATE RETURN_DATE
TSH128 FLDG August, 16 2012 00:00:00-0400 September, 15 2012 00:00:00-0400 (null)
TSH125 SCRW August, 08 2012 00:00:00-0400 September, 07 2012 00:00:00-0400 (null)
TSH137 SCRW September, 08 2012 00:00:00-0400 October, 07 2012 00:00:00-0400 (null)
TSH128 is returned for the reasons you already explained.
TSH125 is returned because TSH116 end_date is August, 07 2012.
TSH137 is returned because TSH125 end_date is September, 07 2012.
If you want to compare only rows within the same product_reg_no, it's easy to add that to the JOIN condition. If you want both "directions" of the 1-day difference, it's easy to add that too.

Detecting duplicates which fall outside of a date interval

I searched in SO but couldnt find a direct answer.
There are patients, hospitals, medical branches(ER,urology,orthopedics,internal disease etc), medical operation codes (examination,surgical operation, MRI, ultrasound or sth. else) and patient visiting dates.
Patient visits doctor, doctor prescribes medicine and asks to come again for control check.
If patient returns after 10 days, (s)he has to pay another examination fee to the same hospital. Hospitals may appoint a date after 10 days telling there are no available slots in following 10 days, in order to get the examination fee.
Table structure is like:
Patient id.no Hospital Medical Branch Medical Op. Code Date
1 H1 M0 P1 01/05/2011
5 H1 M1 P9 03/05/2011
3 H2 M0 P2 09/05/2011
1 H1 M0 P1 14/05/2011
3 H1 M0 P2 20/05/2011
5 H1 M2 P9 25/05/2011
1 H1 M0 P3 26/05/2011
Here, visiting patients no. 3 and 5 does not constitute a problem as patient no. 3 visits different hospitals and patient no.5 visits different medical branches. They would pay the examination fee even if they visited within 10 days.
Patient no.1, however, visits same hospital, same branch and is subject to same process (P1: examination) on 01/05 and 14/05.
26/05 doesnt count because it is not medical examination.
What I want to flag is same patient, same hospital, same branch and same medical operation code (that is specifically medical examination : P1 ), with date range more than 10 days.
The format of resulting table:
HOSPITAL TOTAL NUM. of PATIENTS NUM. of PATIENTS OUT OF DATE RANGE
H1 x a
H2 y b
H3 z c
Thanks.
Once again, it's analytic functions to the rescue.
This query uses the LAG() function to link a record in YOUR_TABLE with the previous (defined by DATE) matching record (defined by PATIENT_ID) in the table.
select hospital_id
, count(*) as total_num_of_patients
, sum (out_of_range) as num_of_patients_out_of_range
from (
select patient_id
, hospital_id
, case
when hospital_id_1 = hospital_id_0
and visit_1 > visit_0 + 10
and med_op_code_1 = med_op_code_0
then 1
else 0
end as out_of_range
from (
select patient_id
, hospital_id as hospital_id_1
, date as visit_1
, med_op_code as med_op_code_1
, lag (date) over (partition by patient_id order by date) as visit_0
, lag (hopital_id) over (partition by patient_id order by date) as hopital_id_0
, lag (med_op_code) over (partition by patient_id order by date) as med_op_code_0
from your_table
where med_op_code = 'P1'
)
)
group by hospital_id
/
Caveat: I haven't tested this code, so it may contain syntax errors. I will check it the next time I can access an Oracle database.
This is a little rough, as I haven't got an Oracle DB to hand, but the key feature is the same: the analytical function LAG(). Along with its companion function, LEAD(), they're great for helping to deal with things like periods of activity.
Here's my attempt at the code:
select n.hospital, COUNT(n.patient_id) as patients_out_of_date_range
from (
select *
from (
select d.*, lag(date, 1) over (partition by d.patient_id, d.hospital, d.medical_branch, d.medical_op_code order by d.date) as prev_date
from datatable d inner join
(
select d.patient_id, d.hospital, d.medical_branch, d.medical_op_code
from datatable d
where d.medical_op_code = 'P1'
group by d.patient_id, d.hospital, d.medical_branch, d.medical_op_code
having COUNT(d.date) > 1
) t on d.patient_id = t.patient_id and d.hospital = t.hospital and d.medical_branch = t.medical_branch and d.medical_op_code = t.medical_op_code
) m
where date - prev_date > 10
) n
group by n.hospital
Like I say, this isn't tested, but it should at least get you started in the right direction.
Some references:
http://www.adp-gmbh.ch/ora/sql/analytical/lag.html
http://www.oracle-base.com/articles/misc/LagLeadAnalyticFunctions.php
I think this is what you're trying for:
WITH Patient_Visits (Patient_Id, Hospital_Id, Branch_Id, Visit_Date, Visit_Order) as (
SELECT Patient_Id, Hospital_Id, BranchId, Visit_Date,
ROW_NUMBER() OVER(PARTITION BY Patient_ID, Hospital_Id, Branch_Id,
ORDER_BY Patient_Id, Hospital_Id, Branch_Id, Visit_Date)
FROM Hospital_Visits
WHERE Procedure_Id = 'P1'),
Hospital_Recent_Visits (Hospital_Id, Recent_Visitor_Count) as (
SELECT a.Hospital_Id, COUNT(DISTINCT a.Patient_Id)
FROM Patient_Visits as a
JOIN Patient_Visits as b
ON b.Hospital_Id = a.Hospital_Id
AND b.Branch_Id = a.Branch_Id
AND b.Patient_Id = a.Patient_Id
AND b.Visit_Order = a.Visit_Order - 1
AND b.Visit_Date + 10 > a.Visit_Date
GROUP BY a.Hospital_Id, a.Patient_Id),
Hospital_Patient_Count (Hospital_Id, Patient_Count) as (
SELECT Hospital_Id, COUNT(DISTINCT Patient_Id)
FROM Hospital_Visits
GROUP BY Hospital_Id, Patient_Id)
SELECT a.Hospital_Id, b.Patient_Count, c.Recent_Visitor_Count
FROM Hospitals as a
LEFT JOIN Hospital_Patient_Count as b
ON b.Hospital_Id = a.Hospital_Id
LEFT JOIN Hospital_Recent_Visits as c
ON c.Hospital_id = a.Hospital_Id
Please note that this was written and tested against a DB2 system. I think Oracle databases have the relevant functionality, so the query should still work as written. However, DB2 appears to lack some of the OLAP functions Oracle has (my version, at least), which could be useful in knocking out some of the CTEs.

Sqlite query comparison multiple times

I have the following schemas (sqlite):
JournalArticle(articleID, title, journal, volume, year, month)
ConferenceArticle(articleID, title, conference, year, location)
Person(name, affiliation)
Author(name, articleID)
I'm trying to get the names of all authors who have number of conferences articles >= journal articles in every year from 2000-2018 inclusive. If an author has 0 articles in each category in a year then the condition still holds. The only years that matter are 2000-2018
The query would be much easier if it was over all years since I could count the journal articles and conferences articles and make a comparison then get the names. However, I'm stuck when trying to check over every year 2000-2018.
I of course don't want to do repetitive queries over all the years. I feel like I may need to group by year but I'm not sure. So far I've been able to get all articles of both types from 2000-2018 as one large table but I'm not sure what to do next.:
select articleID, year
from JournalArticle
where year >= 2000 and year <= 2018
union
select articleID, year
from ConferenceArticle
where year >= 2000 and year <= 2018
Hmmm. Let's start by getting a count for each author and year:
select a.name, year, sum(is_journal), sum(is_conference)
from ((select ja.article_id, ja.year, 1 as is_journal, 0 as is_conference
from journalarticle ja
) union all
(select ca.article_id, ca.year, 0 as is_journal, 1 as is_conference
from conferencearticle ca
)
) jc join
authors a
on a.article_id = jc.article_id
group by a.name, jc.year
Now, you can aggregate again to match the years that match the conditions:
select ay.name
from (select a.name, year, sum(is_journal) as num_journal, sum(is_conference) as num_conference
from ((select ja.article_id, ja.year, 1 as is_journal, 0 as is_conference
from journalarticle ja
) union all
(select ca.article_id, ca.year, 0 as is_journal, 1 as is_conference
from conferencearticle ca
)
) jc join
authors a
on a.article_id = jc.article_id
group by a.name, jc.year
) ay
where (jc.year >= 2000 and jc.year <= 2018) and
num_journal >= num_conference
group by ay.name;
Sounds like you could use a COALESCE in the GROUP BY
SELECT a.name,
COALESCE(j.year, c.year) as "year",
COUNT(j.articleID) AS JournalArticles,
COUNT(c.articleID) AS ConferenceArticles
FROM Author a
LEFT JOIN JournalArticle j ON (j.articleID = a.articleID AND j.year BETWEEN 2000 AND 2018)
LEFT JOIN ConferenceArticle c ON (c.articleID = a.articleID AND c.year BETWEEN 2000 AND 2018)
WHERE (j.year IS NOT NULL OR c.year IS NOT NULL)
GROUP BY a.name, COALESCE(j.year, c.year)
HAVING COUNT(c.articleID) >= COUNT(j.articleID)