yet another date gap-fill SQL puzzle - sql

I'm using Vertica, which precludes me from using CROSS APPLY, unfortunately. And apparently there's no such thing as CTEs in Vertica.
Here's what I've got:
t:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
Note that on the first day, the delta is equal to the metric value.
I'd like to fill in the gaps, like this:
t_fill:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0 -- a delta of 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
I've thought of a way to do this day by day, but what I'd really like is a solution that works in one go.
I think I could get something working with LAST_VALUE, but I can't come up with the right JOIN statements that will let me properly partition and order on each id's day-by-day history.
edit:
assume I have a table like this:
calendar:
day
------------
2011-01-01
2011-01-02
...
that can be involved with joins. My intent would be to maintain the date range in calendar to match the date range in t.
edit:
A few more notes on what I'm looking for, just to be specific:
In generating t_fill, I'd like to exactly cover the date range in t, as well as any dates that are missing in between. So a correct t_fill will start on the same date and end on the same date as t.
t_fill has two properties:
1) once an id appears on some date, it will always have a row for each later date. This is the gap-filling implied in the original question.
2) Should no row for an id ever appear again after some date, the t_fill solution should merrily generate rows with the same metric value (and 0 delta) from the date of that last data point up to the end date of t.
A solution might backfill earlier dates up to the start of the date range in t. That is, for any id that appears after the first date in t, rows between the first date in t and the first date for the id will be filled with metric=0 and d_metric=0. I don't prefer this kind of solution, since it has a higher growth factor for each id that enters the system. But I could easily deal with it by selecting into a new table only rows where metric!=0 and d_metric!=0.

This about what Jonathan Leffler proposed, but into old-fashioned low-level SQL (without fancy CTE's or window functions or aggregating subqueries):
SET search_path='tmp'
DROP TABLE ttable CASCADE;
CREATE TABLE ttable
( zday date NOT NULL
, id INTEGER NOT NULL
, metric INTEGER NOT NULL
, d_metric INTEGER NOT NULL
, PRIMARY KEY (id,zday)
);
INSERT INTO ttable(zday,id,metric,d_metric) VALUES
('2011-12-01',1,10,10)
,('2011-12-03',1,12,2)
,('2011-12-04',1,15,3)
;
DROP TABLE ctable CASCADE;
CREATE TABLE ctable
( zday date NOT NULL
, PRIMARY KEY (zday)
);
INSERT INTO ctable(zday) VALUES
('2011-12-01')
,('2011-12-02')
,('2011-12-03')
,('2011-12-04')
;
CREATE VIEW v_cte AS (
SELECT t.zday,t.id,t.metric,t.d_metric
FROM ttable t
JOIN ctable c ON c.zday = t.zday
UNION
SELECT c.zday,t.id,t.metric, 0
FROM ctable c, ttable t
WHERE t.zday < c.zday
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday = c.zday
)
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday < c.zday
AND nx.zday > t.zday
)
)
;
SELECT * FROM v_cte;
The results:
zday | id | metric | d_metric
------------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
(4 rows)

I am not Vertica user, but if you do not want to use their native support for GAP fillings, here you can find a more generic SQL-only solution to do so.

If you want to use something like a CTE, how about using a temporary table? Essentially, a CTE is a view for a particular query.
Depending on your needs you can make the temporary table transaction or session-scoped.
I'm still curious to know why gap-filling with constant-interpolation wouldn't work here.

Given the complete calendar table, it is doable, though not exactly trivial. Without the calendar table, it would be a lot harder.
Your query needs to be stated moderately precisely, which is usually half the battle in any issue with 'how to write the query'. I think you are looking for:
For each date in Calendar between the minimum and maximum dates represented in T (or other stipulated range),
For each distinct ID represented in T,
Find the metric for the given ID for the most recent record in T on or before the date.
This gives you a complete list of dates with metrics.
You then need to self-join two copies of that list with dates one day apart to form the deltas.
Note that if some ID values don't appear at the start of the date range, they won't show up.
With that as guidance, you should be able get going, I believe.

Related

Compare one row of a table to every rows of a second table

I am trying to retrieve the number of days between a random date and the next known date for a holiday. Let's say my first table looks like this :
date | is_holiday | zone
9/11/18 | 0 | A
22/12/18 | 1 | A
and my holidays table looks like this
start_date | end_date | zone
20/12/18 | 04/01/18 | A
21/12/18 | 04/01/18 | B
...
I want to be able to know how many days are between an entry that is not a holiday in the first table and the next holiday date.
I have tried to get the next row with a later date in a join clause but the join isn't the tool for this task. I also have tried grouping by date and comparing the date with the next row but I can have multiple entries with the same date in the first table so it doesn't work.
This is the join clause I have tried :
SELECT mai.*, vac.start_date, datediff(vac.start_date, mai.date)
FROM (SELECT *
FROM MAIN
WHERE is_holiday = 0
) mai LEFT JOIN
(SELECT start_date, zone
FROM VACATIONS_UPDATED
ORDER BY start_date
) vac
ON mai.date < vac.start_date AND mai.zone = vac.zone
I expect to get a table looking like this :
date | is_holiday | zone | next_holiday
9/11/18 | 0 | A | 11
22/12/18 | 1 | A | 0
Any lead on how to achieve this ?
It might get messy to do it in SQL but if in case you are open to doing it from code, here is what it should look like. You basically need a crossJoin
Dataset<Row> table1 = <readData>
Dataset<Row> holidays = <readData>
//then cache the small table to get the best performance
table1.crossJoin( holidays ).filter("table1.zone == holidays.zone AND table1.date < holidays.start_date").select( "table1.*", "holidays.start_date").withColumn("nextHoliday", *calc diff*)
In scenarios where one row from table1 matches multiple holidays, then you can add an id column to table1 and then group the crossJoin.
// add unique id to the rows
table1 = table1.withColumn("id", functions.monotonically_increasing_id() )
Some details on crossJoins:
http://kirillpavlov.com/blog/2016/04/23/beyond-traditional-join-with-apache-spark/

Finding & updating duplicate rows

I need to implement a query (or maybe a stored procedure) that will perform soft de-duplication of data in one of my tables. If any two records are similar enough, I need to "squash" them: deactivate one and update another.
The similarity is based on a score. Score is calculated the following way:
from both records, take values of column A,
values equal? add A1 to the score,
values not equal? subtract A2 from the score,
move on to the next column.
As soon as all desired value pairs checked:
is resulting score more then X?
yes – records are duplicate, mark older record as "duplicate"; append its id to a duplicate_ids column to the newer record.
no – do nothing.
How would I approach solving this task in SQL?
The table in question is called people. People records are entered by different admins. The de-duplication process exists to make sure no two same people exists in the system.
The motivation for the task is simple: performance.
Right now the solution is implemented in scripting language via several sub-par SQL queries and logic on top of them. However, the volume of data is expected to grow to tens of millions of records, and script will eventually become very slow (it should run via cron every night).
I'm using postgresql.
It appears that the de-duplication is generally a tough problem.
I found this: https://github.com/dedupeio/dedupe. There's a good description of how this works: https://dedupe.io/documentation/how-it-works.html.
I'm going to explore dedupe. I'm not going to try to implement it in SQL.
If I get you correctly, this could help.
You can use PostgreSQL Window Functions to get all the duplicates and use "weights" to determine which records are duplicated so you can do whatever you like with them.
Here is an example:
-- Temporal table for the test, primary key is id and
-- we have A,B,C columns with a creation date:
CREATE TEMP TABLE test
(id serial, "colA" text, "colB" text, "colC" text,creation_date date);
-- Insert test data:
INSERT INTO test ("colA", "colB", "colC",creation_date) VALUES
('A','B','C','2017-05-01'),('D','E','F','2017-06-01'),('A','B','D','2017-08-01'),
('A','B','R','2017-09-01'),('C','J','K','2017-09-01'),('A','C','J','2017-10-01'),
('C','W','K','2017-10-01'),('R','T','Y','2017-11-01');
-- SELECT * FROM test
-- id | colA | colB | colC | creation_date
-- ----+-------+-------+-------+---------------
-- 1 | A | B | C | 2017-05-01
-- 2 | D | E | F | 2017-06-01
-- 3 | A | B | D | 2017-08-01 <-- Duplicate A,B
-- 4 | A | B | R | 2017-09-01 <-- Duplicate A,B
-- 5 | C | J | K | 2017-09-01
-- 6 | A | C | J | 2017-10-01
-- 7 | C | W | K | 2017-10-01 <-- Duplicate C,K
-- 8 | R | T | Y | 2017-11-01
-- Here is the query you can use to get the id's from the duplicate records
-- (the comments are backwards):
-- third, you select the id of the duplicates
SELECT id
FROM
(
-- Second, select all the columns needed and weight the duplicates.
-- You don't need to select every column, if only the id is needed
-- then you can only select the id
-- Query this SQL to see results:
SELECT
id,"colA", "colB", "colC",creation_date,
-- The weights are simple, if the row count is more than 1 then assign 1,
-- if the row count is 1 then assign 0, sum all and you have a
-- total weight of 'duplicity'.
CASE WHEN "num_colA">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colB">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colC">1 THEN 1 ELSE 0 END as weight
FROM
(
-- First, select using window functions and assign a row number.
-- You can run this query separately to see results
SELECT *,
-- NOTE that it is order by id, if needed you can order by creation_date instead
row_number() OVER(PARTITION BY "colA" ORDER BY id) as "num_colA",
row_number() OVER(PARTITION BY "colB" ORDER BY id) as "num_colB",
row_number() OVER(PARTITION BY "colC" ORDER BY id) as "num_colC"
FROM test ORDER BY id
) count_column_duplicates
) duplicates
-- HERE IS DEFINED WHICH WEIGHT TO SELECT, for the test,
-- id defined the ones that are more than 1
WHERE weight>1
-- The total SQL returns all the duplicates acording to the selected weight:
-- id
-- ----
-- 3
-- 4
-- 7
You can add this query to a stored procedure so you can run it whenever you like. Hope it helps.

Using a value from a previous row to calculate a value in the next row

I am trying to create a report that pulls the date from a previous row, does some calculation and then displays the answer on the row below that row. The column in question is "Time Spent".
E.g. I have 3 rows.
+=====+===============+============+====+
|name | DatCompleted | Time Spent | idx|
+=====+===============+============+====+
| A | 1/1/17 | NULL | 0 |
+-----+---------------+------------+----+
| B | 11/1/17 | 10 days | 1 |
+-----+---------------+------------+----+
| C | 20/1/17 | 9 days | 2 |
+=====+===============+============+====+
Time Spent C = DatCompleted of C - DateCompleted of B
Apart from using a crazy loop and using row x row instead of set I can't see how I would complete this. Has anyone ever used this logic before in SQL? If how did you go about this?
Thanks in advance!
Most databases support the ANSI standard LAG() function. Date functions differ depending on the database, but something like this:
select t.*,
(DateCompleted - lag(DateCompleted) over (order by DateCompleted)) as TimeSpent
from t;
In SQL Server, you would use datediff():
select t.*,
datediff(day,
lag(DateCompleted) over (order by DateCompleted),
DateCompleted
) as TimeSpent
from t;
You can do this by using ROW number syntax is
ROW_NUMBER ( ) OVER ( [ PARTITION BY value_expression , ... [ n ] ] order_by_clause)
For reference you can use ROW_NUMBER
You have an index already (similar to rownumber above). Join to itself.
Select table1.*
,TimeSpent=DateDiff("d",table1.DateCompleted,copy.DateCompleted)
from table1
join table1 copy on table.idx=copy.idx-1

How to do a sub-select per result entry in postgresql?

Assume I have a table with only two columns: id, maturity. maturity is some date in the future and is representative of until when a specific entry will be available. Thus it's different for different entries but is not necessarily unique. And with time number of entries which have not reached this maturity date changes.
I need to count a number of entries from such a table that were available on a specific date (thus entries that have not reached their maturity). So I basically need to join this two queries:
SELECT generate_series as date FROM generate_series('2015-10-01'::date, now()::date, '1 day');
SELECT COUNT(id) FROM mytable WHERE mytable.maturity > now()::date;
where instead of now()::date I need to put entry from the generated series. I'm sure this has to be simple enough, but I can't quite get around it. I need the resulting solution to remain a query, thus it seems that I can't use for loops.
Sample table entries:
id | maturity
---+-------------------
1 | 2015-10-03
2 | 2015-10-05
3 | 2015-10-11
4 | 2015-10-11
Expected output:
date | count
------------+-------------------
2015-10-01 | 4
2015-10-02 | 4
2015-10-03 | 3
2015-10-04 | 3
2015-10-05 | 2
2015-10-06 | 2
NOTE: This count doesn't constantly decrease, since new entries are added and this count increases.
You have to use fields of outer query in WHERE clause of a sub-query. This can be done if the subquery is in the SELECT clause of the outer query:
SELECT generate_series,
(SELECT COUNT(id)
FROM mytable
WHERE mytable.maturity > generate_series)
FROM generate_series('2015-10-01'::date, now()::date, '1 day');
More info: http://www.techonthenet.com/sql_server/subqueries.php
I think you want to group your data by the maturity Date.
Check this:
select maturity,count(*) as count
from your_table group by maturity;

Transforming a 2 column SQL table into 3 columns, column 3 lagged on 2

Here's my problem: I want to write a query (that goes into a larger query) that takes a table like this;
ID | DATE
A | 1
A | 2
A | 3
B | 1
B | 2
and so on, and transforms it into;
ID | DATE1 | DATE2
A | 1 | 2
A | 2 | 3
A | 3 | NOW
B | 1 | 2
B | 2 | NOW
Where the numbers are dates, and NOW() is always appended to the most recent date. Given free rein I would do this in Python, but unfortunately this goes into a larger query. We're using SyBase's SQL Anywhere 12, I think? I interact with the database using SQuirreL SQL.
I'm very stumped. I thought (SQL query to transform a list of numbers into 2 columns) would help, but I'm afraid I don't know enough to make it work. I was thinking of JOINing the table to itself, but I don't know how to SELECT for only the A-1-2 rows instead of the A-1-3 rows as well, for instance, or how to insert the NOW() value into it. Does anyone have any ideas?
I made a an sqlfiddle.com to outline a solution for your example. You were mentioning dates, but using integers so I chose to do an integer example, but it can be modified. I wrote it in postgresql so the coalesce() function can be substituted with nvl() or similar. Also, the parameter '0' can be substituted with any value, including now(), but you must change the data type of the "i" column in the table to be a date as well. Please let me know if you need further help on this.
select a.id, a.i, coalesce(min(b.i),'0') from
test a
left join test b on b.id=a.id and a.i<b.i
group by a.id,a.i
order by a.id, a.i
http://sqlfiddle.com/#!15/f1fba/6