Get rid of "semi-duplicates" in a query

Get rid of "semi-duplicates" in a query - sql

I currently have a query that returns, for example, the following: (You can assume that this is what the table structure looks like)
customer_id | start_date | end_date
1 | 20120101 | 20120401
2 | 20120402 | 20121231
1 | 20130101 | 20130401
1 | 20130101 | 20130330
2 | 20130331 | 99991231
2 | 20130402 | 99991231
There's two points to consider:
A Customer can come back, so doing a normal max/min approach on this doesn't work.
This is actually an overview of multiple services, and sometimes one of them starts or ends in a different date. (Very uncommon, but I need to deal with this scenario.)
So taking the above into account, I want a query that will return the 1st, 2nd, 3rd, and 5th lines.
My idea & approach to this would be:
If start_dates are equal, display the max end date. (group by customer_id & start_date, max(end_date))
If end_dates are equal, display the min start date. (group by customer_id & end_date, min(start_date))
I can write a query that will do one of the above, but I'm not sure how I'd be able to go about doing both of them at the same time. Or if a different approach altogether would be better.
SQL Server 2008
Thank you!

I think you can do this with not exists condition -
the following query you can use for this output -
select customer_id , start_date , end_date
from table_name t_1
where not exists(
select 1 from table_name t_2
where t_2.customer_id = t_1.customer_id
and t_2.start_date = t_1.start_date
and t_2.end_date > t_1.end_date)
and not exists (
select 1 from table_name t_3
where t_3.customer_id = t_1.customer_id
and t_3.end_date = t_1.end_date
and t_3.start_date<t_1.start_date)

Related

How to do a sub-select per result entry in postgresql?

Assume I have a table with only two columns: id, maturity. maturity is some date in the future and is representative of until when a specific entry will be available. Thus it's different for different entries but is not necessarily unique. And with time number of entries which have not reached this maturity date changes.
I need to count a number of entries from such a table that were available on a specific date (thus entries that have not reached their maturity). So I basically need to join this two queries:
SELECT generate_series as date FROM generate_series('2015-10-01'::date, now()::date, '1 day');
SELECT COUNT(id) FROM mytable WHERE mytable.maturity > now()::date;
where instead of now()::date I need to put entry from the generated series. I'm sure this has to be simple enough, but I can't quite get around it. I need the resulting solution to remain a query, thus it seems that I can't use for loops.
Sample table entries:
id | maturity
---+-------------------
1 | 2015-10-03
2 | 2015-10-05
3 | 2015-10-11
4 | 2015-10-11
Expected output:
date | count
------------+-------------------
2015-10-01 | 4
2015-10-02 | 4
2015-10-03 | 3
2015-10-04 | 3
2015-10-05 | 2
2015-10-06 | 2
NOTE: This count doesn't constantly decrease, since new entries are added and this count increases.

You have to use fields of outer query in WHERE clause of a sub-query. This can be done if the subquery is in the SELECT clause of the outer query:
SELECT generate_series,
(SELECT COUNT(id)
FROM mytable
WHERE mytable.maturity > generate_series)
FROM generate_series('2015-10-01'::date, now()::date, '1 day');
More info: http://www.techonthenet.com/sql_server/subqueries.php

I think you want to group your data by the maturity Date.
Check this:
select maturity,count(*) as count
from your_table group by maturity;

Find last (first) instance in table but exclude most recent (oldest) date

I have a table that reflects a monthly census of a certain population. Each month on an unpredictable day early in that month, the population is polled. Any member who existed at that point is included in that month's poll, any member who didn't is not.
My task is to look through an arbitrary date range and determine which members were added or lost during that time period. Consider the sample table:
ID | Date
2 | 1/3/2010
3 | 1/3/2010
1 | 2/5/2010
2 | 2/5/2010
3 | 2/5/2010
1 | 3/3/2010
3 | 3/3/2010
In this case, member with ID "1" was added between Jan and Feb, and member with ID 2 was lost between Feb and Mar.
The problem I am having is that if I just poll to try and find the most recent entry, I will capture all the members that were dropped, but also all the members that exist on the last date. For example, I could run this query:
SELECT
ID,
Max(Date)
FROM
tableName
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
GROUP BY
ID
This would return:
ID | Date
1 | 3/3/2010
2 | 2/5/2010
3 | 3/3/2010
What I actually want, however, is just:
ID | Date
2 | 2/5/2010
Of course I can manually filter out the last date, but since the start and end date are parameters I want to generalize that. One way would be to run sequential queries. In the first query I'd find the last date, and then use that to filter in the second query. It would really help, however, if I could wrap this logic into a single query.
I'm also having a related problem when I try to find when a member was first added to the population. In that case I'm using a different type of query:
SELECT
ID,
Date
FROM
tableName i
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
AND
NOT EXISTS(
SELECT
ID,
Date
FROM
tableName ii
WHERE
ii.ID=i.ID
AND
ii.Date < i.Date
AND
Date BETWEEN '1/1/2010' AND '3/27/2010'
)
This returns:
ID | Date
1 | 2/5/2010
2 | 1/1/2010
3 | 1/1/2010
But what I want is:
ID | Date
1 | 2/5/2010
I would like to know:
1. Which approach (the MAX() or the subquery with NOT EXISTS) is more efficient and
2. How to fix the queries so that they only return the rows I want, excluding the first (last) date.
Thanks!

You could do something like this:
SELECT
ID,
Max(Date)
FROM
tableName
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
GROUP BY
ID
having max(date) < '3/1/2010'
This filters out anyone polled in March.

SQL to find the date when the price last changed

Input:
Date Price
12/27 5
12/21 5
12/20 4
12/19 4
12/15 5
Required Output:
The earliest date when the price was set in comparison to the current price.
For e.g., price has been 5 since 12/21.
The answer cannot be 12/15 as we are interested in finding the earliest date where the price was the same as the current price without changing in value(on 12/20, the price has been changed to 4)

This should be about right. You didn't provide table structures or names, so...
DECLARE #CurrentPrice MONEY
SELECT TOP 1 #CurrentPrice=Price FROM Table ORDER BY Date DESC
SELECT MIN(Date) FROM Table WHERE Price=#CurrentPrice AND Date>(
SELECT MAX(Date) FROM Table WHERE Price<>#CurrentPrice
)
In one query:
SELECT MIN(Date)
FROM Table
WHERE Date >
( SELECT MAX(Date)
FROM Table
WHERE Price <>
( SELECT TOP 1 Price
FROM Table
ORDER BY Date DESC
)
)

This question kind of makes no sense so im not 100% sure what you are after.
create four columns, old_price, new_price, old_date, new_date.
! if old_price === new_price, simply print the old_date.

What database server are you using? If it was Oracle, I would use their windowing function. Anyway, here is a quick version that works in mysql:
Here is the sample data:
+------------+------------+---------------+
| date | product_id | price_on_date |
+------------+------------+---------------+
| 2011-01-01 | 1 | 5 |
| 2011-01-03 | 1 | 4 |
| 2011-01-05 | 1 | 6 |
+------------+------------+---------------+
Here is the query (it only works if you have 1 product - will have to add a "and product_id = ..." condition on the where clause if otherwise).
SELECT p.date as last_price_change_date
FROM test.prices p
left join test.prices p2 on p.product_id = p2.product_id and p.date < p2.date
where p.price_on_date - p2.price_on_date <> 0
order by p.date desc
limit 1
In this case, it will return "2011-01-03".
Not a perfect solution, but I believe it works. Have not tested on a larger dataset, though.
Make sure to create indexes on date and product_id, as it will otherwise bring your database server to its knees and beg for mercy.
Bernardo.

yet another date gap-fill SQL puzzle

I'm using Vertica, which precludes me from using CROSS APPLY, unfortunately. And apparently there's no such thing as CTEs in Vertica.
Here's what I've got:
t:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
Note that on the first day, the delta is equal to the metric value.
I'd like to fill in the gaps, like this:
t_fill:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0 -- a delta of 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
I've thought of a way to do this day by day, but what I'd really like is a solution that works in one go.
I think I could get something working with LAST_VALUE, but I can't come up with the right JOIN statements that will let me properly partition and order on each id's day-by-day history.
edit:
assume I have a table like this:
calendar:
day
------------
2011-01-01
2011-01-02
...
that can be involved with joins. My intent would be to maintain the date range in calendar to match the date range in t.
edit:
A few more notes on what I'm looking for, just to be specific:
In generating t_fill, I'd like to exactly cover the date range in t, as well as any dates that are missing in between. So a correct t_fill will start on the same date and end on the same date as t.
t_fill has two properties:
1) once an id appears on some date, it will always have a row for each later date. This is the gap-filling implied in the original question.
2) Should no row for an id ever appear again after some date, the t_fill solution should merrily generate rows with the same metric value (and 0 delta) from the date of that last data point up to the end date of t.
A solution might backfill earlier dates up to the start of the date range in t. That is, for any id that appears after the first date in t, rows between the first date in t and the first date for the id will be filled with metric=0 and d_metric=0. I don't prefer this kind of solution, since it has a higher growth factor for each id that enters the system. But I could easily deal with it by selecting into a new table only rows where metric!=0 and d_metric!=0.

This about what Jonathan Leffler proposed, but into old-fashioned low-level SQL (without fancy CTE's or window functions or aggregating subqueries):
SET search_path='tmp'
DROP TABLE ttable CASCADE;
CREATE TABLE ttable
( zday date NOT NULL
, id INTEGER NOT NULL
, metric INTEGER NOT NULL
, d_metric INTEGER NOT NULL
, PRIMARY KEY (id,zday)
);
INSERT INTO ttable(zday,id,metric,d_metric) VALUES
('2011-12-01',1,10,10)
,('2011-12-03',1,12,2)
,('2011-12-04',1,15,3)
;
DROP TABLE ctable CASCADE;
CREATE TABLE ctable
( zday date NOT NULL
, PRIMARY KEY (zday)
);
INSERT INTO ctable(zday) VALUES
('2011-12-01')
,('2011-12-02')
,('2011-12-03')
,('2011-12-04')
;
CREATE VIEW v_cte AS (
SELECT t.zday,t.id,t.metric,t.d_metric
FROM ttable t
JOIN ctable c ON c.zday = t.zday
UNION
SELECT c.zday,t.id,t.metric, 0
FROM ctable c, ttable t
WHERE t.zday < c.zday
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday = c.zday
)
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday < c.zday
AND nx.zday > t.zday
)
)
;
SELECT * FROM v_cte;
The results:
zday | id | metric | d_metric
------------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
(4 rows)

I am not Vertica user, but if you do not want to use their native support for GAP fillings, here you can find a more generic SQL-only solution to do so.

If you want to use something like a CTE, how about using a temporary table? Essentially, a CTE is a view for a particular query.
Depending on your needs you can make the temporary table transaction or session-scoped.
I'm still curious to know why gap-filling with constant-interpolation wouldn't work here.

Given the complete calendar table, it is doable, though not exactly trivial. Without the calendar table, it would be a lot harder.
Your query needs to be stated moderately precisely, which is usually half the battle in any issue with 'how to write the query'. I think you are looking for:
For each date in Calendar between the minimum and maximum dates represented in T (or other stipulated range),
For each distinct ID represented in T,
Find the metric for the given ID for the most recent record in T on or before the date.
This gives you a complete list of dates with metrics.
You then need to self-join two copies of that list with dates one day apart to form the deltas.
Note that if some ID values don't appear at the start of the date range, they won't show up.
With that as guidance, you should be able get going, I believe.

Finding correlated values from second table without resorting to PL/SQL

I have the following two tables in my database:
a) A table containing values acquired at a certain date (you may think of these as, say, temperature readings):
sensor_id | acquired | value
----------+---------------------+--------
1 | 2009-04-01 10:00:00 | 20
1 | 2009-04-01 10:01:00 | 21
1 | 2009-04 01 10:02:00 | 20
1 | 2009-04 01 10:09:00 | 20
1 | 2009-04 01 10:11:00 | 25
1 | 2009-04 01 10:15:00 | 30
...
The interval between the readings may differ, but the combination of (sensor_id, acquired) is unique.
b) A second table containing time periods and a description (you may think of these as, say, periods when someone turned on the radiator):
sensor_id | start_date | end_date | description
----------+---------------------+---------------------+------------------
1 | 2009-04-01 10:00:00 | 2009-04-01 10:02:00 | some description
1 | 2009-04-01 10:10:00 | 2009-04-01 10:14:00 | something else
Again, the length of the period may differ, but there will never be overlapping time periods for any given sensor.
I want to get a result that looks like this for any sensor and any date range:
sensor id | start date | v1 | end date | v2 | description
----------+---------------------+----+---------------------+----+------------------
1 | 2009-04-01 10:00:00 | 20 | 2009-04-01 10:02:00 | 20 | some description
1 | 2009-04-01 10:10:00 | 25 | 2009-04-01 10:14:00 | 30 | some description
Or in text from: given a sensor_id and a date range of range_start and range_end,
find me all time periods which have overlap with the date range (that is, start_date < range_end and end_date > range_start) and for each of these rows, find the corresponding values from the value table for the time period's start_date and end_date (find the first row with acquired > start_date and acquired > end_date).
If it wasn't for the start_value and end_value columns, this would be a textbook trivial example of how to join two tables.
Can I somehow get the output I need in one SQL statement without resorting to writing a PL/SQL function to find these values?
Unless I have overlooked something blatantly obvious, this can't be done with simple subselects.
Database is Oracle 11g, so any Oracle-specific features are acceptable.
Edit: yes, looping is possible, but I want to know if this can be done with a single SQL select.

You can give this a try. Note the caveats at the end though.
SELECT
RNG.sensor_id,
RNG.start_date,
RDG1.value AS v1,
RNG.end_date,
RDG2.value AS v2,
RNG.description
FROM
Ranges RNG
INNER JOIN Readings RDG1 ON
RDG1.sensor_id = RNG.sensor_id AND
RDG1.acquired => RNG.start_date
LEFT OUTER JOIN Readings RDG1_NE ON
RDG1_NE.sensor_id = RDG1.sensor_id AND
RDG1_NE.acquired >= RNG.start_date AND
RDG1_NE.acquired < RDG1.acquired
INNER JOIN Readings RDG2 ON
RDG2.sensor_id = RNG.sensor_id AND
RDG2.acquired => RNG.end_date
LEFT OUTER JOIN Readings RDG1_NE ON
RDG2_NE.sensor_id = RDG2.sensor_id AND
RDG2_NE.acquired >= RNG.end_date AND
RDG2_NE.acquired < RDG2.acquired
WHERE
RDG1_NE.sensor_id IS NULL AND
RDG2_NE.sensor_id IS NULL
This uses the first reading after the start date of the range and the first reading after the end date (personally, I'd think using the last date before the start and end would make more sense or the closest value, but I don't know your application). If there is no such reading then you won't get anything at all. You can change the INNER JOINs to OUTER and put additional logic in to handle those situations based on your own business rules.

It seems pretty straight forward.
Find the sensor values for each range. Find a row - I will call acquired of this row just X - where X > start_date and not exists any other row with acquired > start_date and acquired < X. Do the same for end date.
Select only the ranges that meet the query - start_date before and end_date after the dates supplied by the query.
In SQL this would be something like that.
SELECT R1.*, SV1.aquired, SV2.aquired
FROM ranges R1
INNER JOIN sensor_values SV1 ON SV1.sensor_id = R1.sensor_id
INNER JOIN sensor_values SV2 ON SV2.sensor_id = R1.sensor_id
WHERE SV1.aquired > R1.start_date
AND NOT EXISTS (
SELECT *
FROM sensor_values SV3
WHERE SV3.aquired > R1.start_date
AND SV3.aquired < SV1.aquired)
AND SV2.aquired > R1.end_date
AND NOT EXISTS (
SELECT *
FROM sensor_values SV4
WHERE SV4.aquired > R1.end_date
AND SV4.aquired < SV2.aquired)
AND R1.start_date < #range_start
AND R1.end_date > #range_end

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get rid of "semi-duplicates" in a query - sql

Related

How to do a sub-select per result entry in postgresql?

Find last (first) instance in table but exclude most recent (oldest) date

SQL to find the date when the price last changed

yet another date gap-fill SQL puzzle

Finding correlated values from second table without resorting to PL/SQL

Categories

Resources