How to know if two intervals coincide at some point?
I have two tables that store two intervals where the values mean meters.
The first interval corresponds to geological codes (VBv, P4, etc).
The second interval corresponds to samples.
They are connected through a field called Hole ID.
CREATE TABLE codes (
code VARCHAR (10),
depth_from INT,
depth_to INT,
hole_id INT
);
INSERT INTO codes VALUES ('P4', 1, 2, 100);
INSERT INTO codes VALUES ('VBv', 2, 6, 100);
INSERT INTO codes VALUES ('P4', 6, 10, 100);
CREATE TABLE samples (
sample VARCHAR (50),
depth_from INT,
depth_to INT,
hole_id INT
);
INSERT INTO samples VALUES ('OP0051780', 1, 3, 100);
INSERT INTO samples VALUES ('OP0051781', 3, 9, 100);
INSERT INTO samples VALUES ('OP0051780', 9, 10, 100);
I need all the sample ranges that match the code ranges, putting a certain code as a parameter.
What I have tried: I built a query that checks if the "from" or "to" match. I also check if any interval is contained in another.
SELECT * FROM codes INNER JOIN samples ON codes.hole_id = samples.hole_id
WHERE codes.code = 'VBv' AND
(
-- Possibility 1: From or to match
(samples.depth_from = codes.depth_from or samples.depth_to = codes.depth_to)
-- Possibility 2: Some interval contained in another.
or (samples.depth_from >= codes.depth_from and samples.depth_to <= codes.depth_to)
or (codes.depth_from >= samples.depth_from and codes.depth_to <= samples.depth_to)
)
This works for the following situations:
But when there is no match in the "from" and "to" and one interval is not contained in the other, I don't know how to solve it.
SELECT * FROM codes INNER JOIN samples ON codes.hole_id = samples.hole_id
WHERE codes.code = 'VBv' AND
(samples.depth_from <= codes.depth_to AND samples.depth_to >= codes.depth_from)
Related
Table 1 contains the history of all the employee information but only captures the data every 90 days. Table 2 contains the current information of all employees and is updated weekly with a timestamp.
Table 1 gets appended by table two every 90 days. I figured by taking the timestamp in table 1 and adding 90 days to it, and comparing it to the time stamp in table 2, I could use the logic below to execute the append, but I'm getting an error:
TypeError: '<' not supported between instances of 'DataFrame' and 'DataFrame'
Am I missing something?
# Let's say the max date in table 1 is 2023-01-15. Adding 90 days would put us on 2023-04-15
futr_date = spark.sql('SELECT date_add(MAX(tm_update), 90) AS future_date FROM tbl_one')
# Checking the date in the weekly refresh table, i have a timestamp of 2023-02-03
curr_date = spark.sql('SELECT DISTINCT tm_update AS current_date FROM tbl_two')
if curr_date > futr_date:
print('execute block of code that transforms table 2 data and append to table 1')
else:
print('ignore and check again next week')
Select statement is not returning value but dataframe and thats why you are getting error. If you want to get value you need to collect
futr_date = spark.sql('SELECT date_add(MAX(tm_update), 90) AS future_date FROM tbl_one').collect()[0]
In second sql you are using distinct to get date, which may return list of values, i am not sure if thats what you want. Maybe here you should use MIN? With onlny one ts value it may be not important, but with more values is may cause some issues
As i said, i am not sure if your logic is correct, but here is working example which you can use for further changes
import time
import pyspark.sql.functions as F
historicalData = [
(1, time.mktime(time.strptime("24/10/2022", "%d/%m/%Y"))),
(2, time.mktime(time.strptime("15/01/2023", "%d/%m/%Y"))),
(3, time.mktime(time.strptime("04/11/2022", "%d/%m/%Y"))),
]
currentData = [
(1, time.mktime(time.strptime("01/02/2023", "%d/%m/%Y"))),
(2, time.mktime(time.strptime("02/02/2023", "%d/%m/%Y"))),
(3, time.mktime(time.strptime("03/02/2023", "%d/%m/%Y"))),
]
oldDf = spark.createDataFrame(historicalData, schema=["id", "tm_update"]).withColumn(
"tm_update", F.to_timestamp("tm_update")
)
oldDf.createOrReplaceTempView("tbl_one")
currentDf = spark.createDataFrame(currentData, schema=["id", "tm_update"]).withColumn(
"tm_update", F.to_timestamp("tm_update")
)
currentDf.createOrReplaceTempView("tbl_two")
futr_date = spark.sql(
"SELECT date_add(MAX(tm_update), 90) AS future_date FROM tbl_one"
).collect()[0]
curr_date = spark.sql(
"SELECT cast(MIN(tm_update) as date) AS current_date FROM tbl_two"
).collect()[0]
print(futr_date)
print(curr_date)
if curr_date > futr_date:
print("execute block of code that transforms table 2 data and append to table 1")
else:
print("ignore and check again next week")
Output
Row(future_date=datetime.date(2023, 4, 15))
Row(current_date=datetime.date(2023, 2, 3))
ignore and check again next week
I use generate series to create a series of numbers from 0 to 200.
I have a table that contains dirtareas in mm² in a column called polutionmm2. What I need is to left join this table to the generated series, but the dirt area must be in cm² so /100. I was not able to make this work, as I can't figure out how I can connect a table to a series that has no name.
This Is what I have so far:
select generate_series(0,200,1) as x, cast(p.polutionmm2/100 as char(8)) as metric
from x
left join polutiondistributionstatistic as p on metric = x
error: relation X does not exist
Here is some sample data: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=3d7d851887adb938819d6cf3e5849719
what I would need, is the first column (x) counting all the way from 0 to 200, and where there is a matching value, to show it in the second column.
Like this:
x, metric
0, 0
1, 1
2, 2
3, null
4, 4
5, null
... , ...
... , ...
200, null
You can put generate_series() in the FROM. So, I think you want something like this:
select gs.x, cast(p.polutionmm2/100 as char(8)) as metric
from generate_series(0,200,1) gs(x) left join
p
on gs.x = (p.polutionmm2/100);
I imagine there is also more to your query, because this doesn't do much that is useful.
I'm using SQL SERVER 2014 and I have this query which needs to be rebuilt to be more efficient in what it is trying to accomplish.
As an example, I created this schema and added data to it so we could replicate the problem. You can try it at rextester (http://rextester.com/AIYG36293)
create table Dogs
(
Name nvarchar(20),
Owner_ID int,
Shelter_ID int
);
insert into Dogs values
('alpha', 1, 1),
('beta', 2, 1),
('charlie', 3, 1),
('beta', 1, 2),
('alpha', 2, 2),
('charlie', 3, 2),
('charlie', 1, 3),
('beta', 2, 3),
('alpha', 3, 3);
I want to find out which Shelter has these set of owner and dog name combinations and it must be exact. This is the query I'm using right now (this is more or less what query Entity Framework generated but with some slight changes to make it simpler):
SELECT DISTINCT
Shelter_ID
FROM Dogs AS [Extent1]
WHERE ( EXISTS (SELECT
1 AS [C1]
FROM [Dogs] AS [Extent2]
WHERE [Extent1].[Shelter_ID] = [Extent2].[Shelter_ID] AND [Extent2].[Name] = 'charlie' AND [Extent2].[Owner_ID] = 1
)) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[Dogs] AS [Extent3]
WHERE [Extent1].[Shelter_ID] = [Extent3].[Shelter_ID] AND [Extent3].[Name] = 'beta' AND [Extent3].[Owner_ID] = 2
)) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[Dogs] AS [Extent4]
WHERE [Extent1].[Shelter_ID] = [Extent4].[Shelter_ID] AND [Extent4].[Name] = 'alpha' AND [Extent4].[Owner_ID] = 3
))
This query is able to get what I need but I want to know if there is any simpler way of querying it. Because in my actual use case, I have more than just 3 combinations to worry about, it could get up to some crazy combinations like 1000 or more. So just imagine having 1000 subqueries in there so, well, yeah you get the point. When I try querying with that many I get an error saying:
The query processor ran out of internal resources and could not
produce a query plan. This is a rare event and only expected for
extremely complex queries or queries that reference a very large
number of tables or partitions.
NOTE
One solution I tried was using a Pivot to flatten the data and although the query becomes simpler since it would then be just a simple WHERE clause with a number of AND statements but when at some point I get to a higher number number of combinations then I exceed the limit for the allowable max row size and get this error when creating my temporary table to store the flatten data:
Cannot create a row of size 10514 which is greater than the allowable
maximum row size of 8060.
I appreciate any help or thoughts on this matter.
Thanks!
Count them.
WITH dogSet AS (
SELECT *
FROM (
VALUES ('charlie',1),('beta',2),('alpha',3)
) ts(Name,Owner_ID)
)
SELECT Shelter_ID
FROM Dogs AS [Extent1]
JOIN dogSet ts ON ts.Name= [Extent1].name and ts.Owner_ID = [Extent1].Owner_ID
GROUP BY Shelter_ID
HAVING count(*) = (SELECT count(*) n FROM dogSet)
I am running PostgreSQL 9.1.9 x64 with PostGIS 2.0.3 under Windows Server 2008 R2.
I have a table:
CREATE TABLE field_data.trench_samples (
pgid SERIAL NOT NULL,
trench_id TEXT,
sample_id TEXT,
from_m INTEGER
);
With some data in it:
INSERT INTO field_data.trench_samples (
trench_id, sample_id, from_m
)
VALUES
('TR01', '1000001', 0),
('TR01', '1000002', 5),
('TR01', '1000003', 10),
('TR01', '1000004', 15),
('TR02', '1000005', 0),
('TR02', '1000006', 3),
('TR02', '1000007', 9),
('TR02', '1000008', 14);
Now, what I am interested in is finding the difference (distance in metres in this example) between a record's "from_m" and the "next" "from_m" for that trench_id.
So, based on the data above, I'd like to end up with a query that produces the following table:
pgid, trench_id, sample_id, from_m, to_m, interval
1, 'TR01', '1000001', 0, 5, 5
2, 'TR01', '1000002', 5, 10, 5
3, 'TR01', '1000003', 10, 15, 5
4, 'TR01', '1000004', 15, 20, 5
5, 'TR02', '1000005', 0, 3, 3
6, 'TR02', '1000006', 3, 9, 6
7, 'TR02', '1000007', 9, 14, 5
8, 'TR02', '1000008', 14, 19, 5
Now, you are likely saying "wait, how do we infer an interval length for the last sample in each line, since there is no "next" from_m to compare to?"
For the "ends" of lines (sample_id 1000004 and 1000008) I would like to use the identical interval length of the previous two samples.
Of course, I have no idea how to tackle this in my current environment. Your help is very much appreciated.
Here is how you get the difference, using the one previous example at the end (as shown in the data but not explained clearly in the text).
The logic here is repeated application of lead() and lag(). First apply lead() to calculate the interval. Then apply lag() to calculate the interval at the boundary, by using the previous interval.
The rest is basically just arithmetic:
select trench_id, sample_id, from_m,
coalesce(to_m,
from_m + lag(interval) over (partition by trench_id order by sample_id)
) as to_m,
coalesce(interval, lag(interval) over (partition by trench_id order by sample_id))
from (select t.*,
lead(from_m) over (partition by trench_id order by sample_id) as to_m,
(lead(from_m) over (partition by trench_id order by sample_id) -
from_m
) as interval
from field_data.trench_samples t
) t
Here is the SQLFiddle showing it working.
I have two tables, both with start time and end time fields. I need to find, for each row in the first table, all of the rows in the second table where the time intervals intersect.
For example:
<-----row 1 interval------->
<---find this--> <--and this--> <--and this-->
Please phrase your answer in the form of a SQL WHERE-clause, AND consider the case where the end time in the second table may be NULL.
Target platform is SQL Server 2005, but solutions from other platforms may be of interest also.
SELECT *
FROM table1,table2
WHERE table2.start <= table1.end
AND (table2.end IS NULL OR table2.end >= table1.start)
It's sound very complicated until you start working from reverse.
Below I illustrated ONLY GOOD CASES (no overlaps)! defined by those 2 simple conditions, we have no overlap ranges if condA OR condB is TRUE, so we going to reverse those:
NOT condA AND NOT CondB, in our case I just reversed signs (> became <=)
/*
|--------| A \___ CondA: b.ddStart > a.ddEnd
|=========| B / \____ CondB: a.ddS > b.ddE
|+++++++++| A /
*/
--DROP TABLE ran
create table ran ( mem_nbr int, ID int, ddS date, ddE date)
insert ran values
(100, 1, '2012-1-1','2012-12-30'), ----\ ovl
(100, 11, '2012-12-12','2012-12-24'), ----/
(100, 2, '2012-12-31','2014-1-1'),
(100, 3, '2014-5-1','2014-12-14') ,
(220, 1, '2015-5-5','2015-12-14') , ---\ovl
(220, 22, '2014-4-1','2015-5-25') , ---/
(220, 3, '2016-6-1','2016-12-16')
select DISTINCT a.mem_nbr , a.* , '-' [ ], b.dds, b.dde, b.id
FROM ran a
join ran b on a.mem_nbr = b.mem_nbr -- match by mem#
AND a.ID <> b.ID -- itself
AND b.ddS <= a.ddE -- NOT b.ddS > a.ddE
AND a.ddS <= b.ddE -- NOT a.ddS > b.ddE
"solutions from other platforms may be of interest also."
SQL Standard defines OVERLAPS predicate:
Specify a test for an overlap between two events.
<overlaps predicate> ::=
<row value constructor 1> OVERLAPS <row value constructor 2>
Example:
SELECT 1
WHERE ('2020-03-01'::DATE, '2020-04-15'::DATE) OVERLAPS
('2020-02-01'::DATE, '2020-03-15'::DATE)
-- 1
db<>fiddle demo
select * from table_1
right join
table_2 on
(
table_1.start between table_2.start and table_2.[end]
or
table_1.[end] between table_2.start and table_2.[end]
or
(table_1.[end] > table_2.start and table_2.[end] is null)
)
EDIT: Ok, don't go for my solution, it perfoms like shit. The "where" solution is 14x faster. Oops...
Some statistics: running on a db with ~ 65000 records for both table 1 and 2 (no indexing), having intervals of 2 days between start and end for each row, running for 2 minutes in SQLSMSE (don't have the patience to wait)
Using join: 8356 rows in 2 minutes
Using where: 115436 rows in 2 minutes
And what, if you want to analyse such an overlap on a minute precision with 70m+ rows?
the only solution i could make up myself was a time dimension table for the join
else the dublicate-handling became a headache .. and the processing cost where astronomical