Get the begin of a union of intervals - sql

Disclaimer
While searching for an answer, I found this question, but I couldn't find a way to express the solution in SQL:
Union of intervals
Background
I'm trying to calculate how long the people in the company I work in are employed. In the database I have (that is already in the company for years and is [sadly] not changeable), each contract is stored as one line. Each line has a lot of information about the employee and the contract, including a contract creation date, a contract rescission date (or infinity, if still active) and the current contract situation ("active" or "deactivated"). There are, however, two problems that are preventing me from simply doing what could seem obvious:
People can be "multicontratual", so the same person could have multiple active lines at the same time.
Sometimes, there are some transfers that result in deactivating one of a person's contracts and creating a new contract line. These transfers must not be counted (i.e., I should take into account both the timelines). There is, however, no explicit flag for the transfers existence in the database, so it was defined that "it is a transfer if there was any contract rescission until 60 days before a new contract is created".
When trying to account for the multiple cases that could arise from this scenario (e.g., if the same person had many contracts through the time, then no contracts during more than 60 days, and then some other contracts, then I'd want to start counting from after the "more-than-60-days" period), I found that two rules solve the problem. I need:
The last contract creation where there was no other contract already active at the time. (this solves the problem 1)
&& there was no other active contract until 60 days before.
To the DB
To solve the problem, I decided to rearrange the rules. I wanted to take all contracts for which there was no other active contract until 60 days before its creation, and then take the "MAX()" of them. So, for example, for the following person, I would say she is active since 1973:
+----------+-----+-----------+-----------+---------------+-----------------+
| CONTRACT | ... | PERSON_ID | STATUS | CREATION_DATE | RESCISSION_DATE |
+----------+-----+-----------+-----------+---------------+-----------------+
| 1 | ... | 1 | deactived | 1973/10/01 | 1999/07/01 |
| 2 | ... | 1 | deactived | 1978/06/01 | 2000/07/01 |
| 3 | ... | 1 | deactived | 2000/08/01 | 2008/06/01 |
| 4 | ... | 1 | active | 2000/08/01 | infinity |
| 5 | ... | 1 | active | 2000/08/01 | infinity |
+----------+-----+-----------+-----------+---------------+-----------------+
I am treating the dates as if they were integers (in fact, they are in the real database). My question is: how could I create a query to take the "1973/10/01"? I.e., how could I get all the "creation_date"s that are distant from (higher than) the others in at least 60, and that are not in the intervals described by the other lines?
[and, anyway, does this seem the best way to solve the problem? (I don't think so)]

Related

In BDD, should scenarios be abstract descriptions of behvaior, or should they contain concrete examples?

I feel I have reached a fundamental dillema in writing BDD scenarios as a tester.
When writing BDD scenarios from testing perspective, I tend to end up using concrete examples with concrete data and observing the state, i.e. Given these initial values, When user performs an action, Then these final values should be observed. Example with an initial dataset given in Background:
Background:
Given following items are in the store
| type | name | X | Y | Z | tags |
| single | el1 | 10 | 20 | 1.03 | t1 |
| multi | el2 | 10 | 20 | 30 | t2 |
| single | el3 | 10 | 3.02 | 30 | t3 |
Scenario: Adding tag to multi-type item
Given Edit Item Popup is opened for item: el2
When user adds tag NEWTAG
And user clicks on Apply changes button
Then item store should display following items
| type | name | X | Y | Z | tags |
| single | el1 | 10 | 20 | 1.03 | t1 |
| multi | el2 | 10 | 20 | 30 | t2, NEWTAG |
| single | el3 | 10 | 3.02 | 30 | t3 |
The initial dataset from Background can be reused in all (or most) scenarios that deal with modifying and adding/deleting items, in relation to particular feature. I can also iterate the scenario over some data set that explores the problem space, boundary conditions etc. (trivial example here: tags with too many or forbidden chars).
But when requirements are not entirely clear I sometimes go with a different approach and start from a more abstract description of the behavior (so that scenarios can become the specification), which seems to me as the more (for lack of a better word) correct way of doing BDD. So I end up with behavior descriptions which are perfectly clear when a human is reading them from the requirement analysis position, but appear to be extremely vague when you shift to testing perspective:
Scenario: Adding tag to multi-type item
Given Edit Item Popup is opened for multi-type item
When user adds a new tag
And user clicks on Apply changes button
Then that item should have that tag displayed in item store
For some reason I feel way better writing a scenario like that, as it seems closer to BDD ideals (describing the behavior, doh!). But at the same time I feel terrible because of 2 reasons:
A lot of details are implicit here and thus hidden deep in the implementation. Because of that, while implementing, we need to ask ourselvs a ton of questions like 'what initial data should I use here?', 'how to keep track of which item are we handling?', 'how deep should I examine the final state?'. This all goes away when you just compare final state with a reference table, as in the first approach.
(Possibly more serious) I am not exploring the problem space here at all, while bugs often await us somewhere in dark corners of that space.
One could argue that these 2 approaches I presented are just extreme ends of a spectrum, but I still see them fundamentally different, and often find myself wondering which approach to chose.
So, how do you write your BDD (test) scenarios? Data-driven and state-comparing, or full blown abstract descriptions of behavior?

how to have one itempointer serialize from 1 to n across the selected rows

as shown in the example below, the output of the query contains blockid startds from 324 and it ends at 127, hence, the itempointer or the row index within the block starts from one for each new block id. in otherwords, as shown below
for the blockid 324 it has only itempointer with index 10
for the blockid 325 it has itempointers starts with 1 and ends with 9
i want to have a single blockid so that the itempointer or the row index starts from 1 and ends with 25
plese let me know how to achive that and
why i have three different blockids?
ex-1
query:
select ctid
from awanti_grid_cell_data agcd
where selectedsiteid = '202230060950'
and centerPointsOfWindowAsGeoJSONInEPSG4326ForCellsInTreatment IS NOT NULL
and centerPointsOfWindowAsGeoJSONInEPSG4326ForCellsInTreatment <> 'None'
result:
|ctid |
|--------|
|(324,10)|
|(325,1) |
|(325,2) |
|(325,3) |
|(325,4) |
|(325,5) |
|(325,6) |
|(325,7) |
|(325,8) |
|(325,9) |
|(326,1) |
|(326,2) |
|(326,3) |
|(326,4) |
|(326,5) |
|(326,6) |
|(326,7) |
|(326,8) |
|(326,9) |
|(327,1) |
|(327,2) |
|(327,3) |
|(327,4) |
|(327,5) |
|(327,6) |
You are missing the point. The ctid is the physical address of a row in the table, and it is none of your business. The database is free to choose whatever place it thinks fit for a table row. As a comparison, you cannot go to the authorities and request that your social security number should be 12345678 - it is simply assigned to you, and you have no say. That's how it is with the physical location of tuples.
Very likely you are not asking this question out of pure curiosity, but because you want to solve some problem. You should instead ask a question about your real problem, and there may be a good answer to that. But whatever problem you are trying to solve, using the ctid is probably not the correct answer, in particular if you want to control it.

Find current data set using two SQL tables storing separately historical insertions and deletions

Problem
I need to do daily syncs of our latest internal data to an external audit database that does not offer an update interface. In order to update some records, I need to first generate and send in a deletion file to remove those records, and then follow by an insertion file with the same but updated records in it.
An important detail is that all of the records in deletion files must match the external records verbatim, in order to be deleted.
Proposed approach
Currently I use two separate SQL tables to version control what I have inserted/deleted.
Let's say that right now the inserted_records table looks like this:
id | file_version | contract_id | customer_name | start_year
9 | 6 | 1 | Alice | 2015
10 | 6 | 2 | Bob | 2015
11 | 6 | 3 | Charlie | 2015
Accompanied by a separate and empty deleted_records table with identical columns.
Now, if I want to
change the customer_name from Alice to Dave on line id 9
change the start_year for Bob from 2015 to 2020 on line id 10
Two new lines in inserted_records would be generated, line 12 and 13, in turn creating a new insertion file 7.
id | file_version | contract_id | customer_name | start_year
9 | 6 | 1 | Alice | 2015
10 | 6 | 2 | Bob | 2015
11 | 6 | 3 | Charlie | 2015
12 | 7 | 1 | Dave | 2015
13 | 7 | 2 | Bob | 2020
Then their original column values in line 9 and 10 are then copied onto the previously empty deleted_records, in turn creating a new deletion file 1.
id | file_version | contract_id | customer_name | start_year
1 | 1 | 1 | Alice | 2015
2 | 1 | 2 | Bob | 2015
Now, if I were to send in the deletion file 1 first followed by the insertion file 7, I would get the result that I wanted.
Question
How can I query the current set of records, considering all insertions and deletions that have occurred? Assuming all records in deleted_records always have matches in inserted_records and if multiple, we always delete records with smaller file version numbers first.
I have tried by first writing one to query the inserted_records for the latest records grouped by contract_id.
select top 1 with ties *
from insertion_record
order by row_number() over (partition by contract_id order by file_version desc)
This would give me line 11, 12 and 13, which is what I wanted in this particular example. But if we also wanted to delete the record line 11 with Charlie, then my query wouldn't work anymore as it doesn't take deleted_records into account, and I have no idea how to do it in SQL.
Furthermore, my nut tells me that this approach isn't solid as there are two separate and moving parts, perhaps there is a better approach to solve this?
How can I query the current set of records
I don't understand your question. Every SQL query is against the current set of records, if by that you mean the data currently in the database.
I do see a couple of problems.
Unless the table you're deleting from has a key defined, even an exact match on every column risks deleting more than one row.
You're performing an ad hoc update with UPDATE's transaction guarantee. I suppose the table you're updating is otherwise idle, and as a practical matter you don't have to worry about someone else (or you) re-inserting the deleted rows before your inserts arrive. But it's problem waiting to happen.
If what you're trying to do is produce the set of rows that will be the result of a series of inserts and deletions, you haven't provided enough information to say how that could be done, or even if it's possible. There would have to be some way to uniquely identify rows, so that deletions and insertions can be associated. (They don't match on all columns, after all.) And you'd need some indication of order of operation, because it matters whether INSERT follows or precedes DELETE.

Best data structure for finding tags of nested locations

Somebody pointed out that my data structure architecture sucks.
The task
I have a locations table which stores the name of a location. Then I have a tags table which stores information about those locations. The locations have a hierarchie which I want to use to get all tags.
Example
Locations:
USA <- California <- San Francisco <- Mission St
Tags:
USA: English
California: Sunny
California: West coast
San Francisco: Sea side
Mission St: Cable car station
If somebody requests information about the Mission St I want to deliver all tags of it and it's ancestors (["English", "Sunny", "West coast", "Sea side", "Cable car station"]. If I request all tags of California the answer would be ["English", "Sunny", "West coast"].
I'm looking for the best read performance! I don't care about write performance. This data is not changed very often. And I don't care about table sizes either. If I need more or larger tables to solve this quicker so be it.
The tables
So currently I'm thinking about setting up these tables:
locations
id | name
---|--------------
1 | USA
2 | California
3 | San Francisco
4 | Mission St
tags
id | location_id | name
---|-------------|------------------
1 | 1 | English
2 | 2 | Sunny
3 | 2 | West coast
4 | 3 | Sea side
5 | 4 | Cable car station
ancestors
I added a position field to store the hierarchy.
| id | location_id | ancestor_id | position |
|----|-------------|-------------|----------|
| 1 | 2 | 1 | 1 |
| 2 | 3 | 2 | 1 |
| 3 | 3 | 1 | 2 |
| 4 | 4 | 3 | 1 |
| 5 | 4 | 2 | 2 |
| 6 | 4 | 1 | 3 |
Question
Is this a good solution to solve the problem or is there a better one? I want to select as fast as possible all tags of any given location including all the tags of it's ancestors. I'm using a PostgreSQL database but I think this is a pure SQL architecture problem.
Your problem seems to consist of two challenges. The most interesting is "how do I store hierarchies in a relational database". There are lots of answers to that - the one you've proposed is the most common.
There's an alternative called "nested set" which is faster for reading (in your example, finding all locations within a particular hierarchy would be "between x and y".
Postgres has dedicated support for hierachies; I'd assume this would also provide great performance.
The second part of your question is "given a path in my hierarchy, retrieve all matching tags". The easiest option is to join to the tags table as you suggest.
The final aspect is "should you denormalize/precalculate". I usually recommend building and optimizing the "normalized" solution and only denormalize when you need to.
If you want to deliver all tags for a particular location, then I would recommend replicating the data and storing the tags in a tags array on a row for each location.
You say that the locations don't change very much. So, I would simply batch create the entire table, when any underlying data changes.
Modifying the data in situ is rather problematic. A single update could end up affecting a zillion different rows -- consider a tag change on USA. Recalculating the entire table is going to be more efficient.
If you need to search on the tags as well as return them, then I would go for a more traditional structure of a table with two important columns, location and tag. Then you can have indexes on both (location) and (tag) to facilitate searching in either direction.
If write performance is not crucial, I would go for denormalization of the database. That means you use the above structure for your write operations and fill a table for your read operations by a trigger or a some async job, if you are afraid of triggers. Then the read performance is optimal, but you have to invest a bit more into the write logic.
Using the above structure for read operations is indeed not a smart solution, cause you don't know how deep the tree can get.

ID Extracted from string not useable for connecting to bound form - "expression ... too complex"

I have a linked table to a Outlook Mailitem folder in my Access Database. This is handy in that it keeps itself constantly updated, but I can't add an extra field to relate these records to a parent table.
My workaround was to put an automatically generated/added ID String into the Subject so I could work from there. In order to make my form work the way I need it to, I'm trying to create a query that takes the fields I need from the linked table and adds a calculated field with the extracted ID so it can be referenced for relating records in the form.
The query works fine (I get all the records and their IDs extracted) but when I try to filter records from this query by the calculated field I get:
This expression is typed incorrectly, or it is too complex to be evaluated. For example, a numeric expression may contain too many complicated elements. Try simplifying the expression by assigning parts of the expression to variables.
I tried separating the calculated field out into three fields so it's easier to read, hoping that would make it easier to evaluate for Access, but I still get the same error. My base query is currently:
SELECT InStr(Subject,"Support Project #CS")+19 AS StartID,
InStr(StartID,Subject," ") AS EndID,
Int(Mid(Subject,StartID,EndID-StartID)) AS ID,
ProjectEmails.Subject,
ProjectEmails.[From],
ProjectEmails.To,
ProjectEmails.Received,
ProjectEmails.Contents
FROM ProjectEmails
WHERE (((ProjectEmails.[Subject]) Like "*Support Project [#]CS*"));
I've tried to bind a subform to this query on qryProjectEmailWithID.ID = SupportProject.ID where the main form is bound to SupportProject, and I get the above error. I tried building a query that selects all records from that query where the ID = a given parameter and I still get the same error.
The working query that adds Support Project IDs would look like:
+----+--------------------------------------+----------------------+----------------------+------------+----------------------------------+
| ID | Subject | To | From | Received | Contents |
+----+--------------------------------------+----------------------+----------------------+------------+----------------------------------+
| 1 | RE: Support Project #CS1 ID Extra... | questions#so.com | Isaac.Reefman#so.com | 2019-03-11 | Trying to work out how to add... |
| 1 | RE: Support Project #CS1 ID Extra... | isaac.reefman#so.com | questions#so.com | 2019-03-11 | Thanks for your question. The... |
| 1 | RE: Support Project #CS1 ID Extra... | isaac.reefman#so.com | questions#so.com | 2019-03-11 | You should use a different me... |
| 2 | RE: Support Project #CS2 IT issue... | support#domain.com | someone#company.com | 2019-02-21 | I really need some help with ... |
| 2 | RE: Support Project #CS2 IT issue... | someone#company.com | support#domain.com | 2019-02-21 | Thanks for your question. The... |
| 2 | RE: Support Project #CS2 IT issue... | someone#company.com | support#domain.com | 2019-02-21 | Have you tried turning it off... |
| 3 | RE: Support Project #CS3 email br... | support#domain.com | someone#company.com | 2019-02-12 | my email server is malfunccti... |
| 3 | RE: Support Project #CS3 email br... | someone#company.com | support#domain.com | 2019-02-12 | Thanks for your question. The... |
| 3 | RE: Support Project #CS3 email br... | someone#company.com | support#domain.com | 2019-02-13 | I've just re-started the nece... |
+----+--------------------------------------+----------------------+----------------------+------------+----------------------------------+
The view in question would populate a datasheet that looks the same with just the items whos ID matches the ID of the current SupportProject record, updating when a new record is selected. A separate text box should show the full content of whichever record is selected in that grid, like this:
Have you tried turning it off and on again?
From: support#domain.com
On: 21/02/2019
Thanks for your question. The matter has been assigned to Support Project #CS2, and a support staff member will be in touch shortly to help you out. As it is considered of medium priority, you should expect daily updates.
Thanks,
Support
From: someone#company
On: 21/02/2019
I really need some help with my computer. It seems really slow and I can't do my work efficiently.
Neither of these things happens as when I try to use the calculated number to relate to the PK of the SupportProject table...
I don't know if this is a part of the problem, but whether I use Int(Mid(Subject... or Val(Mid(Subject... I still apparently get a Double, where the ID field (as an autoincrement ID) is a Long. I can't work out how to force it to return a Long, so I can't test whether that's the problem.
So that is output resulting from posted SQL? I really wanted raw data but close enough. If requirement is to extract number after ...CS, calculate in query and save query:
Val(Mid([Subject],InStr([Subject],"CS")+2))
Then build another query to join first query to table.
SELECT qryProjectEmailWithID.*, SupportProject.tst
FROM qryProjectEmailWithID
INNER JOIN SupportProject ON qryProjectEmailWithID.ID = SupportProject.ID;
Filter criteria can be applied to either ID field.
A subform can display the related child records synchronized with SupportProject records on main form.
I tested the ID calc with your data and then with a link to my Inbox. No issue with query join.