Find current data set using two SQL tables storing separately historical insertions and deletions - sql

Problem
I need to do daily syncs of our latest internal data to an external audit database that does not offer an update interface. In order to update some records, I need to first generate and send in a deletion file to remove those records, and then follow by an insertion file with the same but updated records in it.
An important detail is that all of the records in deletion files must match the external records verbatim, in order to be deleted.
Proposed approach
Currently I use two separate SQL tables to version control what I have inserted/deleted.
Let's say that right now the inserted_records table looks like this:
id | file_version | contract_id | customer_name | start_year
9 | 6 | 1 | Alice | 2015
10 | 6 | 2 | Bob | 2015
11 | 6 | 3 | Charlie | 2015
Accompanied by a separate and empty deleted_records table with identical columns.
Now, if I want to
change the customer_name from Alice to Dave on line id 9
change the start_year for Bob from 2015 to 2020 on line id 10
Two new lines in inserted_records would be generated, line 12 and 13, in turn creating a new insertion file 7.
id | file_version | contract_id | customer_name | start_year
9 | 6 | 1 | Alice | 2015
10 | 6 | 2 | Bob | 2015
11 | 6 | 3 | Charlie | 2015
12 | 7 | 1 | Dave | 2015
13 | 7 | 2 | Bob | 2020
Then their original column values in line 9 and 10 are then copied onto the previously empty deleted_records, in turn creating a new deletion file 1.
id | file_version | contract_id | customer_name | start_year
1 | 1 | 1 | Alice | 2015
2 | 1 | 2 | Bob | 2015
Now, if I were to send in the deletion file 1 first followed by the insertion file 7, I would get the result that I wanted.
Question
How can I query the current set of records, considering all insertions and deletions that have occurred? Assuming all records in deleted_records always have matches in inserted_records and if multiple, we always delete records with smaller file version numbers first.
I have tried by first writing one to query the inserted_records for the latest records grouped by contract_id.
select top 1 with ties *
from insertion_record
order by row_number() over (partition by contract_id order by file_version desc)
This would give me line 11, 12 and 13, which is what I wanted in this particular example. But if we also wanted to delete the record line 11 with Charlie, then my query wouldn't work anymore as it doesn't take deleted_records into account, and I have no idea how to do it in SQL.
Furthermore, my nut tells me that this approach isn't solid as there are two separate and moving parts, perhaps there is a better approach to solve this?

How can I query the current set of records
I don't understand your question. Every SQL query is against the current set of records, if by that you mean the data currently in the database.
I do see a couple of problems.
Unless the table you're deleting from has a key defined, even an exact match on every column risks deleting more than one row.
You're performing an ad hoc update with UPDATE's transaction guarantee. I suppose the table you're updating is otherwise idle, and as a practical matter you don't have to worry about someone else (or you) re-inserting the deleted rows before your inserts arrive. But it's problem waiting to happen.
If what you're trying to do is produce the set of rows that will be the result of a series of inserts and deletions, you haven't provided enough information to say how that could be done, or even if it's possible. There would have to be some way to uniquely identify rows, so that deletions and insertions can be associated. (They don't match on all columns, after all.) And you'd need some indication of order of operation, because it matters whether INSERT follows or precedes DELETE.

Related

how to have one itempointer serialize from 1 to n across the selected rows

as shown in the example below, the output of the query contains blockid startds from 324 and it ends at 127, hence, the itempointer or the row index within the block starts from one for each new block id. in otherwords, as shown below
for the blockid 324 it has only itempointer with index 10
for the blockid 325 it has itempointers starts with 1 and ends with 9
i want to have a single blockid so that the itempointer or the row index starts from 1 and ends with 25
plese let me know how to achive that and
why i have three different blockids?
ex-1
query:
select ctid
from awanti_grid_cell_data agcd
where selectedsiteid = '202230060950'
and centerPointsOfWindowAsGeoJSONInEPSG4326ForCellsInTreatment IS NOT NULL
and centerPointsOfWindowAsGeoJSONInEPSG4326ForCellsInTreatment <> 'None'
result:
|ctid |
|--------|
|(324,10)|
|(325,1) |
|(325,2) |
|(325,3) |
|(325,4) |
|(325,5) |
|(325,6) |
|(325,7) |
|(325,8) |
|(325,9) |
|(326,1) |
|(326,2) |
|(326,3) |
|(326,4) |
|(326,5) |
|(326,6) |
|(326,7) |
|(326,8) |
|(326,9) |
|(327,1) |
|(327,2) |
|(327,3) |
|(327,4) |
|(327,5) |
|(327,6) |
You are missing the point. The ctid is the physical address of a row in the table, and it is none of your business. The database is free to choose whatever place it thinks fit for a table row. As a comparison, you cannot go to the authorities and request that your social security number should be 12345678 - it is simply assigned to you, and you have no say. That's how it is with the physical location of tuples.
Very likely you are not asking this question out of pure curiosity, but because you want to solve some problem. You should instead ask a question about your real problem, and there may be a good answer to that. But whatever problem you are trying to solve, using the ctid is probably not the correct answer, in particular if you want to control it.

Select rows in a table (postgis) from selected features QGIS

How do I select rows in a table based on a key (PK) from another table. I have selected multiple polygons which is within a geografical region from one layer.
The attributes table from the selected layer look like this:
| Bloknr | Column 1 | Column 2 | Column 3 |
| 111-08 | xqyz | xyzq | qxyz |
| 208-09 | abc | cba | bca |
Where the row in question (row 1) is selected.
I now want to select this row from a nongeographic layer (from a postgresql database) with a table that looks like this:
| BLOKNR | Column 1 | Column 2 | Column 3 |
| 111-08 | cab | bac | cab |
| 208-09 | abc | cba | bca |
| 111-08 | cba | bca | cab |
Where the first and third row is to be selected.
There is about 20.000.000 rows in the postgres table and multiple matches on each bloknr
I work in qgis ver. 3.2 and postgresql with PGadmin4
Any help most appreciated.
UPDATE to answer the comments
It would be simple, if it was a matter of doing it within postgres - it's kind of made for that - but i cannot figure out how to query within qgis i would like not to have to export each table (I have a few, and for each i need multiple selection queries, based on geography) to postgresql - partly because i would like to keep the workflow in qgis, and partly because the export feature in the DB manager of qgis gives me this error - which i think means that i have to make all the tables manually.
" ERROR: function addgeometrycolumn(unknown, unknown, unknown,
integer, unknown, integer) does not exist LINE 1: SELECT
AddGeometryColumn('public','Test',NULL,0,'MULTIPOLYGO...
HINT: No function matches the given name and argument types. You might need to add explicit type casts."
So again any help appreciated.
So i have come up with an answer, that will work in theory.
First make the desired geographical selection and make a new layer with the selection
Then export the layer to the postgis database, with which you are connected
Now it is possible to make queries in postgresql - and PGadmin.
Note that this does not keep the workflow in qgis - and for further processing of statistics etc. one will have to work on the integration between the new postgis layer and selection within this - and it doesn't quite solve the geographical/mapbased selection approach - although it will work

SQL: What is a value?

The Question
One thing that I am confused about is the technical definition of possibly the most basic component of a database: a single value.
Some Examples
I understand and follow (at a minimum) the first three normal forms of database normalization - or so I think. That said, with the introduction of RANGE in PostgreSQL 9.2 I started thinking about what makes a single value.
From the docs:
Range types are useful because they represent many element values in a single range value
So, what are you? Several values, or a single value... nothingness... 42?
Why does this matter?
Because is speaks directly to the Second Normal Form:
Create separate tables for sets of values that apply to multiple records.
Relate these tables with a foreign key.
#1 Ranges
For example, in Postgres 9.1 I had some tables structured like this:
"SomeSchema"."StatusType"
"StatusTypeID" | "StatusType"
--------------------|----------------
1 | Start
2 | Stop
"SomeSchema"."Statuses"
"StatusID" | "Identifier" | "StatusType" | "Value" | "Timestamp"
---------------|----------------|----------------|---------|---------------------
1 | 1 | 1 | 0 | 2000-01-01 00:00:00
2 | 1 | 2 | 5 | 2000-01-02 12:00:00
3 | 2 | 1 | 1 | 2000-01-01 00:00:00
4 | 3 | 1 | 2 | 2000-01-01 00:00:00
5 | 2 | 2 | 7 | 2000-01-01 18:30:00
6 | 1 | 2 | 3 | 2000-01-02 12:00:00
This enabled me to keep an historical record of how things were configured at any given point in time.
This structure takes the position that the data in the "Value" column were all separate values.
Now, in Postgres 9.2 if I do the same thing with a RANGE value it would look like this:
"SomeSchema"."Statuses"
"StatusID" | "Identifier" | "Value" | "Timestamp"
---------------|----------------|-------------|---------------------
1 | 1 | (0, NULL) | 2000-01-01 00:00:00
2 | 1 | (0, 5) | 2000-01-02 12:00:00
3 | 2 | (1, NULL) | 2000-01-01 00:00:00
4 | 3 | (2, NULL) | 2000-01-01 00:00:00
5 | 2 | (1, 7) | 2000-01-01 18:30:00
6 | 1 | (0, 3) | 2000-01-02 12:00:00
Again, this structure would enable me to keep an historical record of how things were configured, but I would be storing the same value several times in separate places. It makes updating (technically inserting a new record) more tricky because I have to make sure the data rolls over from the original record.
#2 Arrays
Arrays have been around for a long time, and while they can be abused, I tend to use them for things like color codes. For example, my project stores information and at times needs to know how to display it. I could create three columns to store red, green, and blue values; but that just seems silly. When would I ever create a foreign key (or even just filter) based on one of the given color codes.
When I created the field it was from the perspective that I needed to store a color in a neutral format so that I could feed anything that accepts a color value. I made the column an array and filled it with the appropriate codes to make the color I want.
#3 PostGIS: Geometry & Geography
When storing a polygon in PostGIS, it stores all the points that make the boundary in a single field. If one point were to change and I wanted to keep an historical record, I would have to store all of the points that have not changed twice in order to store the new polygon along with the old.
So, what is a value? and... if RANGE, ARRAY, and GEOGRAPHY are values do they really break the second normal form?
The fact that some operation can derive new values from X that appear to be components of X's value doesn't mean X itself isn't "single valued". Thus "range" values and "geography" values should be single values as far as the DBMSs type system is concerned. I don't know enough about Postgresql's implementation to know whether "arrays" can be considered as single values in themselves. SQL DBMSs like Postgresql are not truly relational DBMSs and SQL supports various structures that certainly aren't proper relation variables, values or types (pointers, nulls and other exotica).
This is a difficult and sometimes controversial topic however. If you haven't read it then I recommend the book Databases, Types, and the Relational Model - The Third Manifesto by Date and Darwen. It addresses exactly the kind of questions you are asking about.
I don't like your description of 2NF but it's not very relevant here.

Creating a flattened table/view of a hierarchically-defined set of data

I have a table containing hierarchical data. There are currently ~8 levels in this hierarchy.
I really like the way the data is structured, but performance is dismal when I need to know if a record at level 8 is a child of a record at level 1.
I have PL/SQL stored functions which do these lookups for me, each having a select * from tbl start with ... connect by... statement. This works fine when I'm querying a handful of records, but I'm in a situation now where I need to query ~10k records at once and for each of them run this function. It's taking 2-3 minutes where I need it to run in just a few seconds.
Using some heuristics based on my knowledge of the current data, I can get rid of the lookup function and just do childrecord.key || '%' LIKE parentrecord.key but that's a really dirty hack and will not always work.
So now I'm thinking that for this hierarchically-defined table I need to have a separate parent-child table, which will contain every relationship...for a hierarchy going from level 1-8 there would be 8! records, associating 1 with 2, 1 with 3,...,1 with 8 and 2 with 3, 2 with 4,...,2 with 8. And so forth.
My thought is that I would need to have an insert trigger where it will basically run the connect by query and for every match going up the hierarchy it will insert a record in the lookup table. And to deal with old data I'll just set up foreign keys to the main table with cascading deletes.
Are there better options than this? Am I missing another way that I could determine these distant ancestor/descendant relationships more quickly?
EDIT: This appears to be exactly what I'm thinking about: http://evolt.org/working_with_hierarchical_data_in_sql_using_ancestor_tables
So what you want is to materialize the transitive closures. That is, given this application table ...
ID | PARENT_ID
------+----------
1 |
2 | 1
3 | 2
4 | 2
5 | 4
... the graph table would look like this:
PARENT_ID | CHILD_ID
-----------+----------
1 | 2
1 | 3
1 | 4
1 | 5
2 | 3
2 | 4
2 | 5
4 | 5
It is possible to maintain a table like this in Oracle, although you will need to roll your own framework for it. The question is whether it is worth the overhead. If the source table is volatile then keeping the graph data fresh may cost more cycles than you will save on the queries. Only you know your data's profile.
I don't think you can maintain such a graph table with CONNECT BY queries and cascading foreign keys. Too much indirect activity, too hard to get right. Also a materialized view is out, because we cannot write a SQL query which will zap the 1->5 record when we delete the source record for ID=4.
So what I suggest you read a paper called Maintaining Transitive Closure of Graphs in SQL by Dong, Libkin, Su and Wong. This contains a lot of theory and some gnarly (Oracle) SQL but it will give you the grounding to build the PL/SQL you need to maintain a graph table.
"can you expand on the part about it
being too difficult to maintain with
CONNECT BY/cascading FKs? If I control
access to the table and all
inserts/updates/deletes take place via
stored procedures, what kinds of
scenarios are there where this would
break down?"
Consider the record 1->5 which is a short-circuit of 1->2->4->5. Now what happens if, as I said before, we delete the the source record for ID=4? Cascading foreign keys could delete the entries for 2->4 and 4->5. But that leaves 1->5 (and indeed 2->5) in the graph table although they no longer represent a valid edge in the graph.
What might work (I think, I haven't done it) would be to use an additional synthetic key in the source table, like this.
ID | PARENT_ID | NEW_KEY
------+-----------+---------
1 | | AAA
2 | 1 | BBB
3 | 2 | CCC
4 | 2 | DDD
5 | 4 | EEE
Now the graph table would look like this:
PARENT_ID | CHILD_ID | NEW_KEY
-----------+----------+---------
1 | 2 | BBB
1 | 3 | CCC
1 | 4 | DDD
1 | 5 | DDD
2 | 3 | CCC
2 | 4 | DDD
2 | 5 | DDD
4 | 5 | DDD
So the graph table has a foreign key referencing the relationship in the source table which generated it, rather than linking to the ID. Then deleting the record for ID=4 would cascade deletes of all records in the graph table where NEW_KEY=DDD.
This would work if any given ID can only have zero or one parent IDs. But it won't work if it is permissible for this to happen:
ID | PARENT_ID
------+----------
5 | 2
5 | 4
In other words the edge 1->5 represents both 1->2->4->5 and 1->2->5. So, what might work depends on the complexity of your data.

How to represent and insert into an ordered list in SQL?

I want to represent the list "hi", "hello", "goodbye", "good day", "howdy" (with that order), in a SQL table:
pk | i | val
------------
1 | 0 | hi
0 | 2 | hello
2 | 3 | goodbye
3 | 4 | good day
5 | 6 | howdy
'pk' is the primary key column. Disregard its values.
'i' is the "index" that defines that order of the values in the 'val' column. It is only used to establish the order and the values are otherwise unimportant.
The problem I'm having is with inserting values into the list while maintaining the order. For example, if I want to insert "hey" and I want it to appear between "hello" and "goodbye", then I have to shift the 'i' values of "goodbye" and "good day" (but preferably not "howdy") to make room for the new entry.
So, is there a standard SQL pattern to do the shift operation, but only shift the elements that are necessary? (Note that a simple "UPDATE table SET i=i+1 WHERE i>=3" doesn't work, because it violates the uniqueness constraint on 'i', and also it updates the "howdy" row unnecessarily.)
Or, is there a better way to represent the ordered list? I suppose you could make 'i' a floating point value and choose values between, but then you have to have a separate rebalancing operation when no such value exists.
Or, is there some standard algorithm for generating string values between arbitrary other strings, if I were to make 'i' a varchar?
Or should I just represent it as a linked list? I was avoiding that because I'd like to also be able to do a SELECT .. ORDER BY to get all the elements in order.
As i read your post, I kept thinking 'linked list'
and at the end, I still think that's the way to go.
If you are using Oracle, and the linked list is a separate table (or even the same table with a self referencing id - which i would avoid) then you can use a CONNECT BY query and the pseudo-column LEVEL to determine sort order.
You can easily achieve this by using a cascading trigger that updates any 'index' entry equal to the new one on the insert/update operation to the index value +1. This will cascade through all rows until the first gap stops the cascade - see the second example in this blog entry for a PostgreSQL implementation.
This approach should work independent of the RDBMS used, provided it offers support for triggers to fire before an update/insert. It basically does what you'd do if you implemented your desired behavior in code (increase all following index values until you encounter a gap), but in a simpler and more effective way.
Alternatively, if you can live with a restriction to SQL Server, check the hierarchyid type. While mainly geared at defining nested hierarchies, you can use it for flat ordering as well. It somewhat resembles your approach using floats, as it allows insertion between two positions by assigning fractional values, thus avoiding the need to update other entries.
If you don't use numbers, but Strings, you may have a table:
pk | i | val
------------
1 | a0 | hi
0 | a2 | hello
2 | a3 | goodbye
3 | b | good day
5 | b1 | howdy
You may insert a4 between a3 and b, a21 between a2 and a3, a1 between a0 and a2 and so on. You would need a clever function, to generate an i for new value v between p and n, and the index can become longer and longer, or you need a big rebalancing from time to time.
Another approach could be, to implement a (double-)linked-list in the table, where you don't save indexes, but links to previous and next, which would mean, that you normally have to update 1-2 elements:
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
2 | 0 | goodbye
3 | 2 | good day
5 | 3 | howdy
hey between hello & goodbye:
hey get's pk 6,
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
6 | 0 | hi <- ins
2 | 6 | goodbye <- upd
3 | 2 | good day
5 | 3 | howdy
the previous element would be hello with pk=0, and goodbye, which linked to hello by now has to link to hey in future.
But I don't know, if it is possible to find a 'order by' mechanism for many db-implementations.
Since I had a similar problem, here is a very simple solution:
Make your i column floats, but insert integer values for the initial data:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
Then, if you want to insert something in between, just compute a float value in the middle between the two surrounding values:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
6 | 2.5 | hey
This way the number of inserts between the same two values is limited to the resolution of float values but for almost all cases that should be more than sufficient.