So we have a new project where we need to use postgres 14 to scale up a transaction table that gets heavily updated. The Master table has about a 60 million rows over a six month period and a child table has about 600 million rows. Data retention period is six months after which we have to drop the oldest month partition.
I want opinions from Postgres Experts on whether this design is right and whether anything is overlooked:
Parent/Master table
ID
JSON 1---> A couple of hundred characters
JSON 2 ---> 50 characters
The table has about 20 columns. Updates are always based on the primary key.
Child Table
Parent_IDFK (Parent key from Parent or Master Table)
Occurance_id (Every parent has 10 rows in the Child table, 1,2,3,4,5....)These are occurances
Occurance JSON . Each Child linked to a parent has a specific JSON. Lets call it Occurance JSON. So Child 1 has Occurance 1 JSON. Child 2 has Occurance 2 JSON.
Over the period of a day,a row first gets inserted into the master. Then about 10 rows get inserted into the child. After the child record is inserted, we have to update the parent
with aggregate occurance. The parent JSON aggregate in the parent table will look something like this
UPDATE PARENT SET AGGREGATE_JSON= (sum up the SUM of occurrences in the Child table for that parent key) WHERE ID=<>.
There will also be updates to the Child table based on primary key and occurance id.
Other than that, there will be heavy reads. Here is my design
1)Primary Key on the Master ID. There may be no need to partition a sixty million row table. Because searches are based on dates, I will have another index on the startdate.
2)Child table. Primary key is Master ID, Occurance ID, StartDate. Table is partitioned based on start date
3)Will try to compute aggregates as much as possible on a daily basis and read from aggregates so full table scans are avoided.
4)When we update the child table, we always specify the partition. Something like this
UPDATE CHILD SET <> WHERE PARENT_IDFK=<> AND OCCURANCE_ID=<> AND START_DATE(partition key)=<>.
That way full table scans are avoided.
5)All INSERTS/UPDATES will be via stored procedures keeping the Python/Flask middleware as SQL code-free as possible.
Any other points you want to add to this or is this good enough?
Related
Background info:
Table A has 100 rows in it representing inventory on a shelf. I want to make column boxCode a primary key but I can't because the empty rows all have boxCode of empty string.
Table B has a variable amount of rows based on the actual inventory. I want to make column boxCode a foreign key linked to Table A.
Problem:
Currently I perform 2 SQL queries to make the above operations occur but the action isn't atomic. I update table A followed by updating table B but there is a small period of time where the tables information is out of sync with each other. This is causing issues with our API. Is there a way in SQL to add or delete rows from Table B when I UPDATE Table A? I can guarantee I only ever UPDATE Table A if that makes a difference.
I do not know if my request has a name but I think it can be called "duplicating a row with all the rows below it in the hierarchy tables".
I want to create a stored procedure that can duplicate a record and all its childs and grandchilds records.
When I say "Duplicate" I main that all the record values in the columns will be copy to a new record on the same table (the copy should be without Primary key Id column because its automated)
This Stored Procedure needs to get two parameters:
Parameter 1 - Table Name
Parameter 2 - Value of the Primary key Id (of that Table Name) that needs to duplicate.
Instructions of the code:
The value (Parameter 2) will find one record on the table (Parameter 1) and duplicate it (as I said same values for all the columns excluding the primary key of that table.
Then we will need to know all the table names that has a relationship with this primary key (of the parent table)
Then every child table will duplicate the records that have the same Id value (Parameter 2) with the new Id value (of the new parent record).
several things:
In the dream scenario the code knows to go down levels without any limit (children, grandchildren, great-grandchildren ...) ,but I guess it is a very complex code that will contain some recursion code, so even a code containing up to 3-4 levels I will be happily accepted.
I have a Parent Table which is having just the 10 rows,
But their respective child table will have 100K records, for 1 ID in parent table we have 10K record in child table.
When I fire a delete command on parent table it also delete the record from child table, But it takes around 5 minutes to delete all 10K records.
So my question is what is the best practice to delete records from child table when we have cascading effects on the table.
10K records is just an example for some ID we have millions of records to delete.
Assuming SQL Server could make use of an index while deleting, placing an index on the foreign key column in the child table might speed up the deletion. For example:
parent (id, col1, col2, ...)
child (id, parent_id, ...)
CREATE INDEX ON child (parent_id);
Such an index might let SQL Server quickly lookup each child record, given a parent record.
This is too long for a comment.
It takes time to delete thousands of rows in a table. The deletion has to:
Find the rows to delete.
Change the data on the data pages.
Log the changes.
Modify indexes.
Execute triggers (if any).
Delete related rows in other tables (if any).
This can be quite expensive.
Given the low volume of parent ids, I think you can speed this by partitioning the child table by the parent id. A clustered index on the parent id might also help -- but that might affect insert performance.
With the H2 database, suppose there is a SUMS table that has a key and several count fields and there is an UPDATES table which has the same key and count fields. The keys in the UPDATES table may or may not exist in the SUMS table.
What is the most efficient way to add all the counts for each key from the UPDATES table to the SUM table, or insert a row with those counts if the SUMS table does not yet have it?
Of course I could always process the result set of a select on the UPDATES table and then one-by-one update or insert into the SUMS table, but this feels like there should be a more efficient way to do it.
If it is not possible in H2 but possible in some other Java-embeddable solution I would be interested in this too, because this processing is just an intermediate step for processing a larger number of these counts (a couple of dozen million keys and a couple of billion rows for updating them).
I have a product and product_detail table pair that I need to copy the data from and update the primary key. Essentially what I'm trying to do is copy last night's data from both tables, give them new primary keys so they don't clash with the current data and insert back into the table with some of the information updated - pk/fk, update_date, and two fields flagged as something different.
I can't make changes to the tables, so I can't use update cascade. We have a file that does an End of Day batch and inserts the data into the tables. We also have a file that is updated every time a transaction happens during the day, so what I thought I could do is copy last nights data, update the keys so they didn't clash, and update that new set of data with whatever comes in from the file during the day. The way it is now, our users have to wait until the end of the day to see where we are. With updating during the day, they can see the balance as the day progresses.
I believe I will have to grab the information out of the main table, get a new pk, update the other info, and pass that new pk to the second table to replace the original fk it has and do it row by row.
Am I heading in the right direction?