Implementing a sorting key in SQL - sql

We store relationships between documents in an Oracle db using a table having a column named docid and a column named parentid. If I have a document, doc1, related to child documents, child1_1 and child1_2 they would be represented by the following records in the Documents table.
docid parentid
1000 null record for doc1
1001 1000 " " child1_1
1002 1000 " " child1_2
The Documents table can have millions of rows, so to make sure all related documents are grouped together in our UI we pre-sort the Documents table by using an indexed varchar column named sortedfamily and populate it with the concatenation of the docids of the related documents. Without using the sortedfamily column sorting the records at query time is too slow. The records shown above become.
docid parentid sortedfamily
1000 null 1000 record for doc1
1001 1000 1000_1001 " " child1_1
1002 1000 1000_1002 " " child1_2
This allows us to add 'ordered by sortedfamily' to our queries and the returned records will always be sorted by related documents. What I outlined above works pretty well but it has some limitations related to a document family hierarchical depth and it feels weird concatenating integers to sort the records. Is there a way to do the above using only integers?
Thanks in advance.
UPDATE: My example above was not detailed enough. The children themselves may also have related documents. If child1_1 had a related document the resulting value for sortedfamily may be "1000_1001_2000".

Oracle has excellent support for hierarchical queries. You can get your document hierarchy without resorting to the sortedfamily column. Here's the query:
SELECT docid, PRIOR docid AS "Parent"
FROM Documents
START WITH parentid IS NULL
CONNECT BY parentid = PRIOR docid
ORDER SIBLINGS BY docid
Now to explain:
SELECT docid, PRIOR docid AS "Parent"
This gets the document and its parent on the same row by "looking back" with the PRIOR operator.
START WITH parentid IS NULL
This defines the hierarchy's root. Every row that has a null parentid is considered the root of a branch.
CONNECT BY parentid = PRIOR docid
This says that the "parent" of the current row is connected by parentid of the child up to docid of the parent.
ORDER SIBLINGS BY docid
This sorts along the entire hierarchy rather than a single value. It's hard to explain, but it works.
The best thing about the Oracle hierarchical queries is that they'll query an entire branch, so if you have a document with a child that has a child (that has a child, and on on...) Oracle will handle it. It will also handle multiple children per parent.
There's a SQL Fiddle here with your data plus a few additional documents.
The Fiddle also includes a column that shows the entire "root to branch" relationship using the SYS_CONNECT_BY_PATH function. The SYS_CONNECT_BY_PATH does the same thing as your sortedfamily column, but it does it dynamically, without the need to maintain the column. It's also a good way to visualize each branch of the hierarchy.
Addendum
Note that the query above will return every branch for every document. If you're just interested in a single document such as docid = 1000, replace the START WITH parentid IS NULL with this:
START WITH docid = 1000
That will give you the entire branch for docid 1000. If you have an index on docid it will be very fast.

Related

Postgres transaction table with Billion rows and multiple JSON columns

So we have a new project where we need to use postgres 14 to scale up a transaction table that gets heavily updated. The Master table has about a 60 million rows over a six month period and a child table has about 600 million rows. Data retention period is six months after which we have to drop the oldest month partition.
I want opinions from Postgres Experts on whether this design is right and whether anything is overlooked:
Parent/Master table
ID
JSON 1---> A couple of hundred characters
JSON 2 ---> 50 characters
The table has about 20 columns. Updates are always based on the primary key.
Child Table
Parent_IDFK (Parent key from Parent or Master Table)
Occurance_id (Every parent has 10 rows in the Child table, 1,2,3,4,5....)These are occurances
Occurance JSON . Each Child linked to a parent has a specific JSON. Lets call it Occurance JSON. So Child 1 has Occurance 1 JSON. Child 2 has Occurance 2 JSON.
Over the period of a day,a row first gets inserted into the master. Then about 10 rows get inserted into the child. After the child record is inserted, we have to update the parent
with aggregate occurance. The parent JSON aggregate in the parent table will look something like this
UPDATE PARENT SET AGGREGATE_JSON= (sum up the SUM of occurrences in the Child table for that parent key) WHERE ID=<>.
There will also be updates to the Child table based on primary key and occurance id.
Other than that, there will be heavy reads. Here is my design
1)Primary Key on the Master ID. There may be no need to partition a sixty million row table. Because searches are based on dates, I will have another index on the startdate.
2)Child table. Primary key is Master ID, Occurance ID, StartDate. Table is partitioned based on start date
3)Will try to compute aggregates as much as possible on a daily basis and read from aggregates so full table scans are avoided.
4)When we update the child table, we always specify the partition. Something like this
UPDATE CHILD SET <> WHERE PARENT_IDFK=<> AND OCCURANCE_ID=<> AND START_DATE(partition key)=<>.
That way full table scans are avoided.
5)All INSERTS/UPDATES will be via stored procedures keeping the Python/Flask middleware as SQL code-free as possible.
Any other points you want to add to this or is this good enough?

SQL Best way to return data from one table along with mapped data from another table

I have the following problem.
I have a table Entries that contains 2 columns:
EntryID - unique identifier
Name - some name
I have another EntriesMapping table (many to many mapping table) that contains 2 columns :
EntryID that refers to the EntryID of the Entries table
PartID that refers to a PartID in a seprate Parts table.
I need to write a SP that will return all data from Entries table, but for each row in the Entries table I want to provide a list of all PartID's that are registered in the EntriesMapping table.
My question is how do I best approach the deisgn of the solution to this, given that the results of the SP would regularly be processed by an app so performance is quite important.
1.
Do I write a SP that will select multiple rows per entry - where if there are more than one PartID's registered for a given entry - I will return multiple rows each having the same EntryID and Name but different PartID's
OR
2.
Do I write a SP that will select 1 row per entry in the Entries table, and have a field that is a string/xml/json that contains all the different PartID's.
OR
3. There is some other solution that I am not thinking of?
Solution 1 seems to me to be the better way to go, but I will be passing lots of repeating data.
Solution 2 wont pass extra data, but the string/json/xml would need to be processed additionally, resuling in larger cpu time per item.
PS: I feel like this is quite a common problem to solve, but I was unable to find any resource that can provide common solutions or some pros/cons to different approaches.
I think you need simple JOIN:
SELECT e.EntryId, e.Name, em.PartId
FROM Entries e
JOIN EntriesMapping em ON e.EntryId = em.EntryId
This will return what you want, no need for stored procedure for that.

Displaying multiple columns on one row (SQL)

I have a report I am trying to make that displays parent information and all children in one household on ONE row.
There is no "parent" table that stores the information on parents and there is no ID that links parents to child and no ID that links sibling to sibling. The only way to tell if they are siblings is if they have the same address (logic being that if they have the same address, they live together, and are part of the same household). All the information is pulled from a "student" table or a custom field in the student table that stores the parent information, address they live at, etc.
Instead of displaying parent info twice I want to display
the information like this:
Parent_name, address, phone,child1_name, child1_schoolname, child1_age, child2_name, child2_schoolname, child2_age, etc(for every child in that household)
The problem is that not every household will have the same amount of children and I can only link siblings by their address.
How can I display all information for each household on ONE row? Is this possible and how? I've tried pivot table but with no avail.
This is a classic 'you shouldn't be doing reports in the database' question. A database is for data retrieval, not data formatting. But let's assume you know this and need to do it anyway for some reason.
The algorithm I'd use for this would be
Create some windowed queries across the data; group by address (the joinable value) and sort by age desc.
Create a query that utilize this window and returns the first item in each group.
Create additional queries that return the second, the third, the fourth, in each group. etc.
Outer join these together.
This is going to be far easier if you define some maximum number of siblings (five?) as opposed to dynamically building these siblings.
If the parents are in the same table, how do you know which items are parents and which are children?
In case you have two tables one for Parent(first table) and one for Children(second table) as below:
You can do something like that in your data model:
select Parent.NAME as parent_name,
Parent.ADDRESS as parent_address,
Parent.PHONE AS phone,
(
select listagg(Child.NAME,',')
within group(order by Child.NAME)
from CHILD Child
where Child.ADDRESS=Parent.ADDRESS
)as children_names,
(
select
listagg(Child.AGE,',')
within group(order by Child.NAME)
from CHILD Child
where Child.ADDRESS=Parent.ADDRESS
)as children_ages
from PARENT Parent .
And you will have the output query result:
Listagg is your solution which operates as you want bringing muliple rows in one.
However,listagg is compatible for database 11g and newest versions,
so in case you have older version,this is not going to work.
Hope this help.

Uniqueness in many-to-many

I couldn't figure out what terms to google, so help tagging this question or just pointing me in the way of a related question would be helpful.
I believe that I have a typical many-to-many relationship:
CREATE TABLE groups (
id integer PRIMARY KEY);
CREATE TABLE elements (
id integer PRIMARY KEY);
CREATE TABLE groups_elements (
groups_id integer REFERENCES groups,
elements_id integer REFERENCES elements,
PRIMARY KEY (groups_id, elements_id));
I want to have a constraint that there can only be one groups_id for a given set of elements_ids.
For example, the following is valid:
groups_id | elements_id
1 | 1
1 | 2
2 | 2
2 | 3
The following is not valid, because then groups 1 and 2 would be equivalent.
groups_id | elements_id
1 | 1
1 | 2
2 | 2
2 | 1
Not every subset of elements must have a group (this is not the power set), but new subsets may be formed. I suspect that my design is incorrect since I'm really talking about adding a group as a single entity.
How can I create identifiers for subsets of elements without risk of duplicating subsets?
That is an interesting problem.
One solution, albeit a klunky one, would be to store a concatenation of groups_id and elements_id in the groups table: 1-1-2 and make it a unique index.
Trying to do a search for duplicate groups before inserting a new row, would be an enormous performance hit.
The following query would spit out offending group ids:
with group_elements_arr as (
select groups_id, array_agg(elements_id order by elements_id) elements
from group_elements
group by groups_id )
select elements, count(*), array_agg(groups_id) offending_groups
from group_elements_arr
group by elements
having count(*) > 1;
Depending on the size of group_elements and its change rate you might get away with stuffing something along this lines into a trigger watching group_elements. If that's not fast enough you can materialize group_elements_arr into a real table managed by triggers.
And I think, the trigger should be FOR EACH STATEMENT and INITIALLY DEFERRED for easy building up a new group.
This link from user ypercube was most helpful: unique constraint on a set. In short, a bit of what everyone is saying is correct.
It's a question of tradeoffs, but here are the best options:
a) Add a hash or some other combination of element values to the groups table and make it unique, then populate the groups_elements table off of it using triggers. Pros of this method are that it preserves querying ability and enforces the constraint so long as you deny naked updates to groups_elements. Cons are that it adds complexity and you've now introduced logic like "how do you uniquely represent a set of elements" into your database.
b) Leave the tables as-is and control the access to groups_elements with your access layer, be it a stored procedure or otherwise. This has the advantage of preserving querying ability and keeps the database itself simple. However, it means that you are moving an analytic constraint into your access layer, which necessarily means that your access layer will need to be more complex. Another point is that it separates what the data should be from the data itself, which has both pros and cons. If you need faster access to whether or not a set already exists, you can attack that problem separately.

Can I select full hierarchy of parents when id and parent id are in the same table?

I have a table which has a column for Id and parentId. ParentId contains the Id of another row in the table. If the ParentId is null then it is the top of the hierarchy.
I have the Id of a row and I want to select all rows above it in the hierarchy. Can I do this in a single select?
so in this example:
Id | parentId | other columns
1 | null
2 | 1
3 | 2
if I have id=3 I want to select rows 1,2,3.
Can I do it in linq to sql?
You can do it in a single select using a recursive CTE, however LINQ to SQL doesn't support this so you will have to create a stored procedure with the query and call that from LINQ to SQL.
Take a look at this example, uses recursive CTE.
Don't know LINQ, but as other answerers have written, many relational databases support Common Table Expressions (CTE) - but not all (Oracle comes to mind). And if supported, CTE is a good approach to retrieving the "ancestry".
That noted, there are some other approaches to consider in particular a bridge table or nested set. See my question for some explanation of these options and other ways of representing hierarchical data. Briefly, a bridge table most likely updated using CTE from a trigger will easily give you all ancestors or descendants - just not how close. A nested set model will give you this information and how close at the expense of more expensive inserts and updates comparatively.