I have a SQLite table, below is its structure
TABLE file(
"ID" STRING PRIMARY KEY,
"filename" string NOT NULL,
"parent" string,
"is_folder" BOOLEAN NOT NULL,
FOREIGN KEY (parent) REFERENCES file(ID)
)
I need to create a recursive query in SQLite to get a list of names, to get the path to the file, an example of implementation through simple queries in Python:
async def get_file_path(self, ID: str):
# bad example
cursor = await self.db.execute(
'SELECT * FROM file WHERE ID = ?',
(ID, )
)
file = await cursor.fetchone()
names = [file[1]]
while file[2]: # while file has parent
cursor = await self.db.execute(
'SELECT * FROM file WHERE ID = ?',
(file[2], )
)
file = await cursor.fetchone()
names.append(file[1])
return '/'.join(names[::-1])
This will work, but if the file is deep in file-tree, it will have to make many consecutive queries to the database, which will be slow. I found, that SQLite supports 'WITH RECURSIVE' construction, but I can't figure out how to correctly compose my query in SQLite.
There are many places in my project where I could apply recursive queries, so I want to figure out how to do it.
[UPDATE] Data sample:
INSERT INTO file (ID, filename, parent, is_folder) VALUES
('ID1', 'folder_one', NULL, true),
('ID2', 'folder_in_folder_one', 'ID1', true),
('ID3', 'another_folder_in_first', 'ID1', true),
('ID4', 'deep_file', 'ID2', false);
The output for "deep_file" (which has "ID4") must be:
["deep_file", "folder_in_folder_one", "folder_one"]
Then I can get filepath of "deep_file". In this example file with "ID3" is ignored because it's not parent of "ID4"
Use a recursive CTE and the aggregate function json_group_array():
WITH cte AS (
SELECT * FROM file WHERE ID = ?
UNION ALL
SELECT f.*
FROM file f INNER JOIN cte c
ON c.parent = f.ID
)
SELECT json_group_array(filename) AS json_names FROM cte;
See the demo.
Although the above query (in every test I did) always aggregates the filenames in the same expected order, this is not guaranteed and it is not mentioned in the documentation.
Just to be on the safe side, another way to do it, which concatenates the filenames one-by-one in the expected order:
WITH cte AS (
SELECT *, json_array(filename) AS json_names FROM file WHERE ID = 'ID4'
UNION ALL
SELECT f.*, json_insert(json_names, '$[#]', f.filename)
FROM file f INNER JOIN cte c
ON c.parent = f.ID
)
SELECT json_names FROM cte ORDER BY LENGTH(json_names) DESC LIMIT 1;
See the demo.
Related
Background
I'm running Postgres 11 on CentOS 7.
I recently learned the basics of recursive CTEs in Postgres thanks to S-Man's answer to my recent question.
The problem
While working on a closely related issue (counting parts sold within bundles and assemblies) and using this recursive CTE, I ran into a problem where the query looped indefinitely and never completed.
I tracked this down to the presence of non-spurious 'self-referential' entries in the relator table, i.e. rows with the same value for parent_name and child_name.
I know that these are the source of the problem because when I recreated the situation with test tables and data, the undesired looping behavior occurred when these rows were present, and disappeared when these rows were absent or when UNION (which excludes duplicate returned rows) was used in the CTE rather than UNION ALL .
I think the data model itself probably needs adjusting so that these 'self-referential' rows aren't necessary, but for now, what I need to do is get this query to return the desired data on completion and stop looping.
How can I achieve this result? All guidance much appreciated!
Tables and test data
CREATE TABLE the_schema.names_categories (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
thing_name TEXT NOT NULL,
thing_category TEXT NOT NULL
);
CREATE TABLE the_schema.relator (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
parent_name TEXT NOT NULL,
child_name TEXT NOT NULL,
child_quantity INTEGER NOT NULL
);
/* NOTE: listing_name below is like an alias of a relator.parent_name as it appears in a catalog,
required to know because it is these listing_names that are reflected by sales.sold_name */
CREATE TABLE the_schema.catalog_listings (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
listing_name TEXT NOT NULL,
parent_name TEXT NOT NULL
);
CREATE TABLE the_schema.sales (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
sold_name TEXT NOT NULL,
sold_quantity INTEGER NOT NULL
);
CREATE VIEW the_schema.relationships_with_child_category AS (
SELECT
c.listing_name,
r.parent_name,
r.child_name,
r.child_quantity,
n.thing_category AS child_category
FROM
the_schema.catalog_listings c
INNER JOIN
the_schema.relator r
ON c.parent_name = r.parent_name
INNER JOIN
the_schema.names_categories n
ON r.child_name = n.thing_name
);
INSERT INTO the_schema.names_categories (thing_name, thing_category)
VALUES ('parent1', 'bundle'), ('child1', 'assembly'), ('child2', 'assembly'),('subChild1', 'component'),
('subChild2', 'component'), ('subChild3', 'component');
INSERT INTO the_schema.catalog_listings (listing_name, parent_name)
VALUES ('listing1', 'parent1'), ('parent1', 'child1'), ('parent1','child2'), ('child1', 'child1'), ('child2', 'child2');
INSERT INTO the_schema.catalog_listings (listing_name, parent_name)
VALUES ('parent1', 'child1'), ('parent1','child2');
/* note the two 'self-referential' entries */
INSERT INTO the_schema.relator (parent_name, child_name, child_quantity)
VALUES ('parent1', 'child1', 1),('child1', 'subChild1', 1), ('child1', 'subChild2', 1)
('parent1', 'child2', 1),('child2', 'subChild1', 1), ('child2', 'subChild3', 1), ('child1', 'child1', 1), ('child2', 'child2', 1);
INSERT INTO the_schema.sales (sold_name, sold_quantity)
VALUES ('parent1', 1), ('parent1', 2), ('listing1', 1);
The present query, loops indefinitely with the required UNION ALL
WITH RECURSIVE cte AS (
SELECT
s.sold_name,
s.sold_quantity,
r.child_name,
r.child_quantity,
r.child_category as category
FROM
the_schema.sales s
JOIN the_schema.relationships_with_child_category r
ON s.sold_name = r.listing_name
UNION ALL
SELECT
cte.sold_name,
cte.sold_quantity,
r.child_name,
r.child_quantity,
r.child_category
FROM cte
JOIN the_schema.relationships_with_child_category r
ON cte.child_name = r.parent_name
)
SELECT
child_name,
SUM(sold_quantity * child_quantity)
FROM cte
WHERE category = 'component'
GROUP BY child_name
;
In catalog_listings table listing_name and parent_name is same for child1 and child2
In relator table parent_name and child_name is also same for child1 and child2
These rows are creating cycling recursion.
Just remove those two rows from both the tables:
delete from catalog_listings where id in (4,5)
delete from relator where id in (7,8)
Then your desired output will be as below:
child_name
sum
subChild2
8
subChild3
8
subChild1
16
Is this the result you are looking for?
If you can't delete the rows you can use below add parent_name<>child_name condition to avoid those rows:
WITH RECURSIVE cte AS (
SELECT
s.sold_name,
s.sold_quantity,
r.child_name,
r.child_quantity,
r.child_category as category
FROM
the_schema.sales s
JOIN the_schema.relationships_with_child_category r
ON s.sold_name = r.listing_name and r.parent_name <>r.child_name
UNION ALL
SELECT
cte.sold_name,
cte.sold_quantity,
r.child_name,
r.child_quantity,
r.child_category
FROM cte
JOIN the_schema.relationships_with_child_category r
ON cte.child_name = r.parent_name and r.parent_name <>r.child_name
)
SELECT
child_name,
SUM(sold_quantity * child_quantity)
FROM cte
WHERE category = 'component'
GROUP BY child_name ;
You may be able to avoid infinite recursion simply by using UNION instead of UNION ALL.
The documentation describes the implementation:
Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.
So long as the working table is not empty, repeat these steps:
Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.
Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.
"Getting rid of the duplicates" should cause the intermediate table to be empty at some point, which ends the iteration.
DROP TABLE IF EXISTS t;
CREATE TABLE t(
mypath varchar(100),
parent_path varchar(100)
);
INSERT INTO t VALUES ('a', NULL),('a/b', 'a'),('a/b/c', 'a/b');
-- Listing all parent paths
1) using LIKE, not making use of parent_path column:
SELECT a.mypath, b.mypath aS parent_path
FROM t a
JOIN t b ON a.mypath LIKE b.mypath + '%' AND a.mypath != b.mypath
2) Using a recursive cte, making use of parent_path column
WITH cte AS (
SELECT mypath, parent_path
FROM t
UNION ALL
SELECT a.mypath, b.parent_path
FROM cte a
JOIN t b ON a.parent_path = b.mypath
)
SELECT * FROM cte WHERE parent_path IS NOT NULL;
On large datasets, what are the pros and cons of each method, performance wise ?
Am I right to thing the rcte method should be faster ?
Should LIKE be able to make use of an index, since there is only a trailing wildcard ?
I am supporting a poorly made webpage that is saving all of the data on screen in one SQL column as a long JSON string. Like so:
{"companies":[{"__type":"MyReplacementCompany:#LifeEApplication.GAINWeb.Utilities.Classes","data":[]}],"data":[{"Key":"existing_insurance","Value":"false"},{"Key":"replace_existing","Value":"false"}]}
It is gross and I am on the middle of redoing this page properly. However, I need to pull all of the current records into my redesigned database and I am having trouble pulling the values out of this column.
Using parseJSON, I am able to get the data out of this field and into this:
I am struggling with getting the data from this approach over to a table like so:
Any suggestions?
Something like the following may be what you need. This specifically creates a table with the output columns, assigns a unique row identifier (rownum) to each input row, and then uses that identifier to join together the various attributes that are in separate rows of the JSON parser output.
create table compdata (
company_name varchar(100),
insured_name varchar(100),
policy_number varchar(100),
face_amount money
);
with parsed_json as (
select
row_number() over() as rownum,
parseJSON(your_json_data_column)
from
your_input_table
)
insert into compdata
(company_name, insured_name, policy_number, face_amount)
from
(select rownum, StringValue from parsed_json where parent_id = 1) as company_name
inner join (select id, StringValue from parsed_json where parent_id = 2) as insured_name
on insured_name.rownum = company_name.rownum
inner join (select rownum, StringValue from parsed_json where parent_id = 3) as policy_number
on policy_number.rownum = company_name.rownum
inner join (select rownum, StringValue from parsed_json where parent_id = 4) as face_amount
on face_amount.rownum = company_name.rownum
;
I'm loading some quite nasty data through Azure data factory
This is how the data looks after being loaded, existing of 2 parts:
1. Metadata of a test
2. Actual measurements of the test -> the measurement is numeric
Image I have about 10 times such 'packages' of 1.Metadata + 2.Measurements
What I would like it to be / what I'm looking for is the following:
The number column with 1,2,.... is what I'm looking for!
Imagine my screenshot could go no further but this goes along until id=10
I guess a while loop is necessary here...
Query before:
SELECT Field1 FROM Input
Query after:
SELECT GeneratedId, Field1 FROM Input
Thanks a lot in advance!
EDIT: added a hint:
Here is a solution, this requires SQL-SERVER 2012 or later.
Start by getting an Id column on your data. If you can do this previous to the script that would be even better, but if not, try something like this...
CREATE TABLE #InputTable (
Id INT IDENTITY(1, 1),
TestData NVARCHAR(MAX) )
INSERT INTO #InputTable (TestData)
SELECT Field1 FROM Input
Now create a query to get the GeneratedId of each package as well as the Id where they start and end. You can do this by getting all the records LIKE 'title%' since that is the first record of each package, then using ROW_NUMBER, Id, and LEAD for the GeneratedId, StartId, and EndId respectively.
SELECT
GeneratedId = ROW_NUMBER() OVER(ORDER BY (Id)),
StartId = Id,
EndId = LEAD(Id) OVER (ORDER BY (Id))
FROM #InputTable
WHERE TestData LIKE 'title%'
Lastly, join this to the input in order to get all the records, with the correct GeneratedId.
SELECT
package.GeneratedId, i.TestData
FROM (
SELECT
GeneratedId = ROW_NUMBER() OVER(ORDER BY (Id)),
StartId = Id,
EndId = LEAD(Id) OVER (ORDER BY (Id))
FROM #InputTable
WHERE TestData LIKE 'title%' ) package
INNER JOIN #InputTable i
ON i.Id >= package.StartId
AND (package.EndId IS NULL OR i.Id < package.EndId)
What is the equivalent Teradata syntax to answer the same question asked about reverse aggregation inside of common table expression found at Reverse aggregation inside of common table expression?
I am trying to hack a Teradata syntax version to iterate over a parent child relationship table and build JSON which places parent of child who is parent to child who is parent to child etc in the one JSON field.
This is the answer given in the question from the hyperlink listed above which I think is written for PostgreSQL. I would really appreciate assistance in translating this to TD as I think this answer should allow me to accomplish my intended task. If not please set me straight.
I am not sure what row_to_json(c) is calling should this be JSON_AGG(c.children)? and I think that the double colon (NULL::JSON) is casting a null to a JSON data type? In any case I have tried a few variations to no avail. Please help.
Here is the PostgreSQL syntax answer given:
WITH RECURSIVE cte AS (
SELECT id, parent_id, name, NULL::JSON AS children
FROM people p
WHERE NOT EXISTS ( -- only leaf nodes; see link below
SELECT 1 FROM people
WHERE parent_id = p.id
)
UNION ALL
SELECT p.id, p.parent_id, p.name, row_to_json(c) AS children
FROM cte c
JOIN people p ON p.id = c.parent_id
)
SELECT id, name, json_agg(children) AS children
FROM cte
GROUP BY 1, 2;
When translating the PostgreSQL to Teradata I encountered a restriction, JSON columns are not supported by set operations like UNION.
Casting JSON/VarChar back and forth is a workaround:
CREATE VOLATILE TABLE people (id INT, name VARCHAR(20), parent_id INT) ON COMMIT PRESERVE ROWS;
INSERT INTO people VALUES(1, 'Adam', NULL);
INSERT INTO people VALUES(2, 'Abel', 1);
INSERT INTO people VALUES(3, 'Cain', 1);
INSERT INTO people VALUES(4, 'Enoch', 3);
WITH RECURSIVE cte AS (
SELECT id, parent_id, name,
CAST(NULL AS VARCHAR(2000)) AS children
FROM people p
WHERE NOT EXISTS (
SELECT * FROM people
WHERE parent_id = p.id
)
UNION ALL
SELECT p.id, p.parent_id, p.name,
-- VarChar -> JSON -> VarChar
CAST(JSON_COMPOSE(c.id,
c.name,
NEW JSON(c.children) AS children) AS VARCHAR(10000)) AS children
FROM cte c
JOIN people p ON p.id = c.parent_id
)
SELECT id, name,
JSON_AGG(NEW JSON(children) AS children) AS children
FROM cte
GROUP BY 1, 2;
The result is similar, but not exactly the same, Teradata adds "children":, e.g:
{"children":{"id":4,"name":"Enoch","children":null}} -- Teradata
[{"id":4,"name":"Enoch","children":null}] -- PostgreSQL
Finally adding JSONExtract to get the array only:
SELECT id, name,
JSON_AGG(NEW JSON(children) AS X).JSONExtract('$..X') AS children
FROM cte
GROUP BY 1, 2;
[{"id":4,"name":"Enoch","children":null}]