Recursive query used for transitive closure - sql

I've created a simple example to illustrate transitive closure using recursive queries in PostgreSQL.
However, something is off with my recursive query. I'm not familiar with the syntax yet so this request may be entirely noobish of me, and for that I apologize in advance. If you run the query, you will see that node 1 repeats itself in the path results. Can someone please help me figure out how to tweak the SQL?
/* 1
/ \
2 3
/ \ /
4 5 6
/
7
/ \
8 9
*/
create table account(
acct_id INT,
parent_id INT REFERENCES account(acct_id),
acct_name VARCHAR(100),
PRIMARY KEY(acct_id)
);
insert into account (acct_id, parent_id, acct_name) values (1,1,'account 1');
insert into account (acct_id, parent_id, acct_name) values (2,1,'account 2');
insert into account (acct_id, parent_id, acct_name) values (3,1,'account 3');
insert into account (acct_id, parent_id, acct_name) values (4,2,'account 4');
insert into account (acct_id, parent_id, acct_name) values (5,2,'account 5');
insert into account (acct_id, parent_id, acct_name) values (6,3,'account 6');
insert into account (acct_id, parent_id, acct_name) values (7,4,'account 7');
insert into account (acct_id, parent_id, acct_name) values (8,7,'account 8');
insert into account (acct_id, parent_id, acct_name) values (9,7,'account 9');
WITH RECURSIVE search_graph(acct_id, parent_id, depth, path, cycle) AS (
SELECT g.acct_id, g.parent_id, 1,
ARRAY[g.acct_id],
false
FROM account g
UNION ALL
SELECT g.acct_id, g.parent_id, sg.depth + 1,
path || g.acct_id,
g.acct_id = ANY(path)
FROM account g, search_graph sg
WHERE g.acct_id = sg.parent_id AND NOT cycle
)
SELECT path[1] as Child,parent_id as Parent,path || parent_id as path FROM search_graph
ORDER BY path[1],depth;

You can simplify in several places (assuming acct_id and parent_id are NOT NULL):
WITH RECURSIVE search_graph AS (
SELECT parent_id, ARRAY[acct_id] AS path
FROM account
UNION ALL
SELECT g.parent_id, sg.path || g.acct_id
FROM search_graph sg
JOIN account g ON g.acct_id = sg.parent_id
WHERE g.acct_id <> ALL(sg.path)
)
SELECT path[1] AS child
, path[array_upper(path,1)] AS parent
, path
FROM search_graph
ORDER BY path;
The columns acct_id, depth, cycle are just noise in your query.
The WHERE condition has to exit the recursion one step earlier, before the duplicate entry from the top node is in the result. That was an "off-by-one" in your original.
The rest is formatting.
If you know the only possible circle in your graph is a self-reference, we can have that cheaper:
WITH RECURSIVE search_graph AS (
SELECT parent_id, ARRAY[acct_id] AS path, acct_id <> parent_id AS keep_going
FROM account
UNION ALL
SELECT g.parent_id, sg.path || g.acct_id, g.acct_id <> g.parent_id
FROM search_graph sg
JOIN account g ON g.acct_id = sg.parent_id
WHERE sg.keep_going
)
SELECT path[1] AS child
, path[array_upper(path,1)] AS parent
, path
FROM search_graph
ORDER BY path;
SQL Fiddle.
Note there would be problems (at least up to pg v9.4) for data types with a modifier (like varchar(5)) because array concatenation loses the modifier but the rCTE insists on types matching exactly:
Surprising results for data types with type modifier

You have account 1 set as its own parent. If you set that account's parent to null you can avoid having that account as both the start and end node (the way your logic is setup you'll include a cycle but then won't add on to that cycle, which seems reasonable). It also looks a little nicer to change your final "path" column to something like case when parent_id is not null then path || parent_id else path end to avoid having the null at the end.

Related

Oracle CONNECT BY recursive and return value a match

In the following example:
TABLE
ID NAME ATTR
-----------------
1 A1 ROOT
2 A2
3 A3 VALX
4 A4
5 A5
6 A6
RELATIONSHIP
ID CHILD_ID PARENT_ID
-------------------------
1 6 4
2 5 4
3 4 3
4 3 1
5 2 1
SCHEMA
I need a query to get the value of the ATTR column of the PARENT when it is different from null. Raising the levels until you get the first match.
For example with ID 6:
ID NAME NAME_PARENT ATTR_PARENT
-----------------------------------------
6 A6 A3 VALX
I have tried with:
select T.ID, T.NAME, T2.NAME PARENT_NAME, T2.ATTR ATTR_PARENT
from TABLE T
INNER JOIN RELATIONSHIP R
ON R.CHILD_ID = T.ID
INNER JOIN TABLE T2
ON T2.ID = R.PARENT_D
WHERE T2.ATTR IS NOT NULL
START WITH T.ID = 6
CONNECT BY T.ID = PRIOR R.PARENTID
--and R.PARENTID != prior T.ID
And sorry for my bad english
Instead of using the [mostly obsolete] CONNECT BY clause you can use standard Recursive SQL CTEs (Common Table Expressions).
For example:
with
n (id, name, name_parent, attr_parent, parent_id, lvl) as (
select t.id, t.name, b.name, b.attr, r.parent_id, 1
from t
join r on t.id = r.child_id
join t b on b.id = r.parent_id
where t.id = 6 -- starting node
union all
select n.id, n.name, b.name, b.attr, r.parent_id, lvl + 1
from n
join r on r.child_id = n.parent_id
join t b on b.id = r.parent_id
where n.attr_parent is null
)
select id, name, name_parent, attr_parent
from n
where lvl = (select max(lvl) from n)
Result:
ID NAME NAME_PARENT ATTR_PARENT
-- ---- ----------- -----------
6 A6 A3 VALX
For reference, the data script I used is:
create table t (
id number(6),
name varchar2(10),
attr varchar2(10)
);
insert into t (id, name, attr) values (1, 'A1', 'ROOT');
insert into t (id, name, attr) values (2, 'A2', null);
insert into t (id, name, attr) values (3, 'A3', 'VALX');
insert into t (id, name, attr) values (4, 'A4', null);
insert into t (id, name, attr) values (5, 'A5', null);
insert into t (id, name, attr) values (6, 'A6', null);
create table r (
id number(6),
child_id number(6),
parent_id number(6)
);
insert into r (id, child_id, parent_id) values (1, 6, 4);
insert into r (id, child_id, parent_id) values (2, 5, 4);
insert into r (id, child_id, parent_id) values (3, 4, 3);
insert into r (id, child_id, parent_id) values (4, 3, 1);
insert into r (id, child_id, parent_id) values (5, 2, 1);
Here is how you can do the whole thing in a single pass of connect by - using the various features available for this kind of query (including the connect_by_isleaf flag and the connect_by_root pseudo-column):
select connect_by_root(r.child_id) as id,
connect_by_root(t.name) as name,
t.name as name_parent,
t.attr as attribute_parent
from r join t on r.child_id = t.id
where connect_by_isleaf = 1
start with r.child_id = 6
connect by prior r.parent_id = r.child_id and prior t.attr is null
;
ID NAME NAME_PARENT ATTRIBUTE_PARENT
---------- ---------- ----------- ----------------
6 A6 A3 VALX
Note that this will still return a null ATTRIBUTE_PARENT, if the entire tree is walked without ever finding an ancestor with non-null ATTRIBUTE. If in fact you only want to show something in the output if an ancestor has a non-null ATTRIBUTE (and allow the output to have no rows if there is no such ancestor), you can change the where clause to where t.attr is not null. In most cases, though, you would probably want the behavior as I coded it.
I used the tables and data as posted in #TheImpaler 's answer (thank you for the create table and insert statements!)
As I commented under his answer: recursive with clause is in the SQL Standard, so it has some advantages over connect by. However, whenever the same job can be done with connect by, it's worth at least testing it that way too. In many cases, due to numerous optimizations Oracle has come up with over time, connect by will be much faster.
One reason some developers avoid connect by is that they don't spend the time to learn the various features (like the ones I used here). Not a good reason, in my opinion.

SQL Tree / Hierarchial Data

This is my first post, I am trying to make a sql tree table that traverses. For example, If a person clicks on a drop down list called Categories, it will display Electric, and InterC. Then, if the user clicks on electric, it will drop down relays and switches, next if the person clicks on relays it will drop down X relays and if the person clicks on switches it will drop down Y switches. I have attempted below , but the part i don't understand is if i have another category InterC, how do I make that another level of drop downs ?
Table Category
insert test select 1, 0,'Electric'
insert test select 2, 1,'Relays'
insert test select 3, 1,'Switches'
insert test select 5, 2,'X Relays'
insert test select 6, 2,'Y Switches'
insert test select 7, 0,'InterC'
insert test select 8, 1,'x Sockets'
insert test select 9, 1,'y Sockets'
insert test select 10, 2,'X Relays'
insert test select 11, 2,'Y Relays'
;
create table test(id int,parentId int,name varchar(50))
WITH tree (id, parentid, level, name) as (
SELECT id, parentid, 0 as level, name
FROM test WHERE parentid = 0
UNION ALL
SELECT c2.id, c2.parentid, tree.level + 1, c2.name
FROM test c2
INNER JOIN tree ON tree.id = c2.parentid
)
SELECT *
FROM tree
order by parentid
Your hierarchical T-SQL query should return all the records in the table, both those under Electric and InterC.
However, you should make parentId nullable and have the root records have a null rather than 0. That will let you add a foreign key that protects your data integrity (it won't be possible to add orphaned records by mistake).
You hierarchy query returns all of your records, I'm guessing that you want to return just one at a time - for that add a where condition to the starting query.
WITH tree (id, parentid, level, name) as (
SELECT id, parentid, 0 as level, name
FROM test
WHERE name = #category AND
parentId is null
UNION ALL
SELECT c2.id, c2.parentid, tree.level + 1, c2.name
FROM test c2
INNER JOIN tree ON tree.id = c2.parentid
)
SELECT *
FROM tree
order by parentid
Then set #category to 'Electric' or'InterC' to get one or the other hierarchy.

Flatten the tree path in SQL server Hierarchy ID

I am using SQL Hierarchy data type to model a taxonomy structure in my application.
The taxonomy can have the same name in different levels
During the setup this data needs to be uploaded via an excel sheet.
Before inserting any node I would like to check if the node at a particular path already exists so that I don't duplicate the entries.
What is the easiest way to check if the node # particular absolute path already exists or not?
for e.g Before inserting say "Retail" under "Bank 2" I should be able to check "/Bank 2/Retail" is not existing
Is there any way to provide a flattened representation of the entire tree structure so that I can check for the absolute path and then proceed?
Yes, you can do it using a recursive CTE.
In each iteration of the query you can append a new level of the hierarchy name.
There are lots of examples of this technique on the internet.
For example, with this sample data:
CREATE TABLE Test
(id INT,
parent_id INT null,
NAME VARCHAR(50)
)
INSERT INTO Test VALUES(1, NULL, 'L1')
INSERT INTO Test VALUES(2, 1, 'L1-A')
INSERT INTO Test VALUES(3, 2, 'L1-A-1')
INSERT INTO Test VALUES(4, 2, 'L1-A-2')
INSERT INTO Test VALUES(5, 1, 'L1-B')
INSERT INTO Test VALUES(6, 5, 'L1-B-1')
INSERT INTO Test VALUES(7, 5, 'L1-B-2')
you can write a recursive CTE like this:
WITH H AS
(
-- Anchor: the first level of the hierarchy
SELECT id, parent_id, name, CAST(name AS NVARCHAR(300)) AS path
FROM Test
WHERE parent_id IS NULL
UNION ALL
-- Recursive: join the original table to the anchor, and combine data from both
SELECT T.id, T.parent_id, T.name, CAST(H.path + '\' + T.name AS NVARCHAR(300))
FROM Test T INNER JOIN H ON T.parent_id = H.id
)
-- You can query H as if it was a normal table or View
SELECT * FROM H
WHERE PATH = 'L1\L1-A' -- for example to see if this exists
The result of the query (without the where filter) looks like this:
1 NULL L1 L1
2 1 L1-A L1\L1-A
5 1 L1-B L1\L1-B
6 5 L1-B-1 L1\L1-B\L1-B-1
7 5 L1-B-2 L1\L1-B\L1-B-2
3 2 L1-A-1 L1\L1-A\L1-A-1
4 2 L1-A-2 L1\L1-A\L1-A-2

Removing duplicate subtrees from CONNECT-BY query in oracle

I have a heirarchical table in the format
CREATE TABLE tree_hierarchy (
id NUMBER (20)
,parent_id NUMBER (20)
);
INSERT INTO tree_hierarchy (id, parent_id) VALUES (2, 1);
INSERT INTO tree_hierarchy (id, parent_id) VALUES (4, 2);
INSERT INTO tree_hierarchy (id, parent_id) VALUES (9, 4);
When I run the Query:-
SELECT id,parent_id,
CONNECT_BY_ISLEAF leaf,
LEVEL,
SYS_CONNECT_BY_PATH(id, '/') Path,
SYS_CONNECT_BY_PATH(parent_id, '/') Parent_Path
FROM tree_hierarchy
WHERE CONNECT_BY_ISLEAF<>0
CONNECT BY PRIOR id = PARENT_id
ORDER SIBLINGS BY ID;
Result I am Getting is like this:-
"ID" "PARENT_ID" "LEAF" "LEVEL" "PATH" "PARENT_PATH"
9 4 1 3 "/2/4/9" "/1/2/4"
9 4 1 2 "/4/9" "/2/4"
9 4 1 1 "/9" "/4"
But I need an Oracle Sql Query That gets me only this
"ID" "PARENT_ID" "LEAF" "LEVEL" "PATH" "PARENT_PATH"
9 4 1 3 "/2/4/9" "/1/2/4"
This is a simpler example I have more that 1000 records in such fashion.When I run the above query,It is generating many duplicates.Can any one give me a generic query that will give complete path from leaf to root with out duplicates.Thanks for the help in advance
The root node in finite hierarchy must be always known.
According to the definition: http://en.wikipedia.org/wiki/Tree_structure
the root node is a node that has no parents.
To check if a given node is a root node, take "parent_id" and check in the table if exists a record with this id.
The query might look like this:
SELECT id,parent_id,
CONNECT_BY_ISLEAF leaf,
LEVEL,
SYS_CONNECT_BY_PATH(id, '/') Path,
SYS_CONNECT_BY_PATH(parent_id, '/') Parent_Path
FROM tree_hierarchy th
WHERE CONNECT_BY_ISLEAF<>0
CONNECT BY PRIOR id = PARENT_id
START WITH not exists (
select 1 from tree_hierarchy th1
where th1.id = th.parent_id
)
ORDER SIBLINGS BY ID;
You should point the id patently to build the path for. Now your query is building the path for all leaves which satisfy your condition. You need to use "start with" Let's try it like this:
SELECT id,parent_id,
CONNECT_BY_ISLEAF leaf,
LEVEL,
SYS_CONNECT_BY_PATH(id, '/') Path,
SYS_CONNECT_BY_PATH(parent_id, '/') Parent_Path
FROM tree_hierarchy
WHERE CONNECT_BY_ISLEAF<>0
CONNECT BY PRIOR id = PARENT_id
START WITH id = 2
ORDER SIBLINGS BY ID;

Cumulative number of files in subfolders

I do have a table with list of files. There is id_folder, id_parrent_folder, size (file size):
create table sample_data (
id_folder bigint ,
id_parrent_folder bigint,
size bigint
);
I would like to know, how many files are in every subfolder (including current folder) for each folder (starting wigh given folder). Given the samle data posted below I expect the following output:
id_folder files
100623 35
100624 14
Sample data:
insert into sample_data values (100623,58091,60928);
insert into sample_data values (100623,58091,59904);
insert into sample_data values (100623,58091,54784);
insert into sample_data values (100623,58091,65024);
insert into sample_data values (100623,58091,25600);
insert into sample_data values (100623,58091,31744);
insert into sample_data values (100623,58091,27648);
insert into sample_data values (100623,58091,39424);
insert into sample_data values (100623,58091,30720);
insert into sample_data values (100623,58091,71168);
insert into sample_data values (100623,58091,68608);
insert into sample_data values (100623,58091,34304);
insert into sample_data values (100623,58091,46592);
insert into sample_data values (100623,58091,35328);
insert into sample_data values (100623,58091,29184);
insert into sample_data values (100623,58091,38912);
insert into sample_data values (100623,58091,38400);
insert into sample_data values (100623,58091,49152);
insert into sample_data values (100623,58091,14444);
insert into sample_data values (100623,58091,33792);
insert into sample_data values (100623,58091,14789);
insert into sample_data values (100624,100623,16873);
insert into sample_data values (100624,100623,32768);
insert into sample_data values (100624,100623,104920);
insert into sample_data values (100624,100623,105648);
insert into sample_data values (100624,100623,31744);
insert into sample_data values (100624,100623,16431);
insert into sample_data values (100624,100623,46592);
insert into sample_data values (100624,100623,28160);
insert into sample_data values (100624,100623,58650);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
I've tried to use example from postgresql (postgresql docs), but it (obviously) can't work this way. Any help appreciated.
-- Edit
I've tried the following query:
WITH RECURSIVE included_files(id_folder, parrent_folder, dist_last_change) AS (
SELECT
id_folder,
id_parrent_folder,
size
FROM
sample_data p
WHERE
id_folder = 100623
UNION ALL
SELECT
p.id_folder,
p.id_parrent_folder,
p.size
FROM
included_files if,
sample_data p
WHERE
p.id_parrent_folder = if.id_folder
)
select * from included_files
This won't work, because for every child there is a lot of parents and as a result rows in child folders are multiplied.
With your sample data, this returns what you want. I'm not 100% sure though that it will cover all possible anomalies in your tree:
with recursive folder_sizes as (
select id_folder, id_parent_folder, count(*) as num_files
from sample_data
group by id_folder, id_parent_folder
),
folder_tree as (
select id_folder, id_parent_folder, num_files as total_files
from folder_sizes
where id_parent_folder = 100623
union all
select c.id_folder, c.id_parent_folder, c.num_files + p.total_files as total_files
from folder_sizes c
join folder_tree p on p.id_parent_folder = c.id_folder
)
select id_folder, id_parent_folder, total_files
from folder_tree;
Here is a SQLFiddle demo: http://sqlfiddle.com/#!12/bb942/2
This only covers a single level hierarchy though (because of the id_parent_folder = 100623 condition). To cover any number of levels, I can only think of a two step approach, that first collects all sub-folders and then walks that tree up again, to calculate the total number of files.
Something like this:
with recursive folder_sizes as (
select id_folder, id_parent_folder, count(*) as num_files
from sample_data
group by id_folder, id_parent_folder
),
folder_tree_down as (
select id_folder, id_parent_folder, num_files, id_folder as root_folder, 1 as level
from folder_sizes
union all
select c.id_folder, c.id_parent_folder, c.num_files, p.root_folder, p.level + 1 as level
from folder_sizes c
join folder_tree_down p on p.id_folder = c.id_parent_folder
),
folder_tree_up as (
select id_folder, id_parent_folder, num_files as total_files, level
from folder_tree_down
where root_folder = 100623
union all
select c.id_folder, c.id_parent_folder, c.num_files + p.total_files as total_files, p.level
from folder_tree_down c
join folder_tree_up p on p.id_parent_folder = c.id_folder
)
select id_folder, id_parent_folder, total_files
from folder_tree_up
where level > 1;
That produces the same output as the first statement, but I think it should work with an unlimited number of levels.
Very nice problem to think about, I upvoted!
As I see it, 2 cases to think about:
multi-level paths and
multi-child nodes.
So far I've came up with the following query:
WITH RECURSIVE tree AS (
SELECT id_folder id, array[id_folder] arr
FROM sample_data sd
WHERE NOT EXISTS (SELECT 1 FROM sample_data s
WHERE s.id_parrent_folder=sd.id_folder)
UNION ALL
SELECT sd.id_folder,t.arr||sd.id_folder
FROM tree t
JOIN sample_data sd ON sd.id_folder IN (
SELECT id_parrent_folder FROM sample_data WHERE id_folder=t.id))
,ids AS (SELECT DISTINCT id, unnest(arr) ua FROM tree)
,agg AS (SELECT id_folder id,count(*) cnt FROM sample_data GROUP BY 1)
SELECT ids.id, sum(agg.cnt)
FROM ids JOIN agg ON ids.ua=agg.id
GROUP BY 1
ORDER BY 1;
I've added the following rows to the sample_data:
INSERT INTO sample_data VALUES (100625,100623,123);
INSERT INTO sample_data VALUES (100625,100623,456);
INSERT INTO sample_data VALUES (100625,100623,789);
INSERT INTO sample_data VALUES (100626,100625,1);
This query is not optimal though and will be slowing down as number of rows grows.
Full-scale tests
In order to simulate original situation, I've done a small python script that scans filesystem and stores it into the database (thus the delay, I'm not yet good at python scripting).
The following tables had been created:
CREATE TABLE fs_file(file_id bigserial, name text, type char(1), level int4);
CREATE TABLE fs_tree(file_id int8, parent_id int8, size int8);
Scanning whole filesystem of my MBP took 7.5 minutes and I have 870k entries in the fs_tree table, which is quite similar to the original task. After upload, the following was run:
CREATE INDEX i_fs_tree_1 ON fs_tree(file_id);
CREATE INDEX i_fs_tree_2 ON fs_tree(parent_id);
VACUUM ANALYZE fs_file;
VACUUM ANALYZE fs_tree;
I've tried running my first query on this data and had to kill it after aprx 1 hour. The improved one takes round 2 minutes (on my MBP) to do the job on the whole filesystem. Here it comes:
WITH RECURSIVE descent AS (
SELECT fs.file_id grp, fs.file_id, fs.size, 1 k, 0 AS lvl
FROM fs_tree fs
WHERE fs.parent_id = (SELECT file_id FROM fs_file WHERE name = '/')
UNION ALL
SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
fs.file_id, fs.size, k.k, d.lvl+1
FROM descent d
JOIN fs_tree fs ON d.file_id=fs.parent_id
CROSS JOIN generate_series(0,1) k(k))
/* the query */
SELECT grp, file_id, size, k, lvl
FROM descent
ORDER BY 1,2,3;
Query uses my table names, but it shouldn't be difficult to change it. It will build a set of groups for each file_id found in the fs_tree. To get the desired output, you can do something like:
SELECT grp AS file_id, count(*), sum(size)
FROM descent GROUP BY 1;
Some notes:
query will work only if there're no duplicates. I think it is a right way to go, 'cos it is impossible to have 2 equally named entries in a single directory;
query doesn't care bout the depth or sibling count of the tree, though this does have impact on the performance;
for me it was good experience, as similar functionality is needed also for task planning systems (I'm working with one at the moment);
as tasks are considered, single entry can have multiple parents (but not otherwise) and query will still work;
this problem can be solved in other ways too, like traversing the tree in ascending order, or using pre-calculated values to avoid the final grouping step, but this is getting a bit bigger then a simple question, so I live it as an exercise for you.
Recommendations
To get this query work, you should prepare your data by aggregating it:
WITH RECURSIVE
fs_tree AS (
SELECT id_folder file_id, id_parrent_folder parent_id,
sum(size) AS size, count(*) AS cnt
FROM sample_data GROUP BY 1,2)
,descent AS (
SELECT fs.file_id grp, fs.file_id, fs.size, fs.cnt, 1 k, 0 AS lvl
FROM fs_tree fs
WHERE fs.parent_id = 58091
UNION ALL
SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
fs.file_id, fs.size, fs.cnt, k.k, d.lvl+1
FROM descent d
JOIN fs_tree fs ON d.file_id=fs.parent_id
CROSS JOIN generate_series(0,1) k(k))
/* the query */
SELECT grp file_id, sum(size) size, sum(cnt) cnt
FROM descent
GROUP BY 1
ORDER BY 1,2,3;
In order to speed things up, you can implement Materialized Views and pre-calculate some metrics.
Sample data
Here's a small dump that will show the data inside the tables:
INSERT INTO fs_file VALUES (1, '/Users/viy/prj/logs', 'D', 0),
(2, 'jobs', 'D', 1),
(3, 'pg_csv_load', 'F', 2),
(4, 'pg_logs', 'F', 2),
(5, 'logs.sql', 'F', 1),
(6, 'logs.sql~', 'F', 1),
(7, 'pgfouine-1.2.tar.gz', 'F', 1),
(8, 'u.sql', 'F', 1),
(9, 'u.sql~', 'F', 1);
INSERT INTO fs_tree VALUES (1, NULL, 0),
(2, 1, 0),
(3, 2, 936),
(4, 2, 706),
(5, 1, 4261),
(6, 1, 4261),
(7, 1, 793004),
(8, 1, 491),
(9, 1, 491);
Note, that I've slightly updated create statements.
And this is the script I've used to scan the filesystem:
#!/usr/bin/python
import os
import psycopg2
import sys
from stat import *
def walk_tree(full, parent, level, call_back):
'''recursively descend the directory tree rooted at top,
calling the callback function for each regular file'''
if not os.access(full, os.R_OK):
return
for f in os.listdir(full):
path = os.path.join(full, f)
if os.path.islink(path):
# It's a link, register and continue
e = entry(f, "L", level)
call_back(parent, e, 0)
continue
mode = os.stat(path).st_mode
if S_ISDIR(mode):
e = entry(f, "D", level)
call_back(parent, e, 0)
# It's a directory, recurse into it
try:
walk_tree(path, e, level+1, call_back)
except OSError:
pass
elif S_ISREG(mode):
# It's a file, call the callback function
call_back(parent, entry(f, "F", level), os.stat(path).st_size)
else:
# It's unknown, just register
e = entry(f, "U", level)
call_back(parent, e, 0)
def register(parent, entry, size):
db_cur.execute("INSERT INTO fs_tree VALUES (%s,%s,%s)",
(entry, parent, size))
def entry(name, type, level):
db_cur.execute("""INSERT INTO fs_file(name,type, level)
VALUES (%s, %s, %s) RETURNING file_id""",
(name, type, level))
return db_cur.fetchone()[0]
db_con=psycopg2.connect("dbname=postgres")
db_cur=db_con.cursor()
if len(sys.argv) != 2:
raise SyntaxError("Root directory expected!")
if not S_ISDIR(os.stat(sys.argv[1]).st_mode):
raise SyntaxError("A directory is wanted!")
e=entry(sys.argv[1], "D", 0)
register(None, e, 0)
walk_tree(sys.argv[1], e, 1, register)
db_con.commit()
db_cur.close()
db_con.close()
This script is for Python 3.2 and is based on the example from official python documentation.
Hope this clarifies things for you.