Cumulative number of files in subfolders - sql

I do have a table with list of files. There is id_folder, id_parrent_folder, size (file size):
create table sample_data (
id_folder bigint ,
id_parrent_folder bigint,
size bigint
);
I would like to know, how many files are in every subfolder (including current folder) for each folder (starting wigh given folder). Given the samle data posted below I expect the following output:
id_folder files
100623 35
100624 14
Sample data:
insert into sample_data values (100623,58091,60928);
insert into sample_data values (100623,58091,59904);
insert into sample_data values (100623,58091,54784);
insert into sample_data values (100623,58091,65024);
insert into sample_data values (100623,58091,25600);
insert into sample_data values (100623,58091,31744);
insert into sample_data values (100623,58091,27648);
insert into sample_data values (100623,58091,39424);
insert into sample_data values (100623,58091,30720);
insert into sample_data values (100623,58091,71168);
insert into sample_data values (100623,58091,68608);
insert into sample_data values (100623,58091,34304);
insert into sample_data values (100623,58091,46592);
insert into sample_data values (100623,58091,35328);
insert into sample_data values (100623,58091,29184);
insert into sample_data values (100623,58091,38912);
insert into sample_data values (100623,58091,38400);
insert into sample_data values (100623,58091,49152);
insert into sample_data values (100623,58091,14444);
insert into sample_data values (100623,58091,33792);
insert into sample_data values (100623,58091,14789);
insert into sample_data values (100624,100623,16873);
insert into sample_data values (100624,100623,32768);
insert into sample_data values (100624,100623,104920);
insert into sample_data values (100624,100623,105648);
insert into sample_data values (100624,100623,31744);
insert into sample_data values (100624,100623,16431);
insert into sample_data values (100624,100623,46592);
insert into sample_data values (100624,100623,28160);
insert into sample_data values (100624,100623,58650);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
I've tried to use example from postgresql (postgresql docs), but it (obviously) can't work this way. Any help appreciated.
-- Edit
I've tried the following query:
WITH RECURSIVE included_files(id_folder, parrent_folder, dist_last_change) AS (
SELECT
id_folder,
id_parrent_folder,
size
FROM
sample_data p
WHERE
id_folder = 100623
UNION ALL
SELECT
p.id_folder,
p.id_parrent_folder,
p.size
FROM
included_files if,
sample_data p
WHERE
p.id_parrent_folder = if.id_folder
)
select * from included_files
This won't work, because for every child there is a lot of parents and as a result rows in child folders are multiplied.

With your sample data, this returns what you want. I'm not 100% sure though that it will cover all possible anomalies in your tree:
with recursive folder_sizes as (
select id_folder, id_parent_folder, count(*) as num_files
from sample_data
group by id_folder, id_parent_folder
),
folder_tree as (
select id_folder, id_parent_folder, num_files as total_files
from folder_sizes
where id_parent_folder = 100623
union all
select c.id_folder, c.id_parent_folder, c.num_files + p.total_files as total_files
from folder_sizes c
join folder_tree p on p.id_parent_folder = c.id_folder
)
select id_folder, id_parent_folder, total_files
from folder_tree;
Here is a SQLFiddle demo: http://sqlfiddle.com/#!12/bb942/2
This only covers a single level hierarchy though (because of the id_parent_folder = 100623 condition). To cover any number of levels, I can only think of a two step approach, that first collects all sub-folders and then walks that tree up again, to calculate the total number of files.
Something like this:
with recursive folder_sizes as (
select id_folder, id_parent_folder, count(*) as num_files
from sample_data
group by id_folder, id_parent_folder
),
folder_tree_down as (
select id_folder, id_parent_folder, num_files, id_folder as root_folder, 1 as level
from folder_sizes
union all
select c.id_folder, c.id_parent_folder, c.num_files, p.root_folder, p.level + 1 as level
from folder_sizes c
join folder_tree_down p on p.id_folder = c.id_parent_folder
),
folder_tree_up as (
select id_folder, id_parent_folder, num_files as total_files, level
from folder_tree_down
where root_folder = 100623
union all
select c.id_folder, c.id_parent_folder, c.num_files + p.total_files as total_files, p.level
from folder_tree_down c
join folder_tree_up p on p.id_parent_folder = c.id_folder
)
select id_folder, id_parent_folder, total_files
from folder_tree_up
where level > 1;
That produces the same output as the first statement, but I think it should work with an unlimited number of levels.

Very nice problem to think about, I upvoted!
As I see it, 2 cases to think about:
multi-level paths and
multi-child nodes.
So far I've came up with the following query:
WITH RECURSIVE tree AS (
SELECT id_folder id, array[id_folder] arr
FROM sample_data sd
WHERE NOT EXISTS (SELECT 1 FROM sample_data s
WHERE s.id_parrent_folder=sd.id_folder)
UNION ALL
SELECT sd.id_folder,t.arr||sd.id_folder
FROM tree t
JOIN sample_data sd ON sd.id_folder IN (
SELECT id_parrent_folder FROM sample_data WHERE id_folder=t.id))
,ids AS (SELECT DISTINCT id, unnest(arr) ua FROM tree)
,agg AS (SELECT id_folder id,count(*) cnt FROM sample_data GROUP BY 1)
SELECT ids.id, sum(agg.cnt)
FROM ids JOIN agg ON ids.ua=agg.id
GROUP BY 1
ORDER BY 1;
I've added the following rows to the sample_data:
INSERT INTO sample_data VALUES (100625,100623,123);
INSERT INTO sample_data VALUES (100625,100623,456);
INSERT INTO sample_data VALUES (100625,100623,789);
INSERT INTO sample_data VALUES (100626,100625,1);
This query is not optimal though and will be slowing down as number of rows grows.
Full-scale tests
In order to simulate original situation, I've done a small python script that scans filesystem and stores it into the database (thus the delay, I'm not yet good at python scripting).
The following tables had been created:
CREATE TABLE fs_file(file_id bigserial, name text, type char(1), level int4);
CREATE TABLE fs_tree(file_id int8, parent_id int8, size int8);
Scanning whole filesystem of my MBP took 7.5 minutes and I have 870k entries in the fs_tree table, which is quite similar to the original task. After upload, the following was run:
CREATE INDEX i_fs_tree_1 ON fs_tree(file_id);
CREATE INDEX i_fs_tree_2 ON fs_tree(parent_id);
VACUUM ANALYZE fs_file;
VACUUM ANALYZE fs_tree;
I've tried running my first query on this data and had to kill it after aprx 1 hour. The improved one takes round 2 minutes (on my MBP) to do the job on the whole filesystem. Here it comes:
WITH RECURSIVE descent AS (
SELECT fs.file_id grp, fs.file_id, fs.size, 1 k, 0 AS lvl
FROM fs_tree fs
WHERE fs.parent_id = (SELECT file_id FROM fs_file WHERE name = '/')
UNION ALL
SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
fs.file_id, fs.size, k.k, d.lvl+1
FROM descent d
JOIN fs_tree fs ON d.file_id=fs.parent_id
CROSS JOIN generate_series(0,1) k(k))
/* the query */
SELECT grp, file_id, size, k, lvl
FROM descent
ORDER BY 1,2,3;
Query uses my table names, but it shouldn't be difficult to change it. It will build a set of groups for each file_id found in the fs_tree. To get the desired output, you can do something like:
SELECT grp AS file_id, count(*), sum(size)
FROM descent GROUP BY 1;
Some notes:
query will work only if there're no duplicates. I think it is a right way to go, 'cos it is impossible to have 2 equally named entries in a single directory;
query doesn't care bout the depth or sibling count of the tree, though this does have impact on the performance;
for me it was good experience, as similar functionality is needed also for task planning systems (I'm working with one at the moment);
as tasks are considered, single entry can have multiple parents (but not otherwise) and query will still work;
this problem can be solved in other ways too, like traversing the tree in ascending order, or using pre-calculated values to avoid the final grouping step, but this is getting a bit bigger then a simple question, so I live it as an exercise for you.
Recommendations
To get this query work, you should prepare your data by aggregating it:
WITH RECURSIVE
fs_tree AS (
SELECT id_folder file_id, id_parrent_folder parent_id,
sum(size) AS size, count(*) AS cnt
FROM sample_data GROUP BY 1,2)
,descent AS (
SELECT fs.file_id grp, fs.file_id, fs.size, fs.cnt, 1 k, 0 AS lvl
FROM fs_tree fs
WHERE fs.parent_id = 58091
UNION ALL
SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
fs.file_id, fs.size, fs.cnt, k.k, d.lvl+1
FROM descent d
JOIN fs_tree fs ON d.file_id=fs.parent_id
CROSS JOIN generate_series(0,1) k(k))
/* the query */
SELECT grp file_id, sum(size) size, sum(cnt) cnt
FROM descent
GROUP BY 1
ORDER BY 1,2,3;
In order to speed things up, you can implement Materialized Views and pre-calculate some metrics.
Sample data
Here's a small dump that will show the data inside the tables:
INSERT INTO fs_file VALUES (1, '/Users/viy/prj/logs', 'D', 0),
(2, 'jobs', 'D', 1),
(3, 'pg_csv_load', 'F', 2),
(4, 'pg_logs', 'F', 2),
(5, 'logs.sql', 'F', 1),
(6, 'logs.sql~', 'F', 1),
(7, 'pgfouine-1.2.tar.gz', 'F', 1),
(8, 'u.sql', 'F', 1),
(9, 'u.sql~', 'F', 1);
INSERT INTO fs_tree VALUES (1, NULL, 0),
(2, 1, 0),
(3, 2, 936),
(4, 2, 706),
(5, 1, 4261),
(6, 1, 4261),
(7, 1, 793004),
(8, 1, 491),
(9, 1, 491);
Note, that I've slightly updated create statements.
And this is the script I've used to scan the filesystem:
#!/usr/bin/python
import os
import psycopg2
import sys
from stat import *
def walk_tree(full, parent, level, call_back):
'''recursively descend the directory tree rooted at top,
calling the callback function for each regular file'''
if not os.access(full, os.R_OK):
return
for f in os.listdir(full):
path = os.path.join(full, f)
if os.path.islink(path):
# It's a link, register and continue
e = entry(f, "L", level)
call_back(parent, e, 0)
continue
mode = os.stat(path).st_mode
if S_ISDIR(mode):
e = entry(f, "D", level)
call_back(parent, e, 0)
# It's a directory, recurse into it
try:
walk_tree(path, e, level+1, call_back)
except OSError:
pass
elif S_ISREG(mode):
# It's a file, call the callback function
call_back(parent, entry(f, "F", level), os.stat(path).st_size)
else:
# It's unknown, just register
e = entry(f, "U", level)
call_back(parent, e, 0)
def register(parent, entry, size):
db_cur.execute("INSERT INTO fs_tree VALUES (%s,%s,%s)",
(entry, parent, size))
def entry(name, type, level):
db_cur.execute("""INSERT INTO fs_file(name,type, level)
VALUES (%s, %s, %s) RETURNING file_id""",
(name, type, level))
return db_cur.fetchone()[0]
db_con=psycopg2.connect("dbname=postgres")
db_cur=db_con.cursor()
if len(sys.argv) != 2:
raise SyntaxError("Root directory expected!")
if not S_ISDIR(os.stat(sys.argv[1]).st_mode):
raise SyntaxError("A directory is wanted!")
e=entry(sys.argv[1], "D", 0)
register(None, e, 0)
walk_tree(sys.argv[1], e, 1, register)
db_con.commit()
db_cur.close()
db_con.close()
This script is for Python 3.2 and is based on the example from official python documentation.
Hope this clarifies things for you.

Related

PostgreSQL merge recursive query and JOIN

I have the following schema:
CREATE TABLE tbl_employee_team
(
employee_id int,
teams_id int
);
INSERT INTO tbl_employee_team
VALUES
(1, 2),
(1, 3),
(1, 4);
CREATE TABLE tbl_team_list_serv
(
service_id int,
team_id int
);
INSERT INTO tbl_team_list_serv
VALUES
(7, 2),
(9, 3),
(10, 4);
CREATE TABLE tbl_service
(
id int,
parent int
);
INSERT INTO tbl_service
VALUES
(5, null),
(6, 5),
(7, 6),
(8, null),
(9, 8),
(10, null);
For the sake of simplicity I declared:
1 as employee_id
2, 3, 4 as team_id
5 -> 6 -> 7 as service (5 is the main service)
8 -> 9 (8 is the main service)
10 (10 is the main service)
To retrieve the services the employee belongs to I query
SELECT ls.service_id FROM tbl_team_list_serv ls
JOIN tbl_employee_team t ON ls.team_id=t.teams_id WHERE t.employee_id = 1
To get the main service from the services I use
WITH RECURSIVE r AS
(
SELECT id, parent, 1 AS level
FROM tbl_service
WHERE id = 7 /*(here's I need to assign to every id from the JOIN)*/
UNION
SELECT tbl_service.id, tbl_service.parent, r.level + 1 AS level
FROM tbl_service
JOIN r
ON r.parent = tbl_service.id
)
SELECT id FROM r WHERE r.level = (SELECT max(level) FROM r)
My question is how do I merge the two queries?
Based on the data above I want to finally get a list of ids which is in this case:
5, 8, 10
Also, I want my recursive query to return the last row (I don't think that the solution with level is elegant)
SQLFiddle can be found here
Thanks in advance
I feel like you already did most of the work for this question. This is just a matter of the following tweaks:
Putting the logic for the first query in the anchor part of the CTE.
Adding the original service id as a column to remember the hierarchy.
Tweaking the final logic to get one row per original service.
As a query:
WITH RECURSIVE r AS (
SELECT ls.service_id as id, s.parent, 1 as level, ls.service_id as orig_service_id
FROM tbl_team_list_serv ls JOIN
tbl_employee_team t
ON ls.team_id = t.teams_id JOIN
tbl_service s
ON ls.service_id = s.id
WHERE t.employee_id = 1
UNION ALL
SELECT s.id, s.parent, r.level + 1 AS level, r.orig_service_id
FROM tbl_service s JOIN
r
ON r.parent = s.id
)
SELECT r.id
FROM (SELECT r.*,
MAX(level) OVER (PARTITION BY orig_service_id) as max_level
FROM r
) r
WHERE r.level = max_level;
Here is a db<>fiddle.

Comparing a value of a row with the value of the previous row

I have a table in SQL Server that stores geology samples, and there is a rule that must be adhered to.
The rule is simple, a "DUP_2" sample must always come after a "DUP_1" sample (sometimes they are loaded inverted)
CREATE TABLE samples (
id INT
,name VARCHAR(5)
);
INSERT INTO samples VALUES (1, 'ASSAY');
INSERT INTO samples VALUES (2, 'DUP_1');
INSERT INTO samples VALUES (3, 'DUP_2');
INSERT INTO samples VALUES (4, 'ASSAY');
INSERT INTO samples VALUES (5, 'DUP_2');
INSERT INTO samples VALUES (6, 'DUP_1');
INSERT INTO samples VALUES (7, 'ASSAY');
id
name
1
ASSAY
2
DUP_1
3
DUP_2
4
ASSAY
5
DUP_2
6
DUP_1
7
ASSAY
In this example I would like to show all rows where name equal to 'DUP_2' and predecessor row (using ID) name is different from 'DUP_1'.
In this case, it would be row 5 only.
I would appreciate very much if you help me.
You can use the LAG() window function or you can use LEAD() - they are identical except for the way in which they are ordered. That is - LAG(name) OVER ( ORDER BY id ) is the same as LEAD(name) OVER ( ORDER BY id DESC ). (You can read more about these functions here.)
WITH s1 ( id, name, prior_name ) AS (
SELECT id, name, LAG(name) OVER ( ORDER BY id ) AS prior_name
FROM samples
)
SELECT id, name
FROM s1
WHERE name = 'DUP_2'
AND COALESCE(prior_name, 'DUMMY') != 'DUP_1';
The reason for the COALESCE() at the end with the DUMMY value is that the first value won't have a LAG(); it will be NULL; and we want to return the DUP_2 record in this case since it doesn't follow a DUP_1 record.
You can use lag():
select s.*
from (select s.*,
lag(name) over (order by id) as prev_name
from samples s
) s
where name = 'DUP_2' and (prev_name <> 'DUP_1' or prev_name is null)

Presto filter an array during aggregation

I would like to filter an aggregated array depending on all values associated with an id. The values are strings and can be of three type all-x:y, x:y and empty (here x and y are arbitrary substrings of values).
I have a few conditions:
If an id has x:y then the result should contain x:y.
If an id always has all-x:y then the resulting aggregation should have all-x:y
If an id sometimes has all-x:y then the resulting aggregation should have x:y
For example with the following
WITH
my_table(id, my_values) AS (
VALUES
(1, ['all-a','all-b']),
(2, ['all-c','b']),
(3, ['a','b','c']),
(1, ['all-a']),
(2, []),
(3, ['all-c']),
),
The result should be:
(1, ['all-a','b']),
(2, ['c','b']),
(3, ['a','b','c']),
I have worked multiple hours on this but it seems like it's not feasible.
I came up with the following but it feels like it cannot work because I can check the presence of all-x in all arrays which would go in <<IN ALL>>:
SELECT
id,
SET_UNION(
CASE
WHEN SPLIT_PART(my_table.values,'-',1) = 'all' THEN
CASE
WHEN <<my_table.values IN ALL>> THEN my_table.values
ELSE REPLACE(my_table.values,'all-')
END
ELSE my_table.values
END
) AS values
FROM my_table
GROUP BY 1
I would need to check that all arrays values for the specific id contains all-x and that's where I'm struggling to find a solution.
I was trying to co
After a few hours of searching how to do so I am starting to believe that it is not feasible.
Any help is appreciated. Thank you for reading.
This should do what you want:
WITH my_table(id, my_values) AS (
VALUES
(1, array['all-a','all-b']),
(2, array['all-c','b']),
(3, array['a','b','c']),
(1, array['all-a']),
(2, array[]),
(3, array['all-c'])
),
with_group_counts AS (
SELECT *, count(*) OVER (PARTITION BY id) group_count -- to see if the number of all-X occurrences match the number of rows for a given id
FROM my_table
),
normalized AS (
SELECT
id,
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'), -- if its an all-X value and every original row for the given id contains it ...
value,
if(starts_with(value, 'all-'), substr(value, 5), value)) AS extracted
FROM with_group_counts CROSS JOIN UNNEST(with_group_counts.my_values) t(value)
)
SELECT id, array_agg(DISTINCT extracted)
FROM normalized
GROUP BY id
The trick is to compute the number of total rows for each id in the original table via the count(*) OVER (PARTITION BY id) expression in the with_group_counts subquery. We can then use that value to determine whether a given value should be treated as an all-x or the x should be extracted. That's handled by the following expression:
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'),
value,
if(starts_with(value, 'all-'), substr(value, 5), value))
For more information about window functions in Presto, check out the documentation. You can find the documentation for UNNEST here.

Recursive query used for transitive closure

I've created a simple example to illustrate transitive closure using recursive queries in PostgreSQL.
However, something is off with my recursive query. I'm not familiar with the syntax yet so this request may be entirely noobish of me, and for that I apologize in advance. If you run the query, you will see that node 1 repeats itself in the path results. Can someone please help me figure out how to tweak the SQL?
/* 1
/ \
2 3
/ \ /
4 5 6
/
7
/ \
8 9
*/
create table account(
acct_id INT,
parent_id INT REFERENCES account(acct_id),
acct_name VARCHAR(100),
PRIMARY KEY(acct_id)
);
insert into account (acct_id, parent_id, acct_name) values (1,1,'account 1');
insert into account (acct_id, parent_id, acct_name) values (2,1,'account 2');
insert into account (acct_id, parent_id, acct_name) values (3,1,'account 3');
insert into account (acct_id, parent_id, acct_name) values (4,2,'account 4');
insert into account (acct_id, parent_id, acct_name) values (5,2,'account 5');
insert into account (acct_id, parent_id, acct_name) values (6,3,'account 6');
insert into account (acct_id, parent_id, acct_name) values (7,4,'account 7');
insert into account (acct_id, parent_id, acct_name) values (8,7,'account 8');
insert into account (acct_id, parent_id, acct_name) values (9,7,'account 9');
WITH RECURSIVE search_graph(acct_id, parent_id, depth, path, cycle) AS (
SELECT g.acct_id, g.parent_id, 1,
ARRAY[g.acct_id],
false
FROM account g
UNION ALL
SELECT g.acct_id, g.parent_id, sg.depth + 1,
path || g.acct_id,
g.acct_id = ANY(path)
FROM account g, search_graph sg
WHERE g.acct_id = sg.parent_id AND NOT cycle
)
SELECT path[1] as Child,parent_id as Parent,path || parent_id as path FROM search_graph
ORDER BY path[1],depth;
You can simplify in several places (assuming acct_id and parent_id are NOT NULL):
WITH RECURSIVE search_graph AS (
SELECT parent_id, ARRAY[acct_id] AS path
FROM account
UNION ALL
SELECT g.parent_id, sg.path || g.acct_id
FROM search_graph sg
JOIN account g ON g.acct_id = sg.parent_id
WHERE g.acct_id <> ALL(sg.path)
)
SELECT path[1] AS child
, path[array_upper(path,1)] AS parent
, path
FROM search_graph
ORDER BY path;
The columns acct_id, depth, cycle are just noise in your query.
The WHERE condition has to exit the recursion one step earlier, before the duplicate entry from the top node is in the result. That was an "off-by-one" in your original.
The rest is formatting.
If you know the only possible circle in your graph is a self-reference, we can have that cheaper:
WITH RECURSIVE search_graph AS (
SELECT parent_id, ARRAY[acct_id] AS path, acct_id <> parent_id AS keep_going
FROM account
UNION ALL
SELECT g.parent_id, sg.path || g.acct_id, g.acct_id <> g.parent_id
FROM search_graph sg
JOIN account g ON g.acct_id = sg.parent_id
WHERE sg.keep_going
)
SELECT path[1] AS child
, path[array_upper(path,1)] AS parent
, path
FROM search_graph
ORDER BY path;
SQL Fiddle.
Note there would be problems (at least up to pg v9.4) for data types with a modifier (like varchar(5)) because array concatenation loses the modifier but the rCTE insists on types matching exactly:
Surprising results for data types with type modifier
You have account 1 set as its own parent. If you set that account's parent to null you can avoid having that account as both the start and end node (the way your logic is setup you'll include a cycle but then won't add on to that cycle, which seems reasonable). It also looks a little nicer to change your final "path" column to something like case when parent_id is not null then path || parent_id else path end to avoid having the null at the end.

postgres - with recursive

I expected the following to return all the tuples, resolving each parent in the hierarchy up to the top, but it only returns the lowest levels (whose ID is specified in the query). How do I return the whole tree for a given level_id?
create table level(
level_id int,
level_name text,
parent_level int);
insert into level values (197,'child',177), ( 177, 'parent', 3 ), ( 2, 'grandparent', null );
WITH RECURSIVE recursetree(level_id, levelparent) AS (
SELECT level_id, parent_level
FROM level
where level_id = 197
UNION ALL
SELECT t.level_id, t.parent_level
FROM level t, recursetree rt
WHERE rt.level_id = t.parent_level
)
SELECT * FROM recursetree;
First of all, your (2, 'grandparent', null) should be (3, 'grandparent', null) if it really is a grandparent. Secondly, your (implicit) join condition in the recursive half of your query is backwards, you want to get the parent out of rt.levelparent rather than t.parent_level:
WITH RECURSIVE recursetree(level_id, levelparent) AS (
SELECT level_id, parent_level
FROM level
WHERE level_id = 197
UNION ALL
SELECT t.level_id, t.parent_level
FROM level t JOIN recursetree rt ON rt.levelparent = t.level_id
-- join condition fixed and ANSI-ified above
)
SELECT * FROM recursetree;