sql join within join? - sql

I need your help building a SQL statement I can't wrap my head around.
In a database, I have four tables - files, folders, folders_files and links.
I have many files. One of them is called "myFile.txt".
I have many folders. "myFile.txt" is in some of them. The first folder it appears in is called "firstFolder".
I have many links to many folders. The first link to "firstFolder" is called "firstLink".
The data structure for the example would be:
// files
Id: 10
Name: "myFile.txt"
// folders
Id: 20
Name: "firstFolder"
// folder_files (join table)
Id: 30
Folder_Id: 20 (meaning "firstFolder")
File_Id: 1 (meaning "myFile.txt")
// links
Id: 40
Name: "firstLink"
Folder_Id: 20 (meaning "firstFolder")
FIRST QUESTION: How do I get the record for "myFile.txt" AND the Name and Id of "firstLink" (the first link), querying on file Id = 10, based on the lowest Id of the folder and the link?
SECOND QUESTION: How do I get the record for "myFile.txt" AND the Name and Id of "firstLink" (the first link), querying on all files, based on the lowest Id of the folder and the link?
put another way - how do I get the first link to the first folder containing "myFile.txt"?
Resulting in a record that looks like:
Id: 10
Name: "myFile.txt"
LinkId: 40
LinkName: "firstLink"
Thanks!

You should try to think about how you want your result set to look. SQL is designed to describe result sets. If you can write out a hypothetical result set, you might have an easier time writing SQL that will render that result set.
I had a hard time understanding what you are looking for, but I'm sure it's a fairly straight forward problem. I would be able to help you easier if you could describe you results more clearly, although you might not need my help anymore!
For example (going with you original schema) Q1 & Q2:
files.Id, files.Name, links.Id, links.Name (4 columns)
Q1:
SELECT
files.Id, files.Name, links.Id, links.Name
FROM
files, links
INNER JOIN
folder_files
ON files.Id = folder_files.File_Id
INNER JOIN
links
ON links.Id = folder_files.Folder_Id
WHERE
files.Id = 10
ORDER BY
folder_files.File_Id ASC, links.Id ASC
LIMIT 1;
(JOIN with folders table not necessary)
Q2:
Change both ASC to DESC

This selects all links for file id 10:
select links.id, links.name
from files
left join folder_files on files.id = folder_files.file_id
left join folders on folder_files.folder_id = folders.id
left join links on links.folder_id = folders.id
where files.id=10;
Change the where clause, add limit or whatever for other things you want. It should be simple to modify this.

I would try this:
select f.*
, l.Id as LinkId
, l.Name as LinkName,
from Link l
inner join Folder_Files ff on ff.Folder_Id = l.Folder_Id
inner join Files f on f.Id = ff.File_Id
where f.Id = 10
Resulting to:
Id | Name | LinkId | LinkName
10 | myFile.txt | 40 | firstLink
Is this what you want?

Taking into account:
more folders per file
more links per folder
taking the lowest id folder for link, and lowest id link for folder
With help of: mysql: group by ID, get highest priority per each ID
The answer for ALL files in the files table ( go for JohnB's solution for a single file, it would be faster):
SELECT file_id, file_name, link_id, link_name FROM (
SELECT file_id, file_name, link_id, link_name,
#r := CASE WHEN #prev_file_id = file_id
THEN #rn + 1
ELSE 1
END AS r,
#prev_file_id := file_id
FROM (
SELECT
f.id as file_id, f.name as file_name, l.id as link_id, l.name as link_name
FROM files f
JOIN folder_files ff
ON ff.file_id = f.id
JOIN links l
ON l.folder_id = ff.folder_id
ORDER BY ff.folder_id, l.id -- first folder first, first link to first folder second
) derived1,
(SELECT #prev_file_id := NULL,#r:=0) vars
) derived2
WHERE r = 1;

Related

How to format results in Postgres?

I need to show the location of content within a course that lives in a Virtual Learning Environment (VLE) - basically it is a teaching and learning website that is backed by a Postgres database.
I created the following query that returns the raw data. A sample of this is below
WITH RECURSIVE folders AS(
SELECT ccs.pk1, ccs.parent_pk1, ccs.position, ccs.folder_ind, ccs.title
FROM course_contents ccs
INNER JOIN course_main cm ON cm.pk1 = ccs.crsmain_pk1
WHERE cm.course_id = '<COURSE ID>'
and ccs.pk1 = '<INDEX>' -- primary key
UNION ALL
SELECT cc.pk1, cc.parent_pk1, cc.position, cc.folder_ind, cc.title
FROM course_contents cc
JOIN folders ON folders.pk1 = cc.parent_pk1
)
SELECT f.*
FROM folders f
SAMPLE DATA
PK1
PARENT_PK1
POSITION
FOLDER
TITLE
11497702
NULL
0
Y
Assessment
11497708
11497702
0
N
Using the Assessment Tools
11497709
11497702
1
N
Past Exams Papers
11497710
11497702
2
N
Using the Assessment Tools
11497711
11497702
3
N
Past Exams Papers
I would like to display it like the table below. Something to note is that there are multiple levels - files in folders, and folders in folders, etc.
PATH
Assessment - Using the Assessment Tools
Assessment - Past Exams Papers
Assessment - Using the Assessment Tools
Assessment - Past Exams Papers
I've used the following line
SELECT STRING_AGG(f.title,' - ') AS path
FROM folders f
GROUP BY f.parent_pk1
but it displays like
PATH
Assessment
Using the Assessment Tools - Past Exams Papers - Using the Assessment Tools - Past Exams Papers"
Any help would be very much appreciated.
Thanks
Hmmm . . . I think you want something like this rather than aggregation:
select f.*
from (select first_value(f.title) over (order by position) || '-' || first_value(f.title) over (order by position desc)
from folders f
) f
where folder = 'N';
Thanks to Gordon's answer above I have found the answer I was looking for.
I made a few changes to Gordon's code. Please see below. I do need to test this with more data though to see if it fully replicates the content structure.
select f.*
from (select first_value(f.title) over (order by position) || ' > ' ||
f.title
from folders f
where f.folder_ind = 'N'
) f

Counting between two different tables Oracle

I have two ORACLE tables, FOLDER and FILES. Each folder contains several files.
I am trying to get the number of files for number of folders. The number of folders x that contains the number of files y.
For example 50 folders contain 10 files, 35 folders contain 8 files...
Can I get some help please on the query :
select count(fl.id_folder) ,count(fi.fileID) from FOLDER fl inner join FILES fi on fl.id_folder=fi.fileID group by fl.id_folder;
You can use two levels of aggregation. Assuming that table files has a column called id_folder, you would do:
select cnt_files, count(*) cnt_folders
from (
select count(*) cnt_files
from files
group by id_folder
) t
group by cnt_files
We can write the query using group by as follows:
Select cnt_files, count(1) as num_of_folders
from
(select fl.id_folder, count(fi.fileid) as cnt_files
from FOLDER fl
Left join FILES fi on fl.id_folder=fi.fileID
Group by fl.id_folder)
Group by cnt_files;
Note: I have used the LEFT JOIN to consider all the folders (With and Without files in it)

Get content data from specific files from bigquery-public-data:github_repos different results with JOIN and WHERE

The most common way of getting content data from specific files bigquery-public-data:github_repos by name is like this:
SELECT *
FROM [bigquery-public-data:github_repos.sample_contents]
WHERE id IN (SELECT id FROM (
SELECT *
FROM [bigquery-public-data:github_repos.sample_files]
WHERE path = 'README.md'
))
This query gives me 14557 results.
I thought that running below query will give me the same ammount of results:
SELECT contents.*
FROM [bigquery-public-data:github_repos.sample_contents] contents
INNER JOIN [bigquery-public-data:github_repos.sample_files] files
ON contents.id = files.id
WHERE files.path = 'README.md'
But it ends up with 14645 results.
Why there is the difference between this two results, and witch one is a proper one for selecting content data of README.md file?
EDIT:
It looks like forked files without modification have the same id across others repos (forks).
First query gives you all contents with files having path = 'README.md' no matter how many times that file id is present in files table
Second query gives you same content as many times as respective file is in files table - because of JOIN
You can run below to validate this
SELECT EXACT_COUNT_DISTINCT(contents.id)
FROM [bigquery-public-data:github_repos.sample_contents] contents
INNER JOIN [bigquery-public-data:github_repos.sample_files] files
ON contents.id = files.id
WHERE files.path = 'README.md'

How to group by more than one row value?

I am working with POSTGRESQL and I can't find out how to solve a problem. I have a model called Foobar. Some of its attributes are:
FOOBAR
check_in:datetime
qr_code:string
city_id:integer
In this table there is a lot of redundancy (qr_code is not unique) but that is not my problem right now. What I am trying to get are the foobars that have same qr_code and have been in a well known group of cities, that have checked in at different moments.
I got this by querying:
SELECT * FROM foobar AS a
WHERE a.city_id = 1
AND EXISTS (
SELECT * FROM foobar AS b
WHERE a.check_in < b.check_in
AND a.qr_code = b.qr_code
AND b.city_id = 2
AND EXISTS (
SELECT * FROM foobar as c
WHERE b.check_in < c.check_in
AND c.qr_code = b.qr_code
AND c.city_id = 3
AND EXISTS(...)
)
)
where '...' represents more queries to get more persons with the same qr_code, different check_in date and those well known cities.
My problem is that I want to group this by qr_code, and I want to show the check_in fields of each qr_code like this:
2015-11-11 14:14:14 => [2015-11-11 14:14:14, 2015-11-11 16:16:16, 2015-11-11 17:18:20] (this for each different qr_code)
where the data at the left is the 'smaller' date for that qr_code, and the right part are all the other dates for that qr_code, including the first one.
Is this possible to do with a sql query only? I am asking this because I am actually doing this app with rails, and I know that I can make a different approach with array methods of ruby (a solution with this would be well received too)
You could solve that with a recursive CTE - if I interpret your question correctly:
Assuming you have a given list of cities that must be visited in order by the same qr_code. Your text doesn't say so, but your query indicates as much.
WITH RECURSIVE
c AS (SELECT '{1,2,3}'::int[] AS cities) -- your list of city_id's here
, route AS (
SELECT f.check_in, f.qr_code, 2 AS idx
FROM foobar f
JOIN c ON f.city_id = c.cities[1]
UNION ALL
SELECT f.check_in, f.qr_code, r.idx + 1
FROM route r
JOIN foobar f USING (qr_code)
JOIN c ON f.city_id = c.cities[r.idx]
WHERE r.check_in < f.check_in
)
SELECT qr_code, array_agg(check_in) AS check_in_list
FROM (
SELECT *
FROM route
ORDER BY qr_code, idx -- or check_in
) sub
HAVING count(*) = (SELECT array_length(cities) FROM c);
GROUP BY 1;
Provide the list as array in the first (non-recursive) CTE c.
In the recursive part start with any rows in the first city and travel along your array until the last element.
In the final SELECT aggregate your check_in column in order. Only return qr_code that have visited all cities of the array.
Similar:
Recursive query used for transitive closure

How to find bad references in a table in Oracle

I have a data problem I need to clean up. Basically I have two tables storing "package" information, one table for documents and one table for audit information. I have entries in the package tables that reference documents that no longer exist and have been replaced (same name but different id) and I want to write a query to find all the bad ones and which new document should replace them. The only thing linking these two is a string value in the audit table which stores the document name (not id).
I've setup a sample schema here: http://sqlfiddle.com/#!4/997bda/1
package_s is the single values for a package in our application
package_r is the repeating values for a package in our application
(these are joined with the same value in the id column)
audit_info is all the audit information in a package
docs is all the documents that can be attached to a package
This query finds the packages with bad attachments (may be more than one per package)
select distinct ps.pkgname, pr.doc_list
from package_s ps, package_r pr
where ps.id = pr.id
and not exists (
select 1 from docs
where pr.doc_list = id
)
order by 1,2 asc
;
I need to build a query with the following rules:
I need to return at least the package id, the position value and the new document id (I will build an update statement to put this new document id in the row matching the package id / position in the package_r table)
the way to get the document name from the audit information is:
SUBSTR(description,0,INSTR(description,'[')-2)
If the document was Added and then Removed, it should be ignored (string_1)
string_2 must not be 'Supporting'
the new document must match
state = 'Master'
latest = 1
pub = '0'
Right now I have a semi-working script that works on a per package basis, but the problem is affecting 2000+ packages. I find the audit entries that don't match documents correctly attached to the package and then search for those names in the document table. The problem with this is since there is no direct link between the package and document tables, if there are multiple problem attachments on one package, each "new" document is returned once per position value, i.e.
package id bad doc id position new doc id
p1 d1 -1 d1-new
p1 d1 -1 d4-new
p1 d4 -2 d1-new
p1 d4 -2 d4-new
It doesn't matter which new id goes into which position value, but the duplication result problem like this makes it hard to mass generate update scripts, some manual filtering would be required.
This is a somewhat complex and unique data issue, so any help would be greatly appreciated.
This query works according to informations provided:
with ai as (
select a1.audited_id id, dc.id doc_id, dc.docname,
row_number() over (partition by a1.audited_id order by dc.id) rn
from audit_info a1
join docs dc
on dc.state = 'Master' and dc.latest = 1 and dc.pub = '0'
and dc.docname = substr(a1.description, 1, instr(a1.description, '[')-2)
where string_1 = 'Added' and string_2 <> 'Supporting'
and not exists (
select * from audit_info a2
where a2.audited_id = a1.audited_id and string_1 = 'Removed'
and a2.description = a1.description )
and not exists ( -- here matching docs are eliminated
select 1 from package_r pr
where pr.id = a1.audited_id and pr.doc_list = dc.id ) ),
p as (
select ps.id, ps.pkgname, pr.doc_list, pr.position,
row_number() over (partition by ps.id order by doc_list) rn
from package_s ps
join package_r pr on pr.id = ps.id
where not exists ( select * from docs where pr.doc_list = docs.id )
)
select p.id, p.pkgname, p.doc_list, p.position
, ai.docname, ai.doc_id
from p join ai on ai.id = p.id and p.rn = ai.rn
order by p.id, p.doc_list, ai.doc_id
Output:
ID PKGNAME DOC_LIST POSITION DOCNAME DOC_ID
-- ------- -------- -------- ------- ------
p1 000001 d3 -3 doc3 d3-new
p1 000001 d4 -4 doc4 d4-new
p2 000002 d5 -2 doc5 d5-new
p4 000004 d6 -1 doc6 d6-new
Edit: Answers to issues reported in comments
it is identifying packages that do not have bad values, and then the doc_list column is blank,
Note that query (my subquery p) for identyfing packages is basically your query, I just added counter there.
I guess that some process/application or someone manually cleared column doc_list in package_r.
If you don't want such entries, just add condition and trim(doc_list) is not null in subquery p.
for the ones it gets right on the package part (they have a bad value) it is bringing back the wrong docname/doc_id to replace the bad value with, it is a different doc_id in the list.
I understand this only partially. Can you add such entries to your examples (in Fiddle or just edit your question and add problematic input rows and expected output for them?)
"It doesn't matter which new id goes into which position value".
Assignment I made this way - if we had two old docs with names "ABC", "DEF" and corrected docs have names "XXA", "DE12"
then they will be linked as "ABC"->"DE12" and "DEF"->"XXA" (alphabetical ordering seems more rational than totally random).
To make assigning random change order by ... to order by null in both row_number() functions.