Use recursion to get all children of node in Bigquery SQL table - sql

I'm working with a dataset in bigquery that has parent-child relationships, but doesn't indicate final_parent...
My data looks something like this:
| id | parent |
| -----| --------|
| AA | AB |
| AB | AC |
| .. | .. |
The rows are either questions or answers, all answers roll up to a single question, but the you can answer an answer so there is this recursive graph structure... What I want is to get all the answers to a single question, starting with the row id of that question...
I generated the following query - I think it is logically correct for the task:
WITH RECURSIVE tbl_1 AS(
(SELECT *
FROM source_table
WHERE (id = xxxxxxxxxxx) OR (parent = xxxxxxxxxxx))
UNION ALL
(SELECT *
FROM source_table
WHERE (parent IN (SELECT DISTINCT id FROM tbl_1)
AND (id NOT IN (SELECT DISTINCT id FROM tbl_1))))
)
SELECT *
FROM tbl_1
However I get the following error...
ERROR:
400 A recursive reference from inside an expression subquery is not allowed at [9:49]
I think this is just something that hasn't been implemented yet in bigquery? Any advice on how to do it despite this? Thanks so much!!

Try below
with recursive tbl as (
select *, 1 pos from your_table
where question not in (select answer from your_table)
union all
select t1.question, t2.answer, pos + 1
from tbl t1
join your_table t2
on t2.question = t1.answer
)
select question, string_agg(answer order by pos) answers
from tbl
group by question
for dummy data as in below example
the output is

Related

Where clause to select rows with only unique values

firstly let me describe you my problem. I need to ignore all repeated values in my select query. So for example if I have something like that:
| Other columns| THE COLUMN I'm working with |
| ............ | Value 1 |
| ............ | Value 2 |
| ............ | Value 2 |
I'd like to get the result containing only the row with "Value 1"
Now because of the specifics of my task I need to validate it with subquery.
So I've figured out something like this:
NOT EXISTS (SELECT 1 FROM TABLE fpd WHERE fpd.value = fp.value HAVING count(*) > 2)
It works like I want, but I'm aware of it being slow. Also I've tried putting 1 instead of 2 in HAVING comprassion, but it just returns zero results. Could you explain where does the 2 value come from?
I would suggest window functions:
select t.*
from (select t.*, count(*) over (partition by value) as cnt
from fpd t
) t
where cnt = 1;
Alternatively, you can use not exists with a primary key:
where not exists (select 1
from fpd fpd2
where fpd2.value = fp.value and
fpd2.primarykey <> fp.primarykey
)
SELECT DISTINCT myColumn FROM myTable

Hierarchy Without CTE - Get Direct Children

I have a table for assets:
id|name|parentId
The view I'm trying to build is for an asset is:
{
'Id': ......,
'Name': ....,
'ChildrenIds': []
}
I need a query that selects TOP 50 assets and its direct children (so results could be more than 50 results).
I have a CTE that works, but its slow (5 seconds, parentId & id is indexed):
WITH MyCte as
(
SELECT TOP 50 a.Id, a.Name, a.ParentAssetId
FROM assets a
UNION ALL
SELECT a2.AssetId, a2.ParentAssetId
FROM assets a2
INNER JOIN MyCte cte ON cte.Id = a2.ParentAssetId
)
SELECT * From MyCte;
This join query does half of what I want.
SELECT TOP 50 a.Id, a.Name, a.ParentAssetId
FROM assets a
LEFT JOIN assets a2 ON a2.ParentAssetId = a.Id
Problem with JOIN, it gives me 50 results, and that's it. I need the descendant info to build a view. I could do 2 queries, but I'd rather not do that.
Any suggestions?
Maybe there is a better way for me to build this view? Without the 50 + N requirement? You can use a GROUP BY with STRING_AGG, but I worry about the size limitation.
SAMPLE DATA:
1,Site1,NULL
2,Site2,1
3,Site3,1
4,Site4,2
5,Site5,NULL
TOP 3 ORDER BY id DESC results will return
1,Site1,NULL
2,Site2,1
3,Site3,1
4,Site4,2
BUT I guess ideally something like this:
1,Site1,NULL|2,Site2,1|3,Site3,1
2,Site2,1|4,Site4,2
3,Site3,1
You can use a temp table to achieve what you need.
SELECT TOP (50) a.Id, a.Name, a.ParentAssetId
INTO #Assets
FROM assets a;
INSERT INTO #Assets
SELECT a2.Id, a2.Name, a2.ParentAssetId
FROM #Assets a
JOIN assets a2 ON a2.ParentAssetId = a.Id;
SELECT *
FROM #Assets;
Note that this is not deterministic because there's no ORDER BY when using TOP.
You could use this CTE and make a view from it:
WITH MyCte as
(
SELECT TOP 50 a.Id, a.Name, a.ParentAssetId
FROM assets a
)
SELECT cte.*, a1.Id as ChildId, a1.Name as ChildName
FROM MyCte cte
INNER JOIN assets a1
ON a1.ParentAssetId=cte.Id
Admittedly this will give you a different kind of result set than the UNION CTE in your question, but I'm assuming that you can make a simple adjustment to your consumer application to handle it. It might even be easier/more performant for the app this way, since the relationships are present in the row, and don't have to be extrapolated.
That said, if you are working with a recent-enough Version of SQL Server, you might look into the built-in JSON functions, since it looks like that is the output you are ultimately trying to generate.
According to what you provide, and if I understand, I think you're looking for
WITH CTE AS
(
SELECT TOP 3 *
FROM T
ORDER BY ID DESC
)
SELECT *
FROM CTE
UNION
SELECT *
FROM T
WHERE ID IN (SELECT ParentId FROM CTE);
Returns:
+----+-------+----------+
| ID | Name | ParentId |
+----+-------+----------+
| 1 | Site1 | |
| 2 | Site2 | 1 |
| 3 | Site3 | 1 |
| 4 | Site4 | 2 |
| 5 | Site5 | |
+----+-------+----------+
Here is a db<>fiddle
UPDATE:
Since you're looking for a way to pass an INT value present the rows number used in TOP, you can create an inline table-valued function as
CREATE FUNCTION dbo.MyFunction (#Rows INT = 1)
RETURNS TABLE
AS
RETURN
(
WITH CTE AS
(
SELECT TOP (#Rows) *
FROM T
ORDER BY ID DESC
)
SELECT *
FROM CTE
UNION
SELECT *
FROM T
WHERE ID IN (SELECT ParentId FROM CTE)
);
and just call it as
SELECT *
FROM dbo.MyFunction(2)
Demo

Slightly different greatest-n-per-group

I have read this comment which explains the greatest-n-per-group problem and its solution. Unfortunately, I am facing a slightly different approach, and I am failing to find a solution for it.
Let's suppose I have a table with some basic info regarding users. Due to implementation, this info may or may not repeat itself:
+----+-------------------+----------------+---------------+
| id | user_name | user_name_hash | address |
+----+-------------------+----------------+---------------+
| 1 | peter_jhones | 0xFF321345 | Some Av |
| 2 | sally_whiterspoon | 0x98AB5454 | Certain St |
| 3 | mark_jackobson | 0x0102AB32 | Some Av |
| 4 | mark_jackobson | 0x0102AB32 | Particular St |
+----+-------------------+----------------+---------------+
As you can see, mark_jackobson appears twice, although its address is different in each appearance.
Every now and then, an ETL process queries new user_names and fetches the most recent records of each. Aftewards, it stores the user_name_hash in a table to sign it has already imported that certain user_name
+----------------+
| user_name_hash |
+----------------+
| 0xFF321345 |
| 0x98AB5454 |
+----------------+
Everything begins with the following query:
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table
This way, I am able to select the new hashes from my table. Since I need to query the most recent occurrence of a hash, I wrap it as a sub-query:
SELECT MAX(id)
FROM my_table
WHERE user_name_hash IN (
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table)
GROUP BY user_name_hash
Perfect! With the ids of my new users, I can query the addresses as follows:
SELECT
address,
user_name_hash
FROM my_table
WHERE Id IN (
SELECT MAX(id)
FROM my_table
WHERE user_name_hash IN (
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table)
GROUP BY user_name_hash)
From my perspective, the above query works, but it does not seem optimal. Reading this comment, I noticed I could query the same data, using joins. Since I am failing to write the desired query, could anyone help me out and point me to a direction?
This is the query I have attempted, without success.
SELECT
tb1.address,
tb1.user_name_hash
FROM my_table tb1
INNER JOIN my_table tb2
ON tb1.user_name_hash = tb2.user_name_hash
LEFT JOIN my_hash_table ht
ON tb1.user_name_hash = ht.user_name_hash AND tb1.id > tb2.id
WHERE ht.user_name_hash IS NULL;
Thanks in advance.
EDIT > I am working with PostgreSQL
I believe you are looking for something like this:
SELECT
address,
user_name_hash
FROM my_table t1
JOIN (
SELECT MAX(id) maxid
FROM my_table t2
WHERE NOT EXISTS (
SELECT 1
FROM my_hash_table t3
WHERE t2.user_name_hash = t3.user_name_hash
)
GROUP BY user_name_hash
) t ON t1.ID = t.maxid
I'm using NOT EXISTS instead of EXCEPT since it is more clear to the optimizer.
You can get a better performance using a left outer join (to get the newest records not already imported) and then compute the max id for these records (subquery in the HAVING clause).
SELECT t1.address,
t1.user_name_hash,
MAX(id) AS maxid
FROM my_table t1
LEFT JOIN my_hash_table th ON t1.user_name_hash = th.user_name_hash
WHERE th.user_name_hash IS NULL
GROUP BY t1.address,
t1.user_name_hash
HAVING MAX(id) = (SELECT MAX(id)
FROM my_table t1)

PostgreSQL if query?

Is there a way to select records based using an if statement?
My table looks like this:
id | num | dis
1 | 4 | 0.5234333
2 | 4 | 8.2234
3 | 8 | 2.3325
4 | 8 | 1.4553
5 | 4 | 3.43324
And I want to select the num and dis where dis is the lowest number... So, a query that will produce the following results:
id | num | dis
1 | 4 | 0.5234333
4 | 8 | 1.4553
If you want all the rows with the minimum value within the group:
SELECT id, num, dis
FROM table1 T1
WHERE dis = (SELECT MIN(dis) FROM table1 T2 WHERE T1.num = T2.num)
Or you could use a join to get the same result:
SELECT T1.id, T1.num, T1.dis
FROM table1 T1
JOIN (
SELECT num, MIN(dis) AS dis
FROM table1
GROUP BY num
) T2
ON T1.num = T2.num AND T1.dis = T2.dis
If you only want a single row from each group, even if there are ties then you can use this:
SELECT id, dis, num FROM (
SELECT id, dis, num, ROW_NUMBER() OVER (PARTITION BY num ORDER BY dis) rn
FROM table1
) T1
WHERE rn = 1
Unfortunately this won't be very efficient. If you need something more efficient then please see Quassnoi's page on selecting rows with a groupwise maximum for PostgreSQL. Here he suggests several ways to perform this query and explains the performance of each. The summary from the article is as follows:
Unlike MySQL, PostgreSQL implements
several clean and documented ways to
select the records holding group-wise
maximums, including window functions
and DISTINCT ON.
However to the lack of the loose index
scan support by the PostgreSQL’s
optimizer and the less efficient usage
of indexes in PostgreSQL, the queries
using these function take too long.
To work around these problems and
improve the queries against the low
cardinality grouping conditions, a
certain solution described in the
article should be used.
This solution uses recursive CTE’s to
emulate loose index scan and is very
efficient if the grouping columns have
low cardinality.
Use this:
SELECT DISTINCT ON (num) id, num, dis
FROM tbl
ORDER BY num, dis
Or if you intend to use other RDBMS in future, use this:
select * from tbl a where dis =
(select min(dis) from tbl b where b.num = a.num)
If you need to have IF logic you can use PL/pgSQL.
http://www.postgresql.org/docs/8.4/interactive/plpgsql-control-structures.html
But try to solve your issue with SQL first if possible, it will be faster and use PL/pgSQL when SQL can't solve your problem.

Concat groups in SQL Server [duplicate]

This question already has answers here:
How to use GROUP BY to concatenate strings in SQL Server?
(22 answers)
Closed 8 years ago.
If I have a table like this:
+------------+
| Id | Value |
+------------+
| 1 | 'A' |
|------------|
| 1 | 'B' |
|------------|
| 2 | 'C' |
+------------+
How can I get a resultset like this:
+------------+
| Id | Value |
+------------+
| 1 | 'AB' |
|------------|
| 2 | 'C' |
+------------+
I know this is really easy to do in MySQL using GROUP_CONCAT, but I need to be able to do it in MSSQL 2005
Thanks
(Duplicate of How to use GROUP BY to concatenate strings in SQL Server?)
For a clean and efficient solution you can create an user defined aggregate function, there is even an example that does just what you need.
You can then use it like any other aggregate function (with a standard query plan):
This will do:
SELECT mt.ID,
SUBSTRING((SELECT mt2.Value
FROM MyTable AS mt2
WHERE mt2.ID = mt.ID
ORDER BY mt2.VALUE
FOR XML PATH('')), 3, 2000) AS JoinedValue
FROM MyTable AS mt
See:
http://blog.shlomoid.com/2008/11/emulating-mysqls-groupconcat-function.html
Often asked here.
The most efficient way is using the FOR XML PATH trick.
This just came to me as one possible solution. I have no idea as to performance, but I thought it would be an interesting way to solve the problem. I tested that it works in a simple situation (I didn't code to account for NULLs). Feel free to give it a test to see if it performs well for you.
The table that I used included an id (my_id). That could really be any column that is unique within the group (grp_id), so it could be a date column or whatever.
;WITH CTE AS (
SELECT
T1.my_id,
T1.grp_id,
CAST(T1.my_str AS VARCHAR) AS my_str
FROM
dbo.Test_Group_Concat T1
WHERE NOT EXISTS (SELECT * FROM dbo.Test_Group_Concat T2 WHERE T2.grp_id = T1.grp_id AND T2.my_id < T1.my_id)
UNION ALL
SELECT
T3.my_id,
T3.grp_id,
CAST(CTE.my_str + T3.my_str AS VARCHAR)
FROM
CTE
INNER JOIN dbo.Test_Group_Concat T3 ON
T3.grp_id = CTE.grp_id AND
T3.my_id > CTE.my_id
WHERE
NOT EXISTS (SELECT * FROM dbo.Test_Group_Concat T4 WHERE
T4.grp_id = CTE.grp_id AND
T4.my_id > CTE.my_id AND
T4.my_id < T3.my_id)
)
SELECT
CTE.grp_id,
CTE.my_str
FROM
CTE
INNER JOIN (SELECT grp_id, MAX(my_id) AS my_id FROM CTE GROUP BY grp_id) SQ ON
SQ.grp_id = CTE.grp_id AND
SQ.my_id = CTE.my_id
ORDER BY
CTE.grp_id