I'm trying to model a hierarchy of "settings", where values are defined at a root level and can be overridden by more specific children. If a child does not specify a value, its parents value should be used.
I'll use a trite example to illustrate the problem. The hierarchy here is only three levels deep, but I would like a solution that works for N levels.
Given that I am storing the information as follows:
id| parent_id| setting
----------------------
1| NULL| false
2| 1| true
3| 2| NULL
What I want, from a procedural perspective, is to get a child node in a tree and if its "settings" value is NULL look at its parent for a value, recursively, until a value is found, or the root is reached. Essentially, for the information provided, I want to produce the following set, so I can attach a simple WHERE clause to get the applicable settings for any given id.
id| setting
-----------
1| false
2| true
3| true
I've have a view which "flattens" the hierarchy into ancestors and descendants:
ancestor| descendant| ancestor_setting| descendant_setting
----------------------------------------------------------
1| 2| false| true
1| 3| false| NULL
2| 3| true| NULL
NULL| 1| NULL| false
NULL| 2| NULL| true
NULL| 3| NULL| NULL
In this way, you can query all levels of the hierarchy as a set, which I hoped would be useful in getting an answer.
So far, I've only been able to select a "branch" from the tree using this view:
SELECT COALESCE(ancestor, descendent) id,
CASE WHEN ancestor IS NULL THEN descendant_setting
ELSE ancestor_setting
END setting
FROM hierarchy
WHERE descendant = 3
id| setting
-----------
1| false
2| true
3| NULL
I've tried to think up ways to use this "flattened" structure to form a simple set of joins, and whilst I can get all the records back this way (and then filter them procedurally on a client), I want to know if there's a way to produce the expected set so that I could get back the expected settings for a single ID.
WITH RECURSIVE
q AS
(
SELECT id, parent_id, id ancestor_id, setting
FROM mytable
WHERE parent_id IS NULL
UNION ALL
SELECT m.id, q.id, ancestor_id, COALESCE(m.setting, q.setting)
FROM q
LEFT JOIN
mytable m
ON m.parent_id = q.id
WHERE q.id IS NOT NULL
)
SELECT *
FROM q
WHERE id IS NULL
Related
I have a parent-child id_table hierarchy - e.g.
|parent|child|
|------|-----|
| | 0|
| 0| 1|
| 0| 2|
| 0| 3|
| 1| 4|
| 1| 5|
| 2| 6|
| 4| 7|
| 4| 8|
I'm building a visual tree hierarchy, where above data would be formatted as:
|parent|child1|child2|child3
|------|------|------|------
| 0| 1|4 | 7
| 0| 1|4 | 8
| 0| 1|5 |
| 0| 2|6 |
| 0| 3| |
Now I want to modify this query to include a row for each parent standalone without the child, so above data would become:
|parent|child1|child2|child3
|------|------|------|------
| 0| | |
| 0| 1| |
| 0| 1| 4|
| 0| 1| 4| 7
| 0| 1| 4| 8
| 0| 1| 5|
| 0| 2| |
| 0| 2| 6|
| 0| 3| |
To get the first result, I am building the data with repeated left joins (using above first data example) as to my understanding I can't do this with recursion, e.g.:
SELECT t1.child AS parent
t2.child AS child1
t3.child AS child2
t4.child AS child3
FROM id_table t1
LEFT JOIN id_table t2
ON t1.child = t2.parent
LEFT JOIN id_table t3
ON t1.child = t3.parent
LEFT JOIN id_table t4
ON t1.child = t4.parent
WHERE t1.child = '0'
This gets me the second example, but I'm lacking a record for each parent as well, as shown in the third example.
I assume this is probably a simple question, I'm just struggling with the syntax. TIA for any help.
EDIT: I had a prior question for a similar implementation in SAS EG: SQL - Recursive Tree Hierarchy with Record at Each Level, however that was with the SAS SQL implementation which is much more restricted - with that method I eventually had to just create temp tables at each level then union the end result, which was messy. Trying to find a cleaner solution.
GROUP BY ROLLUP can be used to create those extra rows:
SELECT DISTINCT
t1.child AS Parent
,t2.child AS child1
,t3.child AS child2
,t4.child AS child3
-- one more column for each additional level
FROM id_table t1
LEFT JOIN id_table t2
ON t1.child = t2.Parent
LEFT JOIN id_table t3
ON t2.child = t3.Parent
LEFT JOIN id_table t4
ON t3.child = t4.Parent
-- one additional join for each new level
WHERE t1.child = '0'
GROUP BY ROLLUP (t1.child,t2.child,t3.child,t4.child)
HAVING t1.child IS NOT NULL
Or a Recursive Query to traverse through the hierarchy, built the path and then split it into columns:
WITH RECURSIVE cte AS
( -- traverse the hierarchy and built the path
SELECT 1 AS lvl,
,child
,Cast(child AS VARCHAR(500)) AS Path -- must be large enough for concatenating all levels
FROM id_table
WHERE Parent IS NULL
UNION ALL
SELECT lvl+1
,t.child
,cte.Path || ',' || Trim(t.child)
FROM cte JOIN id_table AS t
ON cte.child = t.Parent
WHERE lvl < 20 -- just in case there's an endless loop
)
SELECT
StrTok(Path, ',', 1)
,StrTok(Path, ',', 2)
,StrTok(Path, ',', 3)
,StrTok(Path, ',', 4)
-- one additional StrTok for each new level
FROM cte
Don't know which one is more efficient.
It's hard to put it in words what I am trying to do. My knowledge in SQL too weak to use right terminology, so I will try to illustrate it as example.
Say, I have a big table consisting columns "value", "user" and "type"
value|user_id|type|
100| 1| 1|
200| 1| 1|
100| 1| 2|
722| 1| 3|
48| 2| 1|
724| 2| 2|
175| 2| 3|
1) I calculate sum "value" for each "user_id" for each "type".
SELECT SUM("value"), "user_id", "type" from "table" group by "user_id", "type"
giving me
value|user_id|type|
300| 1| 1|
100| 1| 2|
722| 1| 3|
48| 2| 1|
724| 2| 2|
175| 2| 3|
2) I want to obtain rank of "user_id" for each "type" based on the "value".
For type 1, value for user 1 is greater than for user 2, so user 1 ranks as 1 and user 2 ranks 2.
For type 2, value for user 2 is greater...
In other words I want to produce table for user 1:
rank|type
1|1
2|2
1|3
and for user 2:
rank|type
2|1
1|2
2|3
I would really appreciate help with this.
You can use the result of an aggregate function for a window function. The window function will be processed after the group by:
SELECT sum(value), user_id, type,
rank() over (partition by type order by sum(value) desc)
from the_table
group by user_id, type
order by user_id, type;
Considering the table:
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df.show()
+---+-----+---------+
| id|error|timestamp|
+---+-----+---------+
| 1| 1| 1|
| 5| 0| 2|
| 27| 1| 1|
| 1| 0| 3|
| 5| 1| 1|
| 1| 0| 2|
+---+-----+---------+
I would like to make a pivot on timestamp column keeping some other aggregated information from the original table. The result I am interested in can be achieved by
df1=df.groupBy('id').agg(sf.sum('error').alias('Ne'),sf.count('*').alias('cnt'))
df2=df.groupBy('id').pivot('timestamp').agg(sf.count('*')).fillna(0)
df1.join(df2, on='id').filter(sf.col('cnt')>1).show()
with the resulting table:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
However, there are at least two issues with the mentioned solution:
I am filtering by cnt at the end of the script. If I would be able to do this at the beginning, I can avoid almost all processing, because a large portion of data is removed using this filtration. Is there any way how to do this excepting collect and isin methods?
I am doing groupBy on id two-times. First, to aggregate some columns I need in results and the second time to get the pivot columns. Finally, I need join to merge these columns. I feel that I am surely missing some solution because it should be possible to do this with just one groubBy and without join, but I cannot figure out, how to do this.
I think you can not get around the join, because the pivot will need the timestamp values and the first grouping should not consider them. So if you have to create the NE and cnt values you have to group the dataframe only by id which results in the loss of timestamp if you want to preserve the values in columns you have to do the pivot as you did separately and join it back.
The only improvement that can be done is to move the filter to the df1 creation. So as you said this could already improve the performance since df1 should be much smaller after the filtering for your real data.
from pyspark.sql.functions import *
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df1=df.groupBy('id').agg(sum('error').alias('Ne'),count('*').alias('cnt')).filter(col('cnt')>1)
df2=df.groupBy('id').pivot('timestamp').agg(count('*')).fillna(0)
df1.join(df2, on='id').show()
Output:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
Actually it is indeed possible to avoid join using Window as
w1 = Window.partitionBy('id')
w2 = Window.partitionBy('id', 'timestamp')
df.select('id', 'timestamp',
sf.sum('error').over(w1).alias('Ne'),
sf.count('*').over(w1).alias('cnt'),
sf.count('*').over(w2).alias('cnt_2')
).filter(sf.col('cnt')>1) \
.groupBy('id', 'Ne', 'cnt').pivot('timestamp').agg(sf.first('cnt_2')).fillna(0).show()
With Spark dataframe, I want to update a row value based on other rows with same id.
For example,
I have records below,
id,value
1,10
1,null
1,null
2,20
2,null
2,null
I want to get the result as below
id,value
1,10
1,10
1,10
2,20
2,20
2,20
To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.
In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.
update combineCols a
inner join combineCols b
on a.id = b.id
set a.value = b.value
(this is how I do it in sql)
Let's use SQL method to solve this issue -
myValues = [(1,10),(1,None),(1,None),(2,20),(2,None),(2,None)]
df = sqlContext.createDataFrame(myValues,['id','value'])
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, sum(value) over (partition by id) as value from table_view'
)
df1.show()
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
Caveat: Thos code assumes that there is only one non-null value for any particular id. When we groupby values, we have to use an aggregation function, and I have used sum. In case there are 2 non-null values for any id, then the will be summed up. If id could have multiple non-null values, then it's bettwe to use min/max, so that we get one of the values rather than sum.
df1=sqlContext.sql(
'select id, max(value) over (partition by id) as value from table_view'
)
You can use window to do this(in pyspark):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# create dataframe
df = sc.parallelize([
[1,10],
[1,None],
[1,None],
[2,20],
[2,None],
[2,None],
]).toDF(('id', 'value'))
window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
.withColumn('value', F.first('value').over(window)) \
.show()
Results:
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
You can use the same functions in scala.
I've come up with two approaches to the same idea and would like to avoid any obvious pitfalls by using one over the other. I have a table (tbl_post) where a single row can have many relationships to other tables (tbl_category, tbl_site, tbl_team). I have a relationship table to join these but don't know which structure to go with, conditional or direct? Hopefully the following will explain...
tbl_post (simple post, can be associated with many categories, teams and sites)
* id
* title
* content
tbl_category
* id
* title
* other category only columns
tbl_team
* id
* title
* other team only columns
tbl_site
* id
* title
* other site only columns
----------------------------------------------------------
tbl_post_relationship
* id (pk)
* post_id (fk tbl_post)
* related_id (fk, dependant on related_type to either tbl_category, tbl_site or tbl_team)
* related_type (category, site or team)
____________________________________
|id|post_id|related_id|related_type|
|--|-------|----------|------------|
| 1| 1| 6| category|
| 2| 1| 4| site|
| 3| 1| 9| category|
| 4| 1| 3| team|
------------------------------------
SELECT c.*
FROM tbl_category c
JOIN tbl_relationship r ON
r.post_id = 1
AND r.related_type = 'category'
AND c.id = r.related_id
------------- OR ---------------
tbl_post_relationship
* id (pk)
* post_id (fk tbl_post)
* category_id (fk tbl_category)
* site_id (fk tbl_site)
* team_id (fk tbl_team)
________________________________________
|id|post_id|category_id|site_id|team_id|
|--|-------|-----------|-------|-------|
| 1| 1| 6| NULL| NULL|
| 2| 1| NULL| 4| NULL|
| 3| 1| 9| NULL| NULL|
| 4| 1| NULL| NULL| 3|
----------------------------------------
SELECT c.*
FROM tbl_category c
JOIN tbl_relationship r ON
r.post_id = 1
AND r.category_id = c.id
So with the one approach I'll end up with lots of columns (there might be more tables) with NULL's. Or I end up with one simple table to maintain it, but every join is based on a "type". I also know I could have a table per relationship, but again that feels like too many tables. Any ideas / thoughts?
You are best out with one table per relationship. You should not worry about the amount of tables. The drawbacks of a single relationship table are several, and quite risky:
1) You cannot enforce foreign keys if the related tables vary from row to row, so your data integrity is at risk... and sooner or later you will have orphaned data.
2) Queries are more complex because you have to use the related_type to filter out the relations in many places.
3) Query maintenance is more costly, for the same reasons of 2), and because you have to explicitly use the related_type constants in many places... it'll be hell when you need to change them or add some.
I'd suggest you use the orthodox design... just got with 3 distinct relationship tables: post_category, post_team, post_site.