Restore Previous Version Data from table with Spark SQL - apache-spark-sql

I am using Spark SQL 2.4 SQL query API.
I have a table with key-columns as (id + version). There could be multiple values for same id but records are made unique by including the version column. There is also a column called delete_flg which signify whether that id + version combination is active or deleted at the source. However, if the value for delete_flg = 'Y', it means that this record has been deleted at the source, and hence the latest version just preceeding to this particular record version must be selected for loading Target.
SOURCE:
id | version | name | delete_flg
--------------------------------
10 | 1 | John | N
10 | 2 | Mike | N
10 | 3 | Henry| N
Scenario 1 - Latest record selected (based on version) (since delete_flg = 'N'):
Target:
id | version | name
--------------------
10 | 3 | Henry
Scenario 2: Now consider, next day the last record from yesterday comes as delete (Y) -
SOURCE:
id | version | name | delete_flg
--------------------------------
10 | 3 | Henry| Y
In this case, version = 3 has been deleted at source (delete_flg = Y). Hence, version = 2 should be selected for the target.
Target:
id | version | name
--------------------
10 | 2 | Mike
Note that the target has now been set to the record immediately preceding the deleted record (i.e. version = 2 -> "Mike").
Please advise how to approach the query. Any inputs is appreciated.

You are not really clear on input, I see 3 options. I chose the 1st one and I present a solution, see the comments I made. If the input is different, you should be able to work out from this as this covers all of them.
Two parts: generation and then query, you can run that with %sql.
I did it this way with enhanced input but it could probably be done with partitionBy and RANK. The crux is the (10,3,...) entries that complicates things.
Generate data - not for you
import org.apache.spark.sql.functions._
import spark.implicits._
// Option 1
val df_all = sc.parallelize(Seq(
(10, 0, "mark", "N"),
(10, 1, "john", "Y"),
(10, 2, "barry", "N"),
(10, 3, "pete", "N"),
(10, 3, "pete", "Y"),
(10, 4, "new pete", "N"),
(20, 1, "john", "N"),
(20, 2, "prev barry", "N"),
(20, 3, "pete", "N"),
(20, 3, "pete", "Y"),
(30, 1, "first pete", "N")
)).toDF("id", "v", "n", "d")
// Option 2
val df_at_rest = sc.parallelize(Seq(
(10, 1, "john", "N"),
(10, 2, "pete", "N"),
(10, 3, "peter", "N")
)).toDF("id", "v", "n", "d")
// Option 3
val df_delta = sc.parallelize(Seq(
(10, 4, "pete", "Y")
)).toDF("id", "v", "n", "d")
df_all.createOrReplaceTempView("dfa")
df_at_rest.createOrReplaceTempView("dfr")
df_delta.createOrReplaceTempView("dfd")
Query of interest for Option 1
spark.sql(""" select id, v, n
from dfa a, (select id as id1, max(v) as max_v
from dfa
where d = 'N'
group by id) B
where a.id = b.id1
and a.v = b.max_v
and a.d = 'N'
and a.id not in (select a.id
from dfa a, (select id as id2, max(v) as max_v2
from dfa
where d = 'N'
group by id) B
where a.id = b.id2
and a.v = b.max_v2
and a.d = 'Y')
union all
select id, v, n
from dfa E, (
select id2, max(v) as max_v2
from dfa, (select id2, max_v2
from dfa a, (select id as id2, max(v) as max_v2
from dfa
where d = 'N'
group by id) B
where a.id = b.id2
and a.v = b.max_v2
and a.d = 'Y') C
where id = C.id2
and v < C.max_v2
group by id2) D
where E.id = D.id2
and E.v = D.max_v2
""").show(false)
returns:
+---+---+----------+
|id |v |n |
+---+---+----------+
|10 |4 |new pete |
|30 |1 |first pete|
|20 |2 |prev barry|
+---+---+----------+

Related

How to create an array from flattened data in BigQuery

There is a lot of information online to go from flattened data to arrays or structs, but I need to do the opposite and I am having a hard time archiving it. I am using Google BigQuery.
I have something like:
| Id | Value1 | Value2 |
| 1 | 1 | 2 |
| 1 | 3 | 4 |
| 2 | 5 | 6 |
| 2 | 7 | 8 |
I would like to get for the example above:
1, [(1, 2), (3, 4)]
2, [(5, 6), (7, 8)]
If I try to put an array in the select with a group by it is not a valid statement
For example:
SELECT Id, [ STRUCT(Value1, Value2) ] as Value
FROM `table.dataset`
GROUP BY Id
Which returns:
1, (1, 2)
1, (3, 4)
2, (5, 6)
2, (7, 8)
Which is not what I am looking for. The structure I got is: Id, Value.Value1, Value.Value2 and I want Id, [ Value(V1, V2), Value(V1, V2), ... ]
You can do that with SELECT Id, ARRAY_AGG(STRUCT(Value1, Value2)) ... GROUP BY Id
Below is for BigQuery Standard SQL
#standardSQL
select id, array_agg((select as struct t.* except(id))) as `value`
from `project.dataset.table` t
group by id
If to apply to sample data in your question - output is

Roll up multiple rows into one when joining in SQL Server

I have a table, Foo
ID | Name
-----------
1 | ONE
2 | TWO
3 | THREE
And another, Bar:
ID | FooID | Value
------------------
1 | 1 | Alpha
2 | 1 | Alpha
3 | 1 | Alpha
4 | 2 | Beta
5 | 2 | Gamma
6 | 2 | Beta
7 | 3 | Delta
8 | 3 | Delta
9 | 3 | Delta
I would like a query that joins these tables, returning one row for each row in Foo, rolling up the 'value' column from Bar. I can get back the first Bar.Value for each FooID:
SELECT * FROM Foo f OUTER APPLY
(
SELECT TOP 1 Value FROM Bar WHERE FooId = f.ID
) AS b
Giving:
ID | Name | Value
---------------------
1 | ONE | Alpha
2 | TWO | Beta
3 | THREE | Delta
But that's not what I want, and I haven't been able to find a variant that will bring back a rolled up value, that is the single Bar.Value if it is the same for each corresponding Foo, or a static string something like '(multiple)' if not:
ID | Name | Value
---------------------
1 | ONE | Alpha
2 | TWO | (multiple)
3 | THREE | Delta
I have found some solutions that would bring back concatenated values (albeit not very elegant) 'Alpha' Alpha, Alpha', 'Beta, Gamma, Beta' &c, but that's not what I want either.
One method, using a a CASE expression and assuming that [Value] cannot have a value of NULL:
WITH Foo AS
(SELECT *
FROM (VALUES (1, 'ONE'),
(2, 'TWO'),
(3, 'THREE')) V (ID, [Name])),
Bar AS
(SELECT *
FROM (VALUES (1, 1, 'Alpha'),
(2, 1, 'Alpha'),
(3, 1, 'Alpha'),
(4, 2, 'Beta'),
(5, 2, 'Gamma'),
(6, 2, 'Beta'),
(7, 3, 'Delta'),
(8, 3, 'Delta'),
(9, 3, 'Delta')) V (ID, FooID, [Value]))
SELECT F.ID,
F.[Name],
CASE COUNT(DISTINCT B.[Value]) WHEN 1 THEN MAX(B.Value) ELSE '(Multiple)' END AS [Value]
FROM Foo F
JOIN Bar B ON F.ID = B.FooID
GROUP BY F.ID,
F.[Name];
You can also try below:
SELECT F.ID, F.Name, (case when B.Value like '%,%' then '(Multiple)' else B.Value end) as Value
FROM Foo F
outer apply
(
select SUBSTRING((
SELECT distinct ', '+ isnull(Value,',') FROM Bar WHERE FooId = F.ID
FOR XML PATH('')
), 2 , 9999) as Value
) as B

SQL pass the row or add subsequent row based on column values

I have s select statement that returns some rows as follows:
a| b| c| d|
1, 2, 5, 1
1, 5, 7, 1
2, 5, 4, 1
2, 5, 2, 2
3, 5, 9, 1
and what I need to happen is if col a and col b in subsequent rows match, they represent the same object and c needs to be added for the rows and returned as a single row. it the next row doesn't match it needs to be returned. Im not sure I need column d, it just was generated from a different select.
The results for the query would look like
a| b| c| d
1, 2, 5, 1
1, 5, 7, 1
2, 5, 6, 1 // the combination of rows 3 & 4
3, 5, 9, 1
Sorry, I think too much like a programmer. If someone can give me a place to start I would be very appreciative.
select a, b, sum(c) as c from XXX group by a, b
The shift that a programmer has to make when starting to learn SQL is to not think of the data set procedurally, you're not iterating over the data row by row. You operate on it as a set. It can be a difficult transition to make.
Information on aggregation:
https://www.w3schools.com/sql/sql_groupby.asp
http://www.sql-tutorial.com/sql-aggregate-functions-sql-tutorial
This is pretty basic SQL, I'd suggest working through some tutorials or HackerRank kind of things to get more familiar with the language.
Use a Group By with SUM and MIN functions.
SQL Fiddle
PostgreSQL 9.6 Schema Setup:
CREATE TABLE t
(a int, b int, c int, d int)
;
INSERT INTO t
(a, b, c, d)
VALUES
(1, 2, 5, 1),
(1, 5, 7, 1),
(2, 5, 4, 1),
(2, 5, 2, 2),
(3, 5, 9, 1)
;
Query 1:
SELECT a
,b
,SUM(c) AS c
,MIN(d) AS d
FROM t
GROUP BY a
,b
ORDER BY a
Results:
| a | b | c | d |
|---|---|---|---|
| 1 | 2 | 5 | 1 |
| 1 | 5 | 7 | 1 |
| 2 | 5 | 6 | 1 |
| 3 | 5 | 9 | 1 |

SQL performing day difference by matching value

My goal is to get the duration when the 1st OLD or 1st NEW status reaches to the 1st END. For example: Table1
ID Day STATUS
111 1 NEW
111 2 NEW
111 3 OLD
111 4 END
111 5 END
112 1 OLD
112 2 OLD
112 3 NEW
112 4 NEW
112 5 END
113 1 NEW
113 2 NEW
The desired outcome would be:
STATUS Count
NEW 2 (1 for ID 111-New on day 1 to End on day 4,and 1 for 112-new on day 3 to End on day 5)
OLD 2 (1 for ID 111-Old on day 3 to End on day 4, and 1 for 112-OLD on day 1 to End on day 5)
The following is T-SQL (SQL Server) and NOT available in MySQL. The choice of dbms is vital in a question because there are so many dbms specific choices to make. The query below requires using a "window function" row_number() over() and a common table expression neither of which exist yet in MySQL (but will one day). This solution also uses cross apply which (to date) is SQL Server specific but there are alternatives in Postgres and Oracle 12 using lateral joins.
SQL Fiddle
MS SQL Server 2014 Schema Setup:
CREATE TABLE Table1
(id int, day int, status varchar(3))
;
INSERT INTO Table1
(id, day, status)
VALUES
(111, 1, 'NEW'),
(111, 2, 'NEW'),
(111, 3, 'OLD'),
(111, 4, 'END'),
(111, 5, 'END'),
(112, 1, 'OLD'),
(112, 2, 'OLD'),
(112, 3, 'NEW'),
(112, 4, 'NEW'),
(112, 5, 'END'),
(113, 1, 'NEW'),
(113, 2, 'NEW')
;
Query 1:
with cte as (
select
*
from (
select t.*
, row_number() over(partition by id, status order by day) rn
from table1 t
) d
where rn = 1
)
select
t.id, t.day, ca.nxtDay, t.Status, ca.nxtStatus
from cte t
outer apply (
select top(1) Status, day
from cte nxt
where t.id = nxt.id
and t.status = 'NEW' and nxt.status = 'END'
order by day
) ca (nxtStatus, nxtDay)
where nxtStatus IS NOT NULL or Status = 'OLD'
order by id, day
Results:
| id | day | nxtDay | Status | nxtStatus |
|-----|-----|--------|--------|-----------|
| 111 | 1 | 4 | NEW | END |
| 111 | 3 | (null) | OLD | (null) |
| 112 | 1 | (null) | OLD | (null) |
| 112 | 3 | 5 | NEW | END |
As you can see, counting that Status column would result in NEW = 2 and OLD = 2

Efficient Way to do Very Complicated SQL Grouping:

Say you have a table like this:
ID | Type | Reference #1 | Reference #2
0 | 1 | [A] | {a}
1 | 2 | [B] | {b}
2 | 2 | [B] | {c}
3 | 1 | [C] | {d}
4 | 1 | [D] | {d}
5 | 1 | [E] | {d}
6 | 1 | [C] | {e}
Is there any good way to group by "Reference #1" and "Reference #2" as a "fallback", for lack of a better way of putting it...
For example, I would like to group the following IDs together:
{0} [Unique Reference #1],
{1,2} [Same Reference #1],
{3,4,5,6} [{3,4,5} have same Reference #2 and {3,6} have same Reference #1]
I am at a total loss as to how to do this... Any thoughts?
In mellamokb's query, the groupings are dependent on the order of the input.
ie.
VALUES
(0, 1, '[A]', '{a}'),
(1, 2, '[B]', '{b}'),
(2, 2, '[B]', '{c}'),
(3, 1, '[C]', '{d}'), // group 3
(4, 1, '[D]', '{d}'), // group 3
(5, 1, '[E]', '{d}'), // group 3
(6, 1, '[C]', '{e}'); // group 3
produces a different result tahn
VALUES
(0, 1, '[A]', '{a}'),
(1, 2, '[B]', '{b}'),
(2, 2, '[B]', '{c}'),
(3, 1, '[C]', '{e}'), //group 3
(4, 1, '[D]', '{d}'), // group 4
(5, 1, '[E]', '{d}'), // group 4
(6, 1, '[C]', '{d}'); // group 3
This might be intended, if there is some natural order to the References that you could specify, but its a problem if they are not. The way to 'solve' this or specify another problem is to say that all equal Reference1s create a set of elements whose members are themselves and those elements whose Reference2 is equal to at least one member of that set.
In SQL:
with groupings as (
select
ID,Reference1,Reference2,
(select min(ID) from Table1 t2
where t2.Reference1=t1.Reference1 or t2.Reference2=t1.Reference2 ) as minID
from
Table1 t1
)
select
t1.ID,t1.Reference1,t1.Reference2,t1.minid as round1,
(select min(t2.minid) from
groupings t2
INNER JOIN groupings t3 ON t1.Reference2=t2.Reference2
) as minID
from
groupings t1
This should produce the full grouping each time.