Spark DataFrame: Ignore columns with empty IDs in groupBy - dataframe

I have a dataframe e.g. with this structure:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1 | 123 | 1 | | | A1 | B1 | | ... <- only P1_x columns filled
1 | 123 | 2 | | | A2 | B2 | | ... <- only P1_x filled
1 | 123 | 3 | | | A3 | B3 | | ... <- only P1_x filled
1 | 123 | | 1 | | | | A4 | ... <- only P2_x filled
1 | 123 | | 2 | | | | A5 | ... <- only P2_x filled
1 | 123 | | | 1 | | | | ... <- only P3_x filled
I need to combine the rows, that have the same ID, Date and Px_ID values, but without caring for empty values in the Px_ID when comparing the key columns.
In the end I need a dataframe like this:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1 | 123 | 1 | 1 | 1 | A1 | B1 | A4 | ...
1 | 123 | 2 | 2 | | A2 | B2 | A5 | ...
1 | 123 | 3 | | | A3 | B3 | | ...
Is this possible and how? Thank you!

I found a solution for this problem: Since the non-relevant x_ID columns are empty, one possible way is to create a new column combined_ID that contains a concatenation of all x_ID column values (this will only contain one value, since only one x_ID is not empty in each row):
var xIdArray = Seq[Column]("P1_ID", "P2_ID", "P3_ID")
myDF = myDF.withColumn("combined_ID", concat(xIdArray : _*))
This changes the DF to following structure:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ... | combined_ID
===========================================================================
1 | 123 | 1 | | | A1 | B1 | | ... | 1
1 | 123 | 2 | | | A2 | B2 | | ... | 2
1 | 123 | 3 | | | A3 | B3 | | ... | 3
1 | 123 | | 1 | | | | A4 | ... | 1
1 | 123 | | 2 | | | | A5 | ... | 2
1 | 123 | | | 1 | | | | ... | 1
Now, I can simply group my DF by ID, Date and combined_ID and aggreate all the relevant columns by e.g. max function to get the values of the non-empty cells:
var groupByColumns : Seq[String] = Seq("ID", "Date", "x_ID")
var aggColumns : Seq[String] = Seq("P1_ID", "P2_ID", "P3_ID", "P1_A", "P1_B", "P2_A", ...)
myDF = myDF.groupBy(groupByColumns.head, groupByColumns.tail : _*).agg(aggColumns.head, aggColumns.tail : _*)
Result:
ID | Date | combined_ID | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
===========================================================================
1 | 123 | 1 | 1 | 1 | 1 | A1 | B1 | A4 | ...
1 | 123 | 2 | 2 | 2 | | A2 | B2 | A5 | ...
1 | 123 | 3 | 3 | | | A3 | B3 | | ...

Related

SQL how to find a multi column maximum in a group?

How can I write an SQL query (DB2) that will run on this table:
| A | B | C | V |
+---+---+---+----+
| | | | |
| 1 | 1 | 1 | k1 |
| | | | |
| 1 | 1 | 2 | k1 |
| | | | |
| 1 | 2 | 3 | k2 |
| | | | |
| 2 | 3 | 4 | k2 |
| | | | |
| 1 | 2 | 3 | k3 |
| | | | |
| 1 | 3 | 5 | k3 |
| | | | |
| 1 | 4 | 6 | k3 |
+---+---+---+----+
and produce this result
+---+---+---+----+
| A | B | C | V |
+---+---+---+----+
| | | | |
| 1 | 1 | 2 | k1 |
| | | | |
| 2 | 3 | 4 | k2 |
| | | | |
| 1 | 4 | 6 | k3 |
+---+---+---+----+
that is it will select rows based on a max of a "tuple" (A,B,C) in a group:
or for two rows R1, R2 :
if R1.A <> R2.A return Row where A = Max(R1.A,R2.A)
if R2.B <> R2.B return Row where B = Max(R1.B,R2.B)
return Row where C = Max(R1.C,R2.C)
I think row_number() does what you want -- if by "group" you mean V:
select t.*
from (select t.*,
row_number() over (partition by v order by a desc, b desc, c desc) as seqnum
from t
) t
where seqnum = 1;

How to use Count for specific condition

How Can I count and show how many Opportunity have Stage 3 but dont have Stage 2?
+-------+-------+
| OppID | Stage |
+-------+-------+
| ABC | 1 |
| ABC | 2 |
| ABC | 3 |
| ABC | 4 |
| CDF | 3 |
| CDF | 4 |
| EFG | 1 |
| EFG | 2 |
| EFG | 3 |
| HIJ | 2 |
| HIJ | 3 |
| LMI | 1 |
| LMI | 2 |
| LMI | 4 |
+-------+-------+
The count result is 1
+-------+-------+
| OppID | Stage |
+-------+-------+
| CDF | 3 |
| CDF | 4 |
+-------+-------+
Got it, you could use NOT EXISTS and COUNT DISTINCT in following:
SELECT COUNT(DISTINCT OppID)
FROM tbl AS t1
WHERE NOT EXISTS (SELECT 1 FROM tbl AS t2 WHERE t1.OppID = t2.OppID and t2.Stage = 2) and t1.Stage = 3

Need to shift the data to next column, unfortunately added data in wrong column

I have a table test
+----+--+------+--+--+--------------+--+--------------+
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+--------------+--+--------------+
| 1 | | Andy | | | NULL |
| 2 | | Kevin | | | NULL |
| 3 | | Phil | | | NULL |
| 4 | | Maria | | | NULL |
| 5 | | Jackson | | | NULL |
+----+--+------+--+--+----------+--+--
I am expecting output like
+----+--+------+--+--+----------+--
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+----------+--
| 1 | | NULL | | | Andy |
| 2 | | NULL | | | Kevin |
| 3 | | NULL | | | Phil |
| 4 | | NULL | | | Maria |
| 5 | | NULL | | | Jackson |
+----+--+------+--+--+----------+--
I unfortunately inserted data in wrong column and now I want to shift the data to the next column.
You can use an UPDATE statement with no WHERE condition, to cover the entire table.
UPDATE test
SET Name2 = Name1,
Name1 = NULL

Creating a "First" identifier in an SQL query

I need to be able to identify the first time a set of rows, a question set, occur in a view of a table. Not sure how to frame the question clearly so will try and set it out below.
Example data
+-----------+------------+--------+------+----------+--------+
| Reference | Date | PerRef | QSet | Question | Answer |
+-----------+------------+--------+------+----------+--------+
| 1 | 01/01/2015 | a | QS1 | Q1 | A1 |
| 1 | 01/01/2015 | a | QS1 | Q2 | A2 |
| 1 | 01/01/2015 | a | QS1 | Q3 | A3 |
| 2 | 02/01/2015 | a | QS1 | Q1 | A1 |
| 2 | 02/01/2015 | a | QS1 | Q2 | A2 |
| 2 | 02/01/2015 | a | QS1 | Q3 | A3 |
| 3 | 02/01/2015 | b | QS2 | Q1 | A1 |
| 3 | 02/01/2015 | b | QS2 | Q2 | A2 |
| 3 | 02/01/2015 | b | QS2 | Q3 | A3 |
| 4 | 02/01/2015 | b | QS3 | Q1 | A1 |
| 4 | 02/01/2015 | b | QS3 | Q2 | A2 |
| 4 | 02/01/2015 | b | QS3 | Q3 | A3 |
+-----------+------------+--------+------+----------+--------+
For this table I would want to generate a new column on a select query which would look like this:
+-----------+------------+---------+-------+----------+--------+----------+--------+
| Reference | Date | PerRef | QSet | Question | Answer | Earliest | Latest |
+-----------+------------+---------+-------+----------+--------+----------+--------+
| 1 | 01/01/2015 | a | QS1 | Q1 | A1 | 1 | 2 |
| 1 | 01/01/2015 | a | QS1 | Q2 | A2 | 1 | 2 |
| 1 | 01/01/2015 | a | QS1 | Q3 | A3 | 1 | 2 |
| 2 | 02/01/2015 | a | QS1 | Q1 | A1 | 2 | 1 |
| 2 | 02/01/2015 | a | QS1 | Q2 | A2 | 2 | 1 |
| 2 | 02/01/2015 | a | QS1 | Q3 | A3 | 2 | 1 |
| 3 | 02/01/2015 | b | QS2 | Q1 | A1 | 1 | 1 |
| 3 | 02/01/2015 | b | QS2 | Q2 | A2 | 1 | 1 |
| 3 | 02/01/2015 | b | QS2 | Q3 | A3 | 1 | 1 |
| 4 | 05/01/2015 | b | QS3 | Q1 | A1 | 1 | 1 |
| 4 | 05/01/2015 | b | QS3 | Q2 | A2 | 1 | 1 |
| 4 | 05/01/2015 | b | QS3 | Q3 | A3 | 1 | 1 |
+-----------+------------+---------+-------+----------+--------+----------+--------+
I have this in both Earliest and Latest as it then lets me pull back the first completed or the latest. Generally queries will be to bring back just one of one type of question set against a person, either the first completed or he latest. Having both these will of course enable the option to bring back the earliest OR latest and see the change between the two.
So to try and rephrase this - I need to create a counter based on the Reference, grouped by the PerRef and QSet, ordered by the date (either ascending or descending)
Of course if there is a better way please state as I may be making this overly complex - if there a way to select First or Last Reference grouped on PerRef and Qset?
I managed to find a solution using row_number():
SELECT t.*, orderT.Earliest
FROM #tmp_table t JOIN (SELECT Reference, PerRef, Qset, row_number() OVER (PARTITION BY PerRef, Qset order by Date ) as Earliest
FROM #tmp_table
GROUP BY Reference, Date, PerRef, Qset) orderT ON t.Reference = orderT.Reference
AND t.PerRef = orderT.PerRef
AND t.Qset = orderT.Qset
What it does is:
1. Compute the Earliest column for each Reference/PerRef/Qset combination
2. "Apply" it to each line using JOIN on Reference/PerRef/Qset
Check out row_number() documentation if you have trouble understanding the query:
https://msdn.microsoft.com/en-us/library/ms186734.aspx
Generating the Latestcolumn using a similar method shouldn't be a problem.

How to aggregate column on changing criteria in SQL (multiple SUMIFS)

Consider the following simplified example:
Table JobTitles
| PersonID | JobTitle | StartDate | EndDate |
|----------|----------|-----------|---------|
| A | A1 | 1 | 5 |
| A | A2 | 6 | 10 |
| A | A3 | 11 | 15 |
| B | B1 | 2 | 4 |
| B | B2 | 5 | 7 |
| B | B3 | 8 | 11 |
| C | C1 | 5 | 12 |
| C | C2 | 13 | 14 |
| C | C3 | 15 | 18 |
Table Transactions:
| PersonID | TransDate | Amt |
|----------|-----------|-----|
| A | 2 | 5 |
| A | 3 | 10 |
| A | 12 | 5 |
| A | 12 | 10 |
| B | 3 | 5 |
| B | 3 | 10 |
| B | 10 | 5 |
| C | 16 | 10 |
| C | 17 | 5 |
| C | 17 | 10 |
| C | 17 | 5 |
Desired Output:
| PersonID | JobTitle | StartDate | EndDate | Amt |
|----------|----------|-----------|---------|-----|
| A | A1 | 1 | 5 | 15 |
| A | A2 | 6 | 10 | 0 |
| A | A3 | 11 | 15 | 15 |
| B | B1 | 2 | 4 | 15 |
| B | B2 | 5 | 7 | 0 |
| B | B3 | 8 | 11 | 5 |
| C | C1 | 5 | 12 | 0 |
| C | C2 | 13 | 14 | 0 |
| C | C3 | 15 | 18 | 30 |
To me this is JobTitles LEFT OUTER JOIN Transactions with some type of moving criteria for the TransDate -- that is, I want to SUM Transaction.Amt if Transactions.TransDate is between JobTitles.StartDate and JobTitles.EndDate per each PersonID.
Feels like some type of partition or window function, but my SQL skills are not strong enough to create an elegant solution. In Excel, this equates to:
SUMIFS(Transaction[Amt], JobTitles[PersonID], Results[#[PersonID]], Transactions[TransDate], ">" & Results[#[StartDate]], Transactions[TransDate], "<=" & Results[#[EndDate]])
Moreover, I want to be able to perform this same logic over several flavors of Transaction tables.
The basic query is:
select jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate, coalesce(sum(amt), 0) as amt
from JobTitles jt left join
Transactions t
on jt.PersonId = t.PersonId and
t.TransDate between jt.StartDate and jt.EndDate
group by jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate;