I need to be able to identify the first time a set of rows, a question set, occur in a view of a table. Not sure how to frame the question clearly so will try and set it out below.
Example data
+-----------+------------+--------+------+----------+--------+
| Reference | Date | PerRef | QSet | Question | Answer |
+-----------+------------+--------+------+----------+--------+
| 1 | 01/01/2015 | a | QS1 | Q1 | A1 |
| 1 | 01/01/2015 | a | QS1 | Q2 | A2 |
| 1 | 01/01/2015 | a | QS1 | Q3 | A3 |
| 2 | 02/01/2015 | a | QS1 | Q1 | A1 |
| 2 | 02/01/2015 | a | QS1 | Q2 | A2 |
| 2 | 02/01/2015 | a | QS1 | Q3 | A3 |
| 3 | 02/01/2015 | b | QS2 | Q1 | A1 |
| 3 | 02/01/2015 | b | QS2 | Q2 | A2 |
| 3 | 02/01/2015 | b | QS2 | Q3 | A3 |
| 4 | 02/01/2015 | b | QS3 | Q1 | A1 |
| 4 | 02/01/2015 | b | QS3 | Q2 | A2 |
| 4 | 02/01/2015 | b | QS3 | Q3 | A3 |
+-----------+------------+--------+------+----------+--------+
For this table I would want to generate a new column on a select query which would look like this:
+-----------+------------+---------+-------+----------+--------+----------+--------+
| Reference | Date | PerRef | QSet | Question | Answer | Earliest | Latest |
+-----------+------------+---------+-------+----------+--------+----------+--------+
| 1 | 01/01/2015 | a | QS1 | Q1 | A1 | 1 | 2 |
| 1 | 01/01/2015 | a | QS1 | Q2 | A2 | 1 | 2 |
| 1 | 01/01/2015 | a | QS1 | Q3 | A3 | 1 | 2 |
| 2 | 02/01/2015 | a | QS1 | Q1 | A1 | 2 | 1 |
| 2 | 02/01/2015 | a | QS1 | Q2 | A2 | 2 | 1 |
| 2 | 02/01/2015 | a | QS1 | Q3 | A3 | 2 | 1 |
| 3 | 02/01/2015 | b | QS2 | Q1 | A1 | 1 | 1 |
| 3 | 02/01/2015 | b | QS2 | Q2 | A2 | 1 | 1 |
| 3 | 02/01/2015 | b | QS2 | Q3 | A3 | 1 | 1 |
| 4 | 05/01/2015 | b | QS3 | Q1 | A1 | 1 | 1 |
| 4 | 05/01/2015 | b | QS3 | Q2 | A2 | 1 | 1 |
| 4 | 05/01/2015 | b | QS3 | Q3 | A3 | 1 | 1 |
+-----------+------------+---------+-------+----------+--------+----------+--------+
I have this in both Earliest and Latest as it then lets me pull back the first completed or the latest. Generally queries will be to bring back just one of one type of question set against a person, either the first completed or he latest. Having both these will of course enable the option to bring back the earliest OR latest and see the change between the two.
So to try and rephrase this - I need to create a counter based on the Reference, grouped by the PerRef and QSet, ordered by the date (either ascending or descending)
Of course if there is a better way please state as I may be making this overly complex - if there a way to select First or Last Reference grouped on PerRef and Qset?
I managed to find a solution using row_number():
SELECT t.*, orderT.Earliest
FROM #tmp_table t JOIN (SELECT Reference, PerRef, Qset, row_number() OVER (PARTITION BY PerRef, Qset order by Date ) as Earliest
FROM #tmp_table
GROUP BY Reference, Date, PerRef, Qset) orderT ON t.Reference = orderT.Reference
AND t.PerRef = orderT.PerRef
AND t.Qset = orderT.Qset
What it does is:
1. Compute the Earliest column for each Reference/PerRef/Qset combination
2. "Apply" it to each line using JOIN on Reference/PerRef/Qset
Check out row_number() documentation if you have trouble understanding the query:
https://msdn.microsoft.com/en-us/library/ms186734.aspx
Generating the Latestcolumn using a similar method shouldn't be a problem.
Related
I have the following table:
+-----+----+---------+
| grp | id | sub_grp |
+-----+----+---------+
| 10 | A2 | 1 |
| 10 | B4 | 2 |
| 10 | F1 | 2 |
| 10 | B3 | 3 |
| 10 | C2 | 4 |
| 10 | A2 | 4 |
| 10 | H4 | 5 |
| 10 | K0 | 5 |
| 10 | Z3 | 5 |
| 10 | F1 | 5 |
| 10 | A1 | 5 |
| 10 | A | 6 |
| 10 | B | 6 |
| 10 | B | 7 |
| 10 | C | 7 |
| 10 | C | 8 |
| 10 | D | 8 |
| 20 | A | 1 |
| 20 | B | 1 |
| 20 | B | 2 |
| 20 | C | 2 |
| 20 | C | 3 |
| 20 | D | 3 |
+-----+----+---------+
Within every grp, my goal is to merge all the sub_grp sharing at least one id.
More than 2 sub_grp can be merged together.
The expected result should be:
+-----+----+---------+
| grp | id | sub_grp |
+-----+----+---------+
| 10 | A2 | 1 |
| 10 | B4 | 2 |
| 10 | F1 | 2 |
| 10 | B3 | 3 |
| 10 | C2 | 1 |
| 10 | A2 | 1 |
| 10 | H4 | 2 |
| 10 | K0 | 2 |
| 10 | Z3 | 2 |
| 10 | F1 | 2 |
| 10 | A1 | 2 |
| 10 | A | 6 |
| 10 | B | 6 |
| 10 | B | 6 |
| 10 | C | 6 |
| 10 | C | 6 |
| 10 | D | 6 |
| 20 | A | 1 |
| 20 | B | 1 |
| 20 | B | 1 |
| 20 | C | 1 |
| 20 | C | 1 |
| 20 | D | 1 |
+-----+----+---------+
Here is a SQL Fiddle with the test values: http://sqlfiddle.com/#!9/13666c/2
I am trying to solve this either with a stored procedure or queries.
This is an evolution from my previous problem: Merge rows containing same values
My understanding of the problem
Merge sub_grp (for a given grp) if any one of the IDs in one sub_grp match any one of the IDs in another sub_grp. A given sub_grp can be merged with only one other (the earliest in ascending order) sub_grp.
Disclaimer
This code may work. Not tested as OP did not provide DDLs and data scripts.
Solution
UPDATE final
SET sub_grp = new_sub_grp
FROM
-- For each grp, sub_grp combination return a matching new_sub_grp
( SELECT a.grp, a.sub_grp, MatchGrp.sub_grp AS new_sub_grp
FROM tbl AS a
-- Inner join will exclude cases where there are no matching sub_grp and thus nothing to update.
INNER JOIN
-- Find the earliest (if more than one sub-group is a match) matching sub-group where one of the IDs matches
( SELECT TOP 1 grp, sub_grp
FROM tbl AS b
-- b.sub_grp > a.sub_grp - this will only look at the earlier sub-groups avoiding the "double linking"
WHERE b.grp = a.grp AND b.sub_grp > a.sub_grp AND b.ID = a.ID
ORDER BY grp, sub_grp ) AS MatchGrp ON 1 = 1
-- Only return one record per grp, sub_grp combo
GROUP BY grp, sub_grp, MatchGrp.sub_grp ) AS final
You can re-number sub groups afterwards as a separate update statement with the help of DENSE_RANK window function.
I have a dataframe e.g. with this structure:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1 | 123 | 1 | | | A1 | B1 | | ... <- only P1_x columns filled
1 | 123 | 2 | | | A2 | B2 | | ... <- only P1_x filled
1 | 123 | 3 | | | A3 | B3 | | ... <- only P1_x filled
1 | 123 | | 1 | | | | A4 | ... <- only P2_x filled
1 | 123 | | 2 | | | | A5 | ... <- only P2_x filled
1 | 123 | | | 1 | | | | ... <- only P3_x filled
I need to combine the rows, that have the same ID, Date and Px_ID values, but without caring for empty values in the Px_ID when comparing the key columns.
In the end I need a dataframe like this:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1 | 123 | 1 | 1 | 1 | A1 | B1 | A4 | ...
1 | 123 | 2 | 2 | | A2 | B2 | A5 | ...
1 | 123 | 3 | | | A3 | B3 | | ...
Is this possible and how? Thank you!
I found a solution for this problem: Since the non-relevant x_ID columns are empty, one possible way is to create a new column combined_ID that contains a concatenation of all x_ID column values (this will only contain one value, since only one x_ID is not empty in each row):
var xIdArray = Seq[Column]("P1_ID", "P2_ID", "P3_ID")
myDF = myDF.withColumn("combined_ID", concat(xIdArray : _*))
This changes the DF to following structure:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ... | combined_ID
===========================================================================
1 | 123 | 1 | | | A1 | B1 | | ... | 1
1 | 123 | 2 | | | A2 | B2 | | ... | 2
1 | 123 | 3 | | | A3 | B3 | | ... | 3
1 | 123 | | 1 | | | | A4 | ... | 1
1 | 123 | | 2 | | | | A5 | ... | 2
1 | 123 | | | 1 | | | | ... | 1
Now, I can simply group my DF by ID, Date and combined_ID and aggreate all the relevant columns by e.g. max function to get the values of the non-empty cells:
var groupByColumns : Seq[String] = Seq("ID", "Date", "x_ID")
var aggColumns : Seq[String] = Seq("P1_ID", "P2_ID", "P3_ID", "P1_A", "P1_B", "P2_A", ...)
myDF = myDF.groupBy(groupByColumns.head, groupByColumns.tail : _*).agg(aggColumns.head, aggColumns.tail : _*)
Result:
ID | Date | combined_ID | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
===========================================================================
1 | 123 | 1 | 1 | 1 | 1 | A1 | B1 | A4 | ...
1 | 123 | 2 | 2 | 2 | | A2 | B2 | A5 | ...
1 | 123 | 3 | 3 | | | A3 | B3 | | ...
I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.
You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.
Forgive me if im having trouble explaining this. But basically I have:
|Place| Agreement | Open Date | Closed Date | FlagDate
| 1 | a | 1/29/2011 | 7/29/2011 | 2/29/2011
| 1 | b | 2/15/2016 | 7/30/2016 | 2/29/2011
| 1 | c | 3/29/2015 | 8/02/2015 | 2/29/2011
| 2 | d | 4/29/2011 | 7/28/2011 | 6/29/2012
| 2 | e | 5/29/2012 | 7/30/2012 | 6/29/2012
| 3 | f | 6/29/2011 | 8/01/2011 | 7/15/2011
| 3 | g | 7/29/2012 | 8/02/2012 | 7/15/2011
Where there can be multiple Agreements for a single Place. The FlagDate will be the same for each individual Place (ie all of Place 1 has a FlagDate of 2/29/2011)
What I would like to do is simply select the Agreement where the FlagDate falls in between the Open and Close date (the open and close date will never overlap with different agreements.) So the output for the above would be
|Place| Agreement | Open Date | Closed Date | FlagDate
| 1 | a | 1/29/2011 | 7/29/2011 | 2/29/2011
| 2 | e | 5/29/2012 | 7/30/2012 | 6/29/2012
| 3 | f | 6/29/2011 | 8/01/2011 | 7/15/2011
You can use the BETWEEN operator in your WHERE clause to do this:
SELECT * FROM yourtable WHERE flagdate BETWEEN opendate AND closeddate;
Consider the following simplified example:
Table JobTitles
| PersonID | JobTitle | StartDate | EndDate |
|----------|----------|-----------|---------|
| A | A1 | 1 | 5 |
| A | A2 | 6 | 10 |
| A | A3 | 11 | 15 |
| B | B1 | 2 | 4 |
| B | B2 | 5 | 7 |
| B | B3 | 8 | 11 |
| C | C1 | 5 | 12 |
| C | C2 | 13 | 14 |
| C | C3 | 15 | 18 |
Table Transactions:
| PersonID | TransDate | Amt |
|----------|-----------|-----|
| A | 2 | 5 |
| A | 3 | 10 |
| A | 12 | 5 |
| A | 12 | 10 |
| B | 3 | 5 |
| B | 3 | 10 |
| B | 10 | 5 |
| C | 16 | 10 |
| C | 17 | 5 |
| C | 17 | 10 |
| C | 17 | 5 |
Desired Output:
| PersonID | JobTitle | StartDate | EndDate | Amt |
|----------|----------|-----------|---------|-----|
| A | A1 | 1 | 5 | 15 |
| A | A2 | 6 | 10 | 0 |
| A | A3 | 11 | 15 | 15 |
| B | B1 | 2 | 4 | 15 |
| B | B2 | 5 | 7 | 0 |
| B | B3 | 8 | 11 | 5 |
| C | C1 | 5 | 12 | 0 |
| C | C2 | 13 | 14 | 0 |
| C | C3 | 15 | 18 | 30 |
To me this is JobTitles LEFT OUTER JOIN Transactions with some type of moving criteria for the TransDate -- that is, I want to SUM Transaction.Amt if Transactions.TransDate is between JobTitles.StartDate and JobTitles.EndDate per each PersonID.
Feels like some type of partition or window function, but my SQL skills are not strong enough to create an elegant solution. In Excel, this equates to:
SUMIFS(Transaction[Amt], JobTitles[PersonID], Results[#[PersonID]], Transactions[TransDate], ">" & Results[#[StartDate]], Transactions[TransDate], "<=" & Results[#[EndDate]])
Moreover, I want to be able to perform this same logic over several flavors of Transaction tables.
The basic query is:
select jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate, coalesce(sum(amt), 0) as amt
from JobTitles jt left join
Transactions t
on jt.PersonId = t.PersonId and
t.TransDate between jt.StartDate and jt.EndDate
group by jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate;