Consider the following simplified example:
Table JobTitles
| PersonID | JobTitle | StartDate | EndDate |
|----------|----------|-----------|---------|
| A | A1 | 1 | 5 |
| A | A2 | 6 | 10 |
| A | A3 | 11 | 15 |
| B | B1 | 2 | 4 |
| B | B2 | 5 | 7 |
| B | B3 | 8 | 11 |
| C | C1 | 5 | 12 |
| C | C2 | 13 | 14 |
| C | C3 | 15 | 18 |
Table Transactions:
| PersonID | TransDate | Amt |
|----------|-----------|-----|
| A | 2 | 5 |
| A | 3 | 10 |
| A | 12 | 5 |
| A | 12 | 10 |
| B | 3 | 5 |
| B | 3 | 10 |
| B | 10 | 5 |
| C | 16 | 10 |
| C | 17 | 5 |
| C | 17 | 10 |
| C | 17 | 5 |
Desired Output:
| PersonID | JobTitle | StartDate | EndDate | Amt |
|----------|----------|-----------|---------|-----|
| A | A1 | 1 | 5 | 15 |
| A | A2 | 6 | 10 | 0 |
| A | A3 | 11 | 15 | 15 |
| B | B1 | 2 | 4 | 15 |
| B | B2 | 5 | 7 | 0 |
| B | B3 | 8 | 11 | 5 |
| C | C1 | 5 | 12 | 0 |
| C | C2 | 13 | 14 | 0 |
| C | C3 | 15 | 18 | 30 |
To me this is JobTitles LEFT OUTER JOIN Transactions with some type of moving criteria for the TransDate -- that is, I want to SUM Transaction.Amt if Transactions.TransDate is between JobTitles.StartDate and JobTitles.EndDate per each PersonID.
Feels like some type of partition or window function, but my SQL skills are not strong enough to create an elegant solution. In Excel, this equates to:
SUMIFS(Transaction[Amt], JobTitles[PersonID], Results[#[PersonID]], Transactions[TransDate], ">" & Results[#[StartDate]], Transactions[TransDate], "<=" & Results[#[EndDate]])
Moreover, I want to be able to perform this same logic over several flavors of Transaction tables.
The basic query is:
select jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate, coalesce(sum(amt), 0) as amt
from JobTitles jt left join
Transactions t
on jt.PersonId = t.PersonId and
t.TransDate between jt.StartDate and jt.EndDate
group by jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate;
Related
I have the following table:
+-----+----+---------+
| grp | id | sub_grp |
+-----+----+---------+
| 10 | A2 | 1 |
| 10 | B4 | 2 |
| 10 | F1 | 2 |
| 10 | B3 | 3 |
| 10 | C2 | 4 |
| 10 | A2 | 4 |
| 10 | H4 | 5 |
| 10 | K0 | 5 |
| 10 | Z3 | 5 |
| 10 | F1 | 5 |
| 10 | A1 | 5 |
| 10 | A | 6 |
| 10 | B | 6 |
| 10 | B | 7 |
| 10 | C | 7 |
| 10 | C | 8 |
| 10 | D | 8 |
| 20 | A | 1 |
| 20 | B | 1 |
| 20 | B | 2 |
| 20 | C | 2 |
| 20 | C | 3 |
| 20 | D | 3 |
+-----+----+---------+
Within every grp, my goal is to merge all the sub_grp sharing at least one id.
More than 2 sub_grp can be merged together.
The expected result should be:
+-----+----+---------+
| grp | id | sub_grp |
+-----+----+---------+
| 10 | A2 | 1 |
| 10 | B4 | 2 |
| 10 | F1 | 2 |
| 10 | B3 | 3 |
| 10 | C2 | 1 |
| 10 | A2 | 1 |
| 10 | H4 | 2 |
| 10 | K0 | 2 |
| 10 | Z3 | 2 |
| 10 | F1 | 2 |
| 10 | A1 | 2 |
| 10 | A | 6 |
| 10 | B | 6 |
| 10 | B | 6 |
| 10 | C | 6 |
| 10 | C | 6 |
| 10 | D | 6 |
| 20 | A | 1 |
| 20 | B | 1 |
| 20 | B | 1 |
| 20 | C | 1 |
| 20 | C | 1 |
| 20 | D | 1 |
+-----+----+---------+
Here is a SQL Fiddle with the test values: http://sqlfiddle.com/#!9/13666c/2
I am trying to solve this either with a stored procedure or queries.
This is an evolution from my previous problem: Merge rows containing same values
My understanding of the problem
Merge sub_grp (for a given grp) if any one of the IDs in one sub_grp match any one of the IDs in another sub_grp. A given sub_grp can be merged with only one other (the earliest in ascending order) sub_grp.
Disclaimer
This code may work. Not tested as OP did not provide DDLs and data scripts.
Solution
UPDATE final
SET sub_grp = new_sub_grp
FROM
-- For each grp, sub_grp combination return a matching new_sub_grp
( SELECT a.grp, a.sub_grp, MatchGrp.sub_grp AS new_sub_grp
FROM tbl AS a
-- Inner join will exclude cases where there are no matching sub_grp and thus nothing to update.
INNER JOIN
-- Find the earliest (if more than one sub-group is a match) matching sub-group where one of the IDs matches
( SELECT TOP 1 grp, sub_grp
FROM tbl AS b
-- b.sub_grp > a.sub_grp - this will only look at the earlier sub-groups avoiding the "double linking"
WHERE b.grp = a.grp AND b.sub_grp > a.sub_grp AND b.ID = a.ID
ORDER BY grp, sub_grp ) AS MatchGrp ON 1 = 1
-- Only return one record per grp, sub_grp combo
GROUP BY grp, sub_grp, MatchGrp.sub_grp ) AS final
You can re-number sub groups afterwards as a separate update statement with the help of DENSE_RANK window function.
I have the following sql to calculate a % total:
SELECT tblTourns_atp.ID_Ti,
Sum([FS_1]/(SELECT Sum(FSOF_1)
FROM stat_atp
WHERE stat_atp.ID_T = tblTourns_atp.ID_T)) AS S1_IP
FROM stat_atp
INNER JOIN tblTourns_atp ON stat_atp.ID_T = tblTourns_atp.ID_T
GROUP BY tblTourns_atp.ID_Ti
I'm getting the 'aggregate error' because it wants the ID_T fields either grouped or in an aggregate function. I've read loads of examples but none of them seem to apply when the offending field is the subject of 'WHERE'.
Tables and output as follows:
+----------+------+--------+--+---------------+-------+--+--------+--------+
| stat_atp | | | | tblTourns_atp | | | Output | |
+----------+------+--------+--+---------------+-------+--+--------+--------+
| ID_T | FS_1 | FSOF_1 | | ID_T | ID_Ti | | ID_Ti | S1_IP |
| 1 | 20 | 40 | | 1 | 1 | | 1 | 31.03% |
| 2 | 30 | 100 | | 2 | 1 | | 2 | 28.57% |
| 3 | 40 | 150 | | 3 | 1 | | 3 | 33.33% |
| 4 | 30 | 100 | | 4 | 2 | | | |
| 5 | 30 | 100 | | 5 | 2 | | | |
| 6 | 40 | 150 | | 6 | 2 | | | |
| 7 | 20 | 40 | | 7 | 3 | | | |
| 8 | 30 | 100 | | 8 | 3 | | | |
| 9 | 40 | 150 | | 9 | 3 | | | |
| 10 | 20 | 40 | | 10 | 3 | | | |
+----------+------+--------+--+---------------+-------+--+--------+--------+
Since you already have an inner join between the two tables, a separate subquery isn't required:
select t.id_ti, sum(s.fs_1)/sum(s.fsof_1) as pct
from tbltourns_atp t inner join stat_atp s on t.id_t = s.id_t
group by t.id_ti
I have a hive external table with data say, (version less than 0.14)
+--------+------+------+------+
| id | A | B | C |
+--------+------+------+------+
| 10011 | 10 | 3 | 0 |
| 10012 | 9 | 0 | 40 |
| 10015 | 10 | 3 | 0 |
| 10017 | 9 | 0 | 40 |
+--------+------+------+------+
And I have a delta file having data given below.
+--------+------+------+------+
| id | A | B | C |
+--------+------+------+------+
| 10012 | 50 | 3 | 10 | --> update
| 10013 | 29 | 0 | 40 | --> insert
| 10014 | 10 | 3 | 0 | --> update
| 10013 | 19 | 0 | 40 | --> update
| 10015 | 70 | 3 | 0 | --> update
| 10016 | 17 | 0 | 40 | --> insert
+--------+------+------+------+
How can I update my hive table with the delta file, without using sqoop. Any help on how to proceed will be great! Thanks.
This is because there is duplicates in the file. How do you know which you should keep? The last one?
In that case you can use, for example, the row_number and then get the maximum value. Something like that.
SELECT coalesce(tmp.id,initial.id) as id,
coalesce(tmp.A, initial.A) as A,
coalesce(tmp.B,initial.B) as B,
coalesce(tmp.C, initial.C) as C
FROM
table_a initial
FULL OUTER JOIN
( SELECT *, row_number() over( partition by id ) as row_num
,COUNT(*) OVER (PARTITION BY id) AS cnt
FROM temp_table
) tmp
ON initial.id=tmp.id
WHERE row_num=cnt
OR row_num IS NULL;
Output:
+--------+-----+----+-----+--+
| id | a | b | c |
+--------+-----+----+-----+--+
| 10011 | 10 | 3 | 0 |
| 10012 | 50 | 3 | 10 |
| 10013 | 19 | 0 | 40 |
| 10014 | 10 | 3 | 0 |
| 10015 | 70 | 3 | 0 |
| 10016 | 17 | 0 | 40 |
| 10017 | 9 | 0 | 40 |
+--------+-----+----+-----+--+
You can load the file to a temporary table in hive and then execute a FULL OUTER JOIN between the two tables.
Query Example:
SELECT coalesce(tmp.id,initial.id) as id,
coalesce(tmp.A, initial.A) as A,
coalesce(tmp.B,initial.B) as B,
coalesce(tmp.C, initial.C) as C
FROM
table_a initial
FULL OUTER JOIN
temp_table tmp on initial.id=tmp.id;
Output
+--------+-----+----+-----+--+
| id | a | b | c |
+--------+-----+----+-----+--+
| 10011 | 10 | 3 | 0 |
| 10012 | 50 | 3 | 10 |
| 10013 | 29 | 0 | 40 |
| 10013 | 19 | 0 | 40 |
| 10014 | 10 | 3 | 0 |
| 10015 | 70 | 3 | 0 |
| 10016 | 17 | 0 | 40 |
| 10017 | 9 | 0 | 40 |
+--------+-----+----+-----+--+
I have a following selection:
| EmployeeId | DepartmentName | Salary |
----------------------------------------
| 1 | A | 10 |
| 2 | A | 10 |
| 3 | A | 15 |
| 4 | A | 20 |
| 5 | A | 20 |
| 6 | B | 15 |
| 7 | B | 25 |
| 8 | B | 25 |
I want to group all rows by salary in each department and add this number of group to the selection.
Example of the result selection:
| EmployeeId | DepartmentName | Salary | Group |
------------------------------------------------
| 1 | A | 10 | 1 |
| 2 | A | 10 | 1 |
| 3 | A | 15 | 2 |
| 4 | A | 20 | 3 |
| 5 | A | 20 | 3 |
| 6 | B | 15 | 1 |
| 7 | B | 25 | 2 |
| 8 | B | 25 | 2 |
SELECT *
, DENSE_RANK() OVER (PARTITION BY DepartmentName
ORDER BY Salary) AS [Group]
FROM TableName
I already have the argument for the st column..
mysql> SELECT * FROM t ORDER BY LEFT(st,LOCATE(' ',st)), CAST(SUBSTRING(st,LOCATE(' ',st)+1) AS SIGNED);
+----+------+
| id | st |
+----+------+
| 1 | a 1 |
| 3 | a 6 |
| 4 | a 11 |
| 2 | a 11 |
| 5 | b 1 |
| 7 | b 6 |
| 8 | b 12 |
| 6 | b 12 |
+----+------+
this is what it should happen..
+----+------+
| id | st |
+----+------+
| 1 | a 1 |
| 3 | a 6 |
| 2 | a 11 |
| 4 | a 11 |
| 5 | b 1 |
| 7 | b 6 |
| 6 | b 12 |
| 8 | b 12 |
+----+------+
can I do this or not via SQL?
Well from what you show the only difference is that "ties" are broken by the ID column. If that's the case you can just use that as the third sort field:
SELECT *
FROM t
ORDER BY LEFT(st,LOCATE(' ',st)),
CAST(SUBSTRING(st,LOCATE(' ',st)+1) AS SIGNED),
ID