I am looking for a hive query which can do the below transformation on the below table
Input:
cust1 jan 100
cust1 feb 110
cust2 mar 150
cust2 apr 140
cust2 feb 170
Result:
cust1 100, 110
cust2 170, 150, 140
The collect_list() can be used to perform the operation:
Query
select col1,collect_list(col3) from tabledata group by col1;
Output
cust1 [100,110]
cust2 [150,140,170]
Related
Please consider this table:
Year Month Value YearMonth
2011 1 70 201101
2011 1 100 201101
2011 2 200 201102
2011 2 50 201102
2011 3 80 201103
2011 3 250 201103
2012 1 100 201201
2012 2 200 201202
2012 3 250 201203
I want to get a cumulative sum based on each year. For the above table I want to get this result:
Year Month Sum
-----------------------
2011 1 170
2011 2 420 <--- 250 + 170
2011 3 750 <--- 330 + 250 + 170
2012 1 100
2012 2 300 <--- 200 + 100
2012 3 550 <--- 250 + 200 + 100
I wrote this code:
Select c1.YearMonth, Sum(c2.Value) CumulativeSumValue
From #Tbl c1, #Tbl c2
Where c1.YearMonth >= c2.YearMonth
Group By c1.YearMonth
Order By c1.YearMonth Asc
But its CumulativeSumValue is calculated twice for each YearMonth:
YearMonth CumulativeSumValue
201101 340 <--- 170 * 2
201102 840 <--- 420 * 2
201103 1500
201201 850
201202 1050
201203 1300
How can I achieve my desired result?
I wrote this query:
select year, (Sum (aa.[Value]) Over (partition by aa.Year Order By aa.Month)) as 'Cumulative Sum'
from #Tbl aa
But it returned multiple records for 2011:
Year Cumulative Sum
2011 170
2011 170
2011 420
2011 420
2011 750
2011 750
2012 100
2012 300
2012 550
You are creating a cartesian product here. In your ANSI-89 implicit JOIN (you really need to stop using those and switch to ANSI-92 syntax) you are joining on c1.YearMonth >= c2.YearMonth.
For your first month you have two rows with the same value of the year and month, so each of those 2 rows joins to the other 2; this results in 4 rows:
Year
Month
Value1
Value2
2011
1
70
70
2011
1
70
100
2011
1
100
70
2011
1
100
100
When you SUM this value you get 340, not 170, as you have 70+70+100+100.
Instead of a triangular JOIN however, you should be using a windowed SUM. As you want to also get aggregate nmonths into a single rows, you'll need to also aggregate inside the windowed SUM like so:
SELECT V.YearMonth,
SUM(SUM(V.Value)) OVER (PARTITION BY Year ORDER BY V.YearMonth) AS CumulativeSum
FROM (VALUES (2011, 1, 70, 201101),
(2011, 1, 100, 201101),
(2011, 2, 200, 201102),
(2011, 2, 50, 201102),
(2011, 3, 80, 201103),
(2011, 3, 250, 201103),
(2012, 1, 100, 201201),
(2012, 2, 200, 201202),
(2012, 3, 250, 201203)) V (Year, Month, Value, YearMonth)
GROUP BY V.YearMonth,
V.Year;
I using a software where is partly SQL server based. The software is made by another company so I do not have full access to the sql editing part. In simple terms, I have a datebase and it is stored using sql formats. I can fiddle with some areas but there are limitation to how much I can customize and the type of syntax that can be used.
Aggregate function such as SUM() cannot be used and I am trying to find an alternative method to reach a similar/same result. I know that the normal way to sum is as below but SUM() and GROUP BY cannot be used.
I am still inexperienced in sql, I kindly ask for your advices.
Thank you very much in advance.
DBMS: Microsoft SQL server
SELECT *,SUM(value)
FROM table
GROUP BY ID1
Note:
ID1 continues to expand
ID2 consist a fix set of value (AA, BB, CC, DD, EE) ONLY
I don't need to group it but I don't know how to do it without grouping at the moment
TABLE
ID1
ID2
Value
001
AA
10
001
BB
21
001
CC
2
001
DD
16
002
AA
7
002
CC
8
003
AA
10
003
BB
9
003
AA
11
RESULT
ID1
ID2
Value
SUM
001
AA
10
49
001
BB
21
49
001
CC
2
49
001
DD
16
49
002
AA
7
15
002
CC
8
15
003
AA
10
30
003
BB
9
30
003
AA
11
30
Here is an option using the window function sum() over(). Notice there is no need for a GROUP BY or subquery.
The window functions are invaluable and well worth your time getting comfortable with them.
Example
Select *
,Sum = sum(Value) over (partition by ID1)
From YourTable
Results
ID1 ID2 Value Sum
001 AA 10 49
001 BB 21 49
001 CC 2 49
001 DD 16 49
002 AA 7 15
002 CC 8 15
003 AA 10 30
003 BB 9 30
003 AA 11 30
I am trying to join the columns "Type2" and "Measurement2" from table "Update" to the table "Have". I want the columns to align where column "Subject1" in table "Have" matches column "Subject2" in table "update", and column "Procedure1" in table "Have" matches column "Procedure2" in table "Update".Thank you in advance.
data Have;
input Subject1 Type1 :$12. Date1 &:anydtdte. Procedure1 :$12. Measurement1;
format date yymmdd10.;
datalines;
500 Initial 15 AUG 2017 Invasive 20
500 Initial 15 AUG 2017 Surface 35
428 Initial 3 JUL 2017 Outer 10
765 Initial 20 JUL 2019 Other 19
610 Initial 17 Mar 2018 Invasive 17
;
data Update;
input Subject2 Type2 :$12. Date2 &:anydtdte. Procedure2 :$12. Measurement2;
format date yymmdd10.;
datalines;
500 Followup 15 AUG 2018 Invasive 54
428 Followup 15 AUG 2018 Outer 29
765 Seventh 3 AUG 2018 Other 13
500 Followup 3 JUL 2018 Surface 98
610 Third 20 AUG 2019 Invasive 66
;
Are you just looking for a join between two tables ??
Select distinct have.*, update.type2, update.measurement2
from have
left join update
on
have.subject1 = update.subject2
and have.procedure1 = update.procedure2
Combining two data sets based on a key (your subject and procedure) is performed using a MERGE according to the group variables named in a BY statement. Both data sets need the same BY variables.
Example code:
MERGE requires sorted data, so that will have to occur first.
Data set option rename= is used to create common names for the BY statement.
proc sort data=Have; by Subject1 Procedure1;
proc sort data=Updates; by Subject2 Procedure2;
data combined;
* trick: force these variables to be first two columns in output data set;
retain subject procedure;
merge
have (rename=(subject1=subject procedure1=procedure))
updates (rename=(subject2=subject procedure2=procedure))
;
by subject procedure;
run;
Example data:
data Have;
attrib
Subject1 length=8
Type1 length=$12
Date1 informat=anydtdte. format=yymmdd10.
Procedure1 length=$12
Measurement1 length=8
;
input
Subject1& Type1& Date1& Procedure1& Measurement1&; datalines;
500 Initial 15 AUG 2017 Invasive 20
500 Initial 15 AUG 2017 Surface 35
428 Initial 3 JUL 2017 Outer 10
765 Initial 20 JUL 2019 Other 19
610 Initial 17 Mar 2018 Invasive 17
;
data Updates;
attrib
Subject2 length=8
Type2 length=$12
Date2 informat=anydtdte. format=yymmdd10.
Procedure2 length=$12
Measurement2 length=8
;
input
Subject2& Type2& Date2& Procedure2& Measurement2&; datalines;
500 Followup 15 AUG 2018 Invasive 54
428 Followup 15 AUG 2018 Outer 29
765 Seventh 3 AUG 2018 Other 13
500 Followup 3 JUL 2018 Surface 98
610 Third 20 AUG 2019 Invasive 66
;
I have a data such as
Table 1: (after converting data into the format I need with the below query later in this question after the left join)
Initially has details of tickets such as date, ticket number, ticket type.
Monthyear Premiumold Silverold
-----------------------------------
Jan 2019 233 156
Feb 2019 344 258
Mar 2019 222 298
Table 2: which I predicted and pushed from a different source in the same format
Monthyear Premium silver
----------------------------
Apr 2019 284 312
May 2019 267 344
Jun 2019 223 356
Jul 2019 244 367
Aug 2019 234 373
I want to get this data to be in a format such as:
Monthyear Premiumold Silverold Premium silver
---------------------------------------------------------
Jan2019 233 156 NULL NULL
Feb 2019 344 258 NULL NULL
Mar 2019 222 298 NULL NULL
Apr 2019 NULL NULL 284 312
May 2019 NULL NULL 267 344
Jun 2019 NULL NULL 223 356
Jul 2019 NULL NULL 244 367
Aug 2019 NULL NULL 234 373
which basically puts the months together and leaves out NULL in wherever data isn't present for.
I have tried:
select *
from
((select Monthyear, Premium, Silver
from [dbo].[Predicted]) c
left join
(select
case when (tickettype = 'Premium')
then count(number)
end as Premiumold,
case when (tickettype = 'Silver')
then count(number)
end as Silverold,
concat(convert(char(3), a.date, 0), ' ', year(a.date)) as Monthyear
from
openquery(SNOW, 'select number,date, ticket_type from ticketdata
where date between ''2019-01-01 00:00:00'' and ''2019-02-28 23:59:59''')a
group by concat(convert(char(3), a.sys_created_on, 0),' ',year(a.date)),tickettype) as b
on c.Monthyear = b.Monthyear)
This obviously isn't returning what I want.
Please help me with this.
Thanks!
Try this.
select ISNULL(a.monthyear,b.monthyear),a.Premiumold,silverold,Premium,silver from Table1 a
full join Table2 b on a.monthyear=b.MonthYear
use union all
select Monthyear,Premiumold,Silverold, null as Premium, null as silver
from tabl1
union all
select Monthyear,null,null,Premium,silver from tabl2
I have two tables:
Meter
ID SerialNumber
=======================
1 ABC1
2 ABC2
3 ABC3
4 ABC4
5 ABC5
6 ABC6
RegisterLevelInformation
ID MeterID ReadValue Consumption PreviousReadDate ReadType
============================================================================
1 1 250 250 1 jan 2015 EST
2 1 550 300 1 feb 2015 ACT
3 1 1000 450 1 apr 2015 EST
4 2 350 350 1 jan 2015 EST
5 2 850 500 1 feb 2015 ACT
6 2 1000 150 1 apr 2015 ACT
7 3 1500 1500 1 jan 2015 EST
8 3 2500 1000 1 mar 2015 EST
9 3 5000 2500 4 apr 2015 EST
10 4 250 250 1 jan 2015 EST
11 4 550 300 1 feb 2015 ACT
12 4 1000 450 1 apr 2015 EST
13 5 350 350 1 jan 2015 ACT
14 5 850 500 1 feb 2015 ACT
15 5 1000 150 1 apr 2015 ACT
16 6 1500 1500 1 jan 2015 EST
17 6 2500 1000 1 mar 2015 EST
18 6 5000 2500 4 apr 2015 EST
I am trying to group by meter serial and return the last actual read date for each of the meters but I am unsure as to how to accomplish this. Here is the sql I have thus far:
select a.SerialNumber, ReadTypeCode, MAX(PreviousReadDate) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
where ReadType = 'ACT'
group by a.SerialNumber,b.ReadTypeCode, PreviousReadDate
order by a.SerialNumber
I can't seem to get the MAX function to take effect in returning only the latest actual reading row and it returns all dates and the same meter serial is displayed several times.
If I use the following sql:
select a.SerialNumber, count(*) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
group by a.SerialNumber
order by a.SerialNumber
then each serial is shown only once. Any help would be greatly appreciated.
Like #PaulGriffin said in his comment you need to remove PreviousReadDate column from your GROUP BY clause.
Why are you experiencing this behaviour?
Basically the partition you have chosen - (SerialNumber,ReadTypeCode,PreviousReadDate) for each distinct pair of those values prints you SerialNumber, ReadTypeCode, MAX(PreviousReadDate). Since you are applying a MAX() function to each row of the partition that includes this column you are simply using an aggregate function on one value - so the output of MAX() will be equal to the one without it.
What you wanted to achieve
Get MAX value of PreviousReadDate for every pair of (SerialNumber,ReadTypeCode). So this is what your GROUP BY clause should include.
select a.SerialNumber, ReadTypeCode, MAX(PreviousReadDate) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
where ReadType = 'ACT'
group by a.SerialNumber,b.ReadTypeCode
order by a.SerialNumber
Is the correct SQL query for what you want.
Difference example
ID MeterID ReadValue Consumption PreviousReadDate ReadType
============================================================================
1 1 250 250 1 jan 2015 EST
2 1 550 300 1 feb 2015 ACT
3 1 1000 450 1 apr 2015 EST
Here if you apply the query with grouping by 3 columns you would get result:
SerialNumber | ReadTypeCode | PreviousReadDate
ABC1 | EST | 1 jan 2015 -- which is MAX of 1 value (1 jan 2015)
ABC1 | ACT | 1 feb 2015
ABC1 | EST | 1 apr 2015
But instead when you only group by SerialNumber,ReadTypeCode it would yield result (considering the sample data that I posted):
SerialNumber | ReadTypeCode | PreviousReadDate
ABC1 | EST | 1 apr 2015 -- which is MAX of 2 values (1 jan 2015, 1 apr 2015)
ABC1 | ACT | 1 feb 2015 -- which is MAX of 1 value (because ReadTypeCode is different from the row above
Explanation of your second query
In this query - you are right indeed - each serial is shown only once.
select a.SerialNumber, count(*) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
group by a.SerialNumber
order by a.SerialNumber
But this query would produce you odd results you don't expect if you add grouping by more columns (which you have done in your first query - try it yourself).
You need to remove PreviousReadDate from your Group By clause.
This is what your query should look like:
select a.SerialNumber, ReadTypeCode, MAX(PreviousReadDate) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
where ReadType = 'ACT'
group by a.SerialNumber,b.ReadTypeCode
order by a.SerialNumber
To understand how the group by clause works when you mention multiple columns, follow this link: Using group by on multiple columns
You will understand what was wrong with your query and why it returns all dates and the same meter serial is displayed several times.
Good luck!
Kudos! :)