How to get max of group by query - hive

Here is my dataset:
00000000040112 2702 00000000040112 AVAILABLE 1566921227223 -6.0 LB
00000000040112 2702 00000000040112 AVAILABLE 1566921247222 -9.0 LB
00030400791888 6065 00030400791888 AVAILABLE 1566919357992 45.0 EA
00030400791888 6065 00030400791888 AVAILABLE 1566919547809 72.0 EA
I am trying to get the max from each group, so based on the above data the expected result would be like this:
00000000040112 2702 00000000040112 AVAILABLE 1566921247222 -9.0 LB
00030400791888 6065 00030400791888 AVAILABLE 1566919547809 72.0 EA
My query which does not produce the correct result is:
select
primegtin, nodeid, gtin, inventory_state,
max(last_updated_time),
quantity_by_gtin, quantity_uom
from pit_by_prime_gtin
where
year=2019 and month =8 and day =27 and hour=15
group by
primegtin, nodeid, gtin, inventory_state,
last_updated_time,
quantity_by_gtin, quantity_uom ;
What could be wrong with it?

You need to remove columns, that you are aggregating by from group by clause.
In your example it should probably be something like:
select
primegtin, nodeid, gtin, inventory_state,
max(last_updated_time),
max(quantity_by_gtin), quantity_uom
from pit_by_prime_gtin
where
year=2019 and month =8 and day =27 and hour=15
group by
primegtin, nodeid, gtin, inventory_state,
quantity_uom ;

Related

In Oracle SQL, is it possible have an effective dated query that also show rows that occur between dates?

I have the following tables:
ProductGroup
|GroupID|Product Group|Product Date|
|-|-------|--------|--------|
|A|Bicycles|1/1/2018|
|A|Two-Wheels|12/1/2018|
|A|Sport Bicycles|6/1/2019|
|A|Fast Bicycles|1/1/2020|
SubGroup
|SubgroupID|GroupID|SubGroup|SubGroupDate|
|-|-|-----|-----|
|1|A |wheels |06/01/2015|
|2|A |tires |10/01/2015|
|3|A |spokes |01/01/2017|
|4|A |chains |01/01/2019|
|5|A |brakes |03/01/2019|
I join them using a maximum effective dated query:
Select ProductName, ProductDate, tSubGroup.SubGroup, tSubGroup.SubGroupDate
FROM ProductGroup
Left Join (SELECT SubGroupName, SubGroupDate
FROM SubGroup
WHERE SubGroup.SubGroupDate = (Select max(SubGroupDate)
FROM SubGroup B
where b.SubGroupName = SubGroup.SubGroupDate
) ) tSubGroup on tSubGroup.GroupID = ProductGroup.GroupID
and tSubGroup.SubGroupDate <= ProductGroup.ProductDate
I get the following results as expected:
|ProductGroup |ProductDate |SubGroup |SubGroupDate|
|---|---|---|---|
|Bicycles |01/01/2018 |Spokes |01/01/2017|
|Two-Wheels |12/01/2018 |Spokes |01/01/2017|
|Sport Bicycles |06/01/2019 |Chains |01/01/2019|
|Fast Bicycles |01/01/2020 |Brakes |03/01/2019|
But what I want is this:
|ProductGroup |ProductDate |SubGroup |SubGroupDate|
|----|---|---|----|
|Bicycles |01/01/2018 |Spokes |01/01/2017|
|Two-Wheels |12/01/2018 |Spokes |01/01/2017|
|Sport Bicycles |06/01/2019 |Chains |01/01/2019|
|Sport Bicycles |06/01/2019 |Brakes |03/01/2019|
|Fast Bicycles |01/01/2020 |Brakes |03/01/2019|
In these results, there are two rows for Sport Bicycles, because the SubGroup Chains (1/1/2019) and Brakes (3/1/2019) occurred between the ProductGroup Two-Wheels (12/1/2018) and Sport Bicycles (6/1/2019).
I don't know how to combine the two tables to get all the rows, yet making sure there are no extra rows added. I tried using a FULL JOIN and different variations of LAG, LEAD, RANK, etc. I am just stuck on figuring this out.
Is there a way to produce the results I am looking for? Also, there will be three more columns to add to this when I am done.
Thank you for the help.
Hmmm . . . I think you want overlapping intervals. If so, lead() can help:
select pg.*, sg.*
from (select pg.*,
lead(productdate) over (partition by groupid order by productdate) as next_productdate
from productgroup pg
) pg join
(select sg.*,
lead(subgroupdate) over (partition by groupid order by subgroupdate) as next_subgroupdate
from SubGroup sg
) sg
on sg.groupid = pg.groupid and
sg.subgroupdate >= p.productdate and
(sg.subgroupdate < p.next_productdate or p.next_productdate is null)

Sum of column returning all null values in PySpark SQL

I am new to Spark and this might be a straightforward problem.
I've a SQL with name sql_left which is in the format:
Here is a sample data generated using sql_left.take(1):
[Row(REPORT_ID='2016-30-15/08/2019', Stats Area='2 Metropolitan', Suburb='GREENACRES', Postcode=5086, LGA Name='CITY OF PORT ADELAIDE ENFIELD', Total Units=3, Total Cas=0, Total Fats=0, Total SI=0, Total MI=0, Year=2016, Month='November', Day='Wednesday', Time='01:20 am', Area Speed=50, Position Type='Not Divided', Horizontal Align='Straight road', Vertical Align='Level', Other Feat='Not Applicable', Road Surface='Sealed', Moisture Cond='Dry', Weather Cond='Not Raining', DayNight='Night', Crash Type='Hit Parked Vehicle', Unit Resp=1, Entity Code='Driver Rider', CSEF Severity='1: PDO', Traffic Ctrls='No Control', DUI Involved=None, Drugs Involved=None, ACCLOC_X=1331135.04, ACCLOC_Y=1677256.22, UNIQUE_LOC=13311351677256, REPORT_ID='2016-30-15/08/2019', Unit No=2, No Of Cas=0, Veh Reg State='UNKNOWN', Unit Type='Motor Vehicle - Type Unknown', Veh Year='XXXX', Direction Of Travel='East', Sex=None, Age=None, Lic State=None, Licence Class=None, Licence Type=None, Towing='Unknown', Unit Movement='Parked', Number Occupants='000', Postcode=None, Rollover=None, Fire=None)]
Note: Age column has 'XXX','NUll' and other integer values as 023,034 etc.
The printSchema shows Age,Total Cas as integers.
I've tried the below code to first join two tables:
sql_left = spark.sql('''
SELECT *
FROM sql_crash c Left JOIN sql_units u ON c.REPORT_ID=u.REPORT_ID''')
sql_left.createOrReplaceTempView("mytable")
And below code to generate Total Cas:
sql_result = spark.sql('''select concat_ws(' ', Day, Month,Year,Time) as Date_Time,Age,"Licence Type","Unit Type",Sex,COALESCE(sum("Total Cas"),0) as Total_casualities from mytable where Suburb in ('ADELAIDE','ADELAIDE AIRPORT','NORTH ADELAIDE','PORT ADELAIDE') Group by Date_Time, Age,"Licence Type","Unit Type",Sex order by Total_casualities desc''')
sql_result.show(20,truncate=False)
The output I'm getting is below with sum as 0.
+--------------------------------+---+------------+---------+-------+-----------------+
|Date_Time |Age|Licence Type|Unit Type|Sex |Total_casualities|
+--------------------------------+---+------------+---------+-------+-----------------+
|Friday December 2016 02:45 pm |XXX|Licence Type|Unit Type|Unknown|0.0 |
|Saturday September 2017 06:35 pm|023|Licence Type|Unit Type|Male |0.0 |
+--------------------------------+---+------------+---------+-------+-----------------+
I've tried multiple options, however nothing worked out.
My main problem here is Total_casualities is returning 0.0 for all rows if I use COALESCE(sum("Total Cas"),0). If I don't use COALESCE, it is displaying values as NULL.
Help is much appreciated.
Instead of specifying Total Cas in double-quotes("Total Cas"), mention it in backticks.
i.e. `Total Cas`
Note: The column name with space in between needs to be specified with backticks. As you are mentioning under quotations, it considers it as a string, that's why you are not getting sum. Also, for other columns(like Licence Type,Unit Type), it's displaying the same as a string instead of its value. Hope you got it.
sql_result = spark.sql('''select concat_ws(' ', Day, Month,Year,Time) as Date_Time,Age,`Licence Type`,`Unit Type`,Sex,**sum(`Total Cas`)** as Total_casualities from mytable where Suburb in ('ADELAIDE','ADELAIDE AIRPORT','NORTH ADELAIDE','PORT ADELAIDE') Group by Date_Time, Age,`Licence Type`,`Unit Type`,Sex order by Total_casualities desc''')

Oracle, SQL Conditional exclusion based on the items in the joined table

The following query excludes all the products however, I am trying to exclude the products "only if" the R.OPERATING_UNITS = 'WP' and PRODUCT_CAT = 'FUEL' in the joined table. I don't know how to condition that. I wanted to know what is the best efficient way to do that. Below is the query, the RESOURCE, PRODUCT table and also the desired result set. I simplified both the tables and query for the sake of explanation.
SELECT R.DEPTID,
R.FISCAL_YEAR,
sum(R.AMOUNT) total
FROM RESOURCE R
WHERE
R.PRODUCT_ID NOT IN (
SELECT PRODUCT_ID FROM PRODUCT WHERE PRODUCT_CAT='FUEL' )
group by R.FISCAL_YEAR,R.DEPTID
the RESOURCE table
DPTID FISCAL_YEAR OPERATING_UNIT AMOUNT PRODUCT
PTT 2017 WP 1200 31000
PTT 2017 SP 3000 32000
PTT 2017 GP 1000 31000
PTT 2017 WP 1000 32000
FPP 2017 WP 1000 32000
FPP 2018 GP 2000 33000
FPP 2017 SP 1000 32000
FPP 2018 WP 2200 31000
PRODUCT Table:
PRODUCT PRODUCT_CAT
31000 FUEL
32000 NON-FUEL
33000 MATERIAL
Result set. Note that it is ignoring WP when calculating the sum.
2017 PTT 5000 (igonred 1200 since operating unit=wp and product is 31000->FUEL but included wp and 32000)
2017 FPP 2000
2018 FPP 2000 (it did not consider the 2200 since operating unit=wp and product is 31000->FUEL)
WP filter should work after you change below statement
NOT IN (
SELECT PRODUCT_ID FROM PRODUCT WHERE PRODUCT_CAT='FUEL' )
and then you can filter operating unit.
SELECT R.DEPTID,
R.FISCAL_YEAR,
sum(R.AMOUNT) total
FROM RESOURCE R
WHERE
r.OPERATING_UNIT = 'WG' and
R.PRODUCT_ID IN
(
SELECT PRODUCT_ID FROM PRODUCT WHERE PRODUCT_CAT='FUEL' )
group by R.FISCAL_YEAR,R.DEPTID
Sticking in a not-equal to clause with an OR in-between should filter out the cases where both OperatingUnit = 'WP' & ProductCat = 'Fuel'
SELECT r.DEPTID
,r.FISCAL_YEAR
,SUM(r.AMOUNT) AS TOTAL
FROM [Resource] r
INNER JOIN [Product] p ON r.PRODUCT = p.PRODUCT
WHERE r.OPERATING_UNIT != 'WP'
OR p.PRODUCT_CAT != 'FUEL'
GROUP BY r.DEPTID
,r.FISCAL_YEAR
I used the following query below to view the data and verify that it's returning the 6/8 rows I wanted.
SELECT *
FROM [Resource] r
INNER JOIN [Product] p ON r.PRODUCT = p.PRODUCT
For ease of writing - and reading - the exclusion condition, it would be nice if we could work with tuples. And we can. One benefit is that it will be easy, in the future, to add other pairs of operating unit and product category to the exclusion list, without having to write lengthy conditions with lots of OR and AND.
If you run a query like this, and then you take a look at the EXPLAIN PLAN for the query, you will see that the parser expanded the tuple condition to a long logical expression with OR (and AND, if more than one tuple is excluded) - so the end result is the same, but the code looks more natural.
select r.deptid, r.fiscal_year, sum(r.amount) as total
from resource r inner join product p on r.product = p.product
where (r.operating_unit, p.product_cat) not in ( ('WP', 'FUEL') )
group by r.deptid, r.fiscal_year
;
Regarding NULL: if either r.operating_unit or p.product_cat can be NULL, you need to state how they should be handled. If, for example, the operating unit is WP but the product category is NULL, the corresponding row will be excluded in the query above. That may be the proper handling: the unit is definitely 'WP', and since we don't know the product category, we must make a decision. Since it may be 'FUEL', we just don't know for sure, we may choose to exclude it. Obviously, if both columns are NOT NULL then this is not an issue.
Note - I hope you don't really have a table PRODUCT with a column PRODUCT; that will lead to confusion which almost always then leads to bugs.

I am trying to write a SQL nested query that finds/uses a max value to find the entry just before the max value

I am fairly new to SQL and am trying to write a query that finds the last time a water meter was read so I can see the value. There is a table of properties that have meters and another table of meters that stores the inputs from engineers. Every input is listed as a sequence, a keyword lists the type of input and expression lists their entry. The max sequence will not always be the answer.
What I am looking for is the last time the read the meter for water and then also get the value for electricity from that reading which is stored in the previous entry (sequence). To make it harder engineers input the sequence number and some go by ones (1,2,3) and others go by twos (2,4,6) so the previous entry may be minus one or maybe minus two.
I can write the queries to find the max sequence and another one to find the entry one previous or two previous but cannot figure out how to make it into one query.
to find the max sequence for site 12345, I have:
SELECT MAX(M.SEQUENCE) maxseq
FROM METERS M JOIN PROPERTY P ON M.PROPNUM = P.PROPNUM
WHERE (P.CORP_ID ='12345' AND M.KEYWORD = 'WTR')
I manually search for the entry before to get the electricity entry with the following query.
SELECT P.NAME, P.CORP_ID, M.KEYWORD, M.SEQUENCE, M.EXPRESSION
FROM METERS M JOIN PROPERTY P ON M.PROPNUM = P.PROPNUM
WHERE (P.CORP_ID ='12345')
ORDER BY M.SEQUENCE
I have tried different nested queries but have not been able to write anything that will work.
The data that I am interested in for the meters table looks like:
PROPNUM SEQUENCE KEYWORD EXPRESSION
10a124 95 ELC 9845
10a124 96 WTR 4521
10a124 97 SVC A105
10a124 98 HEALTH GOOD
10a124 99 DAY 150209
10a124 100 HEALTH GOOD
10a124 101 ELC 10283
10a124 102 WTR 4621
I use the property table to find the PROPNUM for the site I am interested as I have the site's ID (CORP_ID) but not its PROPNUM value.
The result I would like to get back would look like below.
NAME WTR_EXPRESSION ELC_EXPRESSION
SMITH 4621 10283
You can inner join the METER table to the PROPERTY table once for each KEYWORD, and specify that the SEQUENCE for 'ELC' (guessing KEYWORD) is less than the 'WTR' SEQUENCE. Since you are on SQL SERVER, we can do this in a CTE and inner join that data set to the METER table to display the EXPRESSION values for each KEYWORD in a single row:
;with wtr_elc as (
select
p.PROPNUM,
p.NAME,
max(w.SEQUENCE) as max_wtr_seq,
max(e.SEQUENCE) as max_elc_seq
from PROPERTY as p
inner join METERS as w
on w.PROPNUM = p.PROPNUM
w.KEYWORD = 'WTR'
inner join METERS as e
on e.PROPNUM = p.PROPNUM
and e.KEYWORD = 'ELC'
and e.SEQUENCE < w.SEQUENCE
where p.CORP_ID ='12345'
group by
p.PROPNUM,
p.NAME)
select
wtr_elc.NAME,
wtr.EXPRESSION as WTR_EXPRESSION,
elc.EXPRESSION as ELC_EXPRESSION
from METERS as wtr
inner join wtr_elc
on wtr_elc.PROPNUM = wtr.PROPNUM
and wtr_elc.max_wtr_seq = wtr.SEQUENCE
inner join METERS elc
on wtr_elc.PROPNUM = elc.PROPNUM
and wtr_elc.max_elc_seq = elc.SEQUENCE
and elc.KEYWORD = 'ELC'
where wtr.KEYWORD = 'WTR'
If you want to do this for more or all PROPERTY records, you can modify the where clause in the CTE.

Pivot Table from Three Tables without Aggregate

Apologies for my lack of knowledge on this, I've researched here and elsewhere but have hit a brick wall (my brain). I'm trying to display the rates for a villa in a table like this:
SPRING SUMMER FALL WINTER MAX GUESTS
2 Rooms $343 $288 $389 $467 2
3 Rooms $456 $415 $536 $756 4
Whole Villa $809 $789 $906 $1023 6
I assume that PIVOT is the answer to my woes. I'm using SQL Server 2008 on MS Server 2008 R2
The seasons, packages and prices are stored in 3 tables like this:
CONFIGURATIONS
--------------
configurationID
configurationName
maximumGuests
SEASONS
-------
seasonID
seasonName
CONFIGURATIONSEASONRATES
------------------------
seasonID
configurationID
price
I got as far as this based on the examples I've been able to find:
SELECT 'Packages', 'Summer', 'Winter', 'Christmas', 'Tropical'
FROM
(SELECT ACCOMMODATION_configurations.configurationName, price
FROM ACCOMMODATION_configurations INNER JOIN
ACCOMMODATION_configurationSeasonRates ON
ACCOMMODATION_configurations.configurationID = ACCOMMODATION_configurationSeasonRates.configurationID INNER JOIN
ACCOMMODATION_seasons ON ACCOMMODATION_configurationSeasonRates.seasonID = ACCOMMODATION_seasons.seasonID) as somethingNice
PIVOT (sum(price) for ACCOMMODATION_configurations.configurationName IN (['Summer'],['Winter'],['Christmas'],['Tropical'])) as anyThing
But I get an error saying
The column prefix 'ACCOMMODATION_configurations' does not match with a table name or alias used in the query
I then tried replacing SELECT 'Packages' with SELECT ACCOMMODATION_configurations.configurationName but then I am told that:
SELECT ACCOMMODATION_configurations.configurationName cannot be bound
Thanks in advance for any help!
Inside of your PIVOT syntax you need to remove the ACCOMMODATION_configurations. Also remove the single quotes around the values in the FOR and you need to add the Packages column to the inner select.
So the code will be:
-- this select will display the packages and the seasons
SELECT Packages, Summer, Winter, Christmas, Tropical
FROM
(
-- add Packages to this select list
SELECT Packages, ac.configurationName, price
FROM ACCOMMODATION_configurations ac
INNER JOIN ACCOMMODATION_configurationSeasonRates sr
ON ac.configurationID = sr.configurationID
INNER JOIN ACCOMMODATION_seasons s
ON sr.seasonID = s.seasonID
) as somethingNice
PIVOT
(
sum(price)
for configurationName IN ([Summer],[Winter],[Christmas],[Tropical])
) as anyThing
Edit, based on your comment it seems like you might want:
-- this select will display the packages and the seasons
SELECT Packages, Summer, Winter, Christmas, Tropical
FROM
(
SELECT ac.configurationName as Packages, price, seasonName
FROM ACCOMMODATION_configurations ac
INNER JOIN ACCOMMODATION_configurationSeasonRates sr
ON ac.configurationID = sr.configurationID
INNER JOIN ACCOMMODATION_seasons s
ON sr.seasonID = s.seasonID
) as somethingNice
PIVOT
(
sum(price)
for seasonName IN ([Summer],[Winter],[Christmas],[Tropical])
) as anyThing