get latest column value from hive table conditionally on other columns - sql

I have a Hive table 'Orders' with four columns (id String, name String, Order String, ts String). Sample data of table is as below.
-------------------------------------------
id name order ts
-------------------------------------------
1 abc completed 2018-04-12 08:15:26
2 def received 2018-04-15 06:20:17
3 ghi processed 2018-04-16 11:36:56
4 jkl received 2018-04-05 12:23:34
3 ghi received 2018-03-23 16:43:46
1 abc processed 2018-03-17 18:39:22
1 abc received 2018-02-25 20:07:56
The Order column has three states received -> processed -> completed. There are many orders for a single name and each has these three stages. I need the latest value of order for a given 'id' and 'name'. This may seem as a novice question for you but I am stuck with this.
I tried writing queries like below but they are not working and I couldn't use max function directly on 'ts' column as it is in String format. Please advice a best method.
Thanks in advance.
Queries I tried
SELECT
ORDER
FROM Orders
WHERE id = '1'
AND name = 'ghi'
AND ts = (
SELECT max(unix_timestamp(ts, 'yyyy-MM-dd HH:mm:SS'))
FROM Orders
)
Error while compiling statement: FAILED: ParseException line 2:0 cannot recognize input near 'select' 'max' '(' in expression specification
SELECT
ORDER
FROM Orders
WHERE id = '1'
AND name = 'ghi'
AND max(unix_timestamp(ts, 'yyyy-MM-dd HH:mm:SS'))
Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 1:93 Not yet supported place for UDAF 'max'
select o.order from Orders o
inner join (
select id, name, order, max(ts) as ts
from Orders
group by id, name, order
) ord on d.id = ord.id and o.name = ord.name and o.ts = ord.ts where o.id = '1' and o.name = 'abc'
This query was executed but the output is not a single latest order stage but of each order stage with corresponding latest timestamp.
Please help.

For a given order, you want one row. Hence, you can use order by and limit:
SELECT o.*
FROM Orders o
WHERE id = 1 AND -- presumably id is a number
name = 'ghi'
ORDER BY ts DESC
LIMIT 1;
This should also have the best performance.

You can use the RANK analytical function to get your problem resolved as below:
select id,name,order,ts
from (select id,name,order,ts,rank() over(partition by id,name order by ts) r from orders)k
where r = 1
and id = '1'
and name = 'ghi'
If you want to get the latest record for all the ID's and name then you don't need to pass the values for "ID" and "NAME" you will get your desired result easily.
All the best!!!

Related

How to return null values if there is no data to display in BigQuery [duplicate]

This question already has answers here:
Display default value if query results in no records in BigQuery
(2 answers)
Closed 10 months ago.
My question is I want to get null values when there is "no data to display" in the BigQuery.
like this:
But it only works when there are only aggregate functions. How to modify below query so that returns null values?
My query:
select oid, date, coalesce(sum(quantity_sold),0) as quantity_sold
from table
where oid = 'xxx' and (date >= 'xxx' and date <= 'xxx')
group by 1,2
I found this similar SO question but it creates a column that contains a message that says "Results not found" and assigns null values to other columns. You can apply this query and remove the message and retain only the null values, your query will look like this:
with sample_data as (
select 123 as oid, '2022-01-01' as date, 23 as quantity_sold
union all select 111 as oid, '2022-01-02' as date, 24 as quantity_sold
),
actual_query as (
select oid,date,coalesce(sum(quantity_sold),0) as quantity_sold
from sample_data
where oid = 534 and (date >= '2021-03-23' and date <= '2021-04-23')
group by 1,2
)
-- query below is the modified query from the linked SO question above
select actual_query.*
from actual_query
union all
select actual_query.* -- all the `actual_query` columns will be `NULL`
from (select 1) left join
actual_query
on 1 = 0 -- always false
where not exists (select 1 from actual_query);
Sample output:
NOTE: I created random values for sample data that could mimic the message "There is no data to display" when I ran your query.

How to join table is sql?

I have two tables which name shoes_type and shoes_list. The shoes_type table includes shoes_id, shoes_size, shoes_type, date, project_id. Meanwhile, on the shoes_list table, I have shoes_quantity, shoes_id, shoes_color, date, project_id.
I need to get the sum of shoes_quantity based on the shoes_type, shoes_size, date, and also project_id.
I get how to sum the shoes_quantity based on color by doing:
select shoes_color, sum(shoes_quantity)
from shoes_list group by shoes_color
Basically what I want to see is the total quantity of shoes based on the type, size, date and project_id. The type and size information are available on shoes_type table, while the quantity is coming from the shoes_list table. I expect to see something like:
shoes_type shoes_size total quantity date project_id
heels 5 3 19/10/02 1
sneakers 5 3 19/10/02 1
sneakers 6 1 19/10/05 1
heels 7 5 19/10/03 1
While for the desired result, I have tried:
select shoes_type, shoes_size, date, project_id, sum(shoes_quantity)
from shoes_type st
join shoes_list sl
on st.project_id = sl.project_id
and st.shoes_id = sl.shoes_id
and st.date = sl.date
group by shoes_type, shoes_size, date, project_id
Unfortunately, I got an error that says that the column reference "date" is ambiguous.
How should I fix this?
Thank you.
The date column exists in both tables, so you have to specify where to select it from. Replace date with shoes_type.date or shoes_list.date
Qualify all column references to remove the "ambiguous" column error:
select st.shoes_type, st.shoes_size, st.date, st.project_id, sum(slshoes_quantity)
from shoes_type st join
shoes_list sl
on st.project_id = sl.project_id and
st.shoes_id = sl.shoes_id and
st.date = sl.date
group by st.shoes_type, st.shoes_size, st.date, st.project_id;
If you want all columns from shoes_type, you might find that a correlated subquery is faster:
select st.*,
(select sum(slshoes_quantity)
from shoes_list sl
where st.project_id = sl.project_id and
st.shoes_id = sl.shoes_id and
st.date = sl.date
)
from shoes_type st;

querying data with sqlite

I have data in an sqlite db that contains the following columns:
date | name | id | code
all as TEXT (I sourced it from a csv file) and I want to build a query that finds all names that have code ABC120 but not ABC306 nor ABC305 on the same date and group the result GROUP BY name.
How do I do this?
If you want to use GROUP BY you must group by name, date first and set the conditions in the HAVING clause, but also you must use DISTINCT so the results do not contain duplicate names:
select distinct name
from tablename
group by name, date
having sum(code = 'ABC120') > 0 and sum(code in ('ABC305', 'ABC306')) = 0;
You can get the same results with EXISTS:
select distinct t.name
from tablename t
where t.code = 'ABC120'
and not exists (select 1 from tablename where name = t.name and date = t.date and code in ('ABC305', 'ABC306'))
You can use having:
select date, name
from t
where code in ('ABC120', 'ABC306', 'ABC305')
group by date, name
having min(code) = 'ABC120' and max(code) = 'ABC120';
Note: because of the three codes you chose, you could just use max(code) = 120. However, that does not generalize to other code values.

SQL Server select max date per ID

I am trying to select max date record for each service_user_id for each finance_charge_id and the amount that is linked the highest date
select distinct
s.Finance_Charge_ID, MAX(s.start_date), s.Amount
from
Service_User_Finance_Charges s
where
s.Service_User_ID = '156'
group by
s.Finance_Charge_ID, s.Amount
The issue is that I receive multiple entries where the amount is different. I only want to receive the amount on the latest date for each finance_charge_id
At the moment I receive the below which is incorrect (the third line should not appear as the 1st line has a higher date)
Finance_Charge_ID (No column name) Amount
2 2014-10-19 1.00
3 2014-10-16 500.00
2 2014-10-01 1000.00
Remove the Amount column from the group by to get the correct rows. You can then join that query onto the table again to get all the data you need. Here is an example using a CTE to get the max dates:
WITH MaxDates_CTE (Finance_Charge_ID, MaxDate) AS
(
select s.Finance_Charge_ID,
MAX(s.start_date) MaxDate
from Service_User_Finance_Charges s
where s.Service_User_ID = '156'
group by s.Finance_Charge_ID
)
SELECT *
FROM Service_User_Finance_Charges
JOIN MaxDates_CTE
ON MaxDates_CTE.Finance_Charge_ID = Service_User_Finance_Charges.Finance_Charge_ID
AND MaxDates_CTE.MaxDate = Service_User_Finance_Charges.start_date
This can be done using a window function which removes the need for a self join on the grouped data:
select Finance_Charge_ID,
start_date,
amount
from (
select s.Finance_Charge_ID,
s.start_date,
max(s.start_date) over (partition by s.Finance_Charge_ID) as max_date,
s.Amount
from Service_User_Finance_Charges s
where s.Service_User_ID = 156
) t
where start_date = max_date;
As the window function does not require you to use group by you can add any additional column you need in the output.

Variant use of the GROUP BY clause in TSQL

Imagine the following schema and sample data (SQL Server 2008):
OriginatingObject
----------------------------------------------
ID
1
2
3
ValueSet
----------------------------------------------
ID OriginatingObjectID DateStamp
1 1 2009-05-21 10:41:43
2 1 2009-05-22 12:11:51
3 1 2009-05-22 12:13:25
4 2 2009-05-21 10:42:40
5 2 2009-05-20 02:21:34
6 1 2009-05-21 23:41:43
7 3 2009-05-26 14:56:01
Value
----------------------------------------------
ID ValueSetID Value
1 1 28
etc (a set of rows for each related ValueSet)
I need to obtain the ID of the most recent ValueSet record for each OriginatingObject. Do not assume that the higher the ID of a record, the more recent it is.
I am not sure how to use GROUP BY properly in order to make sure the set of results grouped together to form each aggregate row includes the ID of the row with the highest DateStamp value for that grouping. Do I need to use a subquery or is there a better way?
You can do it with a correlated subquery or using IN with multiple columns and a GROUP-BY.
Please note, simple GROUP-BY can only bring you to the list of OriginatingIDs and Timestamps. In order to pull the relevant ValueSet IDs, the cleanest solution is use a subquery.
Multiple-column IN with GROUP-BY (probably faster):
SELECT O.ID, V.ID
FROM Originating AS O, ValueSet AS V
WHERE O.ID = V.OriginatingID
AND
(V.OriginatingID, V.DateStamp) IN
(
SELECT OriginatingID, Max(DateStamp)
FROM ValueSet
GROUP BY OriginatingID
)
Correlated Subquery:
SELECT O.ID, V.ID
FROM Originating AS O, ValueSet AS V
WHERE O.ID = V.OriginatingID
AND
V.DateStamp =
(
SELECT Max(DateStamp)
FROM ValueSet V2
WHERE V2.OriginatingID = O.ID
)
SELECT OriginatingObjectID, id
FROM (
SELECT id, OriginatingObjectID, RANK() OVER(PARTITION BY OriginatingObjectID
ORDER BY DateStamp DESC) as ranking
FROM ValueSet)
WHERE ranking = 1;
This can be done with a correlated sub-query. No GROUP-BY necessary.
SELECT
vs.ID,
vs.OriginatingObjectID,
vs.DateStamp,
v.Value
FROM
ValueSet vs
INNER JOIN Value v ON v.ValueSetID = vs.ID
WHERE
NOT EXISTS (
SELECT 1
FROM ValueSet
WHERE OriginatingObjectID = vs.OriginatingObjectID
AND DateStamp > vs.DateStamp
)
This works only if there can not be two equal DateStamps for a OriginatingObjectID in the ValueSet table.