Hive how to get second max, less than max not working - hive

I searched stackoverflow and found this solution in SQL
select max(concat(snapshot_year_month,snapshot_day)) from db.table where concat(snapshot_year_month,snapshot_day) < (select max(concat(snapshot_year_month,snapshot_day)) from db.tale)
However, this is not working in hive, with the error -
UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#118829
How can I accomplish this task in hive? (Attempting to get second max)

You have ranking functions to do this.
select *
from (select row_number() over(order by concat(snapshot_year_month,snapshot_day) desc) as rnum
from db.table
) t
where rnum=2
This assumes the concatenated column is unique. If not unique, use dense_rank.

Related

BigQuery - Extract last entry of each group

I have one table where multiple records inserted for each group of product. Now, I want to extract (SELECT) only the last entries. For more, see the screenshot. The yellow highlighted records should be return with select query.
The HAVING MAX and HAVING MIN clause for the ANY_VALUE function is now in preview
HAVING MAX and HAVING MIN were just introduced for some aggregate functions - https://cloud.google.com/bigquery/docs/release-notes#February_06_2023
with them query can be very simple - consider below approach
select any_value(t having max datetime).*
from your_table t
group by t.id, t.product
if applied to sample data in your question - output is
You might consider below as well
SELECT *
FROM sample_table
QUALIFY DateTime = MAX(DateTime) OVER (PARTITION BY ID, Product);
If you're more familiar with an aggregate function than a window function, below might be an another option.
SELECT ARRAY_AGG(t ORDER BY DateTime DESC LIMIT 1)[SAFE_OFFSET(0)].*
FROM sample_table t
GROUP BY t.ID, t.Product
Query results
You can use window function to do partition based on key and selecting required based on defining order by field.
For Example:
select * from (
select *,
rank() over (partition by product, order by DateTime Desc) as rank
from `project.dataset.table`)
where rank = 1
You can use this query to select last record of each group:
Select Top(1) * from Tablename group by ID order by DateTime Desc

How to get sample data from a table or a view in Aster Teradata without using order by?

I am trying to get sample data from a table in Aster Teradata using order by using the following code:
SELECT "col"
FROM (SELECT "col",
Row_number()
OVER (
ORDER BY 1) AS RANK
FROM "nisha_test"."test_table") a
WHERE rank <= 10000
I want to get random 10000 rows without using order by.
If you want a sample you should use the built-in sample feature.
For Aster (or Vantage MLE, but with a slightly different syntax) there's a RandomSample operator, e.g.
SELECT * FROM RandomSample (
ON (SELECT 1) PARTITION BY 1 -- dummy data, but needed
InputTable ('nisha_test.test_table')
NumSample ('10000')
)
For Teradata there's the SAMPLE clause, e.g.
select *
from nisha_test.test_table
SAMPLE 10000
You can also use the QUALIFY clause in Teradata to remove the outer SELECT:
SELECT col
FROM nisha_test.test_table
QUALIFY ROW_NUMBER() OVER (ORDER BY NULL) <= 10000
In Teradata, I think you can use a constant value in the ORDER BY. You may even be able to exclude the ORDER BY altogether: ROW_NUMBER() OVER()
We can use the LIMIT keyword to get random values from a table or a view in Aster DB.
select * from "nisha_test"."test_table" limit 10000;

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

SELECT statement in WHERE clause on BigQuery not working

I'm trying to run the following query on Google BigQuery:
SELECT SUM(var1) AS Revenue
FROM [table1]
WHERE timeStamp = (SELECT MAX(timeStamp) FROM [table1])
I'm getting the following error:
Error: Encountered "" at line 3, column 19. Was expecting one of:
Is this not supported in BigQuery? If so, would there be an elegant alternative?
Subselect in a comparison predicate is not supported, but you can use IN.
SELECT SUM(var1) AS Revenue
FROM [table1]
WHERE timeStamp IN (SELECT MAX(timeStamp) FROM [table1])
I would use Rank() to get the max timestamp, and filter the #1s in the where clause.
select SUM(var1) AS Revenue
From
(SELECT var1
,RANK() OVER (ORDER BY timestamp DESC) as RNK
FROM [table1]
)
where RNK=1
I don't know how it works with BQ, but in other DB technologies it would be more efficient as it involves only single table scan rather than 2.

How to select distinct rows from table without using functions(min,max etc)?

I have seen a lot of questions about this lately but I think that there should be something easier than to Group By by one column and making all the other fields of the selected table as Min, Max, Average functions. For example I have a big table with 20 columns. I don't think making 19 columns as functions is the right choice.
I have tried Distinct but it gives also duplicate values.
Also putting every field of the select in the Group By doesn't work either because Oracle complains:
ORA-00932: inconsistent datatypes: expected - got CLOB
Any idea?
You can select distinct rows using the row_number() function:
select t.*
from (select t.*,
row_number() over (partition by <columns that should be different>
order by NULL) as seqnum
from t
) t
where seqnum = 1;