SQL - aggregate function to get value from same row as MAX()

SQL - aggregate function to get value from same row as MAX() - sql-server-2005

I have one table with columns channel, value and timestamp, and another table with 7 other columns with various data.
I'm joining these two together, and I want to select the maximum value of the value column within an hour, and the timestamp of the corresponding row. This is what I've tried, but it (obviously) doesn't work.
SELECT
v.channel,
MAX(v.value),
v.timestamp,
i.stuff,
...
FROM
Values v
INNER JOIN
#Information i
ON i.type = v.type
GROUP BY channel, DATEPART(HOUR, timestamp), i.stuff, ...
I'm (not very surprisingly) getting the following error:
"dbo.Values.timestamp" is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause
How should I do this correctly?

You could use the RANK() or DENSE_RANK() features to get the results as appropriate. Something like:
;WITH RankedResults AS
(
SELECT
channel,
value,
timestamp,
type,
RANK() OVER (PARTITION BY DATEPART(hour,timestamp) ORDER BY value desc) as Position
FROM
Values
)
SELECT
v.channel,
v.value,
v.timestamp,
i.stuff
/* other columns */
FROM
RankedResults v
inner join
#Information i
on
v.type = i.type
WHERE
v.Position = 1
(whether to use RANK or DENSE_RANK depends on what you want to do in the case of ties, really)
(Edited the SQL to include the join, in response to Tomas' comment)

you must include 'v.timestamp' in the Group By clause.
Hope this will help for you.

Related

BigQuery - Extract last entry of each group

I have one table where multiple records inserted for each group of product. Now, I want to extract (SELECT) only the last entries. For more, see the screenshot. The yellow highlighted records should be return with select query.

The HAVING MAX and HAVING MIN clause for the ANY_VALUE function is now in preview
HAVING MAX and HAVING MIN were just introduced for some aggregate functions - https://cloud.google.com/bigquery/docs/release-notes#February_06_2023
with them query can be very simple - consider below approach
select any_value(t having max datetime).*
from your_table t
group by t.id, t.product
if applied to sample data in your question - output is

You might consider below as well
SELECT *
FROM sample_table
QUALIFY DateTime = MAX(DateTime) OVER (PARTITION BY ID, Product);
If you're more familiar with an aggregate function than a window function, below might be an another option.
SELECT ARRAY_AGG(t ORDER BY DateTime DESC LIMIT 1)[SAFE_OFFSET(0)].*
FROM sample_table t
GROUP BY t.ID, t.Product
Query results

You can use window function to do partition based on key and selecting required based on defining order by field.
For Example:
select * from (
select *,
rank() over (partition by product, order by DateTime Desc) as rank
from `project.dataset.table`)
where rank = 1

You can use this query to select last record of each group:
Select Top(1) * from Tablename group by ID order by DateTime Desc

Why can't I use my column alias in WHERE clause?

I want to compare a value of my current row with the value of the row before. I came up with this, but it won't work. It can't find PREV_NUMBER_OF_PEOPLE so my WHERE clause is invalid. I'm not allowed to use WITH. Does anyone have an idea?
SELECT
ID
,NUMBER_OF_PEOPLE
,LAG(NUMBER_OF_PEOPLE) OVER (ORDER BY DATE) AS PREV_NUMBER_OF_PEOPLE
,DATE
FROM (
SELECT * FROM DATAFRAME
WHERE DATE>=CURRENT_DATE-90
ORDER BY DATE DESC
) AS InnerQuery
WHERE NUMBER_OF_PEOPLE <> PREV_NUMBER_OF_PEOPLE

You have several issues with your query:
The filtering conditions should be in the outer query.
The new column definition should be in the inner query.
The order by should be in the outer query.
With these changes, it should work fine:
SELECT ID, NUMBER_OF_PEOPLE, PREV_NUMBER_OF_PEOPLE, DATE
FROM (SELECT D.*,
LAG(NUMBER_OF_PEOPLE) OVER (ORDER BY DATE) AS PREV_NUMBER_OF_PEOPLE
FROM DATAFRAME D
) AS InnerQuery
WHERE NUMBER_OF_PEOPLE <> PREV_NUMBER_OF_PEOPLE AND
DATE >= CURRENT_DATE - 90
ORDER BY DATE DESC;
You need the filtering after the LAG() so you can include the earliest day in the date range. If you filter in the inner query, the LAG() will return NULL in that case.
You need to define the alias in the subquery so you can refer to it in the WHERE. Aliases defined in a SELECT cannot be used in the corresponding WHERE. This is a SQL rule, not due to the database you are using.

You could use common table expression (CTE's) to split the query processing.
Something like this:
WITH cte1 AS
(
SELECT * -- field list is advised...
FROM DATAFRAME
WHERE DATE >= CURRENT_DATE-90
),
cte2 AS
(
SELECT ID
,NUMBER_OF_PEOPLE
,LAG(NUMBER_OF_PEOPLE) OVER (ORDER BY DATE) AS PREV_NUMBER_OF_PEOPLE
,DATE
FROM cte1
)
SELECT ID
,NUMBER_OF_PEOPLE
,PREV_NUMBER_OF_PEOPLE
,DATE
FROM cte2
WHERE NUMBER_OF_PEOPLE <> PREV_NUMBER_OF_PEOPLE
ORDER BY DATE DESC;

Logical query processing is the conceptual interpretation of the query that defines the correct result, and unlike the keyed-in order of the query clauses, it starts by evaluating the FROM clause. Understanding logical query processing is crucial for correct understanding of T-SQL.
The main statement used to retrieve data in T-SQL is the SELECT statement. Following are the main query clauses specified in the order that you are supposed to type them (known as “keyed-in order”):
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
But as mentioned, the logical query processing order, which is the conceptual interpretation order, is different. It starts with the FROM clause. Here is the logical query processing order of the six main query clauses:
FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
You can use a CTE :
WITH CTE1 AS (
SELECT * FROM DATAFRAME
WHERE DATE>=CURRENT_DATE-90
),
CTE2 AS (
SELECT
ID
,NUMBER_OF_PEOPLE
,LAG(NUMBER_OF_PEOPLE) OVER (ORDER BY DATE) AS PREV_NUMBER_OF_PEOPLE
,DATE
FROM CT2
)
SELECT * FROM CT2
WHERE NUMBER_OF_PEOPLE <> PREV_NUMBER_OF_PEOPLE

Just move the lag() into the derived table.
SELECT *
FROM (
SELECT id,
number_of_people,
lag(number_of_people) over (order by date) as prev_number_of_people,
date
FROM dataframe
WHERE date >= current_date - 90
) AS InnerQuery
WHERE number_of_people <> prev_number_of_people
ORDER BY date DESC

Get a new column with updated values, where each row change in time depending on the actual column?

I have some data that includes as columns an ID, Date and Place denoted by a number. I need to simulate a real time update where I create a new column that says how many different places are at the moment, so each time a new place appear in the column, the new column change it's value and shows it.
This is just a little piece of the original table with hundreds of millions of rows.
Here is an example, the left table is the original one and the right table is what I need.
I tried to do it with this piece of code but I cannot use the function DISTINCT with the OVER clause.
SELECT ID, Dates, Place,
count (distinct(Place)) OVER (PARTITION BY Place ORDER BY Dates) AS
DiffPlaces
FROM #informacion_prendaria_muestra
order by ID;

I think it will be possible by using DENSE_RANK() in SQL server
you can try this
SELECT ID, Dates, Place,
DENSE_RANK() OVER(ORDER BY Place) AS
DiffPlaces
FROM #informacion_prendaria_muestra

I think you can use a self join query like this - without using windows functions -:
select
t.ID, t.[Date], t.Place,
count(distinct tt.Place) diffPlace
from
yourTable t left join
yourTable tt on t.ID = tt.ID and t.[Date] >= tt.[Date]
group by
t.ID, t.[Date], t.Place
order by
Id, [Date];
SQL Fiddle Demo

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below

In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;

Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.

Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective

A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

How to get the row number when using alias in orderby

I have a query. I use an alias in order by when using row_number and I got
[42703] ERROR: column "total_comments" does not exist error Position: 335
How can I fix this?
select
cr_seller_history_id,
c.created_at,
company_name,
business_name,
brand,
kep_mail,
address,
phone,
mail,
slug,
name,
point,
contact_positive,
contact_negative,
product_number,
(product_positive + product_negative) as total_comments,
ROW_NUMBER() OVER(ORDER BY total_comments) as rank
from cr_companies a
INNER JOIN cr_sellers b ON a.cr_company_id = b.cr_company_id
INNER JOIN cr_seller_histories c ON b.cr_seller_id = c.cr_seller_id
WHERE DATE(c.created_at) = DATE 'yesterday'
ORDER BY total_comments DESC NULLS LAST

The other solutions are a subquery, CTE, or a lateral join. So, you can write:
select . . .
v.total_comments,
row_number() over (order by v.total_comments) as rank
from cr_companies c join
cr_sellers s
on c.cr_company_id = s.cr_company_id join
cr_seller_histories sh
on s.cr_seller_id = sh.cr_seller_id, lateral
(values (product_positive + product_negative)) v(total_comments)
where DATE(c.created_at) = date 'yesterday'
order by v.total_comments desc nulls last;
Notice that I also changed the table aliases to be abbreviations for the table names. This is a best practice and makes it much easier to write, read, and modify queries.

the problem is in:
ROW_NUMBER() OVER(ORDER BY total_comments) as rank
you can't use alias like this - order by accepts alias in select, not in window function:
https://www.postgresql.org/docs/current/static/sql-select.html#SQL-SELECT-LIST
An output column's name can be used to refer to the column's value in
ORDER BY and GROUP BY clauses, but not in the WHERE or HAVING clauses;
there you must write out the expression instead.
instead try:
ROW_NUMBER() OVER(ORDER BY (product_positive + product_negative)) as rank
or use subquery - then alias can be used in window function

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - aggregate function to get value from same row as MAX() - sql-server-2005

you must include 'v.timestamp' in the Group By clause. Hope this will help for you.

Related

BigQuery - Extract last entry of each group

Why can't I use my column alias in WHERE clause?

Get a new column with updated values, where each row change in time depending on the actual column?

Get minimum without using row number/window function in Bigquery

How to get the row number when using alias in orderby

Categories

Resources