Finding Error when running STRING_AGG function - google-bigquery

I would like to ask about a script in BigQuery. So, I tried to use a query below
SELECT id, STRING_AGG(DISTINCT status, ', ' ORDER BY timestamp) AS grouping
FROM table
GROUP BY id
But I couldn't run it since it gave me an error
An aggregate function that has both DISTINCT and ORDER BY arguments can only ORDER BY expressions that are arguments to the function
Could anyone help me to fix the error? Thank you in advance!

Do you want the distinct statuses ordered by timestamp?
If so, you can first order for each id the column status by timestamp, then aggregate.
WITH ordered as (
SELECT id, status
FROM table
ORDER BY id, row_number() over (partition by id ORDER BY timestamp)
)
SELECT id, STRING_AGG(DISTINCT status, ', ') AS grouping
FROM ordered
GROUP BY id

Related

BigQuery - Extract last entry of each group

I have one table where multiple records inserted for each group of product. Now, I want to extract (SELECT) only the last entries. For more, see the screenshot. The yellow highlighted records should be return with select query.
The HAVING MAX and HAVING MIN clause for the ANY_VALUE function is now in preview
HAVING MAX and HAVING MIN were just introduced for some aggregate functions - https://cloud.google.com/bigquery/docs/release-notes#February_06_2023
with them query can be very simple - consider below approach
select any_value(t having max datetime).*
from your_table t
group by t.id, t.product
if applied to sample data in your question - output is
You might consider below as well
SELECT *
FROM sample_table
QUALIFY DateTime = MAX(DateTime) OVER (PARTITION BY ID, Product);
If you're more familiar with an aggregate function than a window function, below might be an another option.
SELECT ARRAY_AGG(t ORDER BY DateTime DESC LIMIT 1)[SAFE_OFFSET(0)].*
FROM sample_table t
GROUP BY t.ID, t.Product
Query results
You can use window function to do partition based on key and selecting required based on defining order by field.
For Example:
select * from (
select *,
rank() over (partition by product, order by DateTime Desc) as rank
from `project.dataset.table`)
where rank = 1
You can use this query to select last record of each group:
Select Top(1) * from Tablename group by ID order by DateTime Desc

Invalid group by expression error when using any_value with max and window function in Snowflake

I was given a query and I am attempting to modify it in order to get the most recent version of each COMP_ID. The original query:
SELECT
ANY_VALUE(DATA_INDEX)::string AS DATA_INDEX,
COMP_ID::string AS COMP_ID,
ANY_VALUE(ACCOUNT_ID)::string AS ACCOUNT_ID,
ANY_VALUE(COMP_VERSION)::string AS COMP_VERSION,
ANY_VALUE(NAME)::string AS NAME,
ANY_VALUE(DESCRIPTION)::string AS DESCRIPTION,
MAX(OBJECT_DICT:"startshape-type")[0]::string AS STARTSHAPE_TYPE,
MAX(OBJECT_DICT:"startshape-connector-type")[0]::string AS STARTSHAPE_CONNECTOR_TYPE ,
MAX(OBJECT_DICT:"startshape-action-type")[0]::string AS STATSHAPE_ACTION_TYPE,
MAX(OBJECT_DICT:"overrides-enabled")[0]::string AS OVERRIDES_ENABLED
FROM COMP_DATA
GROUP BY COMP_ID
ORDER BY COMP_ID;
I then attempted to use a window function to grab only the highest version for each comp_id.
This is the modified query:
SELECT
ANY_VALUE(DATA_INDEX)::string AS DATA_INDEX,
COMP_ID::string AS COMP_ID,
ANY_VALUE(ACCOUNT_ID)::string AS ACCOUNT_ID,
ANY_VALUE(COMP_VERSION)::string AS COMP_VERSION,
ANY_VALUE(NAME)::string AS NAME,
ANY_VALUE(DESCRIPTION)::string AS DESCRIPTION,
MAX(OBJECT_DICT:"startshape-type")[0]::string AS STARTSHAPE_TYPE,
MAX(OBJECT_DICT:"startshape-connector-type")[0]::string AS STARTSHAPE_CONNECTOR_TYPE ,
MAX(OBJECT_DICT:"startshape-action-type")[0]::string AS STATSHAPE_ACTION_TYPE,
MAX(OBJECT_DICT:"overrides-enabled")[0]::string AS OVERRIDES_ENABLED,
ROW_NUMBER() OVER (PARTITION BY COMP_ID ORDER BY COMP_VERSION DESC) AS ROW_NUM
FROM COMP_DATA
QUALIFY 1 = ROW_NUM;
When attempting to compile the below error is given:
SQL compilation error: [COMP_DATA.COMP_ID] is not a valid group by expression
I had originally thought the issue was the ANY_VALUE on COMP_VERSION, but after removing the ANY_VALUE the same error was given. The only way I found to not get an error was removing the 4 MAX fields and all of the ANY_VALUE()'s, as shown below:
SELECT
DATA_INDEX::string AS DATA_INDEX,
COMP_ID::string AS COMP_ID,
ACCOUNT_ID::string AS ACCOUNT_ID,
COMP_VERSION::string AS COMP_VERSION,
NAME::string AS NAME,
DESCRIPTION::string AS DESCRIPTION,
ROW_NUMBER() OVER (PARTITION BY COMP_ID ORDER BY COMP_VERSION DESC) AS ROW_NUM
FROM COMP_DATA
QUALIFY 1 = ROW_NUM;
Of course this is not at all sufficient since I need the 4 max fields.
I have also tried creating the table with the max fields and from that new table using the window function to select the highest COMP_VERSION of each COMP_ID, but the same error was given.
When you added your QUALIFY clause you dropped the GROUP BY clause from your SQL, aggregate function like MAX, need all selections to be aggregate function OR to have a GROUP BY clause.
So if you only want the best row per the grouping clause, which you note, you aggregate functions need to be explicitly windowed. Thus
SELECT
data_index::string AS data_index,
comp_id::string AS comp_id,
account_id::string AS account_id,
comp_version::string AS comp_version,
name::string AS name,
description::string AS description,
MAX(object_dict:"startshape-type")OVER(PARTITION BY comp_id)[0]::string AS startshape_type,
MAX(object_dict:"startshape-connector-type")OVER (PARTITION BY comp_id)[0]::string AS startshape_connector_type ,
MAX(object_dict:"startshape-action-type")OVER (PARTITION BY comp_id)[0]::string AS statshape_action_type,
MAX(object_dict:"overrides-enabled")OVER(PARTITION BY comp_id)[0]::string AS overrides_enabled,
FROM COMP_DATA
QUALIFY 1 = ROW_NUMBER() OVER (PARTITION BY comp_id ORDER BY comp_version DESC);
There is a small chance you will need to add a set of brackets around those MAX's like
(MAX(object_dict:"overrides-enabled")OVER(PARTITION BY comp_id))[0]::string AS overrides_enabled,
But I suspect it will work out of the box. And I assumed you don't want the row_number so pushed it into the qualify (because it will always be the value 1)

HIVE - Getting ALL columns of the table with COUNT(*) with DISTINCT values

I have the table below called Current_Table
I want to get the output that is,
The Column personalemailtrim to be DISTINCT
The column Occurrences must be over Count >1
Order by the column personalemailtrim
My Query so far build is wrong in many levels, Group by cant with DISTINCT and also using Count(*) doesnt give me any results with Group my etc....
SELECT id,
personalemailtrim,
personworksatnumberofbsbs,
region,
district,
branch,
num,
countofapptsatbsb,
COUNT(personalemailtrim) occurrences
FROM Current_table
GROUP BY id,
personalemailtrim,
personworksatnumberofbsbs,
region,
district,
branch,
num,
countofapptsatbsb
HAVING COUNT(*) > 1
ORDER BY personalemailtrim
Any help provided is really appreciated . I tried several breaking down code methods but i am stuck on this
further to elaborate , The expected output should look like below
As you can see the,
Occurrences are > 1
personalemailtrim is now DISTINCT
I think you want:
select t.*
from (select t.*,
row_number() over (partition by personalemailtrim order by id) as seqnum
from Current_table t
) t
where seqnum = 1 and occurrences > 1;
This assumes that occurrences is the same for each personalemailtrim, which is consistent with your data and with your question.

SQL Select a distinct row based on two columns which has min value in third column

EDIT: I'm using PostgresSQL
My query needs to return all the unique rows for the id column and the type column. When there are multiple rows with the same id and type it will return the row with the smallest value in the time column.
SELECT id, type, value FROM TableName
GROUP BY MIN(time)
ORDER BY id ASC, type ASC
This is what I have so far but I feel like I'm using GROUP BY the wrong way
I think you can use ROW_NUMBER to mark the rows within each combination of id and type with the smallest time having rn = 1, then use WHERE clause to filter the table:
SELECT id, type, value FROM
(SELECT id, type, value,
ROW_NUMBER() OVER(PARTITION BY id, type ORDER BY time) AS rn
FROM TableName) a
WHERE rn = 1
Postgres support distinct on. This is usually the most efficient way to do what you want:
SELECT DISTINCT ON (id, type) id, type, value
FROM TableName
ORDER BY id, type, time ;

How do I use ROW_NUMBER()?

I want to use the ROW_NUMBER() to get...
To get the max(ROW_NUMBER()) --> Or i guess this would also be the count of all rows
I tried doing:
SELECT max(ROW_NUMBER() OVER(ORDER BY UserId)) FROM Users
but it didn't seem to work...
To get ROW_NUMBER() using a given piece of information, ie. if I have a name and I want to know what row the name came from.
I assume it would be something similar to what I tried for #1
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
but this didn't work either...
Any Ideas?
For the first question, why not just use?
SELECT COUNT(*) FROM myTable
to get the count.
And for the second question, the primary key of the row is what should be used to identify a particular row. Don't try and use the row number for that.
If you returned Row_Number() in your main query,
SELECT ROW_NUMBER() OVER (Order by Id) AS RowNumber, Field1, Field2, Field3
FROM User
Then when you want to go 5 rows back then you can take the current row number and use the following query to determine the row with currentrow -5
SELECT us.Id
FROM (SELECT ROW_NUMBER() OVER (ORDER BY id) AS Row, Id
FROM User ) us
WHERE Row = CurrentRow - 5
Though I agree with others that you could use count() to get the total number of rows, here is how you can use the row_count():
To get the total no of rows:
with temp as (
select row_number() over (order by id) as rownum
from table_name
)
select max(rownum) from temp
To get the row numbers where name is Matt:
with temp as (
select name, row_number() over (order by id) as rownum
from table_name
)
select rownum from temp where name like 'Matt'
You can further use min(rownum) or max(rownum) to get the first or last row for Matt respectively.
These were very simple implementations of row_number(). You can use it for more complex grouping. Check out my response on Advanced grouping without using a sub query
If you need to return the table's total row count, you can use an alternative way to the SELECT COUNT(*) statement.
Because SELECT COUNT(*) makes a full table scan to return the row count, it can take very long time for a large table. You can use the sysindexes system table instead in this case. There is a ROWS column that contains the total row count for each table in your database. You can use the following select statement:
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid < 2
This will drastically reduce the time your query takes.
You can use this for get first record where has clause
SELECT TOP(1) * , ROW_NUMBER() OVER(ORDER BY UserId) AS rownum
FROM Users
WHERE UserName = 'Joe'
ORDER BY rownum ASC
ROW_NUMBER() returns a unique number for each row starting with 1. You can easily use this by simply writing:
ROW_NUMBER() OVER (ORDER BY 'Column_Name' DESC) as ROW_NUMBER
May not be related to the question here. But I found it could be useful when using ROW_NUMBER -
SELECT *,
ROW_NUMBER() OVER (ORDER BY (SELECT 100)) AS Any_ID
FROM #Any_Table
select
Ml.Hid,
ml.blockid,
row_number() over (partition by ml.blockid order by Ml.Hid desc) as rownumber,
H.HNAME
from MIT_LeadBechmarkHamletwise ML
join [MT.HAMLE] h on ML.Hid=h.HID
SELECT num, UserName FROM
(SELECT UserName, ROW_NUMBER() OVER(ORDER BY UserId) AS num
From Users) AS numbered
WHERE UserName='Joe'
You can use Row_Number for limit query result.
Example:
SELECT * FROM (
select row_number() OVER (order by createtime desc) AS ROWINDEX,*
from TABLENAME ) TB
WHERE TB.ROWINDEX between 0 and 10
--
With above query, I will get PAGE 1 of results from TABLENAME.
If you absolutely want to use ROW_NUMBER for this (instead of count(*)) you can always use:
SELECT TOP 1 ROW_NUMBER() OVER (ORDER BY Id)
FROM USERS
ORDER BY ROW_NUMBER() OVER (ORDER BY Id) DESC
Need to create virtual table by using WITH table AS, which is mention in given Query.
By using this virtual table, you can perform CRUD operation w.r.t row_number.
QUERY:
WITH table AS
-
(SELECT row_number() OVER(ORDER BY UserId) rn, * FROM Users)
-
SELECT * FROM table WHERE UserName='Joe'
-
You can use INSERT, UPDATE or DELETE in last sentence by in spite of SELECT.
SQL Row_Number() function is to sort and assign an order number to data rows in related record set. So it is used to number rows, for example to identify the top 10 rows which have the highest order amount or identify the order of each customer which is the highest amount, etc.
If you want to sort the dataset and number each row by seperating them into categories we use Row_Number() with Partition By clause. For example, sorting orders of each customer within itself where the dataset contains all orders, etc.
SELECT
SalesOrderNumber,
CustomerId,
SubTotal,
ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY SubTotal DESC) rn
FROM Sales.SalesOrderHeader
But as I understand you want to calculate the number of rows of grouped by a column. To visualize the requirement, if you want to see the count of all orders of the related customer as a seperate column besides order info, you can use COUNT() aggregation function with Partition By clause
For example,
SELECT
SalesOrderNumber,
CustomerId,
COUNT(*) OVER (PARTITION BY CustomerId) CustomerOrderCount
FROM Sales.SalesOrderHeader
This query:
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
will return all rows where the UserName is 'Joe' UNLESS you have no UserName='Joe'
They will be listed in order of UserID and the row_number field will start with 1 and increment however many rows contain UserName='Joe'
If it does not work for you then your WHERE command has an issue OR there is no UserID in the table. Check spelling for both fields UserID and UserName.