porting some stuff to bigquery, and come across an issue.
We have a bunch of data with no unique key value. Unfortuantely some report logic requires a unique value for each row.
So in systems like Oracle I would just user the ROWNUM or ROWID psudeo columns.
In vertica, which doesn't have those psudeo columns I would use ROW_NUMBER() OVER(). But in bigquery that is failing with the error:
'dataset:bqjob_r79e7b4147102bdd7_0000016482b3957c_1': Resources exceeded during query execution: The query could not be executed in the allotted memory.
OVER() operator used too much memory..
The value does not have to be persistent, just a unique value within the query results.
Would like to avoid extract-process-reload if possible.
So is there any way to assign a unqiue value to query result rows in bigquery SQL?
Edit: Sorry, should have clarified. Using standard sql, not legacy
For ROW_NUMBER() OVER() to scale, you'll need to use PARTITION.
See https://stackoverflow.com/a/16534965/132438
#standardSQL
SELECT *
, FORMAT('%i-%i-%i', year, month, ROW_NUMBER() OVER(PARTITION BY year, month)) id
FROM `publicdata.samples.natality`
Related
We are working on converting Informatica mappings to Google Bigquery SQL. In one of the mappings, there are a couple ports/columns, say A and B which are not getting grouped by in the Aggregator transformation and neither have been applied any aggregation function like sum, avg etc.
According to senior devs in my org, in Informatica, we will get last values of these ports/columns as a result after the aggregator. My question is, how do we convert this behaviour in BigQuery SQL? Because we cannot use that columns in select statement, which are not present in the Group by clause and we don't want to group by these columns.
For getting last value of the column, we have LAST_VALUE() analytic function in bigquery, but even then we cannot use the group by and analytic function in same select statement.
I would really appreciate some help!
Use some aggregation function.
In Informatica you will get LAST value. This is not deterministic. It basically means that either
you have same values across all the column,
you don't care which one you get, or
you have specific order, on which the last value is taken.
First two cases mean you can use MIN / MAX / whatsoever. The result will be same or you don't care.
If the last one is your case, ARRAY_AGG should help you, as per this answer.
to convert Infa mapping with aggregator to big SQL, I would use row_number over (partitioned by id order by id) as rn and then in outside put a filter rn=1.
Informatica aggregator - id is group by column.
Equivalent SQL should look like this -
select a,b,id
from
(select a,b,row_number over (partitioned by id order by id desc) as rn --this will mimic informatica aggregator. id column is the group by port. if you have any sorter before aggregator add all ports as per order in order by column on same sequence but reverse order(asc/desc)
from mytable) rs
where rs.rn=1 -- this will ensure to pick latest row.
I'm trying to understand, how window function works internally.
ID,Amt
A,1
B,2
C,3
D,4
E,5
If I run this, will give sum of all amount in total column against every record.
Select ID, SUM (AMT) OVER () total from table
but when I run this, it will give me cumulative sum
Select ID, SUM (AMT) OVER (order by ID) total from table
Trying to understand what is happening when its OVER() and OVER(order by ID)
What I've understood is when no partition is defined in OVER, it considers everything as single partition. But not able to understand when we add order by Id within over(), how come it starts doing cumulative sum ?
Can anyone share what's happening behind the scenes for this ?
That is an interesting case, based on the documentation here is the explanation and example.
If PARTITION BY is not specified, the function treats all rows of the
query result set as a single partition. Function will be applied on
all rows in the partition if you don't specify ORDER BY clause.
So if you specifiey ORDER BY then
If it is specified, and a ROWS/RANGE is not specified, then default
RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used as default for
window frame by the functions that can accept optional ROWS/RANGE
specification (for example min or max).
So technically these two commands are the same:
SELECT ID, SUM(AMT) OVER (ORDER BY ID) total FROM table
SELECT ID, SUM(AMT) OVER (ORDER BY ID RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) total FROM table
More about you can read in the documentation:https://learn.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-ver15
This is not related to Oracle itself, but it's part of the SQL Standard and behaves the same way in many databases including Oracle, DB2, PostgreSQL, SQL Server, MySQL, MariaDB, H2, etc.
By definition, when you include the ORDER BY clause the engine will produce "running values" (cumulative aggregation) inside each partition; without the ORDER BY clause it produces the same, single value that aggregates the whole partition.
Now, the partition itself is mainly defined by the PARTITION BY clause. In its absence, the whole result set is considered as a single partition.
Finally, as a more advanced topic, the partition can be further tweaked using a "frame" clause (ROWS and RANGE) and by a "frame exclusion" clause (EXCLUDE).
I have a database that stores data from sensors in a factory. The DB contains about 1.6 million rows per sensor per day. I have the following index on the DB.
CREATE INDEX sensor_name_time_stamp_index ON sensor_data (time_stamp, sensor_name);
I will be running the following query once per day.
SELECT
time_stamp, value
FROM
(SELECT
time_stamp,
value,
lead(value) OVER (ORDER BY value DESC) as prev_result
FROM
sensor_data WHERE time_stamp between '2021-02-24' and '2021-02-25' and sensor_name = 'sensor8'
ORDER BY
time_stamp DESC) as result
WHERE
result.value IS DISTINCT FROM result.prev_result
ORDER BY
result.time_stamp DESC;
The query returns a list of rows where the value is different from the previous row.
This query takes about 23 seconds to run.
Running on PostgreSQL 10.12 on Aurora serverless.
Questions: Besides the index, are there any other optimisations that I can perform on the DB to make the query run faster?
To support the query optimally, the index must be defined the other way around:
CREATE INDEX ON sensor_data (sensor_name, time_stamp);
Otherwise, PostgreSQL will have to read all index values in the time interval, then discard the ones for the wrong sensor, then fetch the rows from the table.
With the proper column order, only the required rows are scanned in the index.
You asked for other optimizations: Since you have to sort rows, increasing work_mem can be beneficial. Other than that, more memory and faster storage will definitely not harm.
I have a table which contains too much data (10 cr). For Paging purpose i have used
ROW_NUMBER() OVER (Order by id), but its getting very slow in select. So please give any alternate soluion? (Using verica)
You can create a temp table with an identity column. Good read
How to dynamically number rows in a SELECT Transact-SQL statement
I have one table in Oracle consists of around 55 million records with partition on date column.
This table stores around 600,000 records for each day based on some position.
Now, some analytical functions are used in one select query in procedure e.g. lead, lag, row_number() over(partition by col1, date order by col1, date) which is taking too much time due to 'partition by' and 'order by' clause on date column.
Is there any other alternative to optimize the query ?
Have you considered using a materialized view where you store the results of your analytical functions?
More information about MVs
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm