I want to generate unique ids while inserting into Bigquery table. ROW_NUMBER()OVER() fails with resources exceeded. Forums recommend using ROW_NUMBER()OVER(PARTITION BY). Unfortunately, partition by can't be used as it may produce same row_numbers for the partition by key. Please note that the data that I am trying to insert is at least few hundreds of millions every day.
Unfortunately, partition by can't be used as it may produce same row_numbers for the partition by key
yes - you will get same numbers for different partitions - so you can just use compound key like in below much simplified example - just to show approach - you should be able to tweak it to your specific case
#standardSQL
WITH `project.dataset.table` AS (
SELECT value, CAST(10*RAND() AS INT64) partitionid
FROM UNNEST(GENERATE_ARRAY(1, 100)) value
)
SELECT
partitionid,
value,
CONCAT(
CAST(1000 + partitionid AS STRING),
CAST(10000 + ROW_NUMBER() OVER(PARTITION BY partitionid ORDER BY value) AS STRING)
) id
FROM `project.dataset.table`
-- ORDER BY id
Related
Update: This was answered here.
I am putting together a somewhat complex query to do event detection, join(s), and time-based binning with a large time-series dataset in Snowflake. I recently noticed that match_recognize lets me eloquently detect time-series events, but whenever I try to use a match_recognize expression within a Common Table Expression (with .. as ..), I receive the following error:
SQL compilation error: MATCH_RECOGNIZE not supported in this context.
I've done a lot of searching/reading, but haven't found any documented limitations on match_recognize in CTEs. Here's my query:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
),
label_events as (
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
And I get the same error as above with this.
Is this a limitation that I'm not seeing, or am I doing something wrong?
Non-recursive cte could be always rewritten as inline view:
--select ...
--from (
select id, timestamp, measurement, event_number
from (select distinct id, timestamp, measurement
from dataset) clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)mr
-- ) -- if other transformations are required
It is not ideal, but at least it will allow query to run.
Per this thread from a comment by Filipe Hoffa: MATCH_RECOGNIZE with CTE in Snowflake
This seemed to be a non-documented limitation of Snowflake at the time. A two or three step solution has worked well for me:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
)
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
);
set quid=last_query_id();
with label_events as (
select *
from table(result_scan($quid))
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
I prefer to use a variable here, because I can re-run the second query multiple times during development/debugging without having to re-run the first query.
It is also important to note that cached GEOGRAPHY objects in Snowflake are converted to GEOJSON, so when retrieving these with result_scan, you must typecast them back to the GEOGRAPHY type.
When creating a table let's say "orders" with partitioning in the following way my result gets truncated in comparison to if I create it without partitioning. (Commenting and uncommenting rows five and 6).
I suspect that it might have something to do with the BQ limits (found here) but I can't figure out what. The ts is a timestamp field and order_id is a UUID string.
i.e. The count distinct on the last row will yield very different results. When partitioned it will return far less order_ids than without partitioning.
DROP TABLE IF EXISTS
`project.dataset.orders`;
CREATE OR REPLACE TABLE
`project.dataset.orders`
-- PARTITION BY
-- DATE(ts)
AS
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2;
SELECT COUNT(DISTINCT order_id) FROM `project.dataset.orders`;
(This is not a valid 'answer', I just need a better place to write SQL than the comment box, I don't mind if moderator convert this answer into a comment AFTER it serves its purpose)
What is the number you'd get if you do query below, and which one does it align with (partitioned or non-partitioned)?
SELECT COUNT(DISTINCT order_id) FROM (
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2
) t;
It turns out that there's a 60 day partition expiration!
https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration
So by updating the partition expiration I could get the full range.
This is a very simple question but I can't seem to find documentation on it. How one would query rows by index (ie select the 10th through 20th row in a table)?
I know there's a row_numbers function but it doesn't seem to do what I want.
Do not specify any partition so your row number will be an integer between 1 and your number of record.
SELECT row_num FROM (
SELECT row_number() over () as row_num
FROM your_table
)
WHERE row_num between 100000 and 100010
I seem to have found a roundabout and clunky way of doing this in Athena, so any better answers are welcome. This approach requires you have some numeric column in your table already, in this case named some_numeric_column:
SELECT some_numeric_column, row_num FROM (
SELECT some_numeric_column,
row_number() over (order by some_numeric_column) as row_num
FROM your_table
)
WHERE row_num between 100000 and 100010
To explain, you first select some numeric column in your data, then create a column (called row_num) of row numbers which is based on the order of your selected numeric column. Then you wrap that all in a select call because Athena doesn't support creating and then conditioning on the row_num column within a single call. If you don't wrap it in a second SELECT call Athena will spit out some errors about not finding a column named row_num.
Is there an agregate function that returns any value from a group. I could use MIN or MAX, but would rather avoid the overhead if possible given it's a text field.
My situation is an error log summary. The errors are grouped by the type of error and an example of the error text is displayed for each group. It doesn't matter which error message is used as the example.
SELECT
ref_code,
log_type,
error_number,
COUNT(*) AS count,
MIN(data) AS example
FROM data
GROUP BY
ref_code,
log_type,
error_number
What can I replace MIN(data) with to not have to compare 100,000s of varchar(2000) values?
you can use MIN coupled with KEEP, like this:
MIN(data) keep (dense_rank first order by rowid) AS EXAMPLE
The idea behind this is that the database engine will be sorting data over ROWID instead of the VARCHAR(2000) values, which theoretically should be faster. You can replace ROWID with the primary key value, and check if it's faster
Going by the proposed answers, it appears that MIN(data) (or MAX(data)) is the fastest way to achieve what I want. I'm trying to over-optimise unnecessarily.
I'll try out any other answers that come up while I have access to this database, but in the mean time, this comes out on top.
Thank you for everyone's effort!
Well, since you asked about OVER PARTITION AND ORDER BY, below is a version that does your GROUP BY, but then also uses ROW_NUMBER() with OVER and PARTITION AND ORDER BY, to number the first ref_code, log_type, error_num combination it comes across as row number 1 (with whatever data column is there at 1). Then it renumbers, starting at 1, at the next distinct ref_code, log_type, error_num combination it finds (with whatever data column that happens to be there). So you can then simply pull the data field at row number 1 as a representative data field for a given ref_code, log_type, error_num.
It's still lacking something. It would be more elegant if I didn't have the double pass (once for aggregation and once for row_number()); however, it might perform very well none-the-less. I'll have to think about it some more to see if I can eliminate the double pass.
But it avoids any comparison of the large data field. And it is represents a way to do what you asked: to pull 1 representative sample from the data field in correlation with the aggregated fields.
SELECT
t.ref_code,
t.log_type,
t.error_number,
t.count,
d.data
FROM
(
SELECT
ref_code,
log_type,
error_number,
COUNT(*) as count
FROM data
GROUP BY
ref_code,
log_type,
error_number
) t
INNER JOIN
(
SELECT
ref_code,
log_type,
error_number,
data,
ROW_NUMBER() OVER
(
PARTITION BY
ref_code,
log_type,
error_number
ORDER BY
ref_code,
log_type,
error_number
) as row_number
FROM data
) d on
d.ref_code = t.ref_code and
d.log_type = t.log_type and
d.error_number = t.error_number and
row_number = 1
Final caveat: I don't have Oracle to try this on. But I did put it together from reading Oracle documentation.
I added the below after I thought further how to elminate the GROUP BY, which I only had in there for COUNT(*). Don't know if it's any faster though.
SELECT *
FROM
(
SELECT
ref_code,
log_type,
error_number,
data,
ROW_NUMBER() OVER
(
PARTITION BY
ref_code,
log_type,
error_number
ORDER BY
ref_code,
log_type,
error_number
) as row_number,
COUNT(*) OVER
(
PARTITION BY
ref_code,
log_type,
error_number
ORDER BY
ref_code,
log_type,
error_number
) as count
FROM data
) t
WHERE row_number = 1
I need a computed column formula that gives me this yyMMdd##.
I have an identity column (DataID) and a date column (DataDate).
This what I have so far.
(((right(CONVERT([varchar](4),datepart(year,[DataDate]),0),(2))+
right(CONVERT([varchar](4),datepart(month,[DataDate]),0),(2)))+
right(CONVERT([varchar](4),datepart(day,[DataDate]),0),(2)))+
right('00'+CONVERT([varchar](2),[DataID],0),(2)))
And this gives me:
12111201
12111202
12111303
12111304
12111405
12111406
12111407
12111508
What I want is:
12111201
12111202
12111301
12111302
12111401
12111402
12111403
12111501
I'm assuming you want to have a sequence starting at 1 for each date - right? If not: please explain what you really want / need.
You won't be able to do this with a IDENTITY column and a computed column specification. An IDENTITY column returns constantly increasing numbers.
What you could do is not store those values on disk - but instead use CTE and the ROW_NUMBER() OVER (PARTITION BY....) construct to create those numbers on the fly - whenever you need to select them. Or have a job that sets those values based on such a CTE on a regular basis (e.g. once every hour or so).
That CTE might look something like this - again, assuming that DataDate is indeed of type DATE (and not DATETIME or something like that) :
;WITH CTE AS
(
SELECT
DataID, DataDate,
RowNum = ROW_NUMBER() OVER (PARTITION BY DataDate ORDER BY DataID)
FROM
dbo.YourTable
)
SELECT
DataID, DataDate, RowNum
FROM
CTE