ROW_NUMBER() in BQ always changes every time I rerun the query - google-bigquery

I am using BQ and am using ROW_NUMBER to give my data an identifier. However, I found this issue where every time I rerun the query, the ROW_NUMBER would give me different outcome.
In my database, I have 12 fields in total and I use this query
ROW_NUMBER() OVER(ORDER BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) row_number
I have run this query for multiple times, but the outcome would always be different each time I run it.
For example:
1st Run: Merchant A has Row 1212
2nd Run: The exact same Merchant A has Row 2938
Is there anything wrong that I do here? Thanks

Turns out we shouldn't use number for ORDER BY in ROW_NUMBER() function.
Instead of using number, I use the field name instead and it works the magic. The ROW_NUMBER() keeps giving me the same answer.
Instead of this:
ROW_NUMBER() OVER(ORDER BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) row_number
Use this:
ROW_NUMBER() OVER(ORDER BY field_name_1, field_name_2) row_number

Related

SQL script to with the shown screenshot

I want to write a sql script to as shown in the screenshot image. Thank you.
enter image description here
I've tried MAX() function to aggregate the ESSBASE_MONTH field to make it distinct and display a single month in the output instead of multiple months. I am yet to figure out how to put 0 in any month that EMPID did not perform any sale like in December under "Total GreaterThan 24 HE Account" and "Total_HE_Accounts"
The fields of the table are not very informative however based on screenshot, this is the best answer I could come up with.
Assuming the table name is SALES;
select
ADJ_EMPID,
ESSBASE_MONTH,
MAX(YTD_COUNT) AS YTD_COUNT,
SUM(TOTAL_24) AS TOTAL_24,
SUM(TOTAL_ACC) AS TOTAL_ACC
from SALES
group by
ADJ_EMPID,
ESSBASE_MONTH
The above will aggregate the monthly 'sales' data as expected.
To add the 'missing' rows such as the December, it is possible to do it by doing a union of the above query with a vitural table.
select
MAX(MONTH_NUMBER) AS MONTH_NUMBER,
ADJ_EMPID,
ESSBASE_MONTH,
MAX(YTD_COUNT) AS YTD_COUNT,
SUM(TOTAL_24) AS TOTAL_24,
SUM(TOTAL_ACC) AS TOTAL_ACC
from (
select
1 as MONTH_NUMBER,
*
from SALES
union all
select * from (values
(1, '300014366', 'January', 0, 0, 0),
(2, '300014366', 'Feburary', 0, 0, 0),
-- add the other missing months as required
(11, '300014366', 'November', 0, 0, 0),
(12, '300014366', 'December', 0, 0, 0)
) TEMP_TABLE (MONTH_NUMBER, ADJ_EMPID, ESSBASE_MONTH, YTD_COUNT, TOTAL_24, TOTAL_ACC)
) as AGGREGATED_DATA
group by
ADJ_EMPID,
ESSBASE_MONTH
order by MONTH_NUMBER;
TEMP_TABLE is a vitural tables which contains all the months and sales as zero. There is a special field MONTH_NUMBER added to sort the months in the proper order.
Not the easiest query to understand, the requirement is not exactly feasible either..
Link to fiddledb for a working solution with PostgreSQL 15.

Google BigQuery Resources exceeded during query execution. How to split large window frames with partition in SQL

I'm running out of memory with my query on Google BigQuery.
I have to calculate multiple window functions like running sums over multiple different time frames.
My data mainly consists of an id (string), a value (number), a type ('in' or 'out', could be converted to bool is needed) and a timestamp.
I read that there is no way to increase memory per slot, so the only way to be able to execute the query is to cut it into smaller pieces that can be sent to different slots. A way to do this is to use GROUP BY or OVER (PARTITION BY ...) but I have no idea how I could rewrite my query to make use of it.
I have some calculations that need to use PARTITION BY but for others, I want to calculate the total overall, for example:
Imagine a have a large table (> 1 billion rows) where I want to calculate a rolling sum over all values for different time frames, independent of id.
WITH data AS (
SELECT *
FROM UNNEST([
STRUCT
('A' as id,1 as value, 'out' as type, 1 as time),
('A', -1, 'in', 2),
('B', 2, 'out', 2),
('C', 1, 'out', 3),
('B', -1, 'in', 4),
('A', 2, 'out', 4),
('C', 5, 'out', 5),
('B', 3, 'out', 6),
('A', 1, 'out', 6),
('A', -4, 'in', 6),
('C', -3, 'in', 7)
])
)
SELECT
id
, value
, type
, time
, SUM(value) OVER (ORDER BY time RANGE UNBOUNDED PRECEDING) as total
, SUM(value) OVER (ORDER BY time RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) as total_last_day
, SUM(value) OVER (ORDER BY time RANGE BETWEEN 3 PRECEDING AND 2 PRECEDING) as total_prev_day
FROM data
How could I split this query to make use of PARTITION BY or GROUP BY in order to fit within the memory limits?
Try below approach - I think it has good chances to resolve your issue
SELECT *
FROM data
JOIN (
SELECT time
, SUM(time_value) OVER (ORDER BY time RANGE UNBOUNDED PRECEDING) as total
, SUM(time_value) OVER (ORDER BY time RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) as total_last_day
, SUM(time_value) OVER (ORDER BY time RANGE BETWEEN 3 PRECEDING AND 2 PRECEDING) as total_prev_day
FROM (
SELECT time, SUM(value) time_value
FROM data
GROUP BY time
)
)
USING (time)
if applied to sample data in your question - output is
Window functions with specially order by partition by etc are in general very heavy and using it on big data can take long.
It looks like your expected result is keying of of your id in your sample query.
There are a few things you can check:
see if your source data can be "cluster by" id so processing will be faster to start with.
If that does not work, then see adding following style of filters will work or not:
select ...
where mod(farm_fingerprint(id), 5) = 0
store it in a table and keep appending next mods (=1 to 4) to that table. "%5=0" mod is given as a sample, you have to experiment with it knowing your source data. Here using mod will split your source data into 5 smaller buckets to work with so you have to append it all later.
That way amount of data needed for BQ internal memory will be less and it might process results in its limits.
If any of above ideas work, you can all do that in a sql script so you can create temp tables and work with those.

Bigquery equivalent for pandas fillna(method='ffill') [duplicate]

I have a Big Query table that looks like this:
![Table[(https://ibb.co/1ZXMH71)
As you can see most values are empty.
I'd like to forward-fill those empty values, meaning using the last known value ordered by time.
Apparently, there is a function for that called FILL
https://cloud.google.com/dataprep/docs/html/FILL-Function_57344752
But I have no idea how to use it.
This is the Query I've tried posting on the web UI:
SELECT sns_6,Time
FROM TABLE_PATH
FILL sns_6,-1,0 order: Time
the error I get is:
Syntax error: Unexpected identifier "sns_6" at [3:6]
What I want is to get a new table where the column sns_6 is filled with the last known value.
As a bonus: I'd like this to happen for all columns but because fill only supports a single column, for now, I'll have to iterate over all the columns. If anyone has an idea of how to do the iteration This would be a great bonus.
Below is for BigQuery Standard SQL
I'd like to forward-fill those empty values, meaning using the last known value ordered by time
#standardSQL
SELECT time
LAST_VALUE(sns_1 IGNORE NULLS) OVER(ORDER BY time) sns_1,
LAST_VALUE(sns_2 IGNORE NULLS) OVER(ORDER BY time) sns_2
FROM `project.dataset.table`
I'd like this to happen for all columns
You can add as many below lines as many columns you need to fill (obviously you need to replace sns_N with the real column's name
LAST_VALUE(sns_N IGNORE NULLS) OVER(ORDER BY time) sns_N
I'm not sure what your screen shot has to do with your query.
I think this will do what you want:
SELECT sns_6, Time,
LAST_VALUE(sns_6 IGNORE NULLS) ORDER BY (Time) as imputed_sns_6
FROM TABLE_PATH;
EDIT:
This query works fine when I run it:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;

How To Forward-Fill empty values in a table

I have a Big Query table that looks like this:
![Table[(https://ibb.co/1ZXMH71)
As you can see most values are empty.
I'd like to forward-fill those empty values, meaning using the last known value ordered by time.
Apparently, there is a function for that called FILL
https://cloud.google.com/dataprep/docs/html/FILL-Function_57344752
But I have no idea how to use it.
This is the Query I've tried posting on the web UI:
SELECT sns_6,Time
FROM TABLE_PATH
FILL sns_6,-1,0 order: Time
the error I get is:
Syntax error: Unexpected identifier "sns_6" at [3:6]
What I want is to get a new table where the column sns_6 is filled with the last known value.
As a bonus: I'd like this to happen for all columns but because fill only supports a single column, for now, I'll have to iterate over all the columns. If anyone has an idea of how to do the iteration This would be a great bonus.
Below is for BigQuery Standard SQL
I'd like to forward-fill those empty values, meaning using the last known value ordered by time
#standardSQL
SELECT time
LAST_VALUE(sns_1 IGNORE NULLS) OVER(ORDER BY time) sns_1,
LAST_VALUE(sns_2 IGNORE NULLS) OVER(ORDER BY time) sns_2
FROM `project.dataset.table`
I'd like this to happen for all columns
You can add as many below lines as many columns you need to fill (obviously you need to replace sns_N with the real column's name
LAST_VALUE(sns_N IGNORE NULLS) OVER(ORDER BY time) sns_N
I'm not sure what your screen shot has to do with your query.
I think this will do what you want:
SELECT sns_6, Time,
LAST_VALUE(sns_6 IGNORE NULLS) ORDER BY (Time) as imputed_sns_6
FROM TABLE_PATH;
EDIT:
This query works fine when I run it:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;

SQL query involving count, group by and substring

I would like to group rows of this table according to dates which form the start of SessionID and for each day, I would like to count how many rows there are for each set of ReqPhone values. Each set of ReqPhone values will be defined by the first four digits of ReqPhone. In other words, I would like to know how many rows there are for ReqPhone starting with 0925, 0927 and 0940, how many rows there are for ReqPhone starting with 0979, 0969 and 0955, etc etc.
I have been trying all kinds of group by and count but still haven't arrived at the right query.
Can anybody enlighten me?
Update:
In my country, the government assigns telecoms phone numbers starting with certain digits. Therefore, if you know the starting digits, you know which telecom someone is using. I am trying to count how many messages are sent each day using each telecoms.
SELECT SUBSTRING(ReqPhone, 1, 4),
DATEADD(DAY,0, DATEDIFF(DAY, 0, SessionID)) AS dayCreated,
COUNT(*) AS tally
FROM yourTable
GROUP BY SUBSTRING(ReqPhone, 1, 4),
DATEADD(DAY, 0, DATEDIFF(DAY, 0, SessionID))
SELECT LEFT(ReqPhone, 4),
DATEADD(DAY,0, DATEDIFF(DAY, 0, SessionID)) AS dayCreated,
COUNT(*) AS tally
FROM yourTable
GROUP BY LEFT(ReqPhone,4),
DATEADD(DAY, 0, DATEDIFF(DAY, 0, SessionID))
This will help you to calculate the count of rows group by the ReqPhone type. This query is working successfully in Oracle DB.
SELECT COUNT(SESSIONID), REQP
FROM (SELECT SESSIONID,SUBSTR(REQPHONE,1,4) AS REQP FROM SCHEMA_NAME.TABLE_NAME)
GROUP BY REQP
Note: Please use the column which is unique in the COUNT expression.