Generating a stable uuid ID across with clauses in GBQ [duplicate] - google-bigquery

This question already has an answer here:
BigQuery GENERATE_UUID() and CTE's
(1 answer)
Closed 1 year ago.
I'm trying to generate a UUID within a with clause on GBQ and then use it in a subsequent with clause for a join.
with q1 as (
select generate_uuid() as uuid
), q2 as (
select uuid from q1
)
select * from q1
union all
select * from q2
This returns two distinct uuid.
How would I go about generating an ID that stays the same across with clauses?

I'll start with the reason behind this discrepancy.
TLDR; as of today there is no option to force BigQuery to materialize CTE results. It'd be useful when a CTE referenced more than once in a statement.
see below query:
with cte_1 as (
select count(1) as row_count
from `bigquery-public-data.austin_311.311_service_requests` as sr
)
, cte_2 as (
select row_count
from cte_1
)
select row_count
from cte_1
union all
select row_count
from cte_2;
when the execution plan is examined, you'll see 2 Input stage for referenced sr table.
It'd be great if we have an option to materialize CTE results. As I remember, oracle has this implicitly if CTE used more than once or explicitly via hints.
materializing q1 explicitly to table then use it twice might be a workaround. I'd prefer temporary table.
the drawback is; the cost may increase if your project uses on-demand pricing. (rather than flat-rate)

Because subquery always executes each time it is called, so with your query, the generate_uuid will be called twice for both q1 and q2 tables.
I suggest you save the generated UUID to a table, then query from this to make sure the UUID is the same from both tables.

Related

SQL CTE for a numbers table is fast for static value, but slow for table value

I'm trying to output the values of a table multiple times, based on a column in that table.
I tried to use CTE to make a numbers table on the fly:
WITH cte AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY (select 0)) AS i
FROM
sys.columns c1 CROSS JOIN sys.columns c2 CROSS JOIN sys.columns c3
)
select *
from myTable, cte
WHERE i <= myTable.timesToRepeatColumn
and myTable.id = '209386'
This SQL seems to take forever to run, so it seems to be trying to run the entire CTE before joining.
If I replace myTable.timesToRepeatColumn with a static value (say 10000), the query returns virtually instantly. So it seems to be doing the where i <= before fully cross-joining the CTE's table.
How can I tell SQL to do the where statement first like it does with a static number?
you can use recursive cte to achieve your goal
WITH cte AS (
SELECT
*
, timesToRepeatColumn as level
FROM
myTablewhere
WHERE myTable.id = '209386'
UNION ALL
SELECT
*
, level -1 as level
FROM
cte
WHERE
level > 0
)
SELECT * FROM cte
CTEs in SQL Server are not necessarily run 'independently'. SQL (in SQL Server, etc) is declarative, which means you tell it what you want, not how to do it.
It the query optimiser determines that it can do it better by doing something differently, it will.
A good example is
IF EXISTS(SELECT * FROM test) PRINT 'X';
IF (SELECT COUNT(*) FROM test) > 0 PRINT 'Y';
IF (SELECT COUNT(*) FROM test) > 1 PRINT 'Z';
If it was doing what you told it do, the query plans for the second and third would basically be the same. However, when you run it, the query plans for the first and second are the same; the third differs.
When you hard-code the value (e.g., 10,000), the query optimiser can use that hardcoded value to determine what to do. In this case, it probably determines it doesn't need to run the full CTE, just run it until you get 10,000 rows.
However, if you use a value that can vary (e.g., myTable.timesToRepeatColumn), then the query optimiser often makes a query plan that would word for any value. As such, it makes a query plan that is not fantastic for your situation - probably creating the full CTE in memory before using it. If sys.columns has 100 rows, that's 100^3 rows it creates. If it's 1000, it's 1000^3 e.g., 1,000,000,000. Likely you have more than 1000 rows.

Execute Subquery refactoring first before any other SQL

I Have a very complex view which is of the below form
create or replace view loan_vw as
select * from (with loan_info as (select loan_table.*,commission_table.*
from loan_table,
commission_table where
contract_id=commission_id)
select /*complex transformations */ from loan_info
where type <> 'PRINCIPAL'
union all
select /*complex transformations */ from loan_info
where type = 'PRINCIPAL')
Now IF I do the below select the query hangs
select * from loan_vw where contract_id='HA001234TY56';
But if I hardcode inside the subquery refactoring or use package level variable in the same session the query returns in a second
create or replace view loan_vw as
select * from (with loan_info as (select loan_table.*,commission_table.*
from loan_table,
commission_table where
contract_id=commission_id
and contract_id='HA001234TY56'
)
select /*complex transformations */ from loan_info
where type <> 'PRINCIPAL'
union all
select /*complex transformations */ from loan_info
where type = 'PRINCIPAL')
Since I use Business object I cannot use package level variable
So my question is there a hint in Oracle to tell the optimizer to first check the contract_id in loan_vw in the subquery refactoring
As requested the analytical function used is the below
select value_date, item, credit_entry, item_paid
from (
select value_date, item, credit_entry, debit_entry,
greatest(0, least(credit_entry, nvl(sum(debit_entry) over (), 0)
- nvl(sum(credit_entry) over (order by value_date
rows between unbounded preceding and 1 preceding), 0))) as item_paid
from your_table
)
where item is not null;
After following the advice given by Boneist and MarcinJ I removed the Sub query refactoring (CTE) and wrote one long query like the below which improved the performance from 3 min to 0.156 seconds
create or replace view loan_vw as
select /*complex transformations */
from loan_table,
commission_table where
contract_id=commission_id
and loan_table.type <> 'PRINCIPAL'
union all
select /*complex transformations */
from loan_table,
commission_table where
contract_id=commission_id
and loan_table.type = 'PRINCIPAL'
Are these transformations really that complex you have to use UNION ALL? It's really hard to optimize something you can't see, but have you maybe tried getting rid of the CTE and implementing your calculations inline?
CREATE OR REPLACE VIEW loan_vw AS
SELECT loan.contract_id
, CASE commission.type -- or wherever this comes from
WHEN 'PRINCIPAL'
THEN SUM(whatever) OVER (PARTITION BY loan.contract_id, loan.type) -- total_whatever
ELSE SUM(something_else) OVER (PARTITION BY loan.contract_id, loan.type) -- total_something_else
END AS whatever_something
FROM loan_table loan
INNER
JOIN commission_table commission
ON loan.contract_id = commission.commission_id
Note that if your analytic functions don't have PARTITION BY contract_id you won't be able to use an index on that contract_id column at all.
Take a look at this db fiddle (you'll have to click on ... on the last result table to expand the results). Here, the loan table has an indexed (PK) contract_id column, but also some_other_id that is also unique, but not indexed and the predicate on the outer query is still on contract_id. If you compare plans for partition by contract and partition by other id, you'll see that index is not used at all in the partition by other id plan: there's a TABLE ACCESS with FULL options on the loan table, as compared to INDEX - UNIQUE SCAN in partition by contract. That's obviously because the optimizer cannot resolve the relation between contract_id and some_other_id by its own, and so it'll need to run SUM or AVG over the entire window instead of limiting window row counts through index usage.
What you can also try - if you have a dimension table with those contracts - is to join it to your results and expose the contract_id from the dimension table instead of the most likely huge loan fact table. Sometimes this can lead to an improvement in cardinality estimates through the usage of a unique index on the dimension table.
Again, it's really hard to optimize a black box, without a query or even a plan, so we don't know what's going on. CTE or a subquery can get materialized unnecessarily for example.
Thanks for the update to include an example of the column list.
Given your updated query, I would suggest changing your view (or possibly creating a second view for querying single contract_ids, if your original view could be used to query for multiple contract_ids - unless, of course, the results of the original view only make sense for individual contract_ids!) to something like:
CREATE OR REPLACE VIEW loan_vw AS
WITH loan_info AS (SELECT l.*, c.* -- for future-proofing, you should list the column names explicitly; if this statement is rerun and there's a column with the same name in both tables, it'll fail.
FROM loan_table l
INNER JOIN commission_table c ON l.contract_id = c.commission_id -- you should always alias the join condition columns for ease of maintenance.
)
SELECT value_date,
item,
credit_entry,
debit_entry,
GREATEST(0,
LEAST(credit_entry,
NVL(SUM(debit_entry) OVER (PARTITION BY contract_id), 0)
- NVL(SUM(credit_entry) OVER (PARTITION BY contract_id ORDER BY value_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0))) AS item_paid
FROM loan_info
WHERE TYPE <> 'PRINCIPAL'
UNION ALL
SELECT ...
FROM loan_info
WHERE TYPE = 'PRINCIPAL';
Note that I've converted your join into ANSI syntax, because it's easier to understand than the old style joins (easier to separate join conditions from predicates, for a start!).

SQL Server - Pagination Without Order By Clause

My situation is that a SQL statement which is not predictable, is given to the program and I need to do pagination on top of it. The final SQL statement would be similar to the following one:
SELECT * FROM (*Given SQL Statement*) b
OFFSET 0 ROWS FETCH NEXT 50 ROWS ONLY;
The problem here is that the *Given SQL Statement* is unpredictable. It may or may not contain order by clause. I am not able to change the query result of this SQL Statement and I need to do pagination on it.
I searched for solution on the Internet, but all of them suggested to use an arbitrary column, like primary key, in order by clause. But it will change the original order.
The short answer is that it can't be done, or at least can't be done properly.
The problem is that SQL Server (or any RDBMS) does not and can not guarantee the order of the records returned from a query without an order by clause.
This means that you can't use paging on such queries.
Further more, if you use an order by clause on a column that appears multiple times in your resultset, the order of the result set is still not guaranteed inside groups of values in said column - quick example:
;WITH cte (a, b)
AS
(
SELECT 1, 'a'
UNION ALL
SELECT 1, 'b'
UNION ALL
SELECT 2, 'a'
UNION ALL
SELECT 2, 'b'
)
SELECT *
FROM cte
ORDER BY a
Both result sets are valid, and you can't know in advance what will you get:
a b
-----
1 b
1 a
2 b
2 a
a b
-----
1 a
1 b
2 a
2 b
(and of course, you might get other sorts)
The problem here is that the *Given SQL Statement" is unpredictable. It may or may not contain order by clause.
your inner query(unpredictable sql statement) should not contain order by,even if it contains,order is not guaranteed.
To get guaranteed order,you have to order by some column.for the results to be deterministic,the ordered column/columns should be unique
Please note: what I'm about to suggest is probably horribly inefficient and should really only be used to help you go back to the project leader and tell them that pagination of an unordered query should not be done. Having said that...
From your comments you say you are able to change the SQL statement before it is executed.
You could write the results of the original query to a temporary table, adding row count field to be used for subsequent pagination ordering.
Therefore any original ordering is preserved and you can now paginate.
But of course the reason for needing pagination in the first place is to avoid sending large amounts of data to the client application. Although this does prevent that, you will still be copying data to a temp table which, depending on the row size and count, could be very slow.
You also have the problem that the page size is coming from the client as part of the SQL statement. Parsing the statement to pick that out could be tricky.
As other notified using anyway without using a sorted query will not be safe, But as you know about it and search about it, I can suggest using a query like this (But not recommended as a good way)
;with cte as (
select *,
row_number() over (order by (select 0)) rn
from (
-- Your query
) t
)
select *
from cte
where rn between (#pageNumber-1)*#pageSize+1 and #pageNumber*#pageSize
[SQL Fiddle Demo]
I finally found a simple way to do it without any order by on a specific column:
declare #start AS INTEGER = 1, #count AS INTEGER = 5;
select * from (SELECT *,ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS fakeCounter
FROM (select * from mytable) AS t) AS t2 order by fakeCounter OFFSET #start ROWS
FETCH NEXT #count ROWS ONLY
where select * from mytable can be any query

SQL Eliminate Duplicates with NO ID

I have a table with the following Columns...
Node, Date_Time, Market, Price
I would like to delete all but 1 record for each Node, Date time.
SELECT Node, Date_Time, MAX(Price)
FROM Hourly_Data
Group BY Node, Date_Time
That gets the results I would like to see but cant figure out how to remove the other records.
Note - There is no ID for this table
Here are steps that are rather workaround than a simple one-command which will work in any relational database:
Create new table that looks just like the one you already have
Insert the data computed by your group-by query to newly created table
Drop the old table
Rename new table to the name the old one used to have
Just remember that locking takes place and you need to have some maintenance time to perform this action.
There are simpler ways to achieve this, but they are DBMS specific.
here is an easy sql-server method that creates a Row Number within a cte and deletes from it. I believe this method also works for most RDBMS that support window functions and Common Table Expressions.
;WITH cte AS (
SELECT
*
,RowNum = ROW_NUMBER() OVER (PARTITION BY Node, Date_Time ORDER BY Price DESC)
FROM
Hourly_Data
)
DELETE
FROM
cte
WHERE
RowNum > 1

Delete using CTE slower than using temp table in Postgres

I'm wondering if somebody can explain why this runs so much longer using CTEs rather than temp tables... I'm basically deleting duplicate information out of a customer table (why duplicate information exists is beyond the scope of this post).
This is Postgres 9.5.
The CTE version is this:
with targets as
(
select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer
)
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
I killed that version this morning after running for over an hour.
The temp table version is this:
create temp table
targets
as select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer;
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
This version finishes in about 7 seconds.
Any idea what may be causing this?
The CTE is slower because it has to be executed unaltered (via a CTE scan).
TFM (section 7.8.2) states:
Data-modifying statements in WITH are executed exactly once, and always to completion, independently of whether the primary query reads all (or indeed any) of their output.
Notice that this is different from the rule for SELECT in WITH: as stated in the previous section, execution of a SELECT is carried only as far as the primary query demands its output.
It is thus an optimisation barrier; for the optimiser, dismantling the CTE is not allowed, even if it would result in a smarter plan with the same results.
The CTE-solution can be refactored into a joined subquery, though (similar to the temp table in the question). In postgres, a joined subquery is usually faster than the EXISTS() variant, nowadays.
DELETE FROM customer del
USING ( SELECT id
, row_number() over(partition by uuid order by created_date desc)
as rn
FROM customer
) sub
WHERE sub.id = del.id
AND sub.rn > 1
;
Another way is to use a TEMP VIEW. This is syntactically equivalent to the temp table case, but semantically equivalent to the joined subquery form (they yield exactly the same query plan, at least in this case). This is because Postgres's optimiser dismantles the view and combines it with the main query (pull-up). You could see a view as a kind of macro in PG.
CREATE TEMP VIEW targets
AS SELECT id
, row_number() over(partition by uuid ORDER BY created_date DESC) AS rn
FROM customer;
EXPLAIN
DELETE FROM customer
WHERE id IN ( SELECT id
FROM targets
WHERE rn > 1
);
[UPDATED: I was wrong about the CTEs need to be always-executed-to-completion, which is only the case for data-modifying CTEs]
Using a CTE is likely going to cause different bottlenecks than using a temporary table. I'm not familiar with how PostgreSQL implements CTE, but it is likely in memory, so if your server is memory starved and the resultset of your CTE is very large then you could run into issues there. I would monitor the server while running your query and try to find where the bottleneck is.
An alternative way to doing that delete which might be faster than both of your methods:
DELETE C
FROM
Customer C
WHERE
EXISTS (SELECT * FROM Customer C2 WHERE C2.uuid = C.uuid AND C2.created_date > C.created_date)
That won't handle situations where you have exact matches with created_date, but that can be solved by adding the id to the subquery as well.