Create view which will do partition filtering in Synapse openrowset - azure-synapse

I have a view that is defined as
create view [dbo].[darts] as
SELECT pricedate, [hour], node_id, dalmp, cast(result.filepath(1) as int) as [Year], cast(result.filepath(2) as int) as [Month]
FROM
OPENROWSET(
BULK 'year=*/month=*/*.parquet',
DATA_SOURCE='mysource',
FORMAT = 'parquet'
)
with (
pricedate date,
node_id bigint,
[hour] int,
dalmp float
)
as [result]
where cast(result.filepath(1) as int)=datepart(year, pricedate) and cast(result.filepath(2) as int)=datepart(month, pricedate)
What I want to be able to do is do a query on this view like:
select * from darts where pricedate='2022-11-01'
and have the where clause of the view definition force it to only look in year=2022/month=11 but it doesn't work unless I do it explicitly such as:
select * from darts where pricedate='2022-11-01' and Year=2022 and Month=11
For clarity, when I say the first query doesn't work, what I mean is that it isn't doing any partition pruning, that query searches all files/data. Whereas, my second query only scans the fraction that I expect.
Are there any extra modifiers/syntax/functional form I could use in my view definition that would force partition pruning in the case of my first query?

Related

Compare 2 versions a of table in big-query

I wanted to compare 2 versions of a table.
I wanted to compare Before last modification with latest data from a table.
here i have a sample sql script which compares the tables
WITH
before_mod AS ( SELECT *
FROM `big-query-112.temp.tableB`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB({{ lastModification }}, INTERVAL 2 second)),
after_mod AS ( SELECT * FROM `big-query-112.temp.tableB` ),
row_changed AS (
SELECT *
FROM before_mod EXCEPT DISTINCT
SELECT *
FROM after_mod
)
SELECT * FROM row_changed
This SQL first will create a CTE for
before_mob -> this holds a snapshot of the table as it was on that specific point in time.
afrer_mod -> the actual data in the tableB
Then "row_changed" table is created by selecting all rows from "before_mod" that are not in "after_mod".
The problem is that bigquery does not allow to use diferent timestamp FOR SYSTEM_TIME AS ...
Exception:If a 'FOR SYSTEM_TIME AS OF' expression is used, all references of a table should use the same TIMESTAMP value.
I also tried adding the before_mod in a view and then query the view SQL below
CREATE OR REPLACE VIEW `big-query-112.temp.tableB_before_mod_temp` AS (
SELECT *
FROM `big-query-112.temp.tableB`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB('2023-02-04 13:12:35 UTC', INTERVAL 0 second)
);
WITH
before_mod AS ( SELECT * FROM `big-query-112.temp.tableB_before_mod_temp`),
after_mod AS ( SELECT * FROM `big-query-112.temp.tableB` ),
row_changed AS (
SELECT *
FROM before_mod EXCEPT DISTINCT
SELECT *
FROM after_mod
)
SELECT * FROM row_changed
The problem with this one is that it is not showing the rows that are different, seams that is getting in table from only a specific time.
Also, cannot use materialized view Exception: Invalid value: Materialized view query cannot reference historical versions of the table definition
Is there a way how can i compare 2 versions of the table, without creating a copy?
NOTE: Table does not have an ID (in the way the table is being generated it is hard to add an id which is always same for a specific row)
also querying the SELECT * FROM `big-query-112.temp.tableB_before_mod_temp shows the expected results

Teradata: On doing PIVOT operation values of columns interchange

I am having a Table (DB.TAB_UNPIVOTED) on TeraData having millions of rows, and 4 columns - ID, Week, Sales & Profits. Here below is a small microcosm of the problem.
For sake of illustration DB.TAB_UNPIVOTED is like this:
I want to PIVOT this table. Following is my code:
SET SQL_STMT = ('CREATE TABLE DB.TAB_PIVOTED AS
( SELECT * FROM DB.TAB_UNPIVOTED
PIVOT
( SUM(Sales) AS Sales, SUM(Profits) AS Profits
FOR KW_Prefix IN (
'CW_1' AS CW_1,
'CW_2' AS CW_2,
'CW_3' AS CW_3
)
) AS dt
) WITH DATA;'
) ;
EXECUTE IMMEDIATE SQL_STMT;
The strange thing is that, on PIVOTING, everytime I get a different table DB.TAB_PIVOTED, where values across different columns are interchanged. For eg; one time the output can be:
But, next time, it could be a dfferent result, with values interchanged, though Profits & Sales maintain their pairing. In one output they could be under CW_1, where as on another run, it could be under CW_3:
This problem may not be reproduced on small data, but with my data in millions and CW varing from 1 to 52, I see it all the time.
Does anyone have an idea, where in the pivoting am I making the mistake?
Inputs would be very esteemed.
Update:
I tried running the code directly as well, instead of EXECUTE IMMEDIATE SQL_STMT,but same strange results.
CREATE TABLE DB.TAB_PIVOTED AS
( SELECT * FROM DB.TAB_UNPIVOTED
PIVOT
( SUM(Sales) AS Sales, SUM(Profits) AS Profits
FOR KW_Prefix IN (
'CW_1' AS CW_1,
'CW_2' AS CW_2,
'CW_3' AS CW_3
)
) AS dt
) WITH DATA;

BigQuery - Create a table from results of a query that uses complex CTEs?

I have a multi CTE query with large underlying datasets that is run too frequently. I could just create a table of the results of that query for people to use instead, and refresh that daily. But I'm lost on the syntax to create such a table.
CREATE OR REPLACE TABLE dataset.target_table
AS
with cte_one as (
select
stuff
from big.table
),
...
cte_five as (
select
stuff
from other_big.table
),
final as (
select *
from cte_five left join cte_x on cte_five.id = cte_x.id
)
SELECT
*
FROM final
Is basically what I have. This actually creates the target table with the right schema even, but doesn't insert any rows...Any hints? Thanks
If you really want to do this in one step, you can just do SELECT INTO...
with cte_one as (
select
stuff
from big.table
),
...
cte_five as (
select
stuff
from other_big.table
),
final as (
select *
from cte_five left join cte_x on cte_five.id = cte_x.id
)
SELECT
*
INTO dataset.target_table
FROM final
That said, since this isn't just a once-off need I recommend creating the landing table once initially and then scheduling a daily flush and fill (TRUNCATE + INSERT) to update the data. It will give you more explicit control over the data types and also lets you work with a persistent object rather than something built from scratch daily.

Can't use Date/DateTime as arg in argMinMerge/argMaxMerge?

In one of Afinity webinars they give an example of using argMin/argMax aggregate functions to find the first/last value in some table. They are using the following table:
CREATE TABLE cpu_last_point_idle_agg (
created_date AggregateFunction(argMax, Date, DateTime),
...
)
Engine = AggregatingMergeTree
Then they create the materialized view:
CREATE MATERIALIZED VIEW cpu_last_point_idle_mw
TO cpu_last_point_idle_agg
AS SELECT
argMaxState(created_date, created_at) AS created_date,
...
And finally the view:
CREATE VIEW cpu_last_point_idle AS
SELECT
argMaxMerge(created_date) AS created_date,
...
However, when I try to replicate this approach, I am getting an error.
My table:
CREATE TABLE candles.ESM20_mthly_data (
ts DateTime Codec(Delta, LZ4),
open AggregateFunction(argMin, DateTime, Float64),
...
)
Engine = AggregatingMergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY ts
PRIMARY KEY(ts);
My Materialized View:
CREATE MATERIALIZED VIEW candles.ESM20_mthly_mw
TO candles.ESM20_mthly_data
AS SELECT
ts,
argMinState(ts, src.price) AS open,
...
FROM source_table as src
GROUP BY toStartOfInterval(src.ts, INTERVAL 1 month) as ts;
My view:
CREATE VIEW candles.ESM20_mthly
AS SELECT
ts,
argMinMerge(ts) as open,
...
FROM candles.ESM20_mthly_mw
GROUP BY ts;
I get an error:
Code: 43. DB::Exception: Received from localhost:9000. DB::Exception: Illegal type DateTime('UTC') of argument for aggregate function with Merge suffix must be AggregateFunction(...).
I tried using Date and DateTime, with the same result. If I flip the arg and value, it works but of course doesn't give me what I want. Are dates no longer supported by these aggregating functions? How do I make it work?
I am using Connected to ClickHouse server version 20.12.3 revision 54442.
First of all argMin(a, b) -- take a when b is min.
--AggregateFunction(argMin, DateTime, Float64),
++AggregateFunction(argMin, Float64, DateTime),
--argMinState(ts, src.price) AS open,
++argMinState(src.price,ts) AS open,
The second issue is
--argMinMerge(ts) as open,
++argMinMerge(open) as final_open,

Setting a variable in a SQL WHERE clause to be used in SELECT

I'm using Transact-SQL with Microsoft SQL Server, and we have a query that looks like this:
SELECT Cast( Cast ( Cast(XMLBlob as XML).query(N'//continent/forest/tree/age/Text()') as nvarchar) as bigint),
AnotherField
FROM [MyDB].[dbo].[mytable]
WHERE Cast( Cast ( Cast(XMLBlob as XML).query(N'//continent/forest/tree/age/Text()') as nvarchar) as bigint)
between 10 and 100
The XML cast is an expensive operation, and since it's used in both the WHERE and SELECT, it seems like I should be able to save it away as a variable in the WHERE (which, by order of operations, is evaluated before the SELECT), and use it in the SELECT instead of having to cast again. Is this possible?
You could use an inner query where you retrieve the XML value. Then outside the inner query you both return that bigint value and filter the values you want:
SELECT innerTable.Age, innerTable.AnotherField
FROM (
SELECT Cast( Cast ( Cast(XMLBlob as
XML).query(N'//continent/forest/tree/age/Text()') as nvarchar) as bigint) AS Age,
AnotherField
FROM [MyDB].[dbo].[mytable]
) AS innerTable
WHERE innerTable.Age between 10 and 100
By the way... why do you need a bigint to store Age? If you are storing years looks like overkill, even for those trees that live thousands of years :)