USQL Nesting TVFs and Queries is giving horrendous results - azure-data-lake

I 'think' that this problem is relating to the query optimization that Azure Data Lake Analytics does; but let's see...
I have 2 separate queries (TVFs) doing aggregations, and then a final Query to join the 2 together for final results.
So ...
Table > Header Query
Table > Detail Query
Result = Header Query + Detail Query
To test the whole logic out, I run the minor queries separately with a filter, storing the results to file, and then use the hard files as sources for the final query; these are the total durations (minutes).
Header Query 1.4 (408 rows)
Detail Query 0.9 (3298 rows)
Final Query 0.9 (408 rows)
So I know that as a maximum, I can get my result in around 3.5 minutes.
However, I don't really want to create new intermediary files.
I want to use the TDFs directly to feed the final query.
With TDFs in the final query, the Job Graph gets to around 97% progress within about 1.5 minutes.
But then, all hell breaks loose !
The last node is a Aggregate with 2,500 Vertices that says 16 minutes compute time.
So my question ... WHY ??
Is this a case of me not understanding some fundamental concepts of how Azure works ?
So, can anyone explain what's going on?
Any help appreciated.
Final Query:
#Header =
SELECT [CTNNumber],
[CTNCycleNo],
[SeqStart],
[SeqEnd],
[StartUTC],
[EndUTC],
[StartLoc],
[StartType],
[EndLoc],
[EndType],
[Start Step],
[Start Ctn Status],
[Start Fill Status],
[EndStep],
[End Ctn Status],
[End Fill Status]
FROM [Play].[getCycles3]
("") AS X;
#Detail =
SELECT [CTNNumber],
[SeqNo] AS [SeqNo],
[LocationType],
[LocationID],
[BizstepDescription],
[ContainerStatus],
[FillStatus],
[UTCTimeStampforEvent]
FROM [Play].[getRaw]
("") AS Z;
#result =
SELECT
H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]
,COUNT([D].[SeqNo]) AS [SeqCount]
//, COUNT(DISTINCT [LocationID]) AS [#Locations]
FROM
#Header AS [H]
INNER JOIN
#Detail AS [D]
ON
[H].[CTNNumber] == [D].[CTNNumber]
WHERE
[D].[SeqNo] >= [H].[SeqStart] AND
[D].[SeqNo] <= [H].[SeqEnd]
GROUP BY
H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]
;

My apologies for the late reply. I was travelling and this just slipped off the radar.
The problem is most likely that the optimizer got a wrong estimation and decided to over partition. If you read the data from the file, the optimizer has more precise information available at compile time.
Can you add an OPTION(ROWCOUNT=x) hint to the queries that are generating the input for the join to get a better plan?

So I entered this one as a ticket with Microsoft.
Here is their response, which I implemented and was successful.
From: ########microsoft.com
Subject: RE: ########### USQL Job displays bizarre job plan and runtime Back
So the issue here is cardinality. When the script gets broken into parts the U-SQL compiler has a new input to work with at each intermediate step, and it knows the actual data size and distribution of that new input, so the optimizer is able to accurately allocate resources.
However, when the script is put all into one, the optimizer’s estimation is very different, because it doesn’t know what the exact size and distribution of those intermediate steps is going to be. So we expect there to be some difference in estimations between a script that has been broken into parts and a script that is all in one.
Your best defense against optimization differences like these is to CREATE STATISTICS on table columns. In this case you should CREATE STATISTICS on the CTNNumber column and see if that improves performance on the single job script.
Here is some documentation surrounding CREATE STATISTICS:
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-statistics-transact-sql

Related

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

ST_INTERSECTS issue BigQuery Waze

I'm using the following code to query a dataset based on a polygon:
SELECT *
FROM `waze-public-dataset.partner_name.view_jams_clustered`
WHERE ST_INTERSECTS(geo, ST_GEOGFROMTEXT("POLYGON((-99.54913355822276 27.60526592074579,-99.52673174853038 27.60526592074579,-99.52673174853038 27.590813604291416,-99.54913355822276 27.590813604291416,-99.54913355822276 27.60526592074579))")) IS TRUE
The validation message says that "This query will process 1 TB when run".
It seems like there's no problem. However, when I remove the "WHERE INTERSECTS" function, the validation message says exactly the same thing: "This query will process 1 TB when run", the same 1 TB, so I'm guessing that the ST_INTERSECTS function is not working.
When you actually run this query, the amount charged should be usually much less, as expected for spatially clustered table. I've run select count(*) ... query with one partner dataset, and while editor UI announced 9TB before running query, the query reported around 150MB processed after running.
The savings come from clustered table - but the specific clusters that intersect the polygon used in filter depend on actual data in the table and how clusters were created. The clusters and thus the cost of the query can only be determined when the query runs. Editor UI in this case shows maximum possible cost of the query.

How to get a count of the number of times a sql statement has executed in X hours?

I'm using oracle db. I want to be able to count the number of times that a SQL statement was executed in X hours. For instance, how many times has the statement Select * From ExampleTable been executed in the past 5 hours?
I tried looking in V$SQL, V$SQLSTATS, V$SQLAREA, but they only keep a record of a statement's total amount of executions. They don't store what times the individual executions occurred. Is there any view I missed, or something else that does keep track of each individual statement execution + timestamp so that I can query by which have occurred X hours ago? Thanks for the help.
The views in the Active Workload Repository store historical SQL execution information, specifically the view DBA_HIST_SQLSTAT.
The view is not perfect; it contains a summary of the top SQL statements. This is almost perfect information for performance tuning - in practice, sampling will catch any performance problem. But if you're looking for a perfect record of every SQL execution, as far as I know the only way to get that information is through tracing, which is buggy and slow.
Hopefully this query is good enough:
select begin_interval_time, end_interval_time, executions_delta, dba_hist_sqlstat.*
from dba_hist_sqlstat
join dba_hist_snapshot
on dba_hist_sqlstat.snap_id = dba_hist_snapshot.snap_id
and dba_hist_sqlstat.instance_number = dba_hist_snapshot.instance_number
order by begin_interval_time desc, sql_id;
Apologies for putting this in an answer instead of a comment (I don't have the required reputation), but I think you may be out of luck. Here is an AskTOM asking basically the same question: AskTOM. Tom says unless you are using ASH that just isn't something the database is designed to do.

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.