Flink SQL interval join not triggering - sql

I have a simple interval join between two unbounded streams. This works with small workloads, but with a larger (production environment) it no longer works. From observing the output I can see that the Flink SQL Job triggers/emitts records only once the entire topic has been scanned (and consequently read into memory?), but I would want the job to trigger the record as soon as as ingle match is found. Since in my production environment the job cannot withstand reading the entire table into memory.
The interval join which I'm making is very similar to the examples provided here: https://github.com/ververica/flink-sql-cookbook/blob/main/joins/02_interval_joins/02_interval_joins.md
SELECT
o.id AS order_id,
o.order_time,
s.shipment_time,
TIMESTAMPDIFF(DAY,o.order_time,s.shipment_time) AS day_diff
FROM orders o
JOIN shipments s ON o.id = s.order_id
WHERE
o.order_time BETWEEN s.shipment_time - INTERVAL '3' DAY AND s.shipment_time;
Except my time interval is as small as possible (couple of seconds). I also have a watermark of 5 seconds on the Flink SQL source tables.
How can I instruct Flink to emitt/trigger the records as soon as it has made a single 'match' with the join? As currently the job is trying to scan the entire table before emitting any records, which is not feasible with my data volumes. From my understanding it should only need to scan up until the interval (time window) and check that, and once the interval is passed then the record is emitted/triggered.
Also from observing the cluster I can see that the watermark is moving, but no records are being emitted.

May be some data was abandoned, you can check your event time whether if it's reasonable . In this scenes, you can try to use regular join and set a 3 days ttl(table.state.ttl = 3 days) which can ensure output for every data joined.

Related

Big Query External table - Query performance degrades with increased number of files in the Source URI

I have an external big query table created to read "Parquet" files from a GCS bucket.
The folder layout in the GCS bucket is as follows:
gs://mybucket/root/year=2022/model=abc/
gs://mybucket/root/year=2022/model=.../
gs://mybucket/root/year=2021/model=abc/
gs://mybucket/root/year=2021/model=.../
The layout is organized in such a way that it follows hive partitioning layout as explained the big query documentation. The columns "year" and "model" are seen as partition columns in the external table.
**External Data Configuration**
Source URI(s)- gs://mybucket/root/*
Source format - PARQUET
Hive Partitioning Mode - CUSTOM
Hive Partitioning Source URI Prefix - gs://mybucket/root/{year:INTEGER}/{model:STRING}
Hive Partitioning Column(s)- year, model
Problem: When I run queries on the external table as given below, I have observed that every query runs for an initial 2-3 minutes before the actual run happens. Big Query console shows "Query pending" during this time and as soon as it turns "Query Running" the output gets displayed with minimal slot time consumption (Slot time shows in 1-2 seconds.)
Select * from myTable Where year = 2022 and model = 'abc'
The underlying file count will vary and increases for every year and model. For years with more parquet files the initial time sometimes is around 4-5 minutes.
My understanding as per the documentation is that , if the partition columns are present in the query, some sort of partition pruning happens and I expect the query to be responsive immediately as per the documentation.
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs#partition_pruning
But the observations made by me is contrary to this. If the source URIs are restricted to 1 year, the table reads the data from one year, the query initial time (where it remains "Query pending" on console) is reduced to 1-2 minute (or even less)
Source URI(s)- gs://mybucket/root/year=2022/*
Question: Is this the expected behavior ? because as volume of files increase in the GCS bucket, the query takes even longer to run (esp. the initial time, and the actual run time doesn't change much), though in the where clause we have the year and model partition columns applied.
This is likely expected behavior. Before partition pruning can happen objects in GCS need to be listed which is likely where the time is being taken. We are working on improvements in this area. The fact that slot time is so low is a good indicator that pruning is in fact happening (since most files are not being read there isn't a lot of slot time to consume).

Azure Stream Analytics takes too long to process the events

I'm trying to configure an Azure Stream Analytics job, but consistently getting bad performance. I received data from a client system that pushes data into an Event Hub. And the ASA queries that into an Azure SQL database.
A few days ago I noticed that it was generating large amount of InputEventLateBeyondThreshold errors. Here an example out of the ASA. The Timestamp element is set by the client system.
{
"Tag": "MANG_POWER_5",
"Value": 1.08411181,
"ValueType": "Analogue",
"Timestamp": "2022-02-01T09:00:00.0000000Z",
"EventProcessedUtcTime": "2022-02-01T09:36:05.1482308Z",
"PartitionId": 0,
"EventEnqueuedUtcTime": "2022-02-01T09:00:00.8980000Z"
}
You can see that the event arrives pretty quickly, but takes more than 30 mins to process it. To try and avoid InputEventLateBeyondThreshold errors, I have increased the late event threshold. This may be contributing to the increased processing time, but having it too low also increases number of InputEventLateBeyondThreshold errors.
The Watermark Delay is consistently high, and yet SU usage is around 5%. I have increased the SU to as high as I can for this query.
I'm trying to figure out, why it takes so long to process the events once they have arrived.
This is the query I'm using:
WITH PIDataSet AS (SELECT * FROM [<event-hub>] TIMESTAMP BY timestamp)
--Write data to SQL joining with a lookup
SELECT
i.Timestamp as timestamp,
i.Value as value,
INTO [<sql-database>]
FROM PIDataSet as i
INNER JOIN [tagmapping-ref-alias] tm ON tm.sourcename = i.Tag
----Write data to AzureTable joining with a lookup
SELECT
DATEDIFF(second,CAST('1970-01-01' as DateTime), I1.Timestamp) As Rowkey,
I2.TagId as PartitionKey,
I1.Value as Value,
UDF.formatTime(I1.Timestamp) as DeviceTimeStamp
into [<azure-table>]
FROM PIDataSet as I1
JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag
--Get an hourly count into a SQL Table.
SELECT
I2.TagId,
System.Timestamp() as WindowEndTime, COUNT(I2.TagId) AS InputCount
into [tagmeta-ref-alias]
FROM PIDataSet as I1
JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag
GROUP BY I2.TagId, TumblingWindow(Duration(hour, 1))
When you set up a 59 minutes out-of-order window, what you do is that you set up a 59 minutes buffer for that input. When records land in that buffer, they will wait 59 minutes until they get out. What you get in exchange, is that we have the opportunity to re-order these events so they will look in order to the job.
Using it at 1h is an extreme setting that will automatically give you 59minute of watermark, by definition. This is very surprising and I'm wondering why you need a value so high.
Edit
Now looking at the late arrival policy.
You are using an event time (TIMESTAMP BY timestamp) which means that your events can now be late, see this doc and this one.
What this means is that when a record comes later than 1h (so timestamp is older than 1h from the wall clock on our servers - in UTC), then we adjust its timestamp to our wall clock minus 1h, and send it to the query. It also means that your tumbling window always has to wait an additional hour to be sure it's not missing those late records.
Here what I would do is restore the default settings (no out of order, 5 seconds late events, adjust events). Then when you get InputEventLateBeyondThreshold, it means that that the job received a timestamp that was later in the past than 5 seconds. You're not losing the data, we are adjusting its system.timestamp to a more recent value (but not the timestamp field, we don't change it).
What we then need to understand is why does it take more than 5 seconds for a record in your pipeline to go from production to consumption. Is it because you have big delays in your ingestion pipeline, or because you have a time skew on your producer clock? Do you know?

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

How to limit BigQuery query size for testing a query sample through the web user-interface?

I would like to know if it is possible to limit the bigquery query size when running a query through the web user-interface?
My idea is just to test the query but instead of querying all my tables; I would like just to query a part of it with for instance a number of row.
Limit is not optimizing my query cost, so the idea is to find a function similar to "row_number" or "fetch".
Sorry I'm a marketer and not a developer, so thank you in advance for your kind help.
How to limit BigQuery query size for testing ... ?
1 - Try to minimize number of tables involved in your testing
In your query – there are 60+ tables involved for respectively dates between 2016-12-11 and nowadays
SELECT <fields_list> FROM
TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
TIMESTAMP('20161211'),
TIMESTAMP('20170315'))
Instead you can use same day as start and end of time range, thus drastically reducing number of involved tables (down to just one table) and overall scan size. For example
SELECT <fields_list> FROM
TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
TIMESTAMP('20161211'),
TIMESTAMP('20161211'))
2 - Minimize number of rows. Ability to do so really depends on how your table is being loaded with data. If table loaded incrementally - you can use so called table decorators.
Note - this technique works with tables within last 7 days
For example, below will scan only data that was in table at one hour ago (so called snapshot decorator)
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212#-3600000]
This works well with the most recent day's table especially at the start of the day when size of table is not big yet
So, to limit further, you can use below version (so called range decorator) - gives you data added between one hour and half an hour ago
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212#-3600000--1800000]
Finally, #0 is a special case that references the oldest possible snapshot of the table: either 7 days in the past, or the table's creation time if the table is less than 7 days old. For example
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170210#0]
3 - Test against Sampled Table. If you expect experimenting with your query again and again - you can first prepare downsized version of your table with just as many rows as you need and applying sampling logic that fit in your business logic. To limit number of rows you can use LIMIT Clause. To get random rows you can use RAND function for example
After sampled table is prepared - run all your query against it till when you have final version - after this - you can run it against your original table(s)
And btw, to create sampled table you need to set destination table under options in Web UI.

Find external process using databse on particular time and day every month

On last Sunday of the last 2 months at 9.47AM we are getting blocking in SQL server. It last 2-3 hours and then disappears on its own. We don't get any blocking at all any other times of the month.
How can I check what is happening at that particular time?
I tried the query below, but it does not show anything being executed that day at that particular time. The last entry is 35 minutes before the problem starts and then the next entry is 30 minutes after the problem started.
SELECT deqs.last_execution_time AS [Time], dest.text AS [Query]
FROM sys.dm_exec_query_stats AS deqs
CROSS APPLY sys.dm_exec_sql_text(deqs.sql_handle) AS dest
ORDER BY deqs.last_execution_time DESC
I also checked scheduled jobs in SQL Server Agent, but nothing that is not normally running (backups, cdc, index rebuilds etc) is on.
Therefore, my question is: how can I find which external process is using the database at the particular time on the last Sunday of the month? I would be grateful for any tips and suggestions.
sys.dm_exec_query_stats only shows cached execution plans. That's unlikely to help you.
Instead, since it's predictable when this happens, you can use sys.dm_exec_connections to get the active connections (you can then use task manager to the process using the same ports - if it's running on a machine you have access to). To find what kind of query is executing, use sys.dm_exec_requests.
This has to be run while the actual load is happening - so you'll probably want to either do it manually when the problem occurs, or you'll need to schedule its execution and log the results.