Delta job on BigQuery misses records - google-bigquery

I'm facing an odd issue with a small delta job that I implemented on top of a streaming BigQuery table with Apache Beam.
I'm streaming data to a BigQuery table and every hour I run a job to copy any new records from that streaming table to a reconciled table. The delta is built on top of a CreateDatetime column I introduced on the streaming table. Once a record gets loaded to the streaming table, it gets the current UTC timestamp. So the delta naturally takes all records that have a newer CreateDatetime than last time up to the current time the Batch is running.
CreatedDatetime >= LastDeltaDate AND
CreatedDatetime < NowUTC
The logic for LastDeltaDate is as follows:
1. Start: LastDeltaDate = 2017-01-01 00:00:00
2. 1st Delta Run:
- NowUTC = 2017-10-01 06:00:00
- LastDeltaDate = 2017-01-01 00:00:00
- at the end of the successful run LastDeltaDate = NowUTC
3. 2nd Delta Run:
- NowUTC = 2017-10-01 07:00:00
- LastDeltaDate = 2017-10-01 06:00:00
- at the end of the successful run LastDeltaDate = NowUTC
...
Now every other day I find records that are on my streaming table but never arrived on my reconciled table. When I check the timestamps I see they're far away from the batch run and when I check the Google Datflow log I can see that no records where returned for the query at that time but when I run the same query now, I get the records. Is there any way a streamed record could arrive super late in a query or is it possible that Apache Beam is processing a record but not writing it for a long time? I'm not applying any windowing strategy.
Any ideas?

When performing streaming inserts, there is a delay in how quickly those rows are available for batch exports, as described in their documentation data availability.
So, at time T2 you may have streamed a bunch of rows into BigQuery that are stored in the streaming buffer. You then run a batch job from time T1 to T2, but only see the rows up to T2-buffer. As a result, any rows that are in the buffer for each of the delta runs will be dropped.
You may need to make your selection of NowUTC aware of the streaming buffer, so that the next run process rows that were within the buffer.

Related

SSAS Tabular Data Cube - Fact Table with partition of 160K row partition takes almost an hour to process...Why?

I have an SSAS Tabular data model I developed with VS. In this cube there are many fact and dimension tables with lots of measures. HOWEVER, there is one fact table that in total has 158 million rows and to process all 158 million rows in this one fact table takes over an hour... To speed up that processing time I decided to create two partitions based on a date column. Partition 1 has historical data and when loaded has 157 million rows, Partition 2 (One month of data) has about 160,000 rows so VERY VERY small. I only want to process the Partition 2 daily. Unfortunately, when I process just Partition 2 the processing time is still almost an hour?? How can it be that simply refreshing a 160K partition takes 58 minutes... seems like it is still trying to process the full tableā€¦.
I will say when I try to process a separate table that only has 200K rows in total I am able to process this in under 30 seconds. Shouldn't the partition 2 above also process in under a minute?? What would I be doing wrong here and why would this take so long to process a small partition.
In Summary:
Table A = 158,000,000 Rows = 1 Hour 13 Min to Process total table Partition 1 = 157,840,000 Rows = 1 hour to process FULL Partition 2 = 160,000 Rows = 58 minutes to process FULL
Table B = 200,000 Rows = 30 Seconds to process FULL Partition 1 = 200,000 Rows = 30 Seconds to process!
Shouldn't Table A/Partition 2 take 30 seconds to process just like Table B?
I just want to process full the Partition 2 of Table A... Expected process time I expected to be was under 5 min... similar to Table B result time. Instead to process Partition 2 with 160K rows takes almost same time as entire table A (Partition 1 + 2).
If you have calculated columns or DAX tables that refer to this table, they will have to process after the partition loads and could result in an extended load time. You might be able to test for this by creating a new table with a filter like the partition and see how long it takes to load.
I would also make sure the sort on the table is set to the date.

Amount of overlaps per minute

I would like to make an SQL-Statement in order to find the amount of users that are using a channel by date and time. Let me give you an example:
Let's call this table Data:
Date Start End
01.01.2020 17:00 17:30
01.01.2020 17:01 17:03
01.01.2020 17:29 18:30
Data is a table that shows when an user started the connection on a channel and the time the connection was closed. A connection can be made any time, which means from 00:00 until the next day.
What I am trying to achieve is to count the maximum number of connections that were made over a big period if time. Let's say 1st February to 1st April.
My idea was to make another table with timestamps in Excel. The table would display a Timestamp for every Minute in a specific date.
Then I tried to make a statement like:
SELECT *
FROM Data,Timestamps
WHERE Timestamps.Time BETWEEN Data.Start AND Data.End.
Now logically this statement does what is supposed to do. The only problem is that it is not really performant and therefore not finishing. With the amount of timestamps and the amount of data I have to check it is not able to finish.
Could anybody help me with this problem? Any other ideas I can try or how to improve my statement?
Regards!
So why dou you create another table in Excel and not directly in MS Access and then why won't you set up the indexes of the timestamps right. That will speed it up by factors.
By the way I think that your statement will print repeat every user that happened to match your Start .. End period, so the amount of rows produced will be enormous. You shall rather try
SELECT Timestamps.Time, COUNT(*)
FROM Data,Timestamps
WHERE Timestamps.Time BETWEEN Data.Start AND Data.End
GROUP BY Timestamps.Time;
But sorry if the syntax in MS Access is different.

Powerbi Date table makes all related visuals run very slow

The following is in direct query mode:
I have a date/time table that 10 separate queries have a relationship to. All data is from the current day. The date/time table just returns all times on 15 minute intervals up to the current timestamp. This works fine.
I have about 10+ queries that each have a record from whateever the latest 15 minute interval time stamp was. Each of these queries has a relationship to the date/time table.
I am wondering why any visual that uses the date/time table is incredibly slow. Say I have a visual that shows the timestamps on the x axis, and the data on a line (E.g. at 3:45 I had 10 records. At 4 I had 5 records. At 4:15 I had 1 record). If I use the date/time table time on the x axis, it is very slow. If I use the timestamp from the query on the x axis, it is very fast.
Is it doing something with all of the related queries? Only about 2-3 of them should actually be running at any given time based on what page I am on.

Time interval based buckets in redis sorted set

Is there any way to generate time interval based buckets using redis sorted set. I want to create different sorted sets in certain time interval (lets say 15 mins)
t1, t2 are scores
Key SortedSet
bucket#V1 (t1,1),(t2,2)..... (commited bucket)
bucket#V1+15 (t3,1),(t4,2)..... (commited bucket)
bucket#V1+30 (t5,1),(t6,2)..... (current running bucket)
i.e. in 15 mins interval, it should automatically create new key and start ingesting data in new sorted set. V1+15 should start after 15 mins...
The second challenge is how to query commited buckets? (not running buckets where data still getting ingested).
The end goal is to query commited buckets first then query data in each bucket using time range queries (based on score i.e. ZRANGEBYSCORE)
Your key is going to be something like DateTime.Now.SecondsSinceEpoch / TimeSpan.FromMinutes(15) Formatted as a fixed length string. Here is a powershell script showing how to get the interval key value. you can use a similar routine for storing data or requesting any past (or future) interval. Here the interval is 3 seconds and the Epoch value is Jan 1, 1970 but can use anything you want.
1..7 | % { $interval = [int] (((get-date) - (epoch)).TotalSeconds / 3) ; $interval ; start-sleep -Seconds 1 }
514663592
514663593
514663593
514663593
514663594
514663594
514663594

Copying table records after a specific file interval in Postgresql

I have a table whose records I want to copy to a db on a remote server after a specific interval of time. The number of records in the table is very high(in the range of millions) and also the columns can range from 40-50.
I thought about using pg_dump after the time-interval but that sounds inefficient as the whole table would be dumped again and again.
Assume the time interval to be 4 hours and the life-cycle of db start at 10:00.
No. of Records at 10:00 - 0
No. of Records at 14:00 - n
No. of Records at 18:00 - n+m
No. of Records at 22:00 - n+m+l
The script(shell) that I want to write should select 0 rows at 10:00,
n rows at 14:00, m rows at 18:00, l rows at 22:00.
Is there any other way through which I copy only those rows that have been added between the time intervals in order to remove the redundant rows that will come if I take a pg_dump every 4 hours?