Azure data sync is in processing state, how do I know how long it will take to complete? - azure-sql-database

I am processing around 12GB of data from hub to destination, and both databases are Azure SQL Databases. I see a lag of more than 18hrs. And Sync is in processing state for more than 14 hrs. How do I know how long it will take to complete?.
Current log status is :
1549802 rows have been processed in 52609 seconds.
Thank you

Data Sync won't know the two databases are identical until it compares the data row by row. It is a very costly process and may take a long time if you have large databases/tables. My recommendation is to have data only on one side and keep the same table empty in the other databases. In this case, data sync will use bulk load during initialization, and it is much faster than row by row comparison.
Scale up the service tiers of Azure SQL databases during initial sync should speed up things too.
To get information of the state (Ready, Not Ready) of a sync group you can use PowerShell cmdlet Get-AzureRmSqlSyncGroup. I don't know a way to get the progress of the first synchronization.

Related

Running an SQL query in background

I'm trying to update a modest dataset of 60k records with a value which takes a little time to compute. From a small trial run of 6k records in the production environment, it took 4 minutes to complete, so the full execution should take around 40 minutes.
However this trial run showed that there were SQL timeouts occurring on user requests when accessing data in related tables (but not necessarily on the actual rows which were being updated).
My question is, is there a way of running non-urgent queries as a background operation in the SQL server without causing timeouts or table locking for extensive periods of time? The data within the column which is being updated during this period is not essential to have the new value returned; aka if a request happened to come in for this row, returning the old value would be perfectly acceptable rather than locking the set until the update is complete (I'm not sure the ins and outs of how this works, obviously I do want to prevent data corruption; could be a way of queuing any additional changes in the background)
This is possibly a situation where the NOLOCK hint is appropriate. You can read about SQL Server isolation levels in the documentation. And Googling "SQL Server NOLOCK" will give you plenty of material on why you should not over-use the construct.
I might also investigate whether you need a SQL query to compute values. A single query that takes 4 minutes on 6k records . . . well, that is a long time. You might want to consider reading the data into an application (say, using Python, R, or whatever) and doing the data manipulation there. It may also be possible to speed up the query processing itself.

Slow query times on small BigQuery table while streaming

I am streaming data into three BigQuery tables from Cloud DataFlow sinks, but am seeing very slow query results on one of the target tables - a
"small" one, about 1.5 million rows. If I stop the DataFlow streaming job and come back to the table some time later, the same query runs
quickly. Here's the query in Standard SQL Dialect:
SELECT appname, start_time_to_minute, SUM(sum_client_bytes + sum_server_bytes) as `sum`
FROM `myproject.myset`.`mysmalltable`
GROUP BY 1, 2
appname:STRING
start_time_to_minute:TIMESTAMP
sum_client_bytes:INTEGER
sum_server_bytes:INTEGER
Job ID: bquijob_568af346_15c82f9e5ec - it takes 12s.
This table is growing by about 2000 rows per minute, via streaming. Another target table in the same project grows much more quickly via streaming,
maybe 200,000 rows per minute. If I run the query above on mysmalltable while streaming, it can take close to a minute. We experienced query times of several minutes on similar queries.
Job ID: bquijob_7b4ea8a1_15c830b7f12, it takes 42.8s
If I add a filter, things get worse e.g.
WHERE REGEXP_CONTAINS(`appname`, 'Oracle')
Job ID: bquijob_502b0a06_15c830d9ecf, it takes 57s
One yesterday took over 6 minutes:
Job ID: bquijob_49761f0d_15c80c6f415, it took 6m38s
I understand that in order to support querying "live" data, BigQuery has a much less efficient data provider that operates on top of the streaming
buffer. Is this implicated here? Is there a way we can make these queries run reliably in under 30s? For example, somehow avoiding the streaming
buffer and using >1 minute old data? If the streaming buffer is implicated, it still doesn't quite add up for me, since I would've thought most of the data being read out of mysmalltable would still be in native format.
I appreciate any guidance!
I've seen also this behaviour, the way that I workaround it (I'm not going to say solved, because that's mostly something from Google), was to use micro-batching instead of stream inserts. When the concurrency is low, the stream inserts work really well, but with real BigData (like in my case, hundreds of thousands) the best way is to use micro-batching. I'm using the FILE_LOADS option with a windowing of 3 minutes and works really well. Hopefully that can help you.

BigQuery range decorator duplicate issue

We are facing issues with BigQuery range decorators on streaming table. The range decorator queries give duplicate data.
My case:
My BQ table is getting data regularly from customer events through streaming inserts. Another job is periodically fetching time bound data from the table using range decorator and sending it to dataflow jobs. like
First time fetching all the data from table using
SELECT * FROM [project_id:alpha.user_action#1450287482158]
when i ran this query got 91 records..
after 15 mins another query based on last interval
SELECT * FROM [alpha.user_action#1450287482159-1450291802380]
this also gave the same result with 91 records.
however i tried to run the same query again to cross check
SELECT * FROM [project_id:alpha.user_action#1450287482158]
Gives empty data.
any help on this?
First off, have you tried using streaming dataflow? That might be a better fit (though your logic is not expressible as a query). Streaming dataflow also supports Tee-ing your writes, so you can keep both raw data and aggregate results.
On to your question:
Unfortunately this is a collision of two concepts that were built concurrently and somewhat independently, thus resulting in ill-defined interactions.
Time range table decorators were designed/built in a world where only load jobs existed. As such, blocks of data are atomically committed to a table at a single point in time. Time range decorators work quite well with this, as there are clear boundaries of inclusion/exclusion, and the relationship is stable.
Streaming Ingestion + realtime query is somewhat counter to the "load job" world. BigQuery is buffering your data for some period of time, making it available for analysis, and then periodically flushing the buffers onto the table using the traditional storage means. While the data is buffered, we have "infinite" time granularity. However, once we flush the buffer onto the table, that infinite granularity is compressed into a single time, which is currently the flush time.
Thus, using time range decorators on streaming tables can unfortunately result in some unexpected behaviors, as the same data may appear in two non-overlapping time windows (once while it is buffered, and once when it is flushed).
Our recommendation if you're trying to achieve windowed queries over recent data is to do the following:
Include a system timestamp in your data.
For the table decorator timestamps, include some buffer around the actual window to account for clock skew between your clock and Google's, and late arrivals from retry. This buffer should be both before and after your target window.
Modify your query to apply your actual time window.
It should be noted that depending on your actual usage purpose, this may not address your problems. If you can give more details, there might be a way to achieve what you need.
Sorry for the inconvenience.

How to view throughput using a SELECT?

This seems like it should be simple. I am trying to get an idea of the read transfer rate on a DB using a SELECT statement. e.g. I want 'x MB/sec'. The table to be used in this query has millions of rows.
I have tried using SET IO STATISTICS ON/OFF on either side of my SELECT but this doesn't return the transfer rate to me, I just get the number of rows affected and # of reads, writes.
A simple way I can think of is to load up perfmon on your machine and watch it while you're doing the select. This will give you the transfer rate to your machine from the DB.
If you want to know the IO to disk on the DB, then you'll probably have to stop all other loads, load up perfmon on the DB, and watch it while you're executing the select. This result is highly dependent on how much of the data is already in the cache.
If you can't isolate your select, then you can average what your baseline is and see how much more throughput there is during your select.
If you can't pull up perfmon, then you can see if the relevant counters are in sys.dm_os_performance_counters (http://technet.microsoft.com/en-us/library/ms187743.aspx).

max memory per query

How can I configure the maximum memory that a query (select query) can use in sql server 2008?
I know there is a way to set the minimum value but how about the max value? I would like to use this because I have many processes in parallel. I know about the MAXDOP option but this is for processors.
Update:
What I am actually trying to do is run some data load continuously. This data load is in the ETL form (extract transform and load). While the data is loaded I want to run some queries ( select ). All of them are expensive queries ( containing group by ). The most important process for me is the data load. I obtained an average speed of 10000 rows/sec and when I run the queries in parallel it drops to 4000 rows/sec and even lower. I know that a little more details should be provided but this is a more complex product that I work at and I cannot detail it more. Another thing that I can guarantee is that my load speed does not drop due to lock problems because I monitored and removed them.
There isn't any way of setting a maximum memory at a per query level that I can think of.
If you are on Enterprise Edition you can use resource governor to set a maximum amount of memory that a particular workload group can consume which might help.
In SQL 2008 you can use resource governor to achieve this. There you can set the request_max_memory_grant_percent to set the memory (this is the percent relative to the pool size specified by the pool's max_memory_percent value). This setting in not query specific, it is session specific.
In addition to Martin's answer
If your queries are all the same or similar, working on the same data, then they will be sharing memory anyway.
Example:
A busy web site with 100 concurrent connections running 6 different parametrised queries between them on broadly the same range of data.
6 execution plans
100 user contexts
one buffer pool with assorted flags and counters to show usage of each data page
If you have 100 different queries or they are not parametrised then fix the code.
Memory per query is something I've never thought or cared about since last millenium