Interpreting and finetuning the BATCHSIZE parameter? - sql

So I am playing around with the BULK INSERT statement and am beginning to love it. What was talking the SQL Server Import/Export Wizard 7 hours is taking only 1-3 hours using BULK INSERT. However, what I am observing is that the time to completion is heavily dependent on the BATCHSIZE specification.
Following are the times I observed for a 5.7 GB file containing 50 million records:
BATCHSIZE = 50000, Time Taken: 17.30 mins
BATCHSIZE = 10000, Time Taken: 14:00 mins
BATCHSIZE = 5000 , Time Taken: 15:00 mins
This only makes me curious: Is it possible to determine a good number for BATCHSIZE and if so, what factors does it depend on and can it be approximated without having to run the same query tens of times?
My next run would be a 70 GB file containing 780 million records. Any suggestions would be appreciated? I will report back the results once I finish.

There is some information here and it appears the batch size should be as large as is practical; documentation states in general the larger the batch size the better the performance, but you are not experiencing that at all. Seems that 10k is a good batch size to start with, but I would look at optimizing the bulk insert from other angles such as putting the database into simple mode or specifying a tablock hint during your import race.

Related

BigQuery - how to decrease slot time of Coalesce execution step?

I have a pretty complex query, with about 70 execution steps. The query was somewhat optimized for performance, so most of the steps in the execution plan run pretty fast - except Coalesce steps, which take about 10 to 100 times more slot time, compared to the others. As far as I understand, it prepares data for the following Join step, but why does it take so long even if actual nuber of records processed by this step is low? The most extreme case I saw looks like this (ZERO records processed by this step, but still takes 8 seconds! ):
S46: Coalesce
Slot time: 8223 ms
Duration: 92 ms
Bytes Shuffled: 0 B
I wasn't able to find any hints regarding this "Coalesce" step and ways to optimize it in Google documentation, so perhaps you can give me some advice about it or point to actual documentation that explains it.

Check the execution time of a query accurate to the microsecond

I have a query in SQL Server 2019 that does a SELECT on the primary key fields of a table. This table has about 6 million rows of data in it. I want to know exactly how fast my query is down to the microsecond (or at least the 100 microsecond). My query is faster than a millisecond, but all I can find in SQL server is query measurements accurate to the millisecond.
What I've tried so far:
SET STATISTICS TIME ON
This only shows milliseconds
Wrapping my query like so:
SELECT #Start=SYSDATETIME()
SELECT TOP 1 b.COL_NAME FROM BLAH b WHERE b.key = 0x1234
SELECT #End=SYSDATETIME();
SELECT DATEDIFF(MICROSECOND,#Start,#End)
This shows that no time has elapsed at all. But this isn't accurate because if I add WAITFOR DELAY '00:00:00.001', which should add a measurable millisecond of delay, it still shows 0 for the datediff. Only if I wat for 2 milliseconds do I see it show up in the datediff
Looking up the execution plan and getting the total_worker_time from the sys.dm_exec_query_stats table.
Here I see 600 microseconds, however the microsoft docs seem to indicate that this number cannot be trusted:
total_worker_time ... Total amount of CPU time, reported in microseconds (but only accurate to milliseconds)
I've run out of ideas and could use some help. Does anyone know how I can accurately measure my query in microseconds? Would extended events help here? Is there another performance monitoring tool I could use? Thank you.
This is too long for a comment.
In general, you don't look for performance measurements measured in microseconds. There is just too much variation, based on what else is happening in the database, on the server, and in the network.
Instead, you set up a loop and run the query thousands -- or even millions -- of times and then average over the executions. There are further nuances, such as clearing caches if you want to be sure that the query is using cold caches.

Google DataPrep is extremely slow

In Google Dataflow, i have a job that basically looks like this:
Dataset: 100 rows, 1 column.
Recipe: 0 steps
Output: New Table.
But it takes between 6-8 minutes to run. What could be the issue?
Usually times are in minutes, not in seconds for Dataprep/dataflow setup.
These solutions are for large data sets and the duration stays constant even if you have 10 times the size.
DataPrep creates for you a DataFlow workflow, and provisions a few VMs for you, that takes time, usually that phase could be in the minute mark. And only a bit later is scaling that up to 50 or 1000 boxes.

DB2 Query Taking too much time to complete DATA_PHYSICAL_READS is higher

My query is much long time approx 8 mins to complete.
I checked and found that max time is being taken due to data_physical_read is too much.
Can someone suggest how to lower this
select BP_NAME,DATA_PHYSICAL_READS, DATA_HIT_RATIO_PERCENT, INDEX_PHYSICAL_READS,INDEX_HIT_RATIO_PERCENT
from SYSIBMADM.MON_BP_UTILIZATION
BP_NAME DATA_PHYSICAL_READS DATA_HIT_RATIO_PERCENT INDEX_PHYSICAL_READS INDEX_HIT_RATIO_PERCENT
IBMDEFAULTBP 6186475 99.96 3700 99.2

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?