Google DataPrep is extremely slow - google-bigquery

In Google Dataflow, i have a job that basically looks like this:
Dataset: 100 rows, 1 column.
Recipe: 0 steps
Output: New Table.
But it takes between 6-8 minutes to run. What could be the issue?

Usually times are in minutes, not in seconds for Dataprep/dataflow setup.
These solutions are for large data sets and the duration stays constant even if you have 10 times the size.
DataPrep creates for you a DataFlow workflow, and provisions a few VMs for you, that takes time, usually that phase could be in the minute mark. And only a bit later is scaling that up to 50 or 1000 boxes.

Related

How do I tell I find all AWS metrics using high-resolution?

i run into this error in AWS cloudwatch
which does not make sense as I think/thought we had 0 high resolution metrics(high resolution only records for 3 hours). We typically just do 1 minute interval reporting. How do I find all metrics with high resolution? In this way I am hoping I can edit them to not high resolution.
I searched around a ton on the documentation and I looked into micrometer code which seems to default to highResolution = false and a step of 2 minutes. (We are using micrometer). I am trying to figure out next steps on figuring out why AWS thinks this data is high resolution data.
I was also thinking 'ok, perhaps it would roll up to 1 minute data then 5 minute data' so in my query I tried 1 minute and 5 minute but I still get the error of only 3 hours of data.
Error is thrown because you're using the query syntax (SELECT ...) and that only supports the latest 3 hours of data. The feature is called Metrics Insights, you can see the limits here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-metrics-insights-limits.html
Error is not related to high resolution metrics. Even if they were high resolution, when you're setting the period to 5 min, you would only retrieve datapoints aggregated to 5 min granularity.

BigQuery Count Appears to be Processing Data

I noticed that running a SELECT count(*) FROM myTable on my larger BQ tables yields long running times, upwards of 30/40 seconds despite the validator claiming the query processes 0 bytes. This doesn't seem quite right when 500 GB queries run faster. Additionally, total row counts are listed under details -> Table Info. Am I doing something wrong? Is there a way to get total row counts instantly?
When you run a count BigQuery still needs to allocate resources (such as: slot units, shards etc). You might be reaching some limits which cause a delay. For example, the slots default per project is 2,000 units.
BigQuery execution plan provides very detail information about the process which can help you better understand the source of the delay.
One way to overcome this is to use an approximate method described in this link
This Slide by Google might also help you
For more details see this video about how to understand the execution plan

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

How to reduce time allotted for a batch of HITs?

today I created a small batch of 20 categorization HITs with the name Grammatical or Ungrammatical using the web UI. Can you tell me the easiest way to manage this batch so that I can reduce its time allotted to 15 minutes from 1 hour and remove also remove the categorization of masters. This is a very simple task that's set to auto-approve within 1 hour, and I am fine with that. I just need to make it more lucrative for people to attempt this at the penny rate.
You need to register a new HITType with the relevant properties (reduced time and no masters qualification) and then perform a ChangeHITTypeOfHIT operation on all of the HITs in the batch.
API documentation here: http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_ChangeHITTypeOfHITOperation.html

Interpreting and finetuning the BATCHSIZE parameter?

So I am playing around with the BULK INSERT statement and am beginning to love it. What was talking the SQL Server Import/Export Wizard 7 hours is taking only 1-3 hours using BULK INSERT. However, what I am observing is that the time to completion is heavily dependent on the BATCHSIZE specification.
Following are the times I observed for a 5.7 GB file containing 50 million records:
BATCHSIZE = 50000, Time Taken: 17.30 mins
BATCHSIZE = 10000, Time Taken: 14:00 mins
BATCHSIZE = 5000 , Time Taken: 15:00 mins
This only makes me curious: Is it possible to determine a good number for BATCHSIZE and if so, what factors does it depend on and can it be approximated without having to run the same query tens of times?
My next run would be a 70 GB file containing 780 million records. Any suggestions would be appreciated? I will report back the results once I finish.
There is some information here and it appears the batch size should be as large as is practical; documentation states in general the larger the batch size the better the performance, but you are not experiencing that at all. Seems that 10k is a good batch size to start with, but I would look at optimizing the bulk insert from other angles such as putting the database into simple mode or specifying a tablock hint during your import race.