DIU does not increase not more than 4 on copy activity - azure-data-factory-2

I am trying to copy data from GCP(Big Query) Azure Storage Gen2 parquet file with below configuration. Increased DIU from 4 to 16 but during runtime the DIU does not go beyond 4. Can you please help on how to increase the DIU to make my process faster?
using preserve hierarchy
Data size 12 millions with 3gb
Throughput is 2.5mbps

To increase DIU for a copy activity just click on the activity, and under the Settings tab you can find the Data Integration Unit selector.

I have the same issue. My source is Azure Blob (csv files), Staging is also Azure Blob and final destination is Snowflake with Azure. All in the same region/zone.
I have set DIU for 20 and Parallel to 4. But only has DIU as 4 utilized and Parallel to 1.

Sorry, I dont have enough reputations to write a comment so posting it as an answer if this helps.
Please read this thread. This is similar to yours with data size of 3GB. So I assume when your data size increase your DIU will also change.
Alternatively, you can increase the Degree of copy Parallelism (DoCP). I tried in my case for a 1.5GB dataset with source as ADLS and sink as Azure table. With default DoCP i.e. 4 the copy activity takes 10 minutes with throughput starting from 8Mbps and ends at 1.3Mbps. Where with DoCP as 16, the throughput ends with 2Mbps and it takes ~4 minutes to complete the copy activity. Both runs with a DIU of 4 units.

Related

DefaultPartitioner vs TimeBasedPartitioner S3 upload performance difference with 100 partitions and 50K flush size

I'm using a 100 Partition topic with 3 Replicas and 2 ISR in a MSK serverless cluster.
My EC2 instance running the Confluent S3 sink connector ingests 56 GB data from my MSK cluster in 15 minutes and uploads only 37GB data to S3 in the same time frame. The instance's resources are underutilized and I'm using a S3 endpoint which makes me think that this upload differential occurs due to my flush size and Partitioning scheme.
My S3 sink connector config.
tasks.max=50
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
flush.size=50000
rotate.interval.ms=-1
rotate.schedule.interval.ms=-1
Based on my understanding, the current config waits for 50,000 messages to accumulate for each partition before uploading the file to S3.So, if I use a Time based Hourly Partitioner, this 50k message limit would be reached much more quickly as there is only 1 partition for the 15 minute time frame instead of a 100?
Thanks in advance.
Each task has its own flush buffer. Hourly partitioner will buffer either the whole hour or dump each set of 50000 records within the hour partition, whichever occurs first.

Does dataframe.repartition(x) makes execution faster

I have a Spark script that reads data from amazon S3 and then writes in another bucket usion parquet format.
This is what the code looks like:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods.repartition(25).write.format("parquet").mode("OverWrite").save("AnotherLocationInS3")
My question is: how does the repartition argument (here 25) affects the execution time? Should I increase it so the script runs faster?
Second question: Would it be better if I cache my df before the last line?
Thank you
In typical setups neither repartition nor cache will help you in this specific case. Since you read data from non-splittable format:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods will have only one partition.
In such case repartitioning would make sense, if you performed any actual processing on this data.
However if you just write to distributed file system repartitioning will simply double the cost - you have to send data to other nodes first (that involves serialization, deserialization, network transfer, write to disk) and then still write to distributed file system.
There are of course edge cases when this makes sense. If network connecting your cluster is much faster than network connection your cluster to S3 nodes, effective latency might be a bit lower.
As of caching ‒ there is no value in caching here at all. Caching Dataset is expensive, and makes sense only if persisted data is reused.
Answer 1 :- Repartition of 25 or more or less it depends on how much data you have and no. of executors you provided. If your Spark code run in the cluster with more than one executor and it is not repartitioned then repartitioning will speedy to writing parallel your data.
Answer 2 :- There is no need to cache df before the last line because you are using only single action in your code. If you will perform multiple actions on your DF and don't want it will recalculate as the number of actions then you will Cache it.
The thing here is that Spark can parallelize writing to a certain point since one file can't be written by multiple executors at the same time.
Repartition helps you in this parallelization because it will write 25 different files (one for each partition). If you increase the number of partitions you will increase the number of written files hence speeding up the execution. This comes with a price because of the reading time will increase with the number of files to be read.
The limit is the number of executors you are running your job with, e.g. if you are running with 25 executors then setting repartition to 26 will not help you because to write the 26th partition one of the previous 25 would have to be finished.
For the other question, I don't think .cache() will help you because Spark is lazy, maybe this article can help you further.

Google Dataflow not reading more than 3 input compressed files at once when there are multiple sources

Background: I have 30 days data in 30 separate compressed files stored in google storage. I have to write them to a BigQuery table in 30 different partitions in the same table. Each compressed file size was around 750MB.
I did 2 experiments on the same data set on Google Dataflow today.
Experiment 1: I read each day's compressed file using TextIO, applied a simple ParDo transform to prepare TableRow objects and wrote them directly to BigQuery using BigQueryIO. So basically 30 pairs of parallel unconnected sources and and sinks got created. But I found that at any point of time, only 3 files were read, transformed and written to BigQuery. The ParDo transformation and BigQuery writing speed of Google Dataflow was around 6000-8000 elements/sec at any point in time.
So only 3 source and sinks were being processed out of 30 at any time which significantly slowed the process. In over 90 minutes only 7 out 30 files were written to separate BigQuery partitions of a table.
Experiment 2: Here I first read each day's data from the same compressed file for 30 days, applied ParDo transformation on these the 30 PCollections and stored these 30 resultant Pcollections in a PCollectionList object. All these 30 TextIO sources were being read in parallel.
Now I wrote each PCollection corresponding to each day's data in the PCollectionList to BigQuery using BigQueryIO directly. So 30 sinks were being written into again in parallel.
I found that out of 30 parallel sources, again only 3 sources were being read and applied ParDo transformation at a speed of around 20000 elements/sec. At the time of writing of this question when 1 hr had already elapsed, reading from the all the compressed file had not even read completely 50% of the files and writing to the BigQuery table partitions had not even started.
These problems seem to occur only when Google Dataflow reads compressed files. I had asked a question about its slow reading from compressed files(Relatively poor performance when reading compressed files vis a vis normal text files kept in google storage using google dataflow) and was told that parallelizing work would make reading faster as only 1 worker reads a compressed file and multiple sources would mean multiple workers being given chance to read multiple files. But this also does not seem to be working.
Is there any way to speed up this whole process of reading from multiple compressed files and writing to separate partitions of the same table in BigQuery in dataflow job at the same time?
Each compressed file will be read by a single worker. The initial number of workers for a job can be increased with the numWorkers pipeline option, and the maximum number that can be scaled up to can be set with the maxNumWorkers pipeline option.

datastax : Spark job fails : Removing BlockManager with no recent heart beats

Im using datastax-4.6. I have created a cassandra table and stored 2crore records. Im trying to read the data using scala. The code works fine for few records but when i try to retrieve all 2crore records it displays me follwing error.
**WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 172.20.98.17, 34224, 0) with no recent heart beats: 140948ms exceeds 45000ms
15/05/15 19:34:06 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(C15759,34224) not found**
Any help?
This problem is often tied to GC pressure
Tuning your Timeouts
Increase the spark.storage.blockManagerHeartBeatMs so that Spark waits for the GC pause to end.
SPARK-734 recommends setting -Dspark.worker.timeout=30000 -Dspark.akka.timeout=30000 -Dspark.storage.blockManagerHeartBeatMs=30000 -Dspark.akka.retry.wait=30000 -Dspark.akka.frameSize=10000
Tuning your jobs for your JVM
spark.cassandra.input.split.size - will allow you to change the level of parallelization of your cassandra reads. Bigger split sizes mean that more data will have to reside in memory at the same time.
spark.storage.memoryFraction and spark.shuffle.memoryFraction - amount of the heap that will be occupied by RDDs (as opposed to shuffle memory and spark overhead). If you aren't doing any shuffles, you could increase this value. The databricks guys say to make this similar in size to the size of your oldgen.
spark.executor.memory - Obviously this depends on your hardware. Per DataBricks you can do up to 55gb. Make sure to leave enough RAM for C* and for your OS and OS page cache. Remember that long GC pauses happen on larger heaps.
Out of curiosity, are you frequently going to be extracting your entire C* table with Spark? What's the use case?

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?