Writing speed in Delta tables significantly increases after copying it in Databricks - sql

I am merging a PySpark dataframe into a Delta table. The output delta is partitioned by DATE. The following query takes 30s to run:
query = DeltaTable.forPath(spark, PATH_TO_THE_TABLE).alias(
"actual"
).merge(
spark_df.alias("sdf"),
"actual.DATE >= current_date() - INTERVAL 1 DAYS
AND (actual.feat1 = sdf.feat1)
AND (actual.TIME = sdf.TIME)
AND (actual.feat2 = sdf.feat2) "
,
).whenNotMatchedInsertAll()
After copying the content of the Delta to another location, the above query becomes 60 times faster (i.e. it takes 0.5s on the same cluster), when using the NEW_PATH instead of PATH_TO_THE_TABLE. Here is the command to copy the delta:
(spark.read.format("delta").load(PATH_TO_THE_TABLE).write.format(
"delta"
)
.mode("overwrite").partitionBy(["DATE"]).save(NEW_PATH))
How can I make querying on the first delta as fast as on the new one? I understand that Delta has a versioning system and I suspect it is the reason it takes so much time. I tried to vacuum the Delta table (which lowered the query time to 20s) but I am still far from the 0.5s.
Stack:
Python 3.7;
Pyspark 3.0.1;
Databricks Runtime 7.3 LTS

I see that you compare actual.DATE >= current_date() as this is most important part of your query please try to run periodically ZORDER sorting delta by thta filed:
OPTIMIZE actual ZORDER BY (actual.DATE)
You can also try totally vacuum delta:
VACUUM actual RETAIN 0 HOURS
to do that you need to set spark.databricks.delta.retentionDurationCheck.enabled false.
If you don't want benefits of delta (transaction, concurrent writes, timetravel history etc.) you can just use parquet.

Related

Big Query External table - Query performance degrades with increased number of files in the Source URI

I have an external big query table created to read "Parquet" files from a GCS bucket.
The folder layout in the GCS bucket is as follows:
gs://mybucket/root/year=2022/model=abc/
gs://mybucket/root/year=2022/model=.../
gs://mybucket/root/year=2021/model=abc/
gs://mybucket/root/year=2021/model=.../
The layout is organized in such a way that it follows hive partitioning layout as explained the big query documentation. The columns "year" and "model" are seen as partition columns in the external table.
**External Data Configuration**
Source URI(s)- gs://mybucket/root/*
Source format - PARQUET
Hive Partitioning Mode - CUSTOM
Hive Partitioning Source URI Prefix - gs://mybucket/root/{year:INTEGER}/{model:STRING}
Hive Partitioning Column(s)- year, model
Problem: When I run queries on the external table as given below, I have observed that every query runs for an initial 2-3 minutes before the actual run happens. Big Query console shows "Query pending" during this time and as soon as it turns "Query Running" the output gets displayed with minimal slot time consumption (Slot time shows in 1-2 seconds.)
Select * from myTable Where year = 2022 and model = 'abc'
The underlying file count will vary and increases for every year and model. For years with more parquet files the initial time sometimes is around 4-5 minutes.
My understanding as per the documentation is that , if the partition columns are present in the query, some sort of partition pruning happens and I expect the query to be responsive immediately as per the documentation.
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs#partition_pruning
But the observations made by me is contrary to this. If the source URIs are restricted to 1 year, the table reads the data from one year, the query initial time (where it remains "Query pending" on console) is reduced to 1-2 minute (or even less)
Source URI(s)- gs://mybucket/root/year=2022/*
Question: Is this the expected behavior ? because as volume of files increase in the GCS bucket, the query takes even longer to run (esp. the initial time, and the actual run time doesn't change much), though in the where clause we have the year and model partition columns applied.
This is likely expected behavior. Before partition pruning can happen objects in GCS need to be listed which is likely where the time is being taken. We are working on improvements in this area. The fact that slot time is so low is a good indicator that pruning is in fact happening (since most files are not being read there isn't a lot of slot time to consume).

BigQuery. Long execution time on small datasets

I created a new Google cloud project and set up BigQuery data base. I tried different queries, they all are executing too long. Currently we don't have a lot of data, so high performance was expected.
Below are some examples of queries and their execution time.
Query #1 (Job Id bquxjob_11022e81_172cd2d59ba):
select date(installtime) regtime
,count(distinct userclientid) users
,sum(fm.advcost) advspent
from DWH.DimUser du
join DWH.FactMarketingSpent fm on fm.date = date(du.installtime)
group by 1
The query failed in 1 hour + with error "Query exceeded resource limits. 14521.457814668494 CPU seconds were used, and this query must use less than 12800.0 CPU seconds."
Query execution plan: https://prnt.sc/t30bkz
Query #2 (Job Id bquxjob_41f963ae_172cd41083f):
select fd.date
,sum(fd.revenue) adrevenue
,sum(fm.advcost) advspent
from DWH.FactAdRevenue fd
join DWH.FactMarketingSpent fm on fm.date = fd.date
group by 1
Execution time ook 59.3 sec, 7.7 MB processed. What is too slow.
Query Execution plan: https://prnt.sc/t309t4
Query #3 (Job Id bquxjob_3b19482d_172cd31f629)
select date(installtime) regtime
,count(distinct userclientid) users
from DWH.DimUser du
group by 1
Execution time 5.0 sec elapsed, 42.3 MB processed. What is not terrible but must be faster for such small volumes of data.
Tables used :
DimUser - Table size 870.71 MB, Number of rows 2,771,379
FactAdRevenue - Table size 6.98 MB, Number of rows 53,816
FaceMarketingSpent - Table size 68.57 MB, Number of rows 453,600
The question is what am I doing wrong so that query execution time is so big? If everything is ok, I would be glad to hear any advice on how to reduce execution time for such simple queries. If anyone from google reads my question, I would appreciate if jobids are checked.
Thank you!
P.s. Previously I had experience using BigQuery for other projects and the performance and execution time were incredibly good for tables of 50+ TB size.
Posting same reply i've given in the gcp slack workspace:
Both your first two queries looks like you have one particular worker who is overloaded. Can see this because in the compute section, the max time is very different from the avg time. This could be for a number of reasons, but i can see that you are joining a table of 700k+ rows (looking at the 2nd input) to a table of ~50k (looking at the first input). This is not good practice, you should switch it so the larger table is the left most table. see https://cloud.google.com/bigquery/docs/best-practices-performance-compute?hl=en_US#optimize_your_join_patterns
You may also have a heavily skew in your join keys (e.g. 90% of rows are on 1/1/2020, or NULL). check this.
For the third query, that time is expected, try a approx count instead to speed it up. Also note BQ starts to get better if you perform the same query over and over, so this will get quicker.

bq decorator behavior against disk and streaming buffer

I am trying to utilize bigquery's decorator but there is some behaviors I want to confirm.
After some experiment, we found that query result of same absolute interval is not the same by different time of executing the query. If I query a live-streaming table of a very recent hour interval by say batch-issuing 60 queries, each of granularity of one minute, I can see very even distribution of data output for every query. However if I query the same hour interval after say 2 hours. The output loads become very skewed. I see many minutes with output size 0 and suddenly a spike in one minute interval contains almost all the data of that hour.
For example, for querying the same table with absolute value from timestamp 8:00AM to 9:00AM
If I execute the query at say 9:10AM, I get number of rows output with distribution like:
8:01AM - 8:02AM: 12
8:02AM - 8:03AM: 9
8:03AM - 8:04AM: 10
8:04AM - 8:05AM: 22
8:05AM - 8:06AM: 15
…
If I execute the query at say 11:00AM, I get result number of output rows like:
8:01AM - 8:02AM: 0
8:02AM - 8:03AM: 0
8:03AM - 8:04AM: 0
8:04AM - 8:05AM: 0
8:05AM - 8:06AM: 0
…
8:20AM - 8:21AM: 123
…
I assume that difference is caused by whether the data is in streaming buffer or disk. However it is kind of undermining the idempotency of query a given range of same table and causes a lot of complexity of using it. Therefore I want to have some expected behaviors clarified.
Is the difference in this query result really caused by the data residence between streaming buffer and disk?
Assuming the difference is because of (1), when a data is flushed from buffer to disk, will that data be reallocated to a screenshot of future timestamp or might it be possible to reallocated to an past timestamp. The question relates to whether it is possible to miss any streaming data when using the decorator.
When exposed to query result, is it guaranteed that reallocation of data is atomic? Namely for same row, it will only be versioned with one server timestamp?
Assuming the scenario when data is re-allocated to a screenshot of future timestamp. Is possible for BQ to provide transactional read on query groups anyhow. Say if I am batching multiple queries of a table, each query covers a unique minute interval while at the same time these queries get executed there is buffer flushing at the background. Is possible that the same data will appear in more than one query output. The question relates to whether it is possible to get duplicate data when using the decorator.
EDIT:
Some additional observation. I found out that after some time, the query result will be stabilized, namely the result of the same query will not change over the time I execute them. I assume it is because of the data "snapshot" of that time range have gotten finalized. So is it possible for me to know how open does BigQuery flush their data from buffer and how often do data get snapshotted? (or whatever mechanism that determine the query result of bq decorator). Namely is there a guaranteed time cutup on when the output of bq decorator can be finalized?

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

Optimizing R code for ETL

I have both an R script and a Pentaho (PDI) ETL transformation for loading data from a SQL database and performing a calculation. The initial data set has 1.28 million rows of 21 variables and is equivalent in both R and PDI. In fact, I originally wrote the R code and then subsequently "ported" to a transformation in PDI.
The PDI transformation runs in 30s (and includes an additional step of writing the output to a separate DB table). The R script takes between 45m and one hour total. I realize that R is a scripting language and thus interpreted, but it seems like I'm missing some optimization opportunities here.
Here's an outline of the code:
Read data from a SQL DB into a data frame using sqlQuery() from the RODBC package (~45s)
str_trim() two of the columns (~2 - 4s)
split() the data into partitions to prepare for performing a quantitative calculation (separate function) (~30m)
run the calculation function in parallel for each partition of the data using parLapply() (~15-20m)
rbind the results together into a single resulting data frame (~10 - 15m)
I've tried using ddply() instead of split(), parLapply() and rbind(), but it ran for several hours (>3) without completing. I've also modified the SQL select statement to return an artificial group ID that is the dense rank of the rows based on the unique pairs of two columns, in an effort to increase performance. But it didn't seem to have the desired effect. I've tried using isplit() and foreach() %dopar%, but this also ran for multiple hours with no end.
The PDI transformation is running Java code, which is undoubtedly faster than R in general. But it seems that the equivalent R script should take no more than 10 minutes (i.e. 20X slower than PDI/Java) rather than an hour or longer.
Any thoughts on other optimization techniques?
update: step 3 above, split(), was resolved by using indexes as suggested here Fast alternative to split in R
update 2: I tried using mclapply() instead of parLapply(), and it's roughly the same (~25m).
update 3: rbindlist() instead of rbind() runs in under 2s, which resolves step 5