How to prevent spark parquet scan on every query - apache-spark-sql

I created a delta lake table in databricks using a SQL command like the following:
CREATE TABLE mytable
USING DELTA
LOCATION '/mnt/s3-mount-point/mytable/'
AS
SELECT
A,
B,
C
FROM t1
I then optimized the table:
OPTIMIZE mytable
ZORDER BY (A)
When I query the table using a simple query like below, spark fires off 26,537 ScanParquet tasks. The table that I'm working with is on the order of 3.5 TB stored in 9,000 files. The cardinality of A is greater than 2 million
SELECT
A,
B,
C
FROM mytable
WHERE A = 'value'
A few questions:
Why are the number of tasks to scan this table significantly greater than the number of files?
Why is every file being scanned if the table is zordered for this column? Shouldn't the data for 'value' be colocated in the same file? In this case, rows with 'value' represent roughly 5E-7 of the whole data, so it should be located in the same parquet file.
Why is the entire table scanned again when I perform the same query? This is true if I cache the table in 'Delta Cache' or not. I can't cache in memory since it would far overwhelm the memory available to the workers.

The files are optimized to ~1gb (most likely). Spark splits the files in 128mb chunks, for the read.
If youre using databricks, I dont expect the entire
table to be scanned. You can validate its not reading all the files
by looking at the spark ui, then going to the sql tab and looking at
number of files in the read stage.
Lazy evaluation re-runs the dag. If your using a delta cache enabled instance, it should read from cache on 2nd read. Check the spark ui, the go to the sql tab and see the cache read metrics.
If youre seeing something that contradicts what I said, let me know, and I'd like to explore further.

Related

Which method is more memory efficient createOrReplaceView or saveAsTable

I've a dataframe from hive table I'm doing some changes to it, then while saving it again in hive as a new table which method should I use ? Assume this dataframe has 70 million record, I want to make saving process memory & time efficient.
For eg.
Dataframe name = df
df.createOrReplaceView(new_table) SQL("create table new_table as select * from new_table)
df.write.saveAsTable("new_table")
The way I see it there's no way operation 1 can be more efficient.
createOrReplaceView is creating a temporary table in memory, you can read about it in this previous question.
As such between (1) Reading from disk to create a temp table in memory, to write the same table to disk, and (2) Reading from disk to write to disk, number 2 seems the obvious favorite.
If this answer doesn't satisfy you. You can always try both ways and check the total time and memorySeconds consumed in the YARN application UI.

What is the most efficient way to load a Parquet file for SQL queries in Azure Databricks?

Our team drops parquet files on blob, and one of their main usages is to allow analysts (whose comfort zone is SQL syntax) to query them as tables. They will do this in Azure Databricks.
We've mapped the blob storage and can access the parquet files from a notebook. Currently, they are loaded and "prepped" for SQL querying in the following way:
Cell1:
%python
# Load the data for the specified job
dbutils.widgets.text("JobId", "", "Job Id")
results_path = f"/mnt/{getArgument("JobId")}/results_data.parquet"
df_results = spark.read.load(results_path)
df_results.createOrReplaceTempView("RESULTS")
The cell following this can now start doing SQL queries. e.g.:
SELECT * FROM RESULTS LIMIT 5
This takes a bit of time to get up, but not too much. I'm concerned about two things:
Am I loading this in the most efficient way possible, or is there a way to skip the creation of the df_results dataframe, which is only used to create the RESULTS temp table.
Am I loading the table for SQL in a way that lets it be used most efficiently? For example, if the user plans to execute a few dozen queries, I don't want to re-read from disk each time if I have to, but there's no need to persist beyond this notebook. Is createOrReplaceTempView the right method for this?
For your first question:
Yes, you can use the Hive Metastore on Databricks and query any tables in there without first creating DataFrames. The documentation on Databases and Tables is a fantastic place to start.
As a quick example, you can create a table using SQL or Python:
# SQL
CREATE TABLE <example-table>(id STRING, value STRING)
# Python
dataframe.write.saveAsTable("<example-table>")
Once you've created or saved a table this way, you'll be able to access it directly in SQL without creating a DataFrame or temp view.
# SQL
SELECT * FROM <example-table>
# Python
spark.sql("SELECT * FROM <example-table>")
For your second question:
Performance depends on multiple factors but in general, here are some tips.
If your tables are large (tens, hundreds of GB at least), you can partition by a predicate commonly used by your analysts to filter data. For example, if you typically include a WHERE clause that includes a date range or state, it might make sense to partition the table by one of those columns. The key concept here is data skipping.
Use Delta Lake to take advantage of OPTIMIZE and ZORDER. OPTIMIZE helps right-size files for Spark and ZORDER improves data skipping.
Choose Delta Cache Accelerated instace types for the cluster that your analysts will be working on.
I know you said there's no need to persist beyond the notebook but you can improve performance by creating persistent tables and taking advantage of data skipping, caching, and so on.

Understanding data scanned when querying ORC with Presto/Athena

I have a large amount of data in ORC files in AWS S3. The data in ORC files is sorted by uuid. I create an AWS Athena (Presto) table on top of them and run the following experiment.
First, I retrieve the first row to see how much data gets scanned:
select * from my_table limit 1
This query reports 18 MB of data being scanned.
I record the uuid from the row returned from the first query and run the following query:
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query reports 8.5 GB of data being scanned.
By design, both queries return the same result but the second query scans 500 times more data!
Any ideas why this is happening? Is this something inherent to ORC design or is it specific to how Presto interacts with S3?
[EDIT after ilya-kisil's response]
Let's change the last query to only select the uuid column:
select uuid from my_table where uuid=<FIRST_ROW_UUID> limit 1
For this query, the amount of data scanned drops to about 600 MB! This means that the bulk of the 8.5 GB scanned in the second query is attributed to gathering values from all columns for the record found and not to finding this record.
Given that all values in the record add up to no more than 1 MB, scanning almost 8 GB of data to put these values together seems extremely excessive. This seems like some idiosyncrasy of ORC or columnar formats in general and I am wondering if there are standard practices, e.g. ORC properties, that help reduce this overhead?
Well this is fairly simple. The very first time your query would pick a random record from your data. On top of that it is not guaranteed that you have read the very first record, since ORC files are splittable and can be processed in parallel. On the other hand, the second query looks for a specific record.
Here is an analogy. Let's assume you have 100 coins UUID and some other info imprinted at on their backs. All of them are face up on a table, so you can't see their UUID.
select * from my_table limit 1
This query is like you flipped some random coin, looked at what it is written on the back and put it back on a table face up. Next, someone came and shuffled all of the coins.
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query is like you wanting to look at the information written on the back of a specific coin. It is unlikely that you would flip the correct coin with your first try. So you would need to "scan" more coins (data).
One of the common ways to reduce size of scanned data is to partition your data, i.e. put it into separate "folders" (not files) in your S3 bucket. Then "folder" names can be use as a virtual columns within your table definition, i.e. additional metadata for your table. Have a look at this post, which goes into mor details on how to optimise queries in Athena.

How to efficiently find the latest partition from a S3 dataset using Spark

I have dataset that has data added almost everyday, and needs to be processed everyday in a part of a larger ETL.
When I select the partition directly, the query is really fast:
SELECT * FROM JSON.`s3://datalake/partitoned_table/?partition=2019-05-20`
Yet, the issue is that the event type does not generate data on some Sundays, resulting in a non-existing partition on that particular day. Because of this, I cannot use the previous statement to run my daily job.
Another attempt led me to try to have spark find the latest partition of that dataset, in order to be sure the bigger query wouldn't fail:
SELECT * FROM JSON.`s3://datalake/partitoned_table/`
WHERE partition = (SELECT MAX(partition) FROM JSON.`s3://datalake/partitoned_table/`)
This works every time, but it is unbelievably slow.
I found numerous articles and reference on how to build and maintain partitions, yet nothing about how to read them correctly.
Any idea how to have this done properly?
(SELECT MAX(partition) FROM JSON.s3://datalake/partitoned_table/)
This query will be executed as a subquery in Spark.
Reason for slowness
1. Subquery needs to be executed completely before the actual query execution starts.
2. The Above query will list all the S3 files to retrieve the partition information. If the folder has a large number of files, this process will take a long time. Time taken for listing is directly proportional to the number of files.
We could create a table on top of s3://datalake/partitoned_table/ with the partitioning scheme, let's say the name of the table is tbl
You could perform an
ALTER TABLE tbl RECOVER PARTITIONS
which stores the partition information in metastore. This also involves a listing, but it is a one-time operation and spark spawns multiple threads to perform the listing to make it faster.
Then we could fire
SELECT * FROM tbl WHERE partition = (SELECT MAX(partition) FROM tbl`)
Which will get the partition information only from metastore, without having to list the object store which I believe is an expensive operation.
The cost incurred in this approach is that of recovering partitions.
After which all queries will be faster(when data for new partition comes, we need to recover partitions again)
WorkAround when you don't have Hive-
FileSystem.get(URI.create("s3://datalake/partitoned_table"), conf).listStatus(new Path("s3://datalake/partitoned_table/"))
Above code will give you list of file partitions example - List(s3://datalake/partitoned_table/partition=2019-05-20, s3://datalake/partitoned_table/partition=2019-05-21....)
This is very efficient because it is only fetching metadata from the s3 location.
Just take the latest file partitions and use it your SQL.

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.