BigQuery's hive partitioned table performance - google-bigquery

I am trying to take advantage of the hive partitioned table. I have encountered the problem that retrieving Parquet files directly from GCS is several times faster than retrieving the same data using the hive partitioned external table.
My data is stored in parquet format in the following structure:
gs://mybucket/dataset/dt=2019-06-17/h=5/m=0/000
gs://mybucket/dataset/dt=2019-06-17/h=5/m=0/001
gs://mybucket/dataset/dt=2019-06-17/h=5/m=0/...
"h" stands for an hour, and "m" stands for a minute. "mybucket" is on region "us-central1".
Querying parquet files directly takes 3 seconds:
bq --project_id=chronosphere-production --location us-central1 query --nouse_cache --use_legacy_sql=false --external_table_definition='trace::PARQUET=gs://mybucket/dataset/dt=2019-06-17/h=5/*' "SELECT name, count(*) as c FROM people GROUP BY name ORDER BY c DESC LIMIT 20"
The other query, which runs on the same data but using hive partitioned table where hive url is gs://mybucket/dataset/{dt:DATE}/{h:INTEGER}/{m:INTEGER} takes 12 seconds:
bq --location us-central1 query --nouse_cache --use_legacy_sql=false "SELECT name, count(*) as c FROM \`dataset.hive_table\` WHERE dt='2019-06-17' AND h=5 GROUP BY name ORDER BY c DESC LIMIT 20"
Both queries scan the same amount of data/rows, returns the same result. But the response time difference is huge. Any ideas what can be the reason for such a big difference?
BTW if I create a non-hive partitioned table that points to gs://mybucket/dataset/dt=2019-06-17/h=5, it performs as good as querying parquet files directly. I think it's ok as this is temporary table vs permanent table performance.
Any help would be very appreciated.
EDIT:
It feels like it is related to file count, but I'm still not sure what is the root cause and if it's possible to solve it.
Here are some folder/file count numbers:
dt=* folder count = 3
h=* folder count per dt folder = 24
m=* folder count per h folder = 60
files per m folder ~40
My query scans ~32M rows / 500Mb of data.
I assume that when I provide a filter like WHERE dt='2019-06-17' AND h=5, BigQuery should go directly to gs://mybucket/dataset/dt=2019-06-17/h=5/ and start searching for files from there, but it feels that's not what it does.

I'm guessing there's a significant number of files in the bucket, as this is largely a comparison of cloud storage object listing performance. It's not clear from the description how many objects are involved, and how they're distributed across your partitioning scheme. Is that a typical date, or an unusual one in terms of data distribution?
In the hive partitioned case, BigQuery must get the larger list of bucket objects (e.g. gs://mybucket/dataset/*) and then filter it, whereas in the non-hive cases you describe you're effectively pushing a more targeted filter to cloud storages's list operations (e.g. gs://mybucket/dataset/dt=2019-06-17/h=5/*).
The calculus here is whether the performance implications outweigh other factors like convenience/manageability/etc. There's likely a middle ground to consider as well, which would be to experiment using more of the dt prefix for defining yearly/monthly tables to see if you get a more satisfactory performance tradeoff.

Related

What is the most efficient way to load a Parquet file for SQL queries in Azure Databricks?

Our team drops parquet files on blob, and one of their main usages is to allow analysts (whose comfort zone is SQL syntax) to query them as tables. They will do this in Azure Databricks.
We've mapped the blob storage and can access the parquet files from a notebook. Currently, they are loaded and "prepped" for SQL querying in the following way:
Cell1:
%python
# Load the data for the specified job
dbutils.widgets.text("JobId", "", "Job Id")
results_path = f"/mnt/{getArgument("JobId")}/results_data.parquet"
df_results = spark.read.load(results_path)
df_results.createOrReplaceTempView("RESULTS")
The cell following this can now start doing SQL queries. e.g.:
SELECT * FROM RESULTS LIMIT 5
This takes a bit of time to get up, but not too much. I'm concerned about two things:
Am I loading this in the most efficient way possible, or is there a way to skip the creation of the df_results dataframe, which is only used to create the RESULTS temp table.
Am I loading the table for SQL in a way that lets it be used most efficiently? For example, if the user plans to execute a few dozen queries, I don't want to re-read from disk each time if I have to, but there's no need to persist beyond this notebook. Is createOrReplaceTempView the right method for this?
For your first question:
Yes, you can use the Hive Metastore on Databricks and query any tables in there without first creating DataFrames. The documentation on Databases and Tables is a fantastic place to start.
As a quick example, you can create a table using SQL or Python:
# SQL
CREATE TABLE <example-table>(id STRING, value STRING)
# Python
dataframe.write.saveAsTable("<example-table>")
Once you've created or saved a table this way, you'll be able to access it directly in SQL without creating a DataFrame or temp view.
# SQL
SELECT * FROM <example-table>
# Python
spark.sql("SELECT * FROM <example-table>")
For your second question:
Performance depends on multiple factors but in general, here are some tips.
If your tables are large (tens, hundreds of GB at least), you can partition by a predicate commonly used by your analysts to filter data. For example, if you typically include a WHERE clause that includes a date range or state, it might make sense to partition the table by one of those columns. The key concept here is data skipping.
Use Delta Lake to take advantage of OPTIMIZE and ZORDER. OPTIMIZE helps right-size files for Spark and ZORDER improves data skipping.
Choose Delta Cache Accelerated instace types for the cluster that your analysts will be working on.
I know you said there's no need to persist beyond the notebook but you can improve performance by creating persistent tables and taking advantage of data skipping, caching, and so on.

GCP BigQuery - LIMIT but full table read - How to limit queried data to a minimum

It looks like LIMIT would have no effect on the amount of processed/queried data (if you trust the UI).
SELECT
* --count(*)
FROM
`bigquery-public-data.github_repos.commits`
-- LIMIT 20
How to limit the amount of queried data to a minimum (even though one whole partition would probably always be needed)
without to use "preview" or similar
without to know the partition / clustering of the data
How to check the real approximate amount before a query execution?
In the execution details is stated that only 163514 rows has been queried as input (not 244928379 rows)
If you want to limit the amount of data BQ uses for a query you have this two options:
Table Partitioning
Big query can partition data using either a Date/Datetime/Timemestamp column you provide or by insert date (which is good if you have regular updates on a table).
In order to do this, you must specify the partition strategy in the DDL:
CREATE TABLE mydataset.mytable (foo: int64, txdate:date)
PARTITION BY txdate
Wildcard tables (like Sharding - splitting the data into multiple tables
This works when your data holds information about different domains (geographical, customer type, etc.) or sources.
Instead of having one big table, you can create 'subtables' or 'shards' like this with a similar schema (usually people use the same). For instance,dateset.tablename.eur for european data and ```dataset.tablename.jap`` for data from Japan.
You can query one of those tables directll select col1,col2... from dataset.tablename.custromer_eur; or from all tables select col1,col2 from 'dataset.tablename.*'
Wildcard tables can be also partitioned by date.
You pay for the volume of data loaded in the workers. Of course, you do nothing in your request and you ask for the 20 first result, the query stop earlier, and all the data aren't processed, but at least loaded. And you will pay for this!
Have a look to this. I have a similar request
Now, let's go to the logs
The total byte billed is ~800Mb
So you, have to think differently when you work with BigQuery, it's analytics database and not designed to perform small requests (too slow to start, the latency is at least 500ms due to worker warm up).
My table contain 3M+ of rows, and only 10% have been processed
And you pay for the reservation and the load cost (moving data have a cost and reserving slots has also a cost).
That's why, there is a lot of tip to save money on Google BigQuery. Some examples by a former BigQuery Dev Advocate
as of december 2021, I notice select * from Limit, will not scan the whole table and you pay only for a small number of rows, obviously if you add order by, it will scan everything.

Understanding data scanned when querying ORC with Presto/Athena

I have a large amount of data in ORC files in AWS S3. The data in ORC files is sorted by uuid. I create an AWS Athena (Presto) table on top of them and run the following experiment.
First, I retrieve the first row to see how much data gets scanned:
select * from my_table limit 1
This query reports 18 MB of data being scanned.
I record the uuid from the row returned from the first query and run the following query:
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query reports 8.5 GB of data being scanned.
By design, both queries return the same result but the second query scans 500 times more data!
Any ideas why this is happening? Is this something inherent to ORC design or is it specific to how Presto interacts with S3?
[EDIT after ilya-kisil's response]
Let's change the last query to only select the uuid column:
select uuid from my_table where uuid=<FIRST_ROW_UUID> limit 1
For this query, the amount of data scanned drops to about 600 MB! This means that the bulk of the 8.5 GB scanned in the second query is attributed to gathering values from all columns for the record found and not to finding this record.
Given that all values in the record add up to no more than 1 MB, scanning almost 8 GB of data to put these values together seems extremely excessive. This seems like some idiosyncrasy of ORC or columnar formats in general and I am wondering if there are standard practices, e.g. ORC properties, that help reduce this overhead?
Well this is fairly simple. The very first time your query would pick a random record from your data. On top of that it is not guaranteed that you have read the very first record, since ORC files are splittable and can be processed in parallel. On the other hand, the second query looks for a specific record.
Here is an analogy. Let's assume you have 100 coins UUID and some other info imprinted at on their backs. All of them are face up on a table, so you can't see their UUID.
select * from my_table limit 1
This query is like you flipped some random coin, looked at what it is written on the back and put it back on a table face up. Next, someone came and shuffled all of the coins.
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query is like you wanting to look at the information written on the back of a specific coin. It is unlikely that you would flip the correct coin with your first try. So you would need to "scan" more coins (data).
One of the common ways to reduce size of scanned data is to partition your data, i.e. put it into separate "folders" (not files) in your S3 bucket. Then "folder" names can be use as a virtual columns within your table definition, i.e. additional metadata for your table. Have a look at this post, which goes into mor details on how to optimise queries in Athena.

How to efficiently find the latest partition from a S3 dataset using Spark

I have dataset that has data added almost everyday, and needs to be processed everyday in a part of a larger ETL.
When I select the partition directly, the query is really fast:
SELECT * FROM JSON.`s3://datalake/partitoned_table/?partition=2019-05-20`
Yet, the issue is that the event type does not generate data on some Sundays, resulting in a non-existing partition on that particular day. Because of this, I cannot use the previous statement to run my daily job.
Another attempt led me to try to have spark find the latest partition of that dataset, in order to be sure the bigger query wouldn't fail:
SELECT * FROM JSON.`s3://datalake/partitoned_table/`
WHERE partition = (SELECT MAX(partition) FROM JSON.`s3://datalake/partitoned_table/`)
This works every time, but it is unbelievably slow.
I found numerous articles and reference on how to build and maintain partitions, yet nothing about how to read them correctly.
Any idea how to have this done properly?
(SELECT MAX(partition) FROM JSON.s3://datalake/partitoned_table/)
This query will be executed as a subquery in Spark.
Reason for slowness
1. Subquery needs to be executed completely before the actual query execution starts.
2. The Above query will list all the S3 files to retrieve the partition information. If the folder has a large number of files, this process will take a long time. Time taken for listing is directly proportional to the number of files.
We could create a table on top of s3://datalake/partitoned_table/ with the partitioning scheme, let's say the name of the table is tbl
You could perform an
ALTER TABLE tbl RECOVER PARTITIONS
which stores the partition information in metastore. This also involves a listing, but it is a one-time operation and spark spawns multiple threads to perform the listing to make it faster.
Then we could fire
SELECT * FROM tbl WHERE partition = (SELECT MAX(partition) FROM tbl`)
Which will get the partition information only from metastore, without having to list the object store which I believe is an expensive operation.
The cost incurred in this approach is that of recovering partitions.
After which all queries will be faster(when data for new partition comes, we need to recover partitions again)
WorkAround when you don't have Hive-
FileSystem.get(URI.create("s3://datalake/partitoned_table"), conf).listStatus(new Path("s3://datalake/partitoned_table/"))
Above code will give you list of file partitions example - List(s3://datalake/partitoned_table/partition=2019-05-20, s3://datalake/partitoned_table/partition=2019-05-21....)
This is very efficient because it is only fetching metadata from the s3 location.
Just take the latest file partitions and use it your SQL.

Table size in AWS-Athena

Is there SQL-based way to retrieve the size of all tables within a database in AWS-Athena?
I'm more familiar with MSSQL and there it is relatively easy to write such query.
The quick way is via s3: ... > Show Properties > Location and lookup the size in the s3-console.
Explainer
You can run SELECT * FROM some_table for each table and look at the result metadata for the amount scanned, but it will be an expensive way to do it.
Athena doesn't really know about the data in your tables the way an RDBMS does, it's only when you query a table that Athena goes out to look at the data. It really S3 that you should as. You can list all objects in the location(s) of your tables and sum their sizes, but that might be a time consuming way of doing it if there are many objects.
The least expensive, and least time consuming way, when there are many hundreds of thousands of objects, is to enable S3 Inventory on the bucket that contains the data for your tables, then use the inventory to sum up the sizes for each table. You can get the inventory in CSV, ORC, or Parquet format, and they all work well with Athena – so if you have a lot of files in your bucket you can still query the inventory very efficiently.
You can read more about S3 Inventory here: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html