What is the most efficient way to load a Parquet file for SQL queries in Azure Databricks? - sql

Our team drops parquet files on blob, and one of their main usages is to allow analysts (whose comfort zone is SQL syntax) to query them as tables. They will do this in Azure Databricks.
We've mapped the blob storage and can access the parquet files from a notebook. Currently, they are loaded and "prepped" for SQL querying in the following way:
Cell1:
%python
# Load the data for the specified job
dbutils.widgets.text("JobId", "", "Job Id")
results_path = f"/mnt/{getArgument("JobId")}/results_data.parquet"
df_results = spark.read.load(results_path)
df_results.createOrReplaceTempView("RESULTS")
The cell following this can now start doing SQL queries. e.g.:
SELECT * FROM RESULTS LIMIT 5
This takes a bit of time to get up, but not too much. I'm concerned about two things:
Am I loading this in the most efficient way possible, or is there a way to skip the creation of the df_results dataframe, which is only used to create the RESULTS temp table.
Am I loading the table for SQL in a way that lets it be used most efficiently? For example, if the user plans to execute a few dozen queries, I don't want to re-read from disk each time if I have to, but there's no need to persist beyond this notebook. Is createOrReplaceTempView the right method for this?

For your first question:
Yes, you can use the Hive Metastore on Databricks and query any tables in there without first creating DataFrames. The documentation on Databases and Tables is a fantastic place to start.
As a quick example, you can create a table using SQL or Python:
# SQL
CREATE TABLE <example-table>(id STRING, value STRING)
# Python
dataframe.write.saveAsTable("<example-table>")
Once you've created or saved a table this way, you'll be able to access it directly in SQL without creating a DataFrame or temp view.
# SQL
SELECT * FROM <example-table>
# Python
spark.sql("SELECT * FROM <example-table>")
For your second question:
Performance depends on multiple factors but in general, here are some tips.
If your tables are large (tens, hundreds of GB at least), you can partition by a predicate commonly used by your analysts to filter data. For example, if you typically include a WHERE clause that includes a date range or state, it might make sense to partition the table by one of those columns. The key concept here is data skipping.
Use Delta Lake to take advantage of OPTIMIZE and ZORDER. OPTIMIZE helps right-size files for Spark and ZORDER improves data skipping.
Choose Delta Cache Accelerated instace types for the cluster that your analysts will be working on.
I know you said there's no need to persist beyond the notebook but you can improve performance by creating persistent tables and taking advantage of data skipping, caching, and so on.

Related

Hive performance improvement

I want to join 1TB data of table with another table which also has 1TB of data in hive. Could you please suggest some best practises to follow?
I want to know how performance will be improved in hive if both the tables are partitioned. Basically, how mapreduce works in this case.
Below are few performance improvement Rules to follow when dealing with large data -
Tez-Execution Engine in Hive Or Hive on Spark
Use Tez Execution Engine (Hortonworks) – Hive Optimization Techniques, to increase the Hive performance of our hive query by using our execution engine as Tez. On defining Tez, it is a new application framework built on Hadoop Yarn. That executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework.
Or
Use Hive on Spark (Cloudera)
In addition, to write native YARN applications on Hadoop that bridges the spectrum of interactive and batch workloads Tez offers an API framework to developers. To be more specific, to work with petabytes of data over thousands of nodes it allows those data access applications.
Let’s Discuss Apache Hive Features & Limitations of Hive
SET hive.execution.engine=tez;
SET hive.execution.engine=spark;
Usage of Suitable File Format in Hive
ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance. Basically, for increasing your query performance ORC file format is best suitable. Here, ORC refers to Optimized Row Columnar. That implies we can store data in an optimized way than the other file formats.
To be more specific, ORC reduces the size of the original data up to 75%. Hence, data processing speed also increases. On comparing to Text, Sequence and RC file formats, ORC shows better performance. Basically, it contains rows data in groups. Such as Stripes along with a file footer. Therefore, we can say when Hive is processing the data ORC format improves the performance.
Hive Partitioning
Hive Partition – Hive Optimization Techniques, Hive reads all the data in the directory Without partitioning. Further, it applies the query filters on it. Since all data has to be read this is a slow as well as expensive.
Also, users need to filter the data on specific column values frequently. Although, users need to understand the domain of the data on which they are doing analysis, to apply the partitioning in the Hive.
Basically, by Partitioning all the entries for the various columns of the dataset are segregated and stored in their respective partition. Hence, While we write the query to fetch the values from the table, only the required partitions of the table are queried. Thus it reduces the time taken by the query to yield the result.
Bucketing in Hive
Bucketing in Hive – Hive Optimization Techniques, let’s suppose a scenario. At times, there is a huge dataset available. However, after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge. Still, we want to manage the partition results into different parts. Thus, to solve this issue of partitioning, Hive offers Bucketing concept. Basically, that allows the user to divide table data sets into more manageable parts.
Hence, to maintain parts that are more manageable we can use Bucketing. Through it, the user can set the size of the manageable parts or Buckets too.
Vectorization In Hive
Vectorization In Hive – Hive Optimization Techniques, to improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single row each time.
However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
Cost-Based Optimization in Hive (CBO)
Cost-Based Optimization in Hive – Hive Optimization Techniques, before submitting for final execution Hive optimizes each Query’s logical and physical execution plan. Although, until now these optimizations are not based on the cost of the query.
However, CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism and others.
To use CBO, set the following parameters at the beginning of your query:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.
Hive Indexing
Hive Index – Hive Optimization Techniques, one of the best ways is Indexing. To increase your query performance indexing will definitely help. Basically, for the original table use of indexing will create a separate called index table which acts as a reference.
As we know, there are many numbers of rows and columns, in a Hive table. Basically, it will take a large amount of time if we want to perform queries only on some columns without indexing. Because queries will be executed on all the columns present in the table.
Moreover, there is no need for the query to scan all the rows in the table while we perform a query on a table that has an index, it turned out as the major advantage of using indexing. Further, it checks the index first and then goes to the particular column and performs the operation.
Hence, maintaining indexes will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time. Well, time is the only factor that everyone focuses on, eventually.
This was all about Hive Optimization Techniques Tutorial. Hope you like our explanation of Hive Performance Tuning.

How to relationalize, join and aggregate multiple files from S3

I have a bucket in S3, containing hundreds of folders, each contains files with the same structure, which are csv representation of relational db tables. The different folders differ by content of the data, but overlapping might occur.
In each folder, I want to join 3 tables, and store the output in a dedicated table. The dedicated table should eventually hold joined data from all different folders. Duplications might occur between different folders but the records have a unique key that can help with the aggregation.
Data size for a specific folder, of all the files, can reach to 5 GB of disk space. 2 of the files contain hundreds of thousands of records. The third file can reach up to 20M records.
The result should be stored in AWS RDS, on a postgresql instance. However, I am considering to switch to Redshift. Will it be better for this scale of data?
The 3 tables are:
Providers
Consumers
Transactions
All of them are indexed by the key which is used in the join.
My approach is to iterate over S3 bucket, and for each folder load the 3 files to the db. Then, create the joined table for the 3 tables using sql, and finally adding the joined data to the aggregated table that should contain the data from all folders.
I am currently trying to handle 1 folder, in order to understand better how to make the process optimal, both in time and space.
After loading I noticed the db uses around 2X disk space then what I expected. Why is joining cost so much in disk space? Is there a way of loading and joining with minimal cost? The data loaded initially for each folder is used as a staging table, until I drop the duplicates and load it to the aggregated table. So its lifespan will be relatively short. I tried to use CREATE UNLOGGED TABLE but it didn't had much effect.
CREATE UNLOGGED TABLE agg_data AS SELECT * FROM
transactions t
INNER JOIN consumers c USING (consumer_id)
INNER JOIN providers p USING (provider_id);
This works ok for 1 folder, time wise. It does take a lot more disk space than I assumed it will.
How will this work in a mass scale, for hundreds of folders. How will the aggregation behave over time, as I will need to search for duplicated records in a continuously growing table?
To summarize my questions:
How to choose between RDS and Redshift? My concerns are tens of millions of records in the target table, and the need to drop duplicates while adding new data to the target table.
Why is joining data take so much db storage? Is there a way to minimize it, for data that is temporary?
What is an efficient way of inserting new data to the destination table while dropping duplications?
Will it be better to join and store the files in S3 using AWS Glue, and then load them to the target db? Currently it does not seem like an option, as Glue takes forever to join the data.
I would recommend using Amazon Athena to join the files and produce the desired output.
First, each directory needs to be recognised as a table. This can be done by manually running a CREATE EXTERNAL TABLE command in Athena and pointing at the folder. All files within the folder will be treated as containing data for the table and they should all be of the same format.
If desired, an AWS Glue crawler can instead be used to create the table definition. Create a crawler and point it to the folder. Glue will create the table definition in the AWS Glue Data Catalog, which is accessible to Athena.
Once the three input tables have been defined, you can run a query in Amazon Athena that joins the three tables and produces an output table using CREATE TABLE AS.
See: Creating a Table from Query Results (CTAS) - Amazon Athena
Glue can also be used for Program AWS Glue ETL Scripts in Python - AWS Glue, but I haven't tried this so I can't offer advice on it. However, I have used AWS Glue crawlers to create tables that I then query via Amazon Athena.
Once you have the output data, you can then load it into the database of your choice. Which database you choose depends upon your use-case. I would suggest starting with Amazon RDS for PostgreSQL since it is a traditional database and you seem to be comfortable with it. If you later need improved performance (eg billions or rows instead of millions), you can move to Amazon Redshift.
General comment: It is rather strange that you wish to join those 3 tables since there will presumably be a lot of duplicated data (very denormalized). You could instead simply load those tables into your desired database and then do the joins in the database, possibly being selective as to which columns you wish to include.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Use case of using Big Query or Big table for querying aggregate values?

I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?
Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.
Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.
https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables
https://cloud.google.com/community/tutorials/cbt-import-csv
https://cloud.google.com/bigquery/quotas
So, does that mean I can use BigQuery for the above usecase?
The short answer is yes.
I wrote about this in:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And when loading cluster your tables, for massive improvements in costs for the most common queries:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
In summary:
BigQuery can read CSVs and other files straight from GCS.
You can define a view that parses those CSVs in any way you might prefer, all within SQL.
You can run a CREATE TABLE statement to materialize the CSVs into BigQuery native tables for better performance and costs.
Instead of CREATE TABLE you can do imports via API, those are free (instead of cost of query for CREATE TABLE.
15 TB can be handled easily by BigQuery.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).