i am trying to learn apache hive and was going through Oreilly Programming Hive and had some problem understanding partitioning in hive. The following is the query:
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);
Here i am creating a partition depending on the country and state. But, there is no such field as country in the table's metadata and how does partition work in this case? How does hive manages to do this work?
Also can anyone please share some datasets to work on..
How is the data loaded into this kind of table?
PARTITIONED BY doesn't mean that your are going to split the data based on these existing columns, but that you are going to add these "columns" as a way of organizing your table (or more precisely, the file structure to store the data).
The partition keys will affect the data storage structure in Hive. In this case, Hive will create two "subfolders" under "employees" ("country" and "state") and will use these partition keys as regular columns, which you can use in your (more efficient) SELECTqueries (WHERE country = something AND state = other), as well as in your data loadings.
By means of specifying these keys in your loads and selects, Hive will be able to store and retrieve the data faster, since the data storage is now organized.
Kaggle competition has plenty of datasets (and of many diferentes topics) that you can use. They are mainly aimed at the use of Machine Learning algorithms, but nothing prevents you from using them to your own training.
Related
Our team drops parquet files on blob, and one of their main usages is to allow analysts (whose comfort zone is SQL syntax) to query them as tables. They will do this in Azure Databricks.
We've mapped the blob storage and can access the parquet files from a notebook. Currently, they are loaded and "prepped" for SQL querying in the following way:
Cell1:
%python
# Load the data for the specified job
dbutils.widgets.text("JobId", "", "Job Id")
results_path = f"/mnt/{getArgument("JobId")}/results_data.parquet"
df_results = spark.read.load(results_path)
df_results.createOrReplaceTempView("RESULTS")
The cell following this can now start doing SQL queries. e.g.:
SELECT * FROM RESULTS LIMIT 5
This takes a bit of time to get up, but not too much. I'm concerned about two things:
Am I loading this in the most efficient way possible, or is there a way to skip the creation of the df_results dataframe, which is only used to create the RESULTS temp table.
Am I loading the table for SQL in a way that lets it be used most efficiently? For example, if the user plans to execute a few dozen queries, I don't want to re-read from disk each time if I have to, but there's no need to persist beyond this notebook. Is createOrReplaceTempView the right method for this?
For your first question:
Yes, you can use the Hive Metastore on Databricks and query any tables in there without first creating DataFrames. The documentation on Databases and Tables is a fantastic place to start.
As a quick example, you can create a table using SQL or Python:
# SQL
CREATE TABLE <example-table>(id STRING, value STRING)
# Python
dataframe.write.saveAsTable("<example-table>")
Once you've created or saved a table this way, you'll be able to access it directly in SQL without creating a DataFrame or temp view.
# SQL
SELECT * FROM <example-table>
# Python
spark.sql("SELECT * FROM <example-table>")
For your second question:
Performance depends on multiple factors but in general, here are some tips.
If your tables are large (tens, hundreds of GB at least), you can partition by a predicate commonly used by your analysts to filter data. For example, if you typically include a WHERE clause that includes a date range or state, it might make sense to partition the table by one of those columns. The key concept here is data skipping.
Use Delta Lake to take advantage of OPTIMIZE and ZORDER. OPTIMIZE helps right-size files for Spark and ZORDER improves data skipping.
Choose Delta Cache Accelerated instace types for the cluster that your analysts will be working on.
I know you said there's no need to persist beyond the notebook but you can improve performance by creating persistent tables and taking advantage of data skipping, caching, and so on.
I have a very large parquet table containing nested complex types such as structs and arrays. I have partitioned it by date and would like to restrict certain users to, say, the latest week of data.
The usual way of doing this would be to create a time-limited view on top of the table, e.g.:
''' CREATE VIEW time_limited_view
AS SELECT * FROM my_table
WHERE partition_date >= '2020-01-01' '''
This will work fine when querying the view in Hive. However, if I try to query this view from Impala, I get an error:
** AnalysisException: Expr 'my_table.struct_column' in select list returns a complex type **
The reason for this is that Impala does not allow complex types in the select list. Any view I build which selects the complex columns will cause errors like this. If I flatten/unnest the complex types, this would of course get around this issue. However due to the layers of nesting involved I would like to keep the table structure as is.
I see another suggested workaround has been to use Ranger row-level filtering but I do not have Ranger and will not be able to install it on the cluster. Any suggestions on Hive/Impala SQL workarounds would be appreciated
While working on a different problem I came across a kind of solution that fits my needs (but is by no means a general solution). I figured I'd post it in case anyone has similar needs.
Rather than using a view, I can simply use an external table. So firstly I would create a table in database_1 using Hive, which has a corresponding location, location_1, in hdfs. This is my "production" database/table which I use for ETL and contains a very large amount of data. Only certain users have access to this database.
CREATE TABLE database_1.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Next, I create a second, external table in the same location in hdfs. However this table is stored in a database with a much broader user group (database_2).
CREATE EXTERNAL TABLE database_2.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Since this is an external table, I can add/drop date partitions at will without affecting the underlying data. I can add 1 weeks' worth of date partitions to the metastore and as far as end users can tell, that's all that is available in the table. I can even make this part of my ETL job, where each time new data is added, I add that partition to the external table and then drop a partition from a week ago, resulting in this rolling window of 1 weeks' data being made available to this user group without having to duplicate a load of data to a separate location.
This is by no means a row-filtering solution, but is a handy way to use partitions to expose a subset of data to a broader user group without having to duplicate that data in a separate location.
Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.
I'm learning table partitioning.
When I read this page, it said that
The TransactionHistoryArchive table must have the same design schema as the TransactionHistory table. There must also be an empty partition to receive the new data. In this case, TransactionHistoryArchive is a partitioned table that consists of just two partitions.
And with the following picture, we can see that TransactionHistory has 12 partitions, but TransactionHistoryArchive just has 2 partitions.
Illustration http://i.msdn.microsoft.com/dynimg/IC38652.gif
How could it possible? Please help me to understand it.
As long as two individual partitions have identical schema and the same boundary values you can switch them. They don't need to have the same partition scheme or function.
This is because SQL Server ensures that the binary data of those partitions on disk is compatible. That's the magic of partitioning and why you can move arbitrary amounts of data as a quick metadata-only operation.
I have a Hive table where for a user ID I have a ts column, which is a timeseries, stored as array. I want to maintain the timeseries as a recentmost window.
(a) how do I append a new number to the end of each column from another table joined by ID?
(b) how do I drop the leading number?
Data in Hive is typically stored in HDFS. HDFS has limited append capabilities. If the constant modification of data is at the core of your analytics systems, then perhaps you should consider using alternatives like HBase or Cassandra.
However, if the data updates are a small part of your workflow, I would encourage you to continue using Hive (in order to make use of it's SQL like functionality) but reconsider your design for storing these updates.
A quick solution to your above problem would be to have more than one record per user ID in your table. Each record would have a timeseries corresponding to the User ID. When you want to do your last N analysis on the timeseries, you should do a select from the table by using by Distribute By on User ID column. Your custom reducer will simply pick out the last N (or less, if the size of the timeseries is less than N) timestamps and return them.
Harish Butani also did some work on Windowing functions in Hive. You can also take a look at his work and associated documentation to gain some more insight. Good luck, Alexy!