Rationale behind partition specific schema in Hive/Glue tables - hive

I'm trying to understand the rationale behind the partition specific schema managed for Hive/Glue tables. Albeit, I couldn't find any documentation, specifically talking about this but during my search, I found a couple of Hive JIRAs (as attached in references) which hint at its purpose. From what I gathered, partition schema is a snapshot of table schema when it is registered, and it allows Hive to support schema evolution without invalidating existing table partitions and the underlying data. Also, it enables Hive to support different partitions and table level file formats, giving clients more flexibility.
The exact purpose is still not clear to me, so requesting the experts to comment on the following set of questions:
What is the rationale behind maintaining partition specific schema?
How does Hive/Glue behave in case there is a discrepancy in the partition and table schema? Does the resolution criteria consider or is dependent on the underlying data file format?
What are the repercussions of not maintaining partition specific schema in table metadata?
Experimentation and observations:
I ran an experiment on my end, in which I tested a few count, count with partition filters and schema description queries against Glue table without explicit schema definition in partition properties (underlying data files are written in parquet) using Spark-Shell, Hive CLI and Athena. The results retrieved were consistent with the ones computed from the original table.
References:
https://issues.apache.org/jira/browse/HIVE-6131
https://issues.apache.org/jira/browse/HIVE-6835
https://issues.apache.org/jira/browse/HIVE-8839
Thanks!

Related

Materialized view vs table - performance

I am quite new to Redshift but have quite some experience in the BI area. I need help from an expert Redshift developer. Here's my situation:
I have an external (S3) database added to Redshift. This will suffer very frequent changes, approx. every 15 minutes. I will run a lot of concurrent queries directly from Qlik Sense against this external DB.
As best practices say that Redshift + Spectrum works best when the smaller table resides in Redshift, I decided to move some calculated dimension tables locally and leave the outer tables in S3. The challenge I have is if it's better suited to use materialized views for this or tables.
I already tested both, with DIST STYLE = ALL and proper SORT KEY and the test show that MVs are faster. I just don't understand why that is. Considering the dimension tables are fairly small (<3mil rows), I have the following questions:
shall we use MVs and refresh them via scheduled task or use table and perform some form of ETL via stored procedure (maybe) to refresh it.
if table is used: I tried casting the varchar keys (heavily used in joins) to bigint to force encoding to AZ64, but queries perform worse than without casting (where encode=LZO). Is this because in the external DB it's stored as varchar?
if MV is used: I also tried above casting in the query behind MV, but the encoding says NONE (figured out by checking the table created behind the scene). Moreover, even without casting, most of the key columns used in joins have no encoding. Might it be that this is the reason why MVs are faster than table? And should I not expect the opposite - no encoding = worse performance?
Some more info: in S3, we store in the form of parquet files, with proper partitioning. In Redshift, the tables are sorted against the same columns as S3 partitioning, plus some more columns. And all queries use joins against these columns in the same order and also a filter in the where clause on these columns. So the query is well structured.
Let me know if you need any other details.

Performance of partitioned view of an unpartitioned table

I have an unpartitioned hive table which is used to create a partitioned view. The table has just some metadata columns and the actual data is stored in an array which makes querying difficult. Hence the data is exploded into a view which then is used for all querying purposes. This view is partitioned on the date the data arrives. In this scenario, will the performance be affected as the original table is unpartitioned? Should the original table be partitioned too?
If underlying tables are not partitioned, view partitioning is not useful at all.
Of course, table should be partitioned if you want partition pruning to work, otherwise the full-scan will be performed.
On the other hand, if the table is partitioned and a view is not partitioned and query has predicates on table partitions, optimizer is clever enough to derive partition info from the view definition, push predicates down and partition pruning works. And this makes view partitioning rather useless feature and manually managed view partitions add unnecessary complication. Better use partitioned tables and normal not partitioned views.
Why you may need partitioned view if partition pruning works with non-partitioned view. One possible use-case is when restricted user can see only the view, not underlying tables and different tools can derive partition information from view metadata only and suggest filtering on partitions, knowing nothing about underlying tables. From the restricted user perspective, view is the same as table and they should see partitioning schema.
See HIVE-1079:
For the manual approach, a very simple thing we can start with is just
letting users add partitions to views explicitly as a way of
indicating that the underlying data is ready.
See also HIVE-1941 and this design document
BTW Materialized views are already implemented in Hive 3.0.0 and prtitioning of them makes much more sense because the data in them are stored accordingly to partition schema specified in DDL.

Looking for a non-cloud RDBMS to import partitioned tables (in CSV format) with their directory structure

Context: I have been working on Cloudera/Impala in order to use a big database and create more manageable "aggregate" tables which contain substantially less information. These more manageable tables are of the order of tens to hundreds of gigabytes, and there are about two dozen tables. I am looking at about 500 gigabytes of data which will fit on a computer in my lab.
Question: I wish to use a non-cloud RDBMS in order to further work on these tables locally from my lab. The original Impala tables, most of them partitioned by date, have been exported to CSV, in such a way that the "table" folder contains a subfolder for each date, each subfolder containing a unique csv file (in which the partitioned "date" column is absent, since it is in its dated subfolder). Which would be an adequate RDBMS and how would I import these tables?
What I've found so far:
there seem to be several GUIs or commands for MySQL which simplify importing, e.g.:
How do I import CSV file into a MySQL table?
Export Impala Table from HDFS to MySQL
How to load Excel or CSV file into Firebird?
However these do not address my specific situation since 1. I only have access to Impala on the cluster, i.e. I cannot add any tools, so the heavy-lifting must be done on the lab computer, and 2. they do not say anything about importing an already partitioned table with the existing directory/partition structure.
Constraints:
Lab computer is on Ubuntu 20.04
Ideally, I would like to avoid having to load each csv / partition manually, as I have tens of thousands of dates. I am hoping for a RDBMS which already recognizes the partitioned directory structure...
the RDBMS itself should have a fairly recent set of functions available, including lead/lag/first/last window functions. aside from that, it needn't be too fancy.
I'm open to using Spark as an "overkill SQL engine", if that's the best way, I'm just not too sure if this is the best approach for a unique computer (not a cluster). Also, if need be (though I would ideally like to avoid this), I can export my Impala tables in another format in order to ease the import phase. E.g. a different format for text-based tables, parquet, etc.
Edit 1
As suggested in the comments, I am currently looking at Apache Drill. It is correctly installed, and I have successfully run the basic queries from the documentation / tutorials. However, I am now stuck at how to actually "import" (actually, I only need to "use" them since drill seems able to run queries directly on the filesystem) my tables. To clarify:
I currently have two "tables" in the directories /data/table1 and /data/table2 .
those directories contain subdirectories corresponding to the different partitions, e.g.: /data/table1/thedate=1995 , /data/table1/thedate=1996 , etc., and the same goes for table2.
within each subdirectory, I have a file (without an extension) that contains the CSV data, without headers.
My understanding (I'm still new to Apache-Drill) is that I need to create a File System Storage Plugin somehow for drill to understand where to look and what it's looking at, so I created a pretty basic plugin (a quasi copy/paste from this one) using the web interface on the Plugin Management page. The net result of that is that now I can type use data; and drill understands that. I can then say show files in data and it correctly lists table1 and table2 as my two directories. Unfortunately, I am still missing two key things to successfully be able to query these tables:
running select * from data.table1 fails with an error, and I've tried table1 or dfs.data.table1 and I get different errors for each command (object 'data' not found, object 'table1' not found, schema [[dfs,data]] isnot valid with respect to either root schema or current default schema, respectively). I suspect this is because there are sub-directories within table1?
I still have not said anything about the structure of the CSV files, and that structure would need to incorporate the fact that there is "thedate" field and value in the sub-directory name...
Edit 2
After trying a bunch of things, still no luck using text-based files, however using parquet files worked:
I can query a parquet file
I can query a directory containing a partitioned table, each directory being in the format: thedate=1995 , thedate=1996 as stated earlier.
I used the advice here in order to be able to query a table the usual way, i.e. without using dir0 but using thedate. Essentially, I created a view :
create view drill.test as select dir0 as thedate, * from dfs.data/table1_parquet_partitioned
Unfortunately, thedate now is a text that says: thedate=1994 , rather than just 1994 (int). So I renamed the directories in order to only contain the date, however this was not a good solution as the type for thedate was not an int and therefore I could not use dates to join with table2 (which has thedate in a column). So finally, what I did was cast thedate to an int in the view
=> This is all fine as, although not csv files, this alternative is doable for me. However I am wondering if by using such a view, with a cast inside, will I benefit from partition pruning ? The answer in the referenced stackoverflow link suggests partition pruning is conserved by the view, however I am unsure about this when the column is used in a formula... Finally, given that the only way I can make this work is via parquet, it begs the question: is drill the best solution for this in terms of performance? So far, I like it, but migrating the database to this will be time-consuming and I would like to try to choose the best destination for this without too much trial and error...
I ended up using Spark. The only alternative I currently know about, which was brought to my attention by Simon Darr (whom I wish to thank again!), is Apache Drill. Pros and cons for each solution, as far as I could test:
Neither solution was great for offering a simple way to import the existing schema when the database is exported in text (in my case, CSV files).
Both solutions import the schema correctly using parquet files, so I have decided I must recreate my tables in the parquet format from my source cluster (which uses Impala).
The problem remaining is with respect to the partitioning: I was at long last able to figure out how to import partitioned files on Spark and the process of adding that partition dimension is seemless (I got help from here and here for that part), whereas I was not able to find a way to do this convincingly using Drill (although the creation of a view, as suggested here, does help somewhat):
On Spark. I used : spark.sql("select * from parquet.file:///mnt/data/SGDATA/sliced_liquidity_parq_part/"). Note that it is important to not use the * wildcard, as I first did, because if you use the wildcard each parquet file is read without looking at the directory it belongs to, so it doesn't take into account the directory structure for the partitioning or adding those fields into the schema. Without the wildcard, the directory name with syntax field_name=value is correctly added to the schema, and the value types themselves are correctly inferred (in my case, int because I use thedate=intvalue syntax).
On Drill, the trick of creating a view is a bit messy since it involves, first, using the substring of dir0 in order to extract the field_name and value, and second it requires a cast in order to send that field to the correct type in the schema. I am really not certain this sort of view would enable partition pruning when doing queries thereafter, so I was not fond of this hack. NB: there is likely another way to do this properly, I simply haven't found it.
I learned along the way about Drill (which seems great for logs and stuff that don't have a known structure), and learned that Spark could do a lot of what drill does if the data is structured (I had no idea it could read CSVs or parquet files directly without an underlying DB system). I also did not know that Spark was so easy to install on a standalone machine: after following steps here, I simply created a script in my bashrc which launches the master, a worker, and the shell all in one go (although I cannot comment on the performance of using a standalone computer for this, perhaps Spark is bad at this). Having used spark a bit in the past, this solution still seems best for me given my options. If there are any other solutions out there keep them coming as I won't accept my own answer just yet (I have a few days required to change all my tables to parquet anyway).

Get Hive partition schemas

As far as I understand Hive keeps track of schema for all partitions, handling schema evolution.
Is there any way to get schema for particular partition? For example, if I want to compare schema for some old partition with the latest one.
Show extended command does give you a bunch of information around the partition columns and its types, probably you could use those.
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTable/PartitionExtended

Is having many partition key in azure table storagea good design for read queries?

I know that having many partition keys reduce the batch processing (EGT) in the Azure Table Storage. However I wonder to know whether there is any performance issue in terms of reading as well or not? For example, if I designed my Azure Table such that every new entity has a new partition key and I end up having 1M or more partition keys. IS there any performance disadvantege for read queries?
If the most often operation done by you is Point Query (PartitionKey and RowKey specified), the unique-partition-key design is quite good. However if your querying operation is usually Table Scan (No Partition Key specified), the design will be awful.
You can refer to chapter "Design for querying" in Azure Table Design Guide for the details.
Point query is the most efficient query to retrieve a single entity by specifying a single PartitionKey and RowKey using equality predicates. If your PartitionKey is unique, you may consider using a constant string as RowKey to enable you to leverage point query. The choice of design also depends on how you plan to read/retrieve your data. If you always plan to use point query to retrieve the data, this design makes sense.
Please see “New PartitionKey Value for Every Entity” section in the following article http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx. In short, it will scale very well since our storage system has an option to load balance several partitions. However, if your application requires you to retrieve data without specifying a PartionKey, it will be inefficient because it will result a table scan.
Please me an email # ascl#microsoft.com, if you want to discuss further on your table design.