I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.
Related
I am using spark sql on databricks, which uses a Hive metastore, and I am trying to set up a job/query that uses quite a few columns (20+).
The amount of time it takes to run the metastore validation checks is scaling linearly with the number of columns included in my query - is there any way to skip this step? Or pre-compute the checks? Or to at least make the metastore only check once per table rather than once per column?
A small example is that when I run the below, even before calling display or collect, the metastore checker happens once:
new_table = table.withColumn("new_col1", F.col("col1")
and when I run the below, the metastore checker happens multiple times, and therefore takes longer:
new_table = (table
.withColumn("new_col1", F.col("col1")
.withColumn("new_col2", F.col("col2")
.withColumn("new_col3", F.col("col3")
.withColumn("new_col4", F.col("col4")
.withColumn("new_col5", F.col("col5")
)
The metastore checks it's doing look like this in the driver node:
20/01/09 11:29:24 INFO HiveMetaStore: 6: get_database: xxx
20/01/09 11:29:24 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: xxx
The view to the user on databricks is:
Performing Hive catalog operation: databaseExists
Performing Hive catalog operation: tableExists
Performing Hive catalog operation: getRawTable
Running command...
I would be interested to know if anyone can confirm that this is just the way it works (a metastore check per column), and if I have to just plan for the overhead of the metastore checks.
I am surprised by this behavior as it does not fit with the Spark processing model and I cannot replicate it in Scala. It is possible that it is somehow specific to PySpark but I doubt that since PySpark is just an API for creating Spark plans.
What is happening, however, is that after every withColumn(...) the plan is analyzed. If the plan is large, this can take a while. There is a simple optimization, however. Replace multiple withColumn(...) calls for independent columns with df.select(F.col("*"), F.col("col2").as("new_col2"), ...). In this case, only a single analysis will be performed.
In some cases of extremely large plans, we've saved 10+ minutes of analysis for a single notebook cell.
Is there any number of partitions we would expect this command
MSCK REPAIR TABLE tablename;
to fail on?
I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run
MSCK REPAIR TABLE tablename;
We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.
MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.
What to do instead depends on a number of things that are unique to your situation.
In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.
When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.
Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….
When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.
If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.
We are new to BigQuery and are trying to figure out the best way to use it for real time analytics. We are sending a stream of logs from our back-end services to Kafka, and we want to stream those into BigQuery using streaming inserts. For queryability we are both partitioning by time, and sharding tables by event type (for use with wildcard queries). We put a view overtop of the family of tables created so that they look like 1 table and use the _TABLE_SUFFIX (well, when they roll out the feature, for now using UNION ALL) and _PARTITIONTIME columns to reduce the set of rows scanned for queries. So far so good.
What we are unsure of how to handle properly is schema changes. The schema of our log messages changes frequently. Having a manual process to keep BigQuery in sync is not tenable. Ideally our streaming pipeline would detect the change and apply the schema update (for adding columns) or table creation (for adding an event type) as necessary. We have tooling up-stream so that we know all schema updates will be backwards compatible.
My understanding is that all of the shards must have the same schema. How do we apply the schema update in such a fashion that:
We don't break queries that are run during the update.
We don't break streaming inserts.
Is #1 possible? I don't believe we can atomically change the schema of all the sharded tables.
For #2 I presume we have to stop our streaming pipelines while the schema update process is occurring.
Thanks,
--Ben
Wildcard tables with _TABLE_SUFFIX is available https://cloud.google.com/bigquery/docs/querying-wildcard-tables and you can use it even if the schemas of the tables are different, they just need to have compatible schemas. With UNION ALL, you need all the tables to have the same schema so it will not work if you're updating schemas at the same time.
Streaming insert will also work if you only specify a subset of fields. However you cannot add new fields as part of the streaming insert, you'll have to update table first and then insert the data with new schema.
I saw at this link which affects Impala version 1.1:
Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement.
Does this still hold true for later versions of Impala?
According to Cloudera's Impala guide (Cloudera Enterprise 5.8) but stayed the same for 5.9:
INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA
waits to reload the metadata when needed for a subsequent query, but
reloads all the metadata for the table, which can be an expensive
operation, especially for large tables with many partitions. REFRESH
reloads the metadata immediately, but only loads the block location
data for newly added data files, making it a less expensive operation
overall. If data was altered in some more extensive way, such as being
reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a
performance penalty from reduced local reads. If you used Impala
version 1.0, the INVALIDATE METADATA statement works just like the
Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is
optimized for the common use case of adding new data files to an
existing table, thus the table name argument is now required.
and related to working on existing tables:
The table name is a required parameter [for REFRESH]. To flush the metadata for all
tables, use the INVALIDATE METADATA command.
Because REFRESH table_name only works for tables that the current
Impala node is already aware of, when you create a new table in the
Hive shell, enter INVALIDATE METADATA new_table before you can see the
new table in impala-shell. Once the table is known by Impala, you can
issue REFRESH table_name after you add data files for that table.
So it seems like it indeed stayed the same. I believe CDH 5.9 comes with Impala 2.7.
As per Impala document Invalidate Metada and Refresh
INVALIDATE METADATA Statement
The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA.
INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:
Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad.
REFRESH Statement
The REFRESH statement reloads the metadata for the table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, namely Hive Metastore (HMS) and NameNodes.
Usage notes:
The table name is a required parameter, and the table must already exist and be known to Impala.
Only the metadata for the specified table is reloaded.
Use the REFRESH statement to load the latest metastore metadata for a particular table after one of the following scenarios happens outside of Impala:
Deleting, adding, or modifying files.
For example, after loading new data files into the HDFS data directory for the table, appending to an existing HDFS file, inserting data from Hive via INSERT or LOAD DATA.
Deleting, adding, or modifying partitions.
For example, after issuing ALTER TABLE or other table-modifying SQL statement in Hive
Invalidate metadata is used to refresh the metastore and the data (structure & data)==complete flush
Refresh is used to update only the data = lightweight flush
I would like to change the bucket name in location of many Hive tables. Is it possible for us to connect to mySQL database and update it? I think it is possible.But I would like to know if it is safe to do it in production database.
Yes, it is possible, and I have seen it done; but
(a) the Metastore schema is not documented, and each Hive version brings some minor changes, so you have to do your own exploration to find where/how the StorageDescriptor objects are persisted -- then some unit tests / non-regression tests on a Dev system -- plus, don't forget to run a full DB backup before tinkering with your Prod system (and to rehearse an emergency restoration on your Dev system, too!)
(b) you have to update the StorageDescriptor for tables, but also for partitions -- remember that for partitioned tables, the table-level LOCATION is just used as default root dir for future partitions; once created, a partition retains its location until it is ALTERed explicitly.
For the record, the preferred method for bulk updates is (in theory) the Hive MetaTool but unfortunately, it does not support the kind of updates that you need.Right now it's only good for changing the NameNode alias in all HDFS paths, because that was a real pain point...
A valid alternative to brutal SQL Updates would be to develop a custom Java program, using the Hive MetaStore API, to scan all tables & partitions then read their StorageDescriptor then run RegEx changes on their Location then write back the changes (which is exactly what the MetaTool does, only at a lower level). But that would be overkill.
Finally, a possible compromise would be a SQL Select on the appropriate MySQL table, to generate (with regexp_replace()) a chain of ALTER Table/Partition LOCATION commands to run later in the Hive CLI.Plus a chain of ALTER to revert to the original locations, in case you have to do an emergency rollback :-/