Update Hive metadata location for many tables - hive

I would like to change the bucket name in location of many Hive tables. Is it possible for us to connect to mySQL database and update it? I think it is possible.But I would like to know if it is safe to do it in production database.

Yes, it is possible, and I have seen it done; but
(a) the Metastore schema is not documented, and each Hive version brings some minor changes, so you have to do your own exploration to find where/how the StorageDescriptor objects are persisted -- then some unit tests / non-regression tests on a Dev system -- plus, don't forget to run a full DB backup before tinkering with your Prod system (and to rehearse an emergency restoration on your Dev system, too!)
(b) you have to update the StorageDescriptor for tables, but also for partitions -- remember that for partitioned tables, the table-level LOCATION is just used as default root dir for future partitions; once created, a partition retains its location until it is ALTERed explicitly.
For the record, the preferred method for bulk updates is (in theory) the Hive MetaTool but unfortunately, it does not support the kind of updates that you need.Right now it's only good for changing the NameNode alias in all HDFS paths, because that was a real pain point...
A valid alternative to brutal SQL Updates would be to develop a custom Java program, using the Hive MetaStore API, to scan all tables & partitions then read their StorageDescriptor then run RegEx changes on their Location then write back the changes (which is exactly what the MetaTool does, only at a lower level). But that would be overkill.
Finally, a possible compromise would be a SQL Select on the appropriate MySQL table, to generate (with regexp_replace()) a chain of ALTER Table/Partition LOCATION commands to run later in the Hive CLI.Plus a chain of ALTER to revert to the original locations, in case you have to do an emergency rollback :-/

Related

Hive Insert-Only Transactional Tables

What are the specific benefits of using a Hive Insert-Only Transactional Table? Most of the documentation just indicates that if you don't need Delete or Alter functionality, then create this table. Does this speed up processing? Reduce Overhead?
Currently, full ACID tables are only supported in ORC file format. Micromanaged, a.k.a INSERT only transactional tables support any other storage format.
So, if you have all your tables stored in ORC format, you can go ahead with full ACID. If you have other storage types, and you need to be able to do INSERT statements, micromanaged tables can help you there.
Also: for full ACID tables, compaction is done by a MapReduce job. You can configure Hive to use the query based compactor for major compactions (as in creating a new base), but minor compactions (as in merging delta files) are still done with MR, and MR only.
For micromanaged tables, the compaction is query based. So, if you are using Hive on Tez, or Hive on Spark, and you do not want to have MR at all, that is fine. But for full ACID tables, if you want minor compactions, you'll need MapReduce.
Insider note: query based minor compaction for full ACID tables will be supported really soon, and I am pretty sure Parquet is going to support ACID tables very soon.
What you read everywhere is that unlike full transactional tables, "insert-only transactional tables" support data insert operations only.
But that doesn't say much. What one wants to know is:
What is a transactional insert operation ?
To say that an operation is a transaction means basically that it follows the ACID principles and especially its most important property: atomicity (the A from ACID).
In his great book Desigining Data Intensive Applications, Martin Kleppmann explains the atomicity property well:
"if a transaction was aborted, the application can be sure that it
didn’t change anything, so it can safely be retried.
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity."
Desigining Data Itensive Applications, March 2017, 1st edition chapter 7, p. 234
In Hive, this is done by creating a delta directory for each insert transaction, which keeps the new data isolated until the transaction is completed. If there is an error the directory is deleted, otherwise, it is appended.

Using H2 1.4 database can I write new rows if reading other rows

Using H2 1.4 database can I write new rows if reading other rows?
i.e if have 1000 rows in table, and have a SELECT query running that is getting primary key 1-10 would it be possible for an INSERT query to insert some new rows at same time, or would it have to wait for (all) the SELECT query on that table to finish?
What is the situation with an UPDATE of rows in table table but not being retrieved by any SELECT query?
I ask because with H2 1.3 I noticed that my application threads that accessed database seemed to spend a lot of time blocking, it seems better now I have upgraded to 1.4. But in my application that is multithreaded the threads are always dealing with different rows so it is important for me to better understanding how locking works in H2 (with the MV store, was previously using PAGE store with 1.3), and whether H2 can just lock individual rows when UPDATING or if it has to lock whole table.
It depends on storage engine that you choose. All information below applies to the most recent version (1.4.199), old versions have some differences.
With default MVStore engine data modification operations and SELECT … FOR UPDATE lock modified (or selected) rows. Other transactions can't modify locked rows in parallel, but can read their values. Note that read committed isolation level is used by default and other isolation levels are not really supported by this engine. With read committed isolation level other transactions will not see the concurrently modified values, they will see old ones. New values will be visible only when that transaction commits its work. With this engine database runs in multi-threaded mode by default, so a long-running command will not block other sessions.
With legacy PageStore engine (add ;MV_STORE=FALSE to the connection URL if you want to create a database with this engine) the whole tables are locked for writing. It means that you really need to lock the tables in the same order (alphabetical or some other) in all your transactions, otherwise a deadlock is possible. With this engine database runs in single-threaded mode by default, you can enable multi-threaded mode explicitly, but it is not safe with this engine. Different sessions can't do their work concurrently, long-running command will block all other sessions.
Databases are not converted from old (PageStore) format to a new (MVStore) format when you open them with a new version of H2, you have to do it by yourself. Also old databases may have serious problems with new versions, it's recommended to export them to SQL with old version of H2 using the SCRIPT TO 'filename.sql' command and load this script into new database with a new version of H2 using the RUNSCRIPT FROM 'filename.sql' command. You need to do it even if you choose to use the old engine. If you have persistent databases don't forget to create regular backup copies (with BACKUP TO 'filename.zip' command, for example).
You can find more details in the documentation:
https://h2database.com/html/advanced.html#mvcc
https://h2database.com/html/features.html#multiple_connections

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

How to add or route PostgreSQL Data to New Hard Drive

Im Using Windows Server 2008 R2 Standard
Im Running PostgreSQL 9.0.1, compiled by Visual C++ build 1500, 32-bit
I got C:/ and D:/ Drive
C:/ --> 6.7GB free space (almost full and my server performance running low)
D:/ --> 141GB free space
Currently my PostgreSQL Data stored at C:/ Now,I want to route or add path to D:/ without migrate the data from C:/ to D:/ because now my PostgreSQL Data Stored around 148 GB. It Heavy and Massive Stored.
If success, I should still be able to do a query like SELECT * From table_bla_bla and it will return result from both drives?
Please do not suggest me to change PostgreSQL to other DB or whatsoever.
Because Im running 39,763 Device GPS Meter that send the data to my Server.
I have to take care this server because my expert already past-away.
You need to use tablespaces.
Create the tablespace, for example CREATE TABLESPACE second_drive LOCATION 'D:/postgresdata/' (see this other answer if you get permission denied errors)
ALTER TABLE table_bla_bla SET tablespace second_drive
Tablespaces allow you to decide which tables go on which drives and that can help speed up performance by ensuring you control where reads and writes go, but it also helps with space.
Postgres places individual tables in TABLESPACEs (which relate to a single disk), which is enough if you have multiple tables and you can achieve what you need by moving some tables to the other disk.
On the other hand, if you have a large table that you need to split over multiple disks, you need to use Postgres's Horizontal Partitioning capability.
This builds on tablespaces by allowing you to create a master table table_bla_bla which is actually just a facade on top of two or more tables which actually hold the data. These data tables can then be put on different tablesspaces effectively splitting your data over disks.
For this you would:
Rename your current table_bla_bla to something like
table_bla_bla_c
Create a new table_bla_bla master table.
Alter table_bla_bla_c to mark that it inherits from
table_bla_bla
Create a new table_bla_bla_d table that inherits from table_bla_bla and specify the tablespace as the D drive.
Apply partitioning triggers and check constraints as per the partitioning documentation
Once this is in place, you can arrange it so that any inserts into table_bla_bla cause new records to be created on the D drive. Selects on table_bla_bla will read from both disks.

Can I exclude a custom schema from a Schema comparison in SSDT?

We have a SQL server database that is very dynamic and is always creating new and dropping existing tables from a custom schema called 'temp' (we have a dbo schema and a temp schema). We also use SSDT to maintain and monitor changes in our schema but we are unable to use the update feature on a schema comparison because if a new table is created (say temp.MyTable) after the schema comparison is made and before the updated is attempted, SSDT invalidates the schema comparison because something has changed. At the moment, our only solution to this is to run the schema comparisons around midnight when system activity is practically non-existent but is not ideal for the person who has to do the schema comparison.
My question is, is there a way we can exlude tables from the schema comparison that are apart of the 'temp.' schema?
How are you doing the deployment? as I test I used sqlpackage.exe to publish a dacpac and sat there constantly creating new tables and it deployed without complaining.
However, there are a couple of things you can do, the first is to stop getting the deployment to stop when drift is detected:
/p:BlockWhenDriftDetected=False
This is set to true by default.
The second thing is to ignore the temp schema, but I don't think this will help unless you also stop the drift but you might want to use this filter to stop all changes to the temp schema:
http://agilesqlclub.codeplex.com/
Ed