Hive Insert-Only Transactional Tables - hive

What are the specific benefits of using a Hive Insert-Only Transactional Table? Most of the documentation just indicates that if you don't need Delete or Alter functionality, then create this table. Does this speed up processing? Reduce Overhead?

Currently, full ACID tables are only supported in ORC file format. Micromanaged, a.k.a INSERT only transactional tables support any other storage format.
So, if you have all your tables stored in ORC format, you can go ahead with full ACID. If you have other storage types, and you need to be able to do INSERT statements, micromanaged tables can help you there.
Also: for full ACID tables, compaction is done by a MapReduce job. You can configure Hive to use the query based compactor for major compactions (as in creating a new base), but minor compactions (as in merging delta files) are still done with MR, and MR only.
For micromanaged tables, the compaction is query based. So, if you are using Hive on Tez, or Hive on Spark, and you do not want to have MR at all, that is fine. But for full ACID tables, if you want minor compactions, you'll need MapReduce.
Insider note: query based minor compaction for full ACID tables will be supported really soon, and I am pretty sure Parquet is going to support ACID tables very soon.

What you read everywhere is that unlike full transactional tables, "insert-only transactional tables" support data insert operations only.
But that doesn't say much. What one wants to know is:
What is a transactional insert operation ?
To say that an operation is a transaction means basically that it follows the ACID principles and especially its most important property: atomicity (the A from ACID).
In his great book Desigining Data Intensive Applications, Martin Kleppmann explains the atomicity property well:
"if a transaction was aborted, the application can be sure that it
didn’t change anything, so it can safely be retried.
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity."
Desigining Data Itensive Applications, March 2017, 1st edition chapter 7, p. 234
In Hive, this is done by creating a delta directory for each insert transaction, which keeps the new data isolated until the transaction is completed. If there is an error the directory is deleted, otherwise, it is appended.

Related

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

How to manage Schema Drift while streaming to BigQuery sharded table

We are new to BigQuery and are trying to figure out the best way to use it for real time analytics. We are sending a stream of logs from our back-end services to Kafka, and we want to stream those into BigQuery using streaming inserts. For queryability we are both partitioning by time, and sharding tables by event type (for use with wildcard queries). We put a view overtop of the family of tables created so that they look like 1 table and use the _TABLE_SUFFIX (well, when they roll out the feature, for now using UNION ALL) and _PARTITIONTIME columns to reduce the set of rows scanned for queries. So far so good.
What we are unsure of how to handle properly is schema changes. The schema of our log messages changes frequently. Having a manual process to keep BigQuery in sync is not tenable. Ideally our streaming pipeline would detect the change and apply the schema update (for adding columns) or table creation (for adding an event type) as necessary. We have tooling up-stream so that we know all schema updates will be backwards compatible.
My understanding is that all of the shards must have the same schema. How do we apply the schema update in such a fashion that:
We don't break queries that are run during the update.
We don't break streaming inserts.
Is #1 possible? I don't believe we can atomically change the schema of all the sharded tables.
For #2 I presume we have to stop our streaming pipelines while the schema update process is occurring.
Thanks,
--Ben
Wildcard tables with _TABLE_SUFFIX is available https://cloud.google.com/bigquery/docs/querying-wildcard-tables and you can use it even if the schemas of the tables are different, they just need to have compatible schemas. With UNION ALL, you need all the tables to have the same schema so it will not work if you're updating schemas at the same time.
Streaming insert will also work if you only specify a subset of fields. However you cannot add new fields as part of the streaming insert, you'll have to update table first and then insert the data with new schema.

Update Hive metadata location for many tables

I would like to change the bucket name in location of many Hive tables. Is it possible for us to connect to mySQL database and update it? I think it is possible.But I would like to know if it is safe to do it in production database.
Yes, it is possible, and I have seen it done; but
(a) the Metastore schema is not documented, and each Hive version brings some minor changes, so you have to do your own exploration to find where/how the StorageDescriptor objects are persisted -- then some unit tests / non-regression tests on a Dev system -- plus, don't forget to run a full DB backup before tinkering with your Prod system (and to rehearse an emergency restoration on your Dev system, too!)
(b) you have to update the StorageDescriptor for tables, but also for partitions -- remember that for partitioned tables, the table-level LOCATION is just used as default root dir for future partitions; once created, a partition retains its location until it is ALTERed explicitly.
For the record, the preferred method for bulk updates is (in theory) the Hive MetaTool but unfortunately, it does not support the kind of updates that you need.Right now it's only good for changing the NameNode alias in all HDFS paths, because that was a real pain point...
A valid alternative to brutal SQL Updates would be to develop a custom Java program, using the Hive MetaStore API, to scan all tables & partitions then read their StorageDescriptor then run RegEx changes on their Location then write back the changes (which is exactly what the MetaTool does, only at a lower level). But that would be overkill.
Finally, a possible compromise would be a SQL Select on the appropriate MySQL table, to generate (with regexp_replace()) a chain of ALTER Table/Partition LOCATION commands to run later in the Hive CLI.Plus a chain of ALTER to revert to the original locations, in case you have to do an emergency rollback :-/

Backing up portion of data in SQL

I have a huge schema containing billions of records, I want to purge data older than 13 months from it and maintain it as a backup in such a way that it can be recovered again whenever required.
Which is the best way to do it in SQL - can we create a separate copy of this schema and add a delete trigger on all tables so that when trigger fires, purged data gets inserted to this new schema?
Will there be only one record per delete statement if we use triggers? Or all records will be inserted?
Can we somehow use bulk copy?
I would suggest this is a perfect use case for the Stretch Database feature in SQL Server 2016.
More info: https://msdn.microsoft.com/en-gb/library/dn935011.aspx
The cold data can be moved to the cloud with your given date criteria without any applications or users being aware of it when querying the database. No backups required and very easy to setup.
There is no need for triggers, you can use job running every day, that will put outdated data into archive tables.
The best way I guess is to create a copy of current schema. In main part - delete all that is older then 13 months, in archive part - delete all for last 13 month.
Than create SP (or any SPs) that will collect data - put it into archive and delete it from main table. Put this is into daily running job.
The cleanest and fastest way to do this (with billions of rows) is to create a partitioned table probably based on a date column by month. Moving data in a given partition is a meta operation and is extremely fast (if the partition setup and its function is set up properly.) I have managed 300GB tables using partitioning and it has been very effective. Be careful with the partition function so dates at each edge are handled correctly.
Some of the other proposed solutions involve deleting millions of rows which could take a long, long time to execute. Model the different solutions using profiler and/or extended events to see which is the most efficient.
I agree with the above to not create a trigger. Triggers fire with every insert/update/delete making them very slow.
You may be best served with a data archive stored procedure.
Consider using multiple databases. The current database that has your current data. Then an archive or multiple archive databases where you move your records out from your current database to with some sort of say nightly or monthly stored procedure process that moves the data over.
You can use the exact same schema as your production system.
If the data is already in the database no need for a Bulk Copy. From there you can backup your archive database so it is off the sql server. Restore the database if needed to make the data available again. This is much faster and more manageable than bulk copy.
According to Microsoft's documentation on Stretch DB (found here - https://learn.microsoft.com/en-us/azure/sql-server-stretch-database/), you can't update or delete rows that have been migrated to cold storage or rows that are eligible for migration.
So while Stretch DB does look like a capable technology for archive, the implementation in SQL 2016 does not appear to support archive and purge.

SQL Server 2008, Compression / Zip and fast query

I have an application that produce approximately 15000 rows int a table named ExampleLog for each Task. The task has a taskID, that is saved in a table named TaskTable, thus it's possible to retrieve data from the ExampleLog table to run some queries.
The problem is that the ExampleLog table is getting very big, since I run everyday at least 1 task. At the time being my ExampleLog table is over 60 GB.
I would like to compress the 15000 rows which belong to a TaskID, and compress them or just Zip them and then save the compressed data somewhere inside the database as Blob or as Filestream. But it is important for me to be able to query easily the compressed or zipped file and proccess some query in a efficient manner inside the compressed or zipped data. (I don't know, if it's possible or I may lost in term of performance)
PS: The compressed data should not be considered as backup data.
Did someone can recommend an good approach or technique to resolve this problem. My focus is on the speed and of the query running on the ExampleLog and the place taken on the disk.
I'm using SQL Server 2008 on Windows 7
Consider Read-Only Filegroups and Compression.
Using NTFS Compression with Read-Only User-defined Filegroups and Read-Only Databases
SQL Server supports NTFS compression of read-only
user-defined filegroups and read-only databases. You should consider
compressing read-only data in the following situations: You have a
large volume of static or historical data that must be available for
limited read-only access. You have limited disk space.
Also, you can try and estimate the gains from page compression applied to the log table using Data Compression Wizard.
The answer of Denis could not solve my Problem completely, however I will use it for some optimization inside the DB.
Regarding the problem of storing data in package/group, there are 2 solutions of my problem:
The first solution is the use of the Partitioned Table and Index Concepts.
For example, if a current month of data is primarily used for INSERT, UPDATE, DELETE, and MERGE operations while previous months are used primarily for SELECT queries, managing this table may be easier if it is partitioned by month. This benefit can be especially true if regular maintenance operations on the table only have to target a subset of the data. If the table is not partitioned, these operations can consume lots of resources on an entire data set. With partitioning, maintenance operations, such as index rebuilds and defragmentations, can be performed on a single month of write-only data, for example, while the read-only data is still available for online access.
The second solution it to insert from the code (C# in my case) a List or Dictionary of row from a Task, then save them inside a FILESTREAM (SQL Server) on the DB server. Data will later by retrived by Id; the zip will be decompressed and data will be ready to use.
We have decided to use the second solution.