What are the Pros and Cons of hive external and managed tables?
We want to do updates and inserts in Hive tables but wonder which approach to take for these (Managed tables or create a workaround with refreshing external tables after manual file updates), especially after adding many files over time.. Will one approach or the other become too slow (e.g. too many files/too many updates to track via metastore and therefore master node becomes slow?)?
Thanks.
There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s always recommend not to use DML on Hive managed tables especially if the data volume is huge or if the table grows in size over time these operations would become too slow. Although, these operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort.
Related
I have daily MySQL DB snapshots stored on S3. This daily DB snapshot is a backup of 1000 tables in our DB, using mysqldump, size is about 300M daily (stored 1 year of snapshots, which is about 110G).
Now we want to load these snapshots daily to snowflake for reporting purpose. How do we create tables in snowflake? Shall we create 1000 tables? Will snowflake be able to handle this scenario?
All comments are welcome. Thanks!
One comment before I look at possible solutions: your statement "Our purpose is to avoid creating dimension or fact tables (typical data warehouse approach) to save cost at the beginning" is the sort of thinking that can get companies into real trouble. Once you build something and start using it, in 99% of cases you will be stuck with it - so not designing a proper, supportable, reporting solution (whether it is a Kimball model or something else) from the start is always a false economy. If you take a "quick and dirty" approach now you will regret it in a year's time.
With that out of the way, there seem to be 2 issues you need to address:
How to store your data
How to process your data (to produce you metrics and whatever else you want to do with it)
Data Storage
(Probably stating the obvious) Any tables that you create to hold metrics or which will be accessed by BI tools (including direct SQL) I would hold in Snowflake - otherwise you wont get the performance that Snowflake can deliver and there is little point using Snowflake - you might as well be using Athena directly against your S3 buckets.
For your source tables (currently in S3), in an ideal world I would also copy them into Snowflake and treat S3 as your staging area - so once the data has been copied from S3 to Snowflake you can drop the data from S3 (or archive it or do whatever you want to it).
However, if you need the S3 versions of the data for other purposes (and so can't delete it once it has been copied to Snowflake) then rather than keep duplicate copies of the data you could create External Tables in Snowflake that point to your S3 buckets and don't require you to move the data into Snowflake. Query performance against External Tables will be worse than if the tables were within Snowflake, but performance may be good enough for your purposes - especially if they are "just" being used as data sources rather than for analytical queries.
Computation
There are a number of options for the technologies you use to calculate your metrics - which one you choose is probably down to your existing skillset, cost, supportability, etc.
Snowflake functionality - Stored Procedures, External Functions (still in Preview rather than GA, I believe), etc.
External coding tools: anything that can connect to Snowflake and read/write data (e.g. Python, Spark, etc.)
ETL/ELT tool - probably overkill for your specific use case but if you are building a proper reporting platform that requires an ETL tool then obviously you could use this to create your metrics as well as move your data around
Hope this helps?
We are trying to store analytics data into RedShift on-the-fly. However single inserts work slowly with RedShift due to storage nature.
One solution is to collect these insert in our application and then to upload them into RedShift as a bulk. This will however require some nasty architectural changes in our app so I'm looking for other approaches.
For example, is there a way to create fast staging table in RedShift - such that it does not use compressed columnar (?) storage and allows speedy inserts, provided that we are not going to put many records into it, and it is merged into main table, say, after inserting each thousand of records?
Unfortunately you are barking up the wrong architectural tree. For quick single event insert/updates you may want to consider capturing your data in Amazon DynamoDB and then pulling that data into Redshift in batch for analysis. Here's a link on to how to load data from DynamoDB into Redshift.
I am in the process of building a Data Warehouse(DW) and I have a question about loading data; I would appreciate if you guys provide your thoughts on this.
I am planning to load all the tables one-to-one in a staging database first and then load the data into the DW from the staging database. I thought about hitting the OLTP system directly(no staging) but am not 100% sure this would be the best approach from a performance perspective.
Let me give you an example: In our OLTP database, we have a view called Customers that I’ll be pulling into our DW. The view on OLTP database is pretty complex and a select statement takes 8 minutes. So If I load this table directly into the DW and do an incremental load, am thinking this would take more time than loading the view into a staging table first. Also, since the load is going to take time, the DW availability would also be affected as the data won’t be available to users for querying.
What do you guys suggest? Is the staging approach outdated now? I want to understand what the pros and cons are. Thanks in advance for your help
I help maintain a data warehouse and while we don't use a staging database, we do use staging/working/intermediate/whatever_you_want_to_call_it tables.
The gist of what we do is this. We receive the raw data as a series of delimited files. We then do whatever we deem necessary to these files to produce load files. We then populate our working tables from the load files and do whatever we have to do to further prepare the data. Then we populate the real tables from the working tables.
We also do everything as a scheduled job, early in the morning before people come to work, to minimize the liklihood of people trying to query the warehouse while data is being loaded.
I would like to know if there is an inherent flaw with the following way of using a database...
I want to create a reporting system with a web front end, whereby I query a database for the relevant data, and send the results of the query to a new data table using "SELECT INTO". Then the program would make a query from that table to show a "page" of the report. This has the advantage that if there is a lot of data, this can be presented a little at a time to the user as pages. The same data table can be accessed over and over while the user requests different pages of the report. When the web session ends, the tables can be dropped.
I am prepared to program around issues such as tracking the tables and ensuring they are dropped when needed.
I have a vague concern that over a long period of time, the database itself might have some form of maintenance problems, due to having created and dropped so many tables over time. Even day by day, lets say perhaps 1000 such tables are created and dropped.
Does anyone see any cause for concern?
Thanks for any suggestions/concerns.
Before you start implementing your solution consider using SSAS or simply SQL Server with a good model and properly indexed tables. SQL Server, IIS and the OS all perform caching operations that will be hard to beat.
The cause for concern is that you're trying to write code that will try and outperform SQL Server and IIS... This is a classic example of premature optimization. Thousands and thousands of programmer hours have been spent on making sure that SQL Server and IIS are as fast and efficient as possible and it's not likely that your strategy will get better performance.
First of all: +1 to #Paul Sasik's answer.
Now, to answer your question (if you still want to go with your approach).
Possible causes of concern if you use VARBINARY(MAX) columns (from the MSDN)
If you drop a table that contains a VARBINARY(MAX) column with the
FILESTREAM attribute, any data stored in the file system will not be
removed.
If you do decide to go with your approach, I would use global temporary tables. They should get DROPped automatically when there are no more connections using them, but you can still DROP them explicitly.
In your query you can check if they exist or not and create them if they don't exist (any longer).
IF OBJECT_ID('mydb..##temp') IS NULL
-- create temp table and perform your query
this way, you have most of the logic to perform your queries and manage the temporary tables together, which should make it more maintainable. Plus they're built to be created and dropped, so it's quite safe to think SQL Server would not be impacted in any way by creating and dropping a lot of them.
1000 per day should not be a concern if you talk about small tables.
I don't know sql-server, but in Oracle you have the concept of temporary table(small article and another) . The data inserted in this type of table is available only on the current session. when the session ends, the data "disapear". In this case you don't need to drop anything. Every user insert in the same table, and his data is not visible to others. Advantage: less maintenance.
You may check if you have something simmilar in sql-server.
I have a database server with few main databases, and few dozens of small ones.
These small databases are kind of intermediary/staging databases for data import from various sources into main database. Data import is a daily task. They are all quite similar in structure as the implementation of these data imports are similar, so basically they have a configuration tables, which define mapping, conversions etc, and the data tables, which contain the results of the import.
Some time ago there have been only the handful of small ones, but now I have more then 20 of them will grow further with the number of supported data feeds.
I have just migrated all the server environment to SQL Server 2008, and having some time now for clean-up/refactoring, I am thinking to merge all of data-import databases into just one database, and use database schema to separate them.
Question-0: Any other ideas for the described situation?
Question-1: Shall I change from a separate database to a separate schema?
Question-2: !!!: Any tricky thing to be careful about in database schema implementation?
Edit-1: highlighted question-2 as the most 'unanswered' currently.
In your instance, I would probably put merge the databases into one. I don't really see a reason to have them separated, and merging them will reduce the amount of work you have to do to support backups etc. If you were importing data from a data source once and then never using the staging tables again, I could see the reason to bring up separate databases to handle the data transformation. Since you use these tables on an ongoing basis, I would much rather keep them together so that I only have to go to one place to find the full end to end state of the production data and the data load states.
2008 is really good at handling database partitioning too, if the db gets too large, or you need to separate data for security reasons you get the benefit of having a single db with the advantages like having several smaller ones. You won't get that with multiple smaller dbs.
When we migrated we had a very similar situation and I ended up moving everything into one some-what large Importing database like you have hinted towards. We did not, however, separate them using schemas.
Because the database is the unit of referential integrity and backup, if you are bringing in large amounts of data for staging which does not need to be backed up on the same schedule, it might be easiest to keep it in a separate DB.
You can use a single DB with multiple file groups and different backups, but it will require a lot more design.
The basic factors this will depend on are: recovery model, backup objectives, usage patterns and amount of effort to design and maintain your file group design.
All the prior answers work for me, particularly your comment about selectively combining databases -- if some are very busy, very large, or process sensitive data, you might want to keep them separate, or in separate groupings. This would make it easier to configure backups/restores and disk/drive allocation (give the busy ones their own set of spindles).
Like possibly most database developers, I have dealt almost exclusively with objects in the dbo schema, but I have done some recent work with other schemas. The main gotcha I've encountered is remembering to always specify the schema when referring to any database object. Never assume that any given connection will reference an object in the schema you want it to--always be clear and precise!
I would put all your import staging tables in one database separate from your regular production databse as the backup needs may be very different. This database should also contains things like your configuration management for SSIS packages, any logging tables, any import metadata tables (we keep track of every run of the imports and the status of that run as well as a bazillion other things about the import like the filename, the normal file size, etc. Comes in handy for researching problems and for adding checks to the processing. We usea a schema that is by client and then an additional schema for objects realted to the importing/exporting process (logs, meta data etc.)