Looking at
https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration
I'm wondering why there's no support for cell level visibility. Any thoughts?
It's because Hive works with Accumulo by creating a Hive table based on an existing Accumulo table, allowing you to perform Hive queries on that data.
Unfortunately Accumulo's cell level security relies pretty heavily on the way Accumulo tables are structured and how scans are performed to work. Mapping it to a Hive table is just really impractical in a lot of ways. Instead Hive tables created from Accumulo data by performing a scan as an Accumulo user. Whatever data is visible to them will appear in the Hive table with no further security checks.
Ultimately if the cell level visibility feature of Accumulo is an important part of your application using Hive to make queries easier might not be the best idea.
Related
I have daily MySQL DB snapshots stored on S3. This daily DB snapshot is a backup of 1000 tables in our DB, using mysqldump, size is about 300M daily (stored 1 year of snapshots, which is about 110G).
Now we want to load these snapshots daily to snowflake for reporting purpose. How do we create tables in snowflake? Shall we create 1000 tables? Will snowflake be able to handle this scenario?
All comments are welcome. Thanks!
One comment before I look at possible solutions: your statement "Our purpose is to avoid creating dimension or fact tables (typical data warehouse approach) to save cost at the beginning" is the sort of thinking that can get companies into real trouble. Once you build something and start using it, in 99% of cases you will be stuck with it - so not designing a proper, supportable, reporting solution (whether it is a Kimball model or something else) from the start is always a false economy. If you take a "quick and dirty" approach now you will regret it in a year's time.
With that out of the way, there seem to be 2 issues you need to address:
How to store your data
How to process your data (to produce you metrics and whatever else you want to do with it)
Data Storage
(Probably stating the obvious) Any tables that you create to hold metrics or which will be accessed by BI tools (including direct SQL) I would hold in Snowflake - otherwise you wont get the performance that Snowflake can deliver and there is little point using Snowflake - you might as well be using Athena directly against your S3 buckets.
For your source tables (currently in S3), in an ideal world I would also copy them into Snowflake and treat S3 as your staging area - so once the data has been copied from S3 to Snowflake you can drop the data from S3 (or archive it or do whatever you want to it).
However, if you need the S3 versions of the data for other purposes (and so can't delete it once it has been copied to Snowflake) then rather than keep duplicate copies of the data you could create External Tables in Snowflake that point to your S3 buckets and don't require you to move the data into Snowflake. Query performance against External Tables will be worse than if the tables were within Snowflake, but performance may be good enough for your purposes - especially if they are "just" being used as data sources rather than for analytical queries.
Computation
There are a number of options for the technologies you use to calculate your metrics - which one you choose is probably down to your existing skillset, cost, supportability, etc.
Snowflake functionality - Stored Procedures, External Functions (still in Preview rather than GA, I believe), etc.
External coding tools: anything that can connect to Snowflake and read/write data (e.g. Python, Spark, etc.)
ETL/ELT tool - probably overkill for your specific use case but if you are building a proper reporting platform that requires an ETL tool then obviously you could use this to create your metrics as well as move your data around
Hope this helps?
I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.
I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/
What are the Pros and Cons of hive external and managed tables?
We want to do updates and inserts in Hive tables but wonder which approach to take for these (Managed tables or create a workaround with refreshing external tables after manual file updates), especially after adding many files over time.. Will one approach or the other become too slow (e.g. too many files/too many updates to track via metastore and therefore master node becomes slow?)?
Thanks.
There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s always recommend not to use DML on Hive managed tables especially if the data volume is huge or if the table grows in size over time these operations would become too slow. Although, these operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort.
Can one hive instance store different tables across hdfs clusters. and then do hive ql on these tables?
My use case is that I have one hive table on one hdfs cluster. I want to do some process on it with hive ql and have the output been written to another hdfs cluster. I wish to achieve this directly only by hive, not need to run through some dump / copy / import process. So Is that possible? I don't really think it is possible, however, I notice a design page on :
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27837073
in it , it said that :
"Note that, even today, different partitions/tables can span multiple dfs's, and hive does not enforce any restrictions. Those dfs's can be in different data centers also"
except this, I failed to google anything related.
anyone have any ideas on this? Thanks.
There are multiple ways to handle this. you can go with mirroring (use tools like Apache Falcon). In this case you have data stored in both the clusters. If you want to query across clusters having a different table without mirroring then use tools like Apache Drill which can join data from different datasources. it currently supports hive,mongo,json, kudu etc
We are trying to store analytics data into RedShift on-the-fly. However single inserts work slowly with RedShift due to storage nature.
One solution is to collect these insert in our application and then to upload them into RedShift as a bulk. This will however require some nasty architectural changes in our app so I'm looking for other approaches.
For example, is there a way to create fast staging table in RedShift - such that it does not use compressed columnar (?) storage and allows speedy inserts, provided that we are not going to put many records into it, and it is merged into main table, say, after inserting each thousand of records?
Unfortunately you are barking up the wrong architectural tree. For quick single event insert/updates you may want to consider capturing your data in Amazon DynamoDB and then pulling that data into Redshift in batch for analysis. Here's a link on to how to load data from DynamoDB into Redshift.