I've configured Hortonworks HDP with ambari services, later I add HAWQ and PXF. Through some research i've seen that is possible to query data stored in Hive through HCatalog and since i already loaded the dataset to hive it would easy the work, but i am doing some benchmark, can someone tell me if the use of the Hcatalog affects HAWQ in terms of performance?
When HAWQ accesses PXF tables using HCatalog integration, it determines format of underlying table(and even partition) and uses profile, optimized for that particular format, so there should be no performance degradation.
To add to Oleksandr's point, when HAWQ queries hcatalog, even the Hive catalog data is stored only in memory and not on disk within HAWQ without any contention with native HAWQ tables. That said, external hcatalog queries wouldn't be as performant as native HAWQ queries.
Related
i am facing bottleneck at metastore(mariadb) level. While doing analysis on metastore queries being generated by hive metastore service, i found that same get table queries are being fired multiple times.
Is there any caching feature available at metastore layer?
What are the Pros and Cons of hive external and managed tables?
We want to do updates and inserts in Hive tables but wonder which approach to take for these (Managed tables or create a workaround with refreshing external tables after manual file updates), especially after adding many files over time.. Will one approach or the other become too slow (e.g. too many files/too many updates to track via metastore and therefore master node becomes slow?)?
Thanks.
There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s always recommend not to use DML on Hive managed tables especially if the data volume is huge or if the table grows in size over time these operations would become too slow. Although, these operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort.
I am designing my db structure, and wondering if possible to run a single query against two separate Redshift clusters?
If possible, any limitation on the region, availability zones, VPC groups, etc.?
No, it's no possible in Redshift directly. Additionally you cannot query across multiple databases on the same cluster.
UPDATE: Redshift announced a preview for cross database queries on 2020-10-15 - https://docs.aws.amazon.com/redshift/latest/dg/cross-database-overview.html
You could use an external tool such as Amazon Athena or Presto running on an EMR cluster to do this. You would define each Redshift cluster as an external data source. Be careful though, you will lose most of Redshift's performance optimizations and a lot of data will have to be pulled back into Athena / Presto to answer your queries.
As an alternative to cross-cluster queries, consider placing your data onto S3 in well partitioned Parquet or ORC files and using Redshift Spectrum (or Amazon Athena) to query them. This approach allows multiple clusters to query a common data set while maintaining good query performance. https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Using federated queries in Amazon Redshift, a second cluster tables can be accessed as an external schema
You can refer to documentation https://docs.aws.amazon.com/redshift/latest/dg/federated_query_example.html
Looking at
https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration
I'm wondering why there's no support for cell level visibility. Any thoughts?
It's because Hive works with Accumulo by creating a Hive table based on an existing Accumulo table, allowing you to perform Hive queries on that data.
Unfortunately Accumulo's cell level security relies pretty heavily on the way Accumulo tables are structured and how scans are performed to work. Mapping it to a Hive table is just really impractical in a lot of ways. Instead Hive tables created from Accumulo data by performing a scan as an Accumulo user. Whatever data is visible to them will appear in the Hive table with no further security checks.
Ultimately if the cell level visibility feature of Accumulo is an important part of your application using Hive to make queries easier might not be the best idea.
I know the question is a little bit strange. I love Hadoop & HDFS, but recently work on SparkSQL with Hive Metastore.
I want to use SparkSQL as a vertical SQL engine to run OLAP query across different datasources like RDB, Mongo, Elastic ... without ETL process. Then I register different schema as external tables in Metastore with corresponding Hive storage Handler.
Moreover, HDFS is not used as a datasource in my work. Then, given Map/R is already replaced by Spark engine. That sound to me that Hadoop/HDFS is useless but to base the installation of Hive. I don't want to buy them all.
I wonder If I only start Hive metastore service without Hadoop/HDFS to support SparkSQL, what kind of issue will happen. Would I put myself into the jungle?
What you need is "Hive Local Mode" (search for "Hive, Map-Reduce and Local-Mode" in the page).
Also this may help.
This configuration is only suggested if you are experimenting locally. But in this case you only need the metastore.
Also from here;
Spark SQL uses Hive Metastore, even if when we don't configure it to . When not configured it uses a default Derby DB as metastore.
So this seems to be quite legal;
Arrange your metastore in Hive
Start Hive in local mode
And make Spark use Hive metastore
Use Spark as an SQL engine for all datasources supported by Hive.