TiDB as metastore for hive - hive

Is somebody using TiDB as metastore for hive installation?
Looking for experience and testimonials as we are evaluating it to solve our current metastore bottleneck while using mysql.

This was published a while ago: https://www.pingcap.com/blog/create-scale-out-hive-cluster-with-distributed-mysql-compatible-database/
A search also found this: https://dzone.com/articles/cluster-diagnostics-troubleshoot-cluster-issues-us-2
I think TiDB can provide high availability, online upgrades and scalability which would all be good for Hive.
Disclaimer: I'm working for PingCAP, the company behind TiDB.

Related

Hive queries taking so long

I have a CDP environment running Hive, for some reason some queries run pretty quickly and others are taking even more than 5 minutes to run, even a regular select current_timestamp or things like that.
I see that my cluster usage is pretty low so I don't understand why this is happening.
How can I use my cluster fully? I read some posts in the cloudera website, but they are not helping a lot, after all the tuning all the things are the same.
Something to note is that I have the following message in the hive logs:
"Get Query Coordinator (AM) 350"
Then I see that the time to execute the query was pretty low.
I am using tez, any idea what can I look at?
Besides taking care of the overall tuning: https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279
Please check my answer to this same issue here Enable hive parallel processing
That post explains what you need to do to enable parallel processing.

Setting up environment for Hadoop datawarehousing (Hive)

I am new to Hadoop and trying to learn it on datawarehousing and analytical front.
Can someone advise me on how to set up my practice machines, especially with regards to
1.Number of machines/nodes required to start learning
2.Is it advisable to set up on Windows?
3.What software needs to be installed
4.Availability of test/sample data
Also I would like to get advice on the best way to perform BI actions with Hive.
Thank you.
I would suggest to download cloudera VM if you more interested in hadoop machinery. Another way to jump start immidiately - to use amazon EMR (elastic mapreduce). There is an option to create interactive hive cluster there and start playing with datasets stored in S3.
Regarding number of nodes - it depends on goals. If you interested to "feel" some hadoop performance - try at least 4-6 nodes.
Both ways listed above are good if you do not have access to organization's internal hadoop / hive cluster. And even in this case - I would suggest to try with them to gain some hands-on before using shared environment.

how to represent hive data for mahout's recommendations?

I am a new bee in hadoop - big data analysis. I am referring book-"mahout in action".
Here i saw a topic which explains how we can represent recommender's data from database. In the book they have shown programmatic approach of connecting MySQL with mahout.
my question is "Is it possible to connect hive with mahout like we connect MySQL? if yes then how?
These are pretty completely different things -- you're talking about using a non-distributed recommender with MySQL, versus a Hadoop-based non-distributed recommender with the output of Hive on HDFS. If you're using recommenders, the simplest thing is to get your output as simple CSV data on HDFS and use that as input.

Hadoop and MS SQL Server Best Practices

I've been following Hadoop for a while, it seems like a great technology. The Map/Reduce, Clustering it's just good stuff. But I haven't found any article regarding the use of Hadoop with SQL Server.
Let's say I have a huge claims table (600 million rows) and I want to take advantage of Hadoop. I was thinking but correct me if I'm wrong, I can query my table and extract all of my data and insert it into hadoop in chunks of any type (xml, json, csv). Then I can take advantage of Map/Reduce and Clustering with at least 6 machines and leave my SQL Server for other tasks. I'm just throwing a bone here I just want to know if anybody has done such a thing.
Importing and exporting data to and from a relational database is a very common use-case for Hadoop. Take a look at Cloudera's Sqoop utility, which will aid you in this process:
http://incubator.apache.org/projects/sqoop.html

Katta in production environment

According to the website Katta is a scalable, failure tolerant, distributed, indexed, data storage.
I would like to know if it is ready to be deployed into production environment. Anyone already using it and has advices? Any pitfalls? Recommendations? Testimonials? Please share.
Any answer would be greatly appreciated.
We have tried using katta and for what its worth - found it very stable, relatively easy to manage (as compared to managing plain vanilla lucene)
Only pitfall I can think of is lack of realtime updates - when we tested it (about 9-10 months back) update meant, updating index using a separate process (hadoop job or what have you...) and replacing the live index, this was a deal-breaker for us.
If you are looking into distributed lucene you should really tryout ElasticSearch or Solandra