So I have two machines, and I am trying to connect to the hive server with another machine. I simply enter
$hive -h<IP> -p<PORT>
However, it says I need to install hadoop. I only want to connect remotely. So why would I need hadoop? Is there any way to bypass this?
The hive program depends on the hadoop program, because it works by reading from HDFS, launching map-reduce jobs, etc. (In Hive, unlike a typical database server, the command-line interface actually does all the query processing, translating it to the underlying implementation; so you don't usually really run a "Hive server" in the way you seem to be expecting.) This doesn't mean that you need to actually install a Hadoop cluster on this machine, but you will need to install the basic software to connect to your Hadoop cluster.
One way to bypass this is run the Hive JDBC/Thrift server on the box that has the Hadoop infrastructure — that is, to run the hive program with command-line options to run it as a Hive-server on the desired port and so on — and then connect to it using your favorite JDBC-supporting SQL client. This more closely approximates the sort of database-server model of typical DBMSes (though it still differs, in that it still leaves open the possibility of other hive connections that aren't through this server). (Note: this used to be a bit tricky to set up. I'm not sure if it's easier now than it used to be.)
And this is probably obvious, but for completeness: another way to bypass this restriction is to use ssh, and actually run hive on the box that has the Hadoop infrastructure. :-)
Newer Hive CLI actually allows connecting to a remote Thrift server. See the beginning of https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli The remote machine should be running a Hive server for this to work.
You don't need your local box to be a part of a Hadoop cluster. However, you may need Hadoop programs/jars for Hive to work. If you install Hive from a standard repository, it should include a Hadoop distribution.
Related
When connecting to Databricks cluster from local IDE, I take that only spark-related commands are executed in remote mode (on cluster). How about single-node operations such as scikit-learn, to_pandas. If these functions only use local machine, the resource pool would be tiny. How to also utilize remote driver for execution of single-node operations ?
Databricks connect
It's not possible by design of the Databricks Connect - in it, the local machine is always a Spark driver, and worker nodes of Databricks cluster are used as Spark executors. So all local operations like .collect will bring data to your machine & will run locally.
You may need to look onto dbx tool from Databricks Labs - it recently got dbx sync command that allows to automatically sync code changes to the Databricks repo, so you can write code in the IDE, and run it in the Databricks notebook, so in this case it will use driver from the Databricks cluster. (It won't allow to interactively debug code, but at least you can get code executed in the cloud, not on your machine)
I am trying to run a Spark cluster with some Windows instances on an Amazon EC2 infrastructure, but I am facing some issues with extremely high deploying times.
My project needs to be run on a Windows environment, and therefore I am using an alternative AMI by indicating it with the -a flag provided by Spark's spark-ec2 script. When I run the script, the process keeps stuck waiting for the instances to be up and running, with the following message:
Waiting for all instances in cluster to enter 'ssh-ready' state.............
When I use the default AMI, instead, the cluster launches normally after very few minutes of waiting.
I have searched for similar problems with other users, and so far I have only been able to find this statement about long deploying time with custom AMI-s (see Josh Rosen's answer).
I am using the version 1.2.0 of Spark. The call that launches the cluster looks something like the following:
./spark-ec2 -k MyKeyPair
-i MyKeyPair.pem
-s 10
-a ami-905fe9e7
--instance-type=t1.micro
--region=eu-west-1
--spark-version=1.2.0
launch MyCluster
The AMI indicated above refers to:
Microsoft Windows Server 2012 R2 Base - ami-905fe9e7
Desc: Microsoft Windows 2012 R2 Standard edition with 64-bit architecture. [English]
Any help or acclaration abouth this issue would be greatly appreciated.
I think I have figured out the problem. It seems Spark does not support the creation of clusters on a Windows environment with its default scripts. I think it is still possible to create a cluster with some manual tweaking, but it goes out of my limited knowledge. Here is the official post that explains it.
Instead, as a temporal solution, I am considering the usage of a Microsoft Azure cluster, which has just released an experimental tool that makes able to use a variant of Apache Hadoop (Spark) on their HDinsight clusters. Here is the article that explains it better.
I'm new to NoSQL DBs and Apache HBase but I want to learn it.
And I was wondering if I can use HBase with just one server because what I know so far is that there are 3 modes with which HBase can run.
1. Standalone
2. Pseudo-distributed
3. Fully-distributed
so on a single server I'm only able to use standalone and pseudo-distributed but here's the problem, because I've found that these 2 modes are not supposed to be used in production environment.
The question is: Can I use Fully-distributed configuration with a single server or am I forced to buy more servers in order to run HBase in Fully-distributed production environment?
Thank you so much in advance.
A pseudo-distributed configuration is just a fully-distributed running on a single host. You can find a detailed explanation here: http://hbase.apache.org/book/standalone_dist.html
It's up to you to run it in production but It's totally discouraged, if your scale is so small perhaps you should consider simpler options that suit your needs (our good old friends RDBMS maybe?)
Just wondering can I deploy OpenERP(Odoo) on Heroku and use postgres as its dbms? Have any body done this before.
Looking forward to response.
2 years late, but right now it is possible. Shameless plug:
https://github.com/odooku/odooku
Like sepulchered said the file storage is one of the first problems.
This can be solved by using S3 as a fallback in combination with a big /tmp cache in Heroku.
Second problem: db permissions, right now I've patched Odoo to work with a single database. You can also use AWS rdbs with Heroku, which completely solves the single db problem.
Third problem: Long polling running on secondary port. However Odoo can be run in "gevent mode", also currently being patched for best compatibility with Heroku's timeouts.
Fourth problem: Heroku's python buildpack is insufficient for compiling Odoo's dependencies. Was easily fixed with a custom buildpack.
Hope this helps anyone in the future.
Well, actually no, but may be.
Here is why:
openerp requires access to filesystem, and heroku (as far as I know) doesn't provide storage
Postgresql provided as addon to heroku application doesn't provide you with ability to create databases (and openerp creates one database for each company instance)
But I think that you can install it on heroku by collecting requirements via requirements.txt and providing it.
Then you'll have to do something with file storage, I think it's possible to add feature to openerp (as it's open source) for storing files at remote server (cloud storage etc.).
And last you'll have to provide postgresql server with permissions to create databases (I think there are cloud solutions).
PS. openerp is not intended to be installed on cloud platforms, the easiest way of deployment is on some sort of server (e.g. vps) where you can control filesystem and database server.
Hope it helps somehow.
I have a Python / REDIS service running on my desk that I want to move to my Blue-Domino-hosted site. I've got Python available on the server, but not REDIS. They don't give me root access to my Debian VM so I can't git, extract, and install myself from a Unix prompt.
Their tech support might do the install for me, but they need me to point them to server requirements, which I don't see on the REDIS download page.
I could probably FTP binaries to the site if they were available, but that's dicey.
Has anyone dealt with this?
Installing Redis is actually quite easy, from source. It doesn't have any dependencies, so just download the tarball, unzip it, and follow the install instructions. I'm always afraid of doing that sort of stuff, but with Redis it really was a breeze. If you don't dare to do it their tech support should be able to do it.
If it is Intel/AMD server, you can compile the Redis somewhere (32 bit version for example), and upload it as binary. Then start it with Python. I did this myself couples weeks ago.
For port you will need to use something over 1000. I don't recommend to use default port. Remember to change LogLevel too. Daemonize works well as non-root too.
Some servers blocks all external ports, so you will not be able to connect to Redis from outside, but this will be a problem only if you connect from different machine. For same machine should be OK, since is "internal".
However, I am unsure how hosting administrator will react when he sees the process running :) I personally will kill it immediately.
There is other option as well - check service like Redis4you.com . But their free account is small, you probably will need to spend some money for more RAM.
Is your hosting provider looking for a minimum set of system requirements for running Redis? This is indeed not listed on the Redis website. Probably because there aren't many exotic requirements. Also it depends a lot on your use case. Basically what you need to run Redis is:
Operating system: Unix like, Linux is recommended (one reason to favor Linux I've heard of is the performance of its TCP/IP stack)
Tools: GCC, make, (git).
Memory: lots (no seriously this depends on your use case, but because Redis keeps everything in-memory you need a least more RAM than the size of your dataset).
Disk: disk access for making snapshots.
The problem seems to be dealing with something non-traditional with my BlueDomino hosting. Since this project is a new venture, I think the best course for me is to rent a small Linux VM from rackspace and forget about the BD hosting.