In the book Hadoop: The Definitive Guide, there is some content about the Hive remote metastore as below:
"Going a step further, there’s another metastore configuration called a remote metastore,
where one or more metastore servers run in separate processes to the Hive service.
This brings better manageability and security because the database tier can be completely
firewalled off, and the clients no longer need the database credentials."
Does anyone know what is the meaning of the above paragraph? Why "the database tier can be completely
firewalled off, and the clients no longer need the database credentials"?
Related
I have an instance of ActiveMQ 5.16.4 running that is using MySQL as a persistent data storage. Recently the MySQL server had some issues, and ActiveMQ lost its connection to MySQL. That caused multiple Spring microservices to throw errors because ActiveMQ wasn't working.
Is it possible to have master/slave ActiveMQ running where master and slave uses separate persistence storage?
I have done some research and found "pure master slave", but it says that it is deprecated and not recommend to use and will be removed in 5.8. It says to use shared storage which I am trying to avoid (cause my problem is what if storage itself is down).
What are my options to keep running ActiveMQ if it loses connection to database?
If you're using ActiveMQ "Classic" (i.e. 5.x) then your only option is to use shared storage between the master and the slave. This could be a shared file system or a relational database. This, of course, is a single point of failure.
However, there are both file system and database technologies that can mitigate this risk. For example you could use a replicated file system (e.g. Ceph or GlusterFS) or a replicated database (e.g. MySQL).
You might also consider using ActiveMQ Artemis (i.e. the next-generation broker from ActiveMQ) which supports replication natively.
This is regarding the use case where we are trying to use the Redis in PCF (Pivotal Cloud Foundry). In our use case, we will refresh the Redis cache daily once or twice with the required data and then API will query Redis and then provide the response.
One thing of particular concern for us is that we want API queries to happen from Redis only that means Redis to be available at all times. But whenever we are refreshing the Redis DB, Redis would not be able to serve the APIs since it is refreshing the keys. To avoid that we wanted to setup a Redis in cluster mode or master-slave mode so if one instance is being written another can be read from.
How can we setup Redis cluster or master-slave mode in PCF and then fulfil our requirement?
Please provide any other suggestions as well that you may have.
At the time I write this, the Redis for Pivotal Platform product does not support clustering. See Availability, in the docs here -> https://docs.pivotal.io/redis/2-3/erc.html#offerings.
All Redis for Pivotal Platform services are single VMs without clustering capabilities. This means that planned maintenance jobs (e.g., upgrades) can result in 2–10 minutes of downtime, depending on the nature of the upgrade. Unplanned downtime (e.g., VM failure) also affects the Redis service.
Redis for Pivotal Platform has been used successfully in enterprise-ready apps that can tolerate downtime. Pre-existing data is not lost during downtime with the default persistence configuration. Successful apps include those where the downtime is passively handled or where the app handles failover logic.
If you require clustered Redis, you'd need to look at a different offering. Redis Labs has some offerings that integrate with PCF, you could use a Cloud Provider's Redis offering, or you could host your own.
If the solution you use isn't integrated into PCF, you can create a user-provided service with cf cups and provide the Redis credentials to your application that way. It will function just like a Redis service instance created through the marketplace.
I am dealing with the infrastructure for a new project. It is a standard Laravel stack = PHP, SQL server, and Nginx. For the PHP + Nginx part, we are using Kubernetes cluster - so scaling and blue/green deployments are taken care of.
When it comes to the database I am a bit unsure. We don't want to use Kubernetes for SQL, so the current idea is to go for Google Cloud SQL managed service (Are the competitors better for blue/green deployment of SQL?). The question is can it sync the data between old and new versions of the database nodes?
Let's say that we have 3 active Pods and at least 2 active database nodes (and a load balancer).
So the standard deployment should look like this:
Pod with the new code is created.
New database node is created with current data.
The new Pod gets new environment variables to connect to the new database.
Database migrations are run on the new database node.
Health check for the new Pod is run, if it passes Pod starts to receive traffic.
One of the old Pods is taken offline.
It should keep doing this iteration until all of the Pods and Database nodes are replaced.
The question is can this work with the database? Let's imagine there is a user on the website that is using the last OLD database node to write some data and when switched to the NEW database node the data are simply not there until the last database node is upgraded. Can they be synced behind the scenes? Does Google Cloud SQL managed service provide that?
Or is there a completely different and better solution to this problem?
Thank you!
I'm not 100% sure if this is what you are looking for, but for my understanding, Cloud SQL replicas would be a better solution. You can have read replicas [1], that are a copy of the master instance and have different options [2]
A read replica is a copy of the master that reflects changes to the master instance in almost real time. You create a replica to offload read requests or analytics traffic from the master. You can create multiple read replicas for a single master instance.
or a failover replica [3], that in case the master goes down, the data continue to be available there.
If an instance configured for high availability experiences an outage or becomes unresponsive, Cloud SQL automatically fails over to the failover replica, and your data continues to be available to clients. This is called a failover.
You can combine those if you need.
Why do Apache Hive needs Apache Thrift? On the Thrift's site it says that it can compile in multiple languages, but I can't understand where does it fits and why do Hive need it.
Thanks
Cited from safaribooksonline:
Chapter 16. Hive Thrift Service
Hive has an optional component known as HiveServer or HiveThrift that
allows access to Hive over a single port. Thrift is a software
framework for scalable cross-language services development. See
http://thrift.apache.org/ for more details. Thrift allows clients
using languages including Java, C++, Ruby, and many others, to
programmatically access Hive remotely.
The CLI is the most common way to access Hive. However, the design of
the CLI can make it difficult to use programmatically. The CLI is a
fat client; it requires a local copy of all the Hive components and
configuration as well as a copy of a Hadoop client and its
configuration. Additionally, it works as an HDFS client, a MapReduce
client, and a JDBC client (to access the metastore). Even with the
proper client installation, having all of the correct network access
can be difficult, especially across subnets or datacenters.
Couldn't have said it better. Emphasis mine.
https://cwiki.apache.org/confluence/display/Hive/HiveServer
HiveServer is an optional service that allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results. HiveServer is built on Apache ThriftTM (http://thrift.apache.org/), therefore it is sometimes called the Thrift server although this can lead to confusion because a newer service named HiveServer2 is also built on Thrift.
For more details on how to connect to hive server(thrift server) see the link above.
Currently I have the following setup:
Hardware load balancer directing traffic to two physical servers each with 2 instances of weblogic running.
Works ok. I'd like to be able to shutdown one of the servers without dropping active sessions. Right now if I shutdown one of the physical servers any traffic that was going there gets bounced back to a login screen.
I'm looking for the simplest way of accomplishing this with the smallest performance hit.
Things I've considered so far:
1. See if I can somehow store the session information on the Load Balancer and through some Load Balancer magic have it notice a server is dead and try another one with the same session information (not sure this is possible)
2. Configure weblogic clustering. Not sure what the performance hit would be. Im guessing this is what I'll end up with, but still fishing for alternatives.
3. ?
What I currently have is an overly designed DR solution (which was the requirement), but I'd like to move it more in the direction of HA (for the flexibility)
edit Also is it worthwhile to create 2 clusters and replicate the sessions between them (I was thinking one cluster per site, sites are close enough). This would cover the event of one cluster failing.
You could try setting up a JDBC Session Storage pointing (of course) both instances to the same datasource without setting up a cluster, but I think the right approach would be setting up a Weblogic Cluster.
A nice thing about clustering Weblogic Servers is that - (from the link above, emphasis mine):
Sessions can be shared across clustered WebLogic Servers. Note that session persistence is no longer a requirement in a WebLogic Cluster. Instead, you can use in-memory replication of state. For more information, see Using WebLogic Server Clusters.
We've got a write up of this on our blog http://blog.c2b2.co.uk/2012/10/basic-clustering-with-weblogic-12c-and.html which provides step by step instructions on setting up web session failover in a cluster.
Clusters are not heavyweight assuming you don't store huge amounts of data in the cluster as it will be replicated.