DatastaxEnteprise: node vs instance, correct AMI image, why do I need storage - datastax

Currently, we are evaluating datastax enteprise as our provider of Cassandra and Spark.We consider deploying Datastax cluster on AWS.
I have following questions:
1) In step 1 of Datastax on EC2 installation manual, I need to choose correct AMI Image: Currently there are 7 of them. Which is the correct one:
(DataStax Auto-Clustering AMI 2.5.1-pv, DataStax Auto-Clustering AMI 2.6.3-1204-pv, DataStax Auto-Clustering AMI 2.6.3-1404-pv....)
2) The moment we launch the cluster, do we pay only for aws instances or also Datastax Enterprise licensing fee? I know there is a 30 days enterprise free trial, but nowhere in the installation process I saw a step where we can ask for the free trial? Is there some online calculator that we can use to calculate the cost of a cluster on a monthy basis (based on the instance types we create)
3) In the step 3 of the installation process Configure Instance Details, I am confused with terms instance and nodes. What is the difference between them? What happens if I choose:
a) 1 instance, --totalnodes 3 (in the user data)
b) 3 instance, --tatalnodes 3
c) 1 instance, --totalnodes 0 --analyticsnodes 3
d) 3 instance, --totalnodes 0 --analyticsnodes 3
4) We are interested in the use case where each of our 3 cassandra nodes has Spark. Is the proper user data configuration:
--totalnodes 0 --analyticsnodes 3
Are then we going to have 0 nodes with only cassandra, and 3 nodes that have Cassandra and Spark? What is the Number of instances we should specify then?
5) In step 4 of installation process Add Storage, we are asked to add storage to the instance. But why do we need this storage? When choosing instance type, for example m3.large, I already know that my instance has 32GB of SSD storage, what is this then?
Thank you for your answers. If there is some email list to which I can send these questions, I would appreciate it.

Use whichever AMI has the highest version number and the virtualization type you prefer (-pv or -hvm): http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html
You only pay for EC2 usage. DSE is free for testing and development. You do not need to request a trial license. If you want a production license or if you want to become a startup member, contact DataStax.
The AMI will install one "DSE node" per "EC2 instance". So if you want a six node cluster you need to specify 6 instances. To use your examples:
a) 1 instance, --totalnodes 3 (in the user data)
This won't work
b) 3 instance, --tatalnodes 3
This will give you a three node Cassandra cluster (running on three instances). You have not specified search or analytics nodes so by default you will just get Cassandra nodes.
c) 1 instance, --totalnodes 0 --analyticsnodes 3
Won't work. Total nodes should equal number of instances and number of analytics nodes can't be greater than total nodes.
d) 3 instance, --totalnodes 0 --analyticsnodes 3
Won't work. Number of analytics nodes can't be greater than number of total nodes.
If you want a three-node cluster and you want all of them running both Cassandra and Spark use this:
3 instances, --totalnodes 3 --analyticsnodes 3
Adding storage is optional. And only possible with certain instance types. You should notice with m3.large that there is a default config and you can't actually make any changes to it.
Hope this helps!

Related

Redis - is there a way to to divide one Redis server instance into 2 subredis servers?

I was wondering if there is a way to divide one Redis server into 2 different sub-servers? Specifically is there a way to have:
Sub-server1:
a 1 (a - key, 1 - value)
b 2
c 3
Sub-server2:
a 4
b 5
c 6
so that I can search for keys in one sub-server (for example use something like "GET a in Sub-server2", and get value 4)?
I am including jedis tag because the final goal is to make that work in java.
Redis already has an idea of database. A single Redis instance has 16 (0 to 15) separate databases.
You can use this idea of database to your idea of sub-server. For example, use database-1 as sub-server1, database-2 as sub-server2.
Jedis also supports database. Choose a constructor which has database parameter, pass the proper database value and you're done!

Ideal value for Kafka Connect Distributed tasks.max configuration setting?

I am looking to productionize and deploy my Kafka Connect application. However, there are two questions I have about the tasks.max setting which is required and of high importance but details are vague for what to actually set this value to.
If I have a topic with n partitions that I wish to consume data from and write to some sink (in my case, I am writing to S3), what should I set tasks.max to? Should I set it to n? Should I set it to 2n? Intuitively it seems that I'd want to set the value to n and that's what I've been doing.
What if I change my Kafka topic and increase partitions on the topic? I will have to pause my Kafka Connector and increase the tasks.max if I set it to n? If I have set a value of 2n, then my connector should automatically increase the parallelism it operates?
In a Kafka Connect sink, the tasks are essentially consumer threads and receive partitions to read from. If you have 10 partitions and have tasks.max set to 5, each task with receive 2 partitions to read from and track the offsets. If you have configured tasks.max to a number above the partition count Connect will launch a number of tasks equal to the partitions of the topics it's reading.
If you change the partition count of the topic you'll have to relaunch your connect task, if tasks.max is still greater than the partition count, Connect will start that many tasks.
edit, just discovered ConnectorContext: https://kafka.apache.org/0100/javadoc/org/apache/kafka/connect/connector/ConnectorContext.html
The connector will have to be written to include this but it looks like Connect has the ability to reconfigure a connector if there's a topic change (partitions added/removed).
We had a problem with the distribution of the workload between the Kafka-Connect(5.1.2) instances, caused by the high number of tasks.max than the number of partitions.
In our case, there were 10 Kafka Connect tasks and 3 partitions of the topic which is to be consumed. 3 of those 10 workers are assigned to the 3 partitions of the topic and the other 7 are not assigned to any partitions(which is expected) but the Kafka Connect were distributing the tasks evenly, without considering their workload. So we were ending up with a task distribution to our instances where some instances are staying idle( because they are not assigned to any unempty worker ) or some instances are working more than the others.
To come up with the issue, we set tasks.max equal to number of partitions of our topics.
It is really unexpected for us to see that Kafka Connect does not consider tasks' assignments while rebalancing. Also, I couldn't find any documentation for the tasks.max setting.

Migrating from Titan to DataStax Enterprise Graph

I'm migrating from Titan to Datastax. I have a graph with around 50 million nodes that is composed in Persons, Addresses, Phones, etc
I want to calculate a Person node connections (how many persons have the same phone, addresses, etc).
In Titan I wrote a Hadoop job that go over al the person nodes an the I could write a gremlin script to see how many persons have the same phone for this particular node
So as an input properties I have:
titan.hadoop.input.format=com.thinkaurelius.titan.hadoop.formats.hbase.TitanHBaseInputFormat
titan.hadoop.input.conf.storage.backend=hbase
For query filter I query only the person nodes
titan.hadoop.graph.input.vertex-query-filter=v.query().has('type',Compare.EQUAL,'person')
And to run a script I use
titan.hadoop.output.conf.script-file=scripts/calculate.groovy
this will calculate for every node the number of shared phones connection that the person has.
object.phone_shared= object.as('x').out('person_phones').in('person_phones').except('x').count()
Is there a way to write this kind of scripts in Datastax to go over the persons nodes. I see that Datastax uses Spark analytics to count the nodes for example,
https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/graphAnalytics/northwindDemoGraphSnapshot.html
but I didn't found any more documentation on how to run custom scripts using analytics
Thanks
The answer happens to be on the page you linked. It seems like it might just be a little easier than you are used to with Titan. The key is on step 8 where you configure the Traversal to use the preconfigured OLAP/Analytics TraversalSource, which is named a (for Analytics).
Alias the traversal to the Northwind analytics OLAP traversal source
a. Alias g to the OLAP traversal source for one-off analytic queries:
gremlin> :remote config alias g northwind.a
This basically says..
"When I execute a Traversal on TraversalSource g, I want it to be aliased to northwind.a on the server".
Once you do that, all Traversals of g will be executed using northwind.a and thus the Spark analytics engine.

carbon-relay Replication across Datacenters

I recently "inherited" a Carbon/Graphite setup from a colleague which I have to redesign. The current setup is:
Datacenter 1 (DC1): 2 servers (server-DC1-1 and server-DC1-2) with 1 carbon-relay and 4 carbon caches
Datacenter 2 (DC2): 2 servers (server-DC2-1 and server-DC2-2) with 1 carbon-relay and 4 carbon caches
All 4 carbon-relays are configured with a REPLICATION_FACTOR of 2, consistent hashing and all carbon-caches ( 2(DCs) * 2(Servers) * 4(Caches) ). This had the effect that some metrics exist only on 1 server (they probably were hashed to a different cache on the same server). With over 1 million metrics this problem affects about 8% of all metrics.
What I would like to do is a multi-tiered setup with redundancy, so that I mirror all metrics across the datacenters and inside the datacenter I use consistent hashing to distribute the metrics evenly across 2 servers.
For this I need help with the configuration (mainly) of the relays. Here is a picture of what I have in mind:
The clients would send their data to the tier1relays in their respective Datacenters ("loadbalancing" would occur on client side, so that for example all clients with an even number in the hostname would send to tier1relay-DC1-1 and clients with an odd number would send to tier1relay-DC1-2).
The tier2relays use consistent hashing to distribute the data in the datacenter evenly across the 2 servers. For example the "pseudo" configuration for tier2relay-DC1-1 would look like this:
RELAY_METHOD = consistent-hashing
DESTINATIONS = server-DC1-1:cache-DC1-1-a, server-DC1-1:cache-DC1-1-b, (...), server-DC1-2:cache-DC1-2-d
What I would like to know: how do I tell tier1relay-DC1-1 and tier1relay-DC1-2 that they should send all metrics to the tier2relays in DC1 and DC2 (replicate the metrics across the DCs) and do some kind of "loadbalancing" between tier2relay-DC1-1 and tier2relay-DC1-2.
On another note: I also would like to know what happens inside the carbon-relay if I use consistent hashing, but one or more of the destinations are unreachable (server down) - do the metrics get hashed again (against the reachable caches) or will they simply be dropped for the time? (Or to ask the same question from a different angle: when a relay receives a metric does it do the hashing of the metric based on the list of all configured destinations or based on the currently available destinations?)
https://github.com/grobian/carbon-c-relay
Which exactly does what you need. Also it give you a great boost in performance.

ElastiCache cloudwatch metrics for Redis: currItems for a single database

I have set up a metric for the aws interface on the ElastiCache redis cluster. I'm looking at a value of currItems superior to a certain number for a given period (say 1000 for 1 minute)
The issue I have is that I have two databases in Redis, name 0 and 1. I would like to only get the currItems for database 0, not database 1, since database 1 is holding values for a longer period of time and make the whole metric look much bigger than it should be (since I want the current items of database 0)
Is there a way to create a metric that would only get the currItems of the database 0?
You will have to create an application for this or use existing redis tools.
https://stackoverflow.com/questions/8614737/what-are-some-recommended-tools-to-monitor-a-redis-database
If you are using new relic