Specify worker pool for Apache TinkerPop's Spark-Gremlin - DataStax-Enterprise Graph-Analytics - datastax

I need to designate a specific worker pool to run gremlin olap queries. When I run gremlin olap queries using gremlin console or datastax studio it runs under the default pool (which is not what I want). I want to run the gremlin olap queries under a specific worker pool e.g. gremlin_olap or be able to specify the memory and executors. I tried a few settings in dse.yaml (in the location resources/dse/conf) and olap.properties (in the location resources/graph/conf), I re-started the cluster still not able to force to use gremlin_olap worker pool.
olap.properties
spark.scheduler.pool=gremlin_olap
spark.executor.cores=2
spark.executor.memory=2g
dse.yaml
resource_manager_options:
worker_options:
cores_total: 0.7
memory_total: 0.6
workpools:
- name: alwayson_sql
cores: 0.25
memory: 0.25
- name: gremlin_olap
cores: 0.25
memory: 0.25
Gremlin console
bin/dse gremlin-console
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.tinkergraph
gremlin> :remote config alias g identity.a
==>g=identity.a
gremlin> g.V().groupCount().by(label)
==>{identity=50000}
gremlin>
Am I missing something?

These directions should help:
https://docs.datastax.com/en/dse/6.8/dse-dev/datastax_enterprise/graph/graphAnalytics/graphAnalyticsSparkGraphComputer.html#SettingSparkpropertiesfromGremlin
This doesn’t exactly create a Spark resource pool — but it does affect the resources that the Gremlin OLAP Spark application will use — and the way it works in DSE Graph is that there will only ever be one of these applications spun up, so it has the same effect as having a Spark resource pool.

Related

Apache Drill - Hive Integration: Drill Not listing Tables

I have been trying to integrate Apache Drill with Hive using Hive Storage Plugin configuration. I configured the storage plugin with all the necessary properties required. On Drill Shell, I can view the Hive Databases using:
Show Databases;
But when i try to list tables using:
Show Tables;
I get no results (No List of Tables).
Below are the steps i have followed from Apache Drill documentation and other sources:
I created a Drill Distributed Cluster by updating drill-override.conf with same cluster id on all nodes along with ZK IP with Port and then invoking drillbit.sh on each node.
Started Drill shell using drill-conf, Ensured that the Hive metastore service is active as well.
Below is configuration made in Hive Storage Plugin for Drill (from its Web-UI):
{
"type": "hive",
"configProps": {
"hive.metastore.uris": "thrift://node02.cluster.com:9083",
"javax.jdo.option.ConnectionURL": "jdbc:mysql://node02.cluster.com/hive",
"hive.metastore.warehouse.dir": "/apps/hive/warehouse",
"fs.default.name": "hdfs://node01.cluster.com:8020",
"hive.metastore.sasl.enabled": "false"
},
"enabled": true
}
All the properties are set after referring to hive-site.xml
So, That's what all others have done to integrate Drill with Hive. Am i missing something here?
Regarding Versions-
Drill: 1.14 ,Hive : 1.2 (Hive Metastore: MySQL)
We also have Hive Server2 on the same nodes, is that causing any issues?
I just want to integrate Drill with Hive 1.2, am i doing it right?
Any pointers will be helpful, have spent nearly 2 days to get it right.
Thanks for your time.
Starting from Drill 1.13 version Drill leverages Hive client 2.3.2 version.
It is recommended to use Hive 2.3 version to avoid unpredictable issues.
Regarding your setup, please remove all configProps except hive.metastore.uris.
The other configs can be default (it is in HiveConf.java) or can be specified in your hive-site.xml.
Also in case of empty result after usage Show Tables; even after executing use hive, check for errors in Drill's log files. If some error is there, you can create the Jira ticket to improve the output from Drill to reflect that issue.

stream data from datastax to gephi using gremlin

I am using gremlin-console (v3.2.7) bundled with Datastax Enterprise. On start it automatically connects to a remote gremlin server. Next, I create an alias to access the right graph :remote config alias g graph.g. Then, I connect to gephi (v0.9.2) :remote connect tinkerpop.gephi. From this moment on, however, I cannot traverse graph g so that :> g logically fails with java.lang.StackOverflowError as well. These are the two connections:
gremlin> :remote list
==>0 - Gremlin Server - [localhost/127.0.0.1:8182]-[<uuid>]
==>*1 - Gephi - [workspace1]
My question is whether there is a way to stream data from one remote connection to another using the setup outlined above (Datastax -> gephi) and if so how? If not, is there a workaround?
Note: All connections are successful, local gephi streaming tested with TinkerGraph.createModern() works flawlessly.
The Gephi plugin requires a local Graph instance. When you connect the Gremlin Console using :remote you no longer have that (i.e. the Graph instance is on a server somewhere and you're sending via :> the request to the server to be processed over there).
DSE Graph, Neptune, CosmosDB and similar graphs that only offer a remote Graph instance, the only way that you can make the Gephi plugin work is by taking a subgraph and bringing that down to your Gremlin Console. Then, as you've found, TinkerGraph (i.e. the holder of the subgraph) will work just fine with the Gephi plugin.

How to get graph with transaction support from remote gremlin server?

I have next configuration: remote Gremlin server (TinkerPop 3.2.6) with Janus GraphDB
I have gremlin-console (with janus plugin) + conf in remote.yaml:
hosts: [10.1.3.2] # IP og gremlin-server host
port: 8182
serializer: { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
So I want to make connection through gremlin-server (not to JanusGraph directly by graph = JanusGraphFactory.build().set("storage.backend", "cassandra").set("storage.hostname", "127.0.0.1").open();) and get graph which supports transactions?
Is it possible? Because as I see all TinkerFactory graphs do not support transactions
As I understood to use the Janus graph through the gremlin server you should:
Define ip&port in the config file of the gremlin-console:
conf/remote.yaml
Connect by Gremlin-console to the gremlin server:
: remote connect tinkerpop.server conf/remote.yaml
==> Configured localhost/10.1.23.113: 8182
...and work in remote mode (using :> or :remote console), i.e. send ALL commands (or #script) to the gremlin-server.
:> graph.addVertex(...)
or
:remote console
==>All scripts will now be sent to Gremlin Server - [10.1.2.222/10.1.2.222:818]
graph.addVertex(...)
You don't need to define variables for the graph and the trawersal, but rather used
graph. - for the graph
g. - for the traversal
In this case, you can uses all graph features that are provided by the JanusGraphDB.
Tinkerpop provides Cluster object to keep the config of connection. Using Cluster object graphTraversalSource object can be spawned.
this.cluster = Cluster.build()
.addContactPoints("192.168.0.2","192.168.0.1")
.port(8082)
.credentials(username, password)
.serializer(new GryoMessageSerializerV1d0(GryoMapper.build().addRegistry(JanusGraphIoRegistry.getInstance())))
.maxConnectionPoolSize(8)
.maxContentLength(10000000)
.create();
this.gts = AnonymousTraversalSource
.traversal()
.withRemote(DriverRemoteConnection.using(cluster));
gts object is thread safe. With remote each query will be executed in separate transaction. Ideally gts should be a singleton object.
Make sure to call gts.close() and cluster.close() upon shutdown of application else it may lead to connection leak.
I believe that connecting a java application to a running gremlin server using withRemote() will not support transactions. I have had trouble finding information on this as well but as far as I can tell, if you want to do anything but read the graph, you need to use "embedded janusgraph" and have your remotely hosted persistent data stored in a "storage backend" that you connect to from your application as you describe in the second half of your question.
https://groups.google.com/forum/#!topic/janusgraph-users/t7gNBeWC844
Some discussion I found around it here ^^ makes a mention of it auto-committing single transactions in remote mode, but it doesn't seem to do that when I try.

Amazon EMR allocates conatainer with 1 core on slaves

I have the strange behaviour that yarn cannot allocate all containers properly.
I have maximizeResourceAllocation to true and it tries if I start an m4.4xlarge slave instance to request for each executor container all 32 cores from yarn.
However that fails for one container because the applicationmaster master process uses 1 core.
1 container 1vcore for the master process
32 cores container executor 1
32 cores container executor 2
32 cores container executor 3 fails because yarn has only 31 cores left to give.
my Step should execute in client mode. It is the same with the Zeppelin instances.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs("spark-submit",
"--class", ".....",
"--deploy-mode", "client",
"/home/hadoop/jars/.....jar");
I use a bootstrap step to get the jars into /home/hadoop/jars, because if you want to use s3 paths only deploy cluster is allowed which would block one of my executors with the sparkContext process.
All this means if I only have one slave nothing happens at all. And if I have 3, 1 does not do any work. Which is a waste of money.
I could in theory calculate executor cores - 1 and force that setting in spark submit. But this is supposed to work.
How can I tell yarn to put this applicationMaster 1 core process on the master or not create it, or start executors with different core counts?
I only use 1-4 instances and effectively sitting idle is really ok.

How to submit code to a remote Spark cluster from IntelliJ IDEA

I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?
Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.