stream data from datastax to gephi using gremlin - datastax

I am using gremlin-console (v3.2.7) bundled with Datastax Enterprise. On start it automatically connects to a remote gremlin server. Next, I create an alias to access the right graph :remote config alias g graph.g. Then, I connect to gephi (v0.9.2) :remote connect tinkerpop.gephi. From this moment on, however, I cannot traverse graph g so that :> g logically fails with java.lang.StackOverflowError as well. These are the two connections:
gremlin> :remote list
==>0 - Gremlin Server - [localhost/127.0.0.1:8182]-[<uuid>]
==>*1 - Gephi - [workspace1]
My question is whether there is a way to stream data from one remote connection to another using the setup outlined above (Datastax -> gephi) and if so how? If not, is there a workaround?
Note: All connections are successful, local gephi streaming tested with TinkerGraph.createModern() works flawlessly.

The Gephi plugin requires a local Graph instance. When you connect the Gremlin Console using :remote you no longer have that (i.e. the Graph instance is on a server somewhere and you're sending via :> the request to the server to be processed over there).
DSE Graph, Neptune, CosmosDB and similar graphs that only offer a remote Graph instance, the only way that you can make the Gephi plugin work is by taking a subgraph and bringing that down to your Gremlin Console. Then, as you've found, TinkerGraph (i.e. the holder of the subgraph) will work just fine with the Gephi plugin.

Related

Superset with Apache Spark on Hive

I have Apache Superset installed via Docker on my local machine. I have a separate production 20 Node Spark cluster with Hive as the Meta-Store. I want my SuperSet to be able to connect to Hive and run queries via Spark-SQL.
For connecting to Hive, I tried the following
**Add Database --> SQLAlchemy URI ***
hive://hive#<hostname>:10000/default
but it is giving some error when I test connection. I believe I have to do some tunneling, but I am not sure how.
I have the Hive thrift server as well.
Please let me know how to proceed.
What is the error you are receiving? Although the docs do not mention this, the best way to provide the connection URL is in the following format :
hive://<url>/default?auth=NONE ( when there is no security )
hive://<url>/default?auth=KERBEROS
hive://<url>/default?auth=LDAP
first you should connect the 2 containers together.
lets say you have the container_superset that runs superset and container_spark running spark.
run : docker network ls # display containers and its network
select the name of the superset network (should be something like superset_default )
run : docker run --network="superset_default" --name=NameTheConatinerHere --publish port1:port2 imageName
---> port1:port2 is the port mapping and imageName is the image of spak

How to set a specific port for single-user Jupyterhub server REST API calls?

I have setup Spark SQL on Jypterhub using Apache Toree SQL kernel. I wrote a Python function to update Spark configuration options in the kernel.json file for my team to change configuration based on their queries and cluster configuration. But I have to shutdown the running notebook and re-open or restart the kernel after running Python function. In this way, I'm forcing the Toree kernel to read the JSON file to pick up the new configuration.
I thought of implementing this shutdown and restart of kernel in a programmatic way. I got to know about the Jupyterhub REST API documentation and am able implement it by invoking related API's. But the problem is, the single user server API port is set randomly by the Spawner object of Jupyterhub and it keeps changing every time I spin up a cluster. I want this to be fixed before launching the Jupyterhub service.
Here is a solution I tried based on Jupyterhub docs:
sudo echo "c.Spawner.port = 35289
c.Spawner.ip = '127.0.0.1'" >> /etc/jupyterhub/jupyterhub_config.py
But this did not work as the port was again set by the Spawner randomly. I think there is a way to fix this. Any help on this would be greatly appreciated. Thanks

How to get graph with transaction support from remote gremlin server?

I have next configuration: remote Gremlin server (TinkerPop 3.2.6) with Janus GraphDB
I have gremlin-console (with janus plugin) + conf in remote.yaml:
hosts: [10.1.3.2] # IP og gremlin-server host
port: 8182
serializer: { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
So I want to make connection through gremlin-server (not to JanusGraph directly by graph = JanusGraphFactory.build().set("storage.backend", "cassandra").set("storage.hostname", "127.0.0.1").open();) and get graph which supports transactions?
Is it possible? Because as I see all TinkerFactory graphs do not support transactions
As I understood to use the Janus graph through the gremlin server you should:
Define ip&port in the config file of the gremlin-console:
conf/remote.yaml
Connect by Gremlin-console to the gremlin server:
: remote connect tinkerpop.server conf/remote.yaml
==> Configured localhost/10.1.23.113: 8182
...and work in remote mode (using :> or :remote console), i.e. send ALL commands (or #script) to the gremlin-server.
:> graph.addVertex(...)
or
:remote console
==>All scripts will now be sent to Gremlin Server - [10.1.2.222/10.1.2.222:818]
graph.addVertex(...)
You don't need to define variables for the graph and the trawersal, but rather used
graph. - for the graph
g. - for the traversal
In this case, you can uses all graph features that are provided by the JanusGraphDB.
Tinkerpop provides Cluster object to keep the config of connection. Using Cluster object graphTraversalSource object can be spawned.
this.cluster = Cluster.build()
.addContactPoints("192.168.0.2","192.168.0.1")
.port(8082)
.credentials(username, password)
.serializer(new GryoMessageSerializerV1d0(GryoMapper.build().addRegistry(JanusGraphIoRegistry.getInstance())))
.maxConnectionPoolSize(8)
.maxContentLength(10000000)
.create();
this.gts = AnonymousTraversalSource
.traversal()
.withRemote(DriverRemoteConnection.using(cluster));
gts object is thread safe. With remote each query will be executed in separate transaction. Ideally gts should be a singleton object.
Make sure to call gts.close() and cluster.close() upon shutdown of application else it may lead to connection leak.
I believe that connecting a java application to a running gremlin server using withRemote() will not support transactions. I have had trouble finding information on this as well but as far as I can tell, if you want to do anything but read the graph, you need to use "embedded janusgraph" and have your remotely hosted persistent data stored in a "storage backend" that you connect to from your application as you describe in the second half of your question.
https://groups.google.com/forum/#!topic/janusgraph-users/t7gNBeWC844
Some discussion I found around it here ^^ makes a mention of it auto-committing single transactions in remote mode, but it doesn't seem to do that when I try.

How to fetch Spark Streaming job statistics using REST calls when running in yarn-cluster mode

I have a spark streaming program running on Yarn Cluster in "yarn-cluster" mode. (-master yarn-cluster).
I want to fetch spark job statistics using REST APIs in json format.
I am able to fetch basic statistics using REST url call:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/metrics/json. But this is giving very basic statistics.
However I want to fetch per executor or per RDD based statistics.
How to do that using REST calls and where I can find the exact REST url to get these statistics.
Though $SPARK_HOME/conf/metrics.properties file sheds some light regarding urls i.e.
5. MetricsServlet is added by default as a sink in master, worker and client driver, you can send http request "/metrics/json" to get a snapshot of all the registered metrics in json format. For master, requests "/metrics/master/json" and "/metrics/applications/json" can be sent seperately to get metrics snapshot of instance master and applications. MetricsServlet may not be configured by self.
but that is fetching html pages not json. Only "/metrics/json" fetches stats in json format.
On top of that knowing application_id pro-grammatically is a challenge in itself when running in yarn-cluster mode.
I checked REST API section of Spark Monitoring page, but that didn't worked when we run spark job in yarn-cluster mode. Any pointers/answers are welcomed.
You should be able to access the Spark REST API using:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/
From here you can select the app-id from the list and then use the following endpoint to get information about executors, for example:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/{app-id}/executors
I verified this with my spark streaming application that is running in yarn cluster mode.
I'll explain how I arrived at the JSON response using a web browser. (This is for a Spark 1.5.2 streaming application in yarn-cluster mode).
First, use the hadoop url to view the RUNNING applications. http://{yarn-cluster}:8088/cluster/apps/RUNNING.
Next, select a running application, say http://{yarn-cluster}:8088/cluster/app/application_1450927949656_0021.
Next, click on the TrackingUrl link. This uses a proxy and the port is different in my case: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/. This shows the spark UI. Now, append the api/v1/applications to this URL: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/api/v1/applications.
You should see a JSON response with the application name supplied to SparkConf and the start time of the application.
I was able to reconstruct the metrics in the columns seen in the Spark Streaming web UI (batch start time, processing delay, scheduling delay) using the /jobs/ endpoint.
The script I used is available here. I wrote a short post describing and tying its functionality back to the Spark codebase. This does not need any web-scraping.
It works for Spark 2.0.0 and YARN 2.7.2, but may work for other version combinations too.
You'll need to scrape through the HTML page to get the relevant metrics. There isn't a Spark rest endpoint for capturing this info.

How to intercept remote nodes in Riak using riak_test module?

I have a problem when trying Erlang testing module riak_test to simulate connections among remote nodes.
It is possible to connect remote nodes within a test to local nodes (deployed by rt:deploy_nodes) but it is impossible to call functions of rt module, especially to add interceptors for the remote nodes without error.
Is there some solution or method to intercept also remote nodes using Riak testing module?
I need to use interceptors on remote nodes to retrieve some information about Riak node states.
More specifically: riak#10.X.X.X is my remote referenced node.
In the test it is possible to connect this node to local devX#127.0.0.1 nodes deployed in the test but in my test program I have: rt_intercept:add(riak#10.X.X.X, {}) I get error:
{{badmatch,
{badrpc,
{'EXIT',
{undef,
[{intercept,add,
[riak_kv_get_fsm,riak_kv_get_fsm_intercepts,
[{{waiting_vnode_r,2},waiting_vnode_r_tracing},
{client_info,3},client_info_tracing},
{execute,2},execute_preflist}]],
[]},
{rpc,'-handle_call_call/6-fun-0-',5,
[{file,"rpc.erl"},{line,203}]}]}}}},
[{rt_intercept,add,2,[{file,"src/rt_intercept.erl"},{line,57}]},
{remoteRiak,'-confirm/0-lc$^2/1-2-',1,
[{file,"tests/remoteRiak.erl"},{line,49}]},
{remoteRiak,'-confirm/0-lc$^2/1-2-',1,
[{file,"tests/remoteRiak.erl"},{line,49}]},
{remoteRiak,confirm,0,[{file,"tests/remoteRiak.erl"},{line,49}]}]}
the rt_intercept:add function is going to use rpc:call to run the intercept:add function in the target node's VM. This means that the target node must either have the intercept module loaded or in the code path. You can add a path using add_paths in the config for the target node.