openCypher client for AWS Neptune - amazon-neptune

I'm looking for a graph exploration tool similar to https://github.com/prabushitha/gremlin-visualizer for querying AWS Neptune while using openCypher to enjoy the new offering:
https://aws.amazon.com/blogs/database/announcing-opencypher-for-amazon-neptune-building-better-graph-applications-with-opencypher-and-gremlin-together/.
I'm familiar with the Jupyter notebook https://github.com/aws/graph-notebook but I'm looking for other alternatives.

With the recent release of openCypher on Neptune we have provided support for querying and visualizing results of openCypher queries via the Jupyter notebook as you have mentioned. This tool is good for writing and visualizing queries but does not have graph exploration functionality for clicking on and expanding connected nodes/edges.
However with the release of openCypher Neptune supports interoperability between Gremlin and openCypher on top of the same data. This means that you can load the data one time and use either query language. This allows you to use any of the graph exploration tooling that works with Gremlin, such as https://github.com/prabushitha/gremlin-visualizer or https://www.tomsawyer.com/graph-database-browser to provide graph exploration capabilities without having to reload the data.

Related

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :
easier to reason about (less verbose)
easier to maintain (SQL vs scala/python code)
you can run it easily on the GUI if needed
fast without having to really reason about partitioning, caching and so on...
In the end, I only use Spark when I've got something to do that I can't express using SQL.
To be clear, my workflow is often like :
preprocessing (previously in Spark, now in SQL)
feature engineering (previously in Spark, now mainly in SQL)
machine learning model and predictions (Spark ML)
Am I missing something ?
Is there any con in using BigQuery this way instead of Spark ?
Thanks
A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.
If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:
BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.
I'm using PySpark (on GCP Dataproc), BigQuery and we have jobs in both. I will summarize my vision about Pros and Cons of one system against the other. And I do admit that your environment could be different, so that something which I think is Pros might not be like this for you.
Pros of Spark:
better testing of the code, simpler to build unit tests and run them with mocked data and classes, rather in trying to do this with BigQuery
it's possible to use SQL (SparkSQL) for operations and even combine operations over different data sources (DB, files, BQ)
we have JSON files in the format which is not valid for BigQuery, and it cannot parse them (while files have valid JSON format)
possible to implement naturally more complicated logic for some cases, for example, traversing arrays in nested fields and other complicated calculations
better custom monitoring is possible, when we need to check specific metrics in the pipeline we can send related metrics (StatsD, etc.) easier
more natural for CI/CD processes
Pros of BigQuery (all with a note: if all data is available):
simplicity of SQL, when all data is available in a convenient format
DBAs who are not familiar with Python/Scala still could contribute (bcs they know SQL)
awesome infrastructure behind the scene, very performant
With both approaches it's possible to check quickly the result in GUI. For example, Jupyter Notebook allows to run PySpark instantly. I cannot add my notes about ML related traits, though.

What are the differences between `WriteToBigQuery` and `BigQuerySink`

Following this answer I wonder what are the principal differences (if any) between WriteToBigQuery and BigQuerySink of the Apache Beam Python SDK.
What are the considerations or limitations of using one over another?
Looking at sources:
BigQuerySink triggers a Dataflow native sink for BigQuery
that only supports batch pipelines. Instead of using this sink
directly, please use WriteToBigQuery transform that works for both
batch and streaming pipelines.
They both seem to do a similar thing underneath otherwise.

How to Visualize data in Apache Kudu?

Is it possible to visualize data in Apache Kudu? Is there any guideline for it?
Kudu itself does not have any built-in data visualization tool. Just like Oracle is an RDBMS and it does not come with a data visualization tool either. However, there are a few options:
Built a custom visualization tool yourself by using Java, Python or C++ API. https://kudu.apache.org/docs/developing.html.
Impala is a SQL engine that has built-in integration with Kudu. It also supports ODBC/JDBC driver. Thus you can hook almost any BI tools to Impala to query the data in Kudu and build visualization.
You can also use Arcadia Data visualization which directly connects to Kudu tables without Impala connection. And Arcadia is built specially for Big data applications, which resides on distributed cluster. [Distributed visualization tool for distributed computing]

Is Neo4j's Cypher query language open-source?

what is the status of the Neo4j's language Cypher? I really like it, but I would like to avoid the Neo4j lock-in. Are there some other Cypher interface like there are in Gremlin?
Regards
Cypher is totally OSS, see https://github.com/neo4j/community/tree/master/cypher . Right now there is one implementation, but potentially there can be more. It's just too early in the evolution to make it a standard, we are still heavily experimenting with it.
Check out Pixy, a declarative graph query language that works on any Blueprints-compatible graph database. It is built on Gremlin/Pipes from the Tinkerpop software stack.
Pixy enables complex pattern matching and logic programming on graph databases by translating PROLOG-style rules and goals to Gremlin pipelines that represent graph traversal operations. It has some additional advantages over Cypher, other than avoiding vendor lock-in.
Pixy is available under the Apache 2.0 license.
openCypher has been implemented by many databases. According to their site these are some of them:
Agens Graph: A multi-model database
Amazon Neptune
AnzoGraph: A native massively parallel (MPP) graph analytical database
ArcadeDB
CAPS: Cypher for Apache Spark
Cypher for Gremlin
Katana Graph
Memgraph: An in-memory, transactional graph database
Neo4j: A native, transactional property graph database
RedisGraph: A graph module for Redis
SAP HANA Graph

What is the relationship between Sesame & Alibaba?

I am a beginner in this & I am having a hard time understanding this.
What is Alibaba and Sesame?
In the above two, which one does the query optimization and which one does the part of creating repositories.
Any kind of input will be fine. Thanks.
"AliBaba is a RESTful subject-oriented client/server library for distributed persistence of files and data using RDF metadata. AliBaba is the beta version of the next generation of the Elmo codebase. It is a collection of modules that provide simplified RDF store abstractions to accelerate development and facilitate application maintenance."
http://www.openrdf.org/alibaba.jsp
"Sesame is a de-facto standard framework for processing RDF data. This includes parsing, storing, inferencing and querying of/over such data. It offers an easy-to-use API that can be connected to all leading RDF storage solutions."
http://www.openrdf.org/about.jsp
I imagine the query engine, query optimization and storage are part of Sesame, not Alibaba. Alibaba is application code which sits on top of Sesame.
There are also alternatives in Java, such as Apache Jena:
http://incubator.apache.org/jena/
Guess what I use? ;-)