Is Neo4j's Cypher query language open-source? - redis

what is the status of the Neo4j's language Cypher? I really like it, but I would like to avoid the Neo4j lock-in. Are there some other Cypher interface like there are in Gremlin?
Regards

Cypher is totally OSS, see https://github.com/neo4j/community/tree/master/cypher . Right now there is one implementation, but potentially there can be more. It's just too early in the evolution to make it a standard, we are still heavily experimenting with it.

Check out Pixy, a declarative graph query language that works on any Blueprints-compatible graph database. It is built on Gremlin/Pipes from the Tinkerpop software stack.
Pixy enables complex pattern matching and logic programming on graph databases by translating PROLOG-style rules and goals to Gremlin pipelines that represent graph traversal operations. It has some additional advantages over Cypher, other than avoiding vendor lock-in.
Pixy is available under the Apache 2.0 license.

openCypher has been implemented by many databases. According to their site these are some of them:
Agens Graph: A multi-model database
Amazon Neptune
AnzoGraph: A native massively parallel (MPP) graph analytical database
ArcadeDB
CAPS: Cypher for Apache Spark
Cypher for Gremlin
Katana Graph
Memgraph: An in-memory, transactional graph database
Neo4j: A native, transactional property graph database
RedisGraph: A graph module for Redis
SAP HANA Graph

Related

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :
easier to reason about (less verbose)
easier to maintain (SQL vs scala/python code)
you can run it easily on the GUI if needed
fast without having to really reason about partitioning, caching and so on...
In the end, I only use Spark when I've got something to do that I can't express using SQL.
To be clear, my workflow is often like :
preprocessing (previously in Spark, now in SQL)
feature engineering (previously in Spark, now mainly in SQL)
machine learning model and predictions (Spark ML)
Am I missing something ?
Is there any con in using BigQuery this way instead of Spark ?
Thanks
A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.
If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:
BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.
I'm using PySpark (on GCP Dataproc), BigQuery and we have jobs in both. I will summarize my vision about Pros and Cons of one system against the other. And I do admit that your environment could be different, so that something which I think is Pros might not be like this for you.
Pros of Spark:
better testing of the code, simpler to build unit tests and run them with mocked data and classes, rather in trying to do this with BigQuery
it's possible to use SQL (SparkSQL) for operations and even combine operations over different data sources (DB, files, BQ)
we have JSON files in the format which is not valid for BigQuery, and it cannot parse them (while files have valid JSON format)
possible to implement naturally more complicated logic for some cases, for example, traversing arrays in nested fields and other complicated calculations
better custom monitoring is possible, when we need to check specific metrics in the pipeline we can send related metrics (StatsD, etc.) easier
more natural for CI/CD processes
Pros of BigQuery (all with a note: if all data is available):
simplicity of SQL, when all data is available in a convenient format
DBAs who are not familiar with Python/Scala still could contribute (bcs they know SQL)
awesome infrastructure behind the scene, very performant
With both approaches it's possible to check quickly the result in GUI. For example, Jupyter Notebook allows to run PySpark instantly. I cannot add my notes about ML related traits, though.

um.. What is AgensGraph?

I heard about AgensGraph, but I wonder exactly what it is.
If you know someone, please let me know.
I got "What is AgensGraph" from AgensGraph documentation, which you can find this document from a following link: http://bitnine.net/support/documents_backup/quick-start-guide-html/
Agens Graph is a new generation multi-model graph database for the modern complex data environment. Agens Graph is a multi-model database, which supports relational and graph data model at the same time. It enables developers to integrate the legacy relational data model and the nobel graph data model in one database. Agens Graph supports Ansi-SQL and Open Cypher (http://www.opencypher.org). SQL query and Cypher query can be integrated into a single query in Agens Graph.
Agens Graph is based on powerful PostgreSQL RDBMS, so it is very robust, fully-featured and ready to enterprise use. It is optimzied for handling complex connected graph data but at the same time, it provides a plenty of powerful database features essential to the enterprise database environment, like ACID transaction, multi version concurrency control, stored procedure, trigger, constraint, sophistrated monitoring and flexible data model (JSON). Moreover, Agens Graph can leverage the rich eco-systems of PostgreSQL and can be extended with many outstanding external modules, like PostGIS.

Graph Database: TinkerPop/Blueprints vs W3C Linked data

Looking for an infrastructure for network analysis for heterogeneous (multiple node types (multi-mode), multiple edge type (multi-relation) and multiple descriptive features (multi-featured)) networks, I've noticed that there are two standard stacks in the Graph Database world:
On one hand we have the ThinkPop/Blueprint property graph model. It is supported by Neo4j, OrientDB GraphDB, Dex, Titan, InfiniteGraph, etc.
The Tinkerpop stack includes the Blueprint property graph model interface, the Gremlin graph traversal language, and the Furnace graph algorithms package.
On the other hand we have W3C's Linked Data technology stack, which is supported by AllegroGraph, 4store, Oracle Database Semantic Technologies, OWLIM, SYSTap BigData, etc.
Semantic data is represented using RDF/RDFS/OWL, and can be queried using SPARQL On top it offers rules and reasoning capabilities.
Now, suppose that I want to represent heterogeneous data in a graph database, and analyse such data (statistics, relations discovery, structure, evolution, etc.) (I know these terms are wide and vague) - What are the relative strengths of each model for various types of network analysis tasks? Do these two models complement each other?
Couple things, your exemplars of linked data stacks are all triple stores. You would start building a linked data application by first getting your triple store set up, but calling a database a linked data stack is incorrect imo. That's also an incomplete list of triple stores, there is also Sesame, Jena, Mulgara, and Stardog. Sesame and Jena kind of pull double duty, they're the two de-facto standard Java APIs for the semantic web, but both provide triple stores that come bundled with the APIs. I also know that both Cray and IBM are working on triple stores, but I don't know much about either at this point. I do know that Stardog works well with the TinkerPop stack and that it's basically a drop in and start writing Gremlin queries against the RDF.
I think the strengths of RDF/OWL is that you 1) get a real query language 2) they're w3c standards and 3) you get reasoning, if the triple store supports it, for free (more or less -- you still have to write an ontology).
With RDF/OWL/SPARQL being standards, it makes it quite easy to pick up and move to a new triple store with a different feature set should you need to, your data is already in a common format that everyone understands and any application logic encoded as queries are completely portable. And in most cases, you'd be writing against either the Sesame or Jena APIs, or working over SPARQL protocol, so you might need to only change your config/init. I think that's a big win in the early prototyping phases.
I also think that RDF/OWL especially combined w/ reasoning and the kinds of complex SPARQL queries that you can create with the new SPARQL 1.1 really suit themselves well to building complicated analytic applications. Also, I think that the impression that most people have that RDF triple stores don't scale is no longer correct. Most triple stores at this point easily scale into the billions of triples and have very competitive throughput numbers as well.
So based on what I think you might be doing, I think semweb might be a better bet for you. I did a similar project a few years back using RDF & RDFS for the backend fronted by a simple Pylons based webapp and was very happy with the results.

What is the relationship between Sesame & Alibaba?

I am a beginner in this & I am having a hard time understanding this.
What is Alibaba and Sesame?
In the above two, which one does the query optimization and which one does the part of creating repositories.
Any kind of input will be fine. Thanks.
"AliBaba is a RESTful subject-oriented client/server library for distributed persistence of files and data using RDF metadata. AliBaba is the beta version of the next generation of the Elmo codebase. It is a collection of modules that provide simplified RDF store abstractions to accelerate development and facilitate application maintenance."
http://www.openrdf.org/alibaba.jsp
"Sesame is a de-facto standard framework for processing RDF data. This includes parsing, storing, inferencing and querying of/over such data. It offers an easy-to-use API that can be connected to all leading RDF storage solutions."
http://www.openrdf.org/about.jsp
I imagine the query engine, query optimization and storage are part of Sesame, not Alibaba. Alibaba is application code which sits on top of Sesame.
There are also alternatives in Java, such as Apache Jena:
http://incubator.apache.org/jena/
Guess what I use? ;-)

Representing a DAG (directed acyclic graph)

I need to store dependencies in a DAG. (We're mapping a new school curriculum at a very fine grained level)
We're using rails 3
Considerations
Wider than it is deep
Very large
I estimate 5-10 links per node. As the system grows this will increase.
Many reads, few writes
most common are lookups:
dependencies of first and second degree
searching/verifying dependencies
I know SQL, I'll consider NoSQL.
Looking for pointers to good comparisons of implementation options.
Also interested in what we can start with fast, but will be less painful to transition to something more robust/scalable later.
I found this example of modeling a directed acyclic graph in SQL:
http://www.codeproject.com/KB/database/Modeling_DAGs_on_SQL_DBs.aspx?msg=3051183
I think the upcoming version (beta at the moment) of the Ruby bindings for the graph database Neo4j should be a good fit. It's for use with Rails 3. The underlying data model uses nodes and directed relationships/edges with key/value style attributes on both. To scale read-mostly architectures Neo4j uses a master/slave replication setup.
You could use OrientDB as graph database. It's highly optimized for relationships since are stored as link and not JOIN. Load of bidirectional graph with 1,000 vertices needs few milliseconds.
The language binding for Rails is not yet available, but you can use it with HTTP RESTful calls.
You might want to take a look at the act_as_dag gem.
https://github.com/resgraph/acts-as-dag
Also some good writing on Dags with SQL for people that might need some background on this.
http://www.codeproject.com/Articles/22824/A-Model-to-Represent-Directed-Acyclic-Graphs-DAG-o