Hybrid Query Example in AgensGraph - sql

I am using agensgraph but I dont know how to write a hybrid query, any examples of hybrid query in agensgraph would help a lot.

In AgensGraph you can write hybrid queries in two ways:
Let's say you are creating the followings:
CREATE GRAPH AG;
CREATE VLABEL dev;
CREATE (:dev {name: 'someone', year: 2015});
CREATE (:dev {name: 'somebody', year: 2016});
CREATE TABLE history (year, event)
AS VALUES (1996, 'PostgreSQL'), (2016, 'AgensGraph');
1- Cypher in SQL
Syntax:
SELECT [column_name]
FROM ({table_name|SQL-query|CYPHERquery})
WHERE [column_name operator value];
Example:
SELECT n->>'name' as name
FROM history, (MATCH (n:dev) RETURN n) as dev
WHERE history.year > (n->>'year')::int;
Result:
name ----
someone
(1 row)
2- SQL in Cypher
Syntax:
MATCH [table_name]
WHERE (column_name operator {value|SQLquery|CYPHERquery})
RETURN [column_name];
Example:
MATCH (n:dev)
WHERE n.year < (SELECT year FROM history WHERE event =
'AgensGraph')
RETURN properties(n) AS n;
Result:
n ----
{"name": "someone", "year": 2015}
(1 row)
You can find more information here

I found more info on the hybrid query language in these slides. Every other bit of information I have been able to find is just the same example that Eya posted, in different places.
I agree that more information about the hybrid queries in AgensGraph would be great, as it seems like a killer feature of software.

Let’s assume that we have a network management system and we are keeping our network topology in graph part of the AgensGraph (Graph Format) and our time-series data (such as date&time information regarding specific devices) in the relational part of the AgensGraph (Table Format). So, in this case, we know that we have a graph, tables and if we want, we can write a hybrid query to fetch data from both models.
In our graph, we have different devices that are connected to each other such as a modem, IoT sensors, etc. for each of these devices, we also have some information respectively stored in tables - related to those devices such as download speed, the upload speed or CPU usage.
In the following hybrid queries, our goal is to collect the information regarding specific devices by querying both from the graph and the tables simultaneously.
Cypher in SQL
In this hybrid query, we are looking to find modem devices which are having issues and their abnormality type is 2 (2 indicates that this device is having some issues regarding its download and upload speed) and after we find those devices, our goal is to return their id, download, and upload speed to investigate the issue. As you can see in the following query our inner query is Cypher and our outer query is SQL.
SELECT id,sysdnbps, sysupbps
from public.modemrdb where to_jsonb(id) in
(SELECT id FROM (MATCH(m:modem) where
m.abnormaltype=2
return m.name)
AS s(id));
SQL in Cypher
In this hybrid query, we are looking to find modem devices which their CPU usages are more than 80 (not in range of threshold) which indicate there is an issue with these devices and after we find those devices, our goal is to return that modems and any IoT devices that are connected to them. As you can see in the following example our inner query is SQL and our outer query is Cypher.
MATCH p=(n:modem)-[r*1..2]->(iot)
WHERE n.name in
(SELECT to_jsonb(id)
FROM public.modemrdb
WHERE syscpuusage >= 80)
RETURN p;
This can be another example of a hybrid query.

Related

How to group similar GTFS trips

I need to group GTFS trips to human understandable "route variants". As one route can have run different trips based on day/time etc.
Is there any preferred way to group similar trips? Trip shape_id looks promising, but is there any guarantee that all similar trips has same shape_id?
My GTFS data is imported my sql database and the database structure is the same as GTFS txt files.
UPDATE
Im not looking sql query example, im looking high level example how to group similar trips to user friendly "route variants".
Many route planning apps (like Moovit) use GTFS data as source and they display different route variants to users.
There is no official way to do this. The best way is probably to group by the ordered list of stops on each trip, sometimes known as the "stopping pattern" of the trip. The idea is discussed at a conceptual level here by Mapzen.
In practice, I have created concatenated strings of all stops on a given trip (from stop_times), and grouped by that to define similar trips. E.g., if the stops on a given trip are A, B, C, D, and E, create a string A-B-C-D-E or A_B_C_D_E and group trips on that string. This functionality is not part of the SQL spec, although MySQL implements it as GROUP_CONCAT and PostgreSQL uses arrays and array_to_string. You may also want to add route_id and shape_id into the grouping as well, to handle some corner cases.

How to search github projects ordered by number of commits?

I was thinking of trying out BigQuery and GithubArchive, but I'm not sure how to compose a query that would let me search for a term in code or project and order the results by number of commits descending.
Thanks for any tips
The GithubArchive data loaded into BigQuery doesn't have copy of the source code, so search term in code wouldn't be possible. But if you wanted to search for a term in repository description, and then pick top repositories by number of commits, here is an example how to do it (the term is "SQL" in this example):
select count(*) c, repository_url, repository_description
from [githubarchive:github.timeline]
where type = 'PushEvent' and repository_description contains 'SQL'
group by 2, 3
order by c desc
limit 10
This results in
14925 https://github.com/danberindei/infinispan Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.
9377 https://github.com/postgres/postgres Mirror of the official PostgreSQL GIT repository. Note that this is just a *mirror* - we don't work with pull requests on github. To contribute, please see http://wiki.postgresql.org/wiki/Submitting_a_Patch
4876 https://github.com/galderz/infinispan Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.
4747 https://github.com/triAGENS/ArangoDB ArangoDB is a multi-purpose, open-source database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript/Ruby extensions. Use ACID transaction if you require them. Scale horizontally and vertically with a few mouse clicks.
3590 https://github.com/webnotes/erpnext Open Source, web-based ERP based on Python, Javascript and MySQL.
3489 https://github.com/anistor/infinispan Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.
3263 https://github.com/youtube/vitess vitess provides servers and tools which facilitate scaling of MySQL databases for large scale web services.
3071 https://github.com/infinispan/infinispan Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.
2631 https://github.com/theory/sqitch Simple SQL change management
2358 https://github.com/zzzeek/sqlalchemy Mirror of SQLAlchemy
SELECT COUNT(1) c, repository_url, repository_description
FROM [githubarchive:github.timeline]
WHERE type = 'PushEvent'
AND REGEXP_MATCH(repository_description, r'(?i)SQL')
GROUP BY 2, 3
ORDER BY c DESC
LIMIT 10
BigQuery supports Regular Expression so you can greatly improve / narrow down your search result having flexibility of using search pattern vs. seach term
Below references can help you further:
BigQuery Regular expression functions
re2 Syntax

Is Bigtable (or BigQuery) the right platform for correlation analysis of logs?

I'm faced with the challenge of analysing different system logfiles based on following requirements:
several hundred systems
millions of logs every day in different formats
Beside many other objectives my biggest challenge is a realtime correlation analysis of all incoming logs on all current system logs and also on partially historical log events.
Currently we're focusing on MongoDB, ElasticSearch, Hadoop, ... to meet this challenge.
On the other hand I've read some interesting things about Google Bigtable and Bigquery.
So my question is, is Bigtable and/or Bigquery a solution worth looking at, in order to do this realtime analysis ?
I've no experience with these two products, so I'm hoping for some tips whether these Google solutions could be an alternative for my requirements.
THX & BR
bdriven
EDIT:
too broad. you need to show actual analisis you need to make. bigquery will be much much cheaper that homemade with nosql
Our goal is, to develop a system, which is able to generate warnings based on current log events (or a combination of different log events) and their past interactions on other systems behavior.
Therefore we have to be able to do fast correlation analysis for current events against huge amounts of unstructured historical data.
I know that this requirement description is probably not the most specific one, but we're right at the beginning of this project.
So my goal with this question is to get some arguments for our next team meeting, whether we should consider to take a closer look at Bigtable / Bigquery or not.
One of my favorite features of BigQuery is its ability to run correlations.
Here's a correlations with BigQuery tutorial I wrote a couple years ago: http://nbviewer.ipython.org/gist/fhoffa/6459195
For example, to rank and find the most correlated airports in terms of flight delays:
SELECT a.departure_state, b.departure_state, corr(a.avg, b.avg) corr, COUNT(*) c
FROM
(SELECT date, departure_state, AVG(departure_delay) avg , COUNT(*) c
FROM [bigquery-samples:airline_ontime_data.flights]
GROUP BY 1,2 HAVING c > 5
) a
JOIN
(SELECT date, departure_state ,
AVG(departure_delay) avg, COUNT(*) c FROM [bigquery-samples:airline_ontime_data.flights]
GROUP BY 1,2 HAVING c > 5 ) b
ON a.date=b.date
WHERE a.departure_state < b.departure_state
GROUP EACH BY 1, 2
HAVING c > 5
ORDER BY corr DESC;
Try it yourself in the next 5 minutes! A quick getting started tutorial: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/

Microsoft Decision Trees: support cases for a specific node

I'm using Microsoft Decision Trees in Microsoft Analysis Services Data Mining, and need to show the historical data (the support cases from the training data used to train the decision tree) for a given leaf node in my mining model. Is there a way to access those records directly based on the NodeID using a DMX query, or is the only way to get the NODE_DESCRIPTION for the node, replace not = with <> and execute a query against my live database with that as my WHERE clause?
Courtesy of rok1 on the MSDN forums: http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/thread/e6502263-a2b9-4fa1-b60b-04414e3efd29
SELECT * FROM [ModelName].Cases
where ISTrainingCase()
and IsInNode('0') --your intended node

Efficient way to compute accumulating value in sqlite3

I have an sqlite3 table that tells when I gain/lose points in a game. Sample/query result:
SELECT time,p2 FROM events WHERE p1='barrycarter' AND action='points'
ORDER BY time;
1280622305|-22
1280625580|-9
1280627919|20
1280688964|21
1280694395|-11
1280698006|28
1280705461|-14
1280706788|-13
[etc]
I now want my running point total. Given that I start w/ 1000 points,
here's one way to do it.
SELECT DISTINCT(time), (SELECT
1000+SUM(p2) FROM events e WHERE p1='barrycarter' AND action='points'
AND e.time <= e2.time) AS points FROM events e2 WHERE p1='barrycarter'
AND action='points' ORDER BY time
but this is highly inefficient. What's a better way to write this?
MySQL has #variables so you can do things like:
SELECT time, #tot := #tot+points ...
but I'm using sqlite3 and the above isn't ANSI standard SQL anyway.
More info on the db if anyone needs it: http://ccgames.db.94y.info/
EDIT: Thanks for the answers! My dilemma: I let anyone run any
single SELECT query on "http://ccgames.db.94y.info/". I want to give
them useful access to my data, but not to the point of allowing
scripting or allowing multiple queries with state. So I need a single
SQL query that can do accumulation. See also:
Existing solution to share database data usefully but safely?
SQLite is meant to be a small embedded database. Given that definition, it is not unreasonable to find many limitations with it. The task at hand is not solvable using SQLite alone, or it will be terribly slow as you have found. The query you have written is a triangular cross join that will not scale, or rather, will scale badly.
The most efficient way to tackle the problem is through the program that is making use of SQLite, e.g. if you were using Web SQL in HTML5, you can easily accumulate in JavaScript.
There is a discussion about this problem in the sqlite mailing list.
Your 2 options are:
Iterate through all the rows with a cursor and calculate the running sum on the client.
Store sums instead of, or as well as storing points. (if you only store sums you can get the points by doing sum(n) - sum(n-1) which is fast).