Migrating from Titan to DataStax Enterprise Graph - datastax

I'm migrating from Titan to Datastax. I have a graph with around 50 million nodes that is composed in Persons, Addresses, Phones, etc
I want to calculate a Person node connections (how many persons have the same phone, addresses, etc).
In Titan I wrote a Hadoop job that go over al the person nodes an the I could write a gremlin script to see how many persons have the same phone for this particular node
So as an input properties I have:
titan.hadoop.input.format=com.thinkaurelius.titan.hadoop.formats.hbase.TitanHBaseInputFormat
titan.hadoop.input.conf.storage.backend=hbase
For query filter I query only the person nodes
titan.hadoop.graph.input.vertex-query-filter=v.query().has('type',Compare.EQUAL,'person')
And to run a script I use
titan.hadoop.output.conf.script-file=scripts/calculate.groovy
this will calculate for every node the number of shared phones connection that the person has.
object.phone_shared= object.as('x').out('person_phones').in('person_phones').except('x').count()
Is there a way to write this kind of scripts in Datastax to go over the persons nodes. I see that Datastax uses Spark analytics to count the nodes for example,
https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/graphAnalytics/northwindDemoGraphSnapshot.html
but I didn't found any more documentation on how to run custom scripts using analytics
Thanks

The answer happens to be on the page you linked. It seems like it might just be a little easier than you are used to with Titan. The key is on step 8 where you configure the Traversal to use the preconfigured OLAP/Analytics TraversalSource, which is named a (for Analytics).
Alias the traversal to the Northwind analytics OLAP traversal source
a. Alias g to the OLAP traversal source for one-off analytic queries:
gremlin> :remote config alias g northwind.a
This basically says..
"When I execute a Traversal on TraversalSource g, I want it to be aliased to northwind.a on the server".
Once you do that, all Traversals of g will be executed using northwind.a and thus the Spark analytics engine.

Related

Hybrid Query Example in AgensGraph

I am using agensgraph but I dont know how to write a hybrid query, any examples of hybrid query in agensgraph would help a lot.
In AgensGraph you can write hybrid queries in two ways:
Let's say you are creating the followings:
CREATE GRAPH AG;
CREATE VLABEL dev;
CREATE (:dev {name: 'someone', year: 2015});
CREATE (:dev {name: 'somebody', year: 2016});
CREATE TABLE history (year, event)
AS VALUES (1996, 'PostgreSQL'), (2016, 'AgensGraph');
1- Cypher in SQL
Syntax:
SELECT [column_name]
FROM ({table_name|SQL-query|CYPHERquery})
WHERE [column_name operator value];
Example:
SELECT n->>'name' as name
FROM history, (MATCH (n:dev) RETURN n) as dev
WHERE history.year > (n->>'year')::int;
Result:
name ----
someone
(1 row)
2- SQL in Cypher
Syntax:
MATCH [table_name]
WHERE (column_name operator {value|SQLquery|CYPHERquery})
RETURN [column_name];
Example:
MATCH (n:dev)
WHERE n.year < (SELECT year FROM history WHERE event =
'AgensGraph')
RETURN properties(n) AS n;
Result:
n ----
{"name": "someone", "year": 2015}
(1 row)
You can find more information here
I found more info on the hybrid query language in these slides. Every other bit of information I have been able to find is just the same example that Eya posted, in different places.
I agree that more information about the hybrid queries in AgensGraph would be great, as it seems like a killer feature of software.
Let’s assume that we have a network management system and we are keeping our network topology in graph part of the AgensGraph (Graph Format) and our time-series data (such as date&time information regarding specific devices) in the relational part of the AgensGraph (Table Format). So, in this case, we know that we have a graph, tables and if we want, we can write a hybrid query to fetch data from both models.
In our graph, we have different devices that are connected to each other such as a modem, IoT sensors, etc. for each of these devices, we also have some information respectively stored in tables - related to those devices such as download speed, the upload speed or CPU usage.
In the following hybrid queries, our goal is to collect the information regarding specific devices by querying both from the graph and the tables simultaneously.
Cypher in SQL
In this hybrid query, we are looking to find modem devices which are having issues and their abnormality type is 2 (2 indicates that this device is having some issues regarding its download and upload speed) and after we find those devices, our goal is to return their id, download, and upload speed to investigate the issue. As you can see in the following query our inner query is Cypher and our outer query is SQL.
SELECT id,sysdnbps, sysupbps
from public.modemrdb where to_jsonb(id) in
(SELECT id FROM (MATCH(m:modem) where
m.abnormaltype=2
return m.name)
AS s(id));
SQL in Cypher
In this hybrid query, we are looking to find modem devices which their CPU usages are more than 80 (not in range of threshold) which indicate there is an issue with these devices and after we find those devices, our goal is to return that modems and any IoT devices that are connected to them. As you can see in the following example our inner query is SQL and our outer query is Cypher.
MATCH p=(n:modem)-[r*1..2]->(iot)
WHERE n.name in
(SELECT to_jsonb(id)
FROM public.modemrdb
WHERE syscpuusage >= 80)
RETURN p;
This can be another example of a hybrid query.

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

Recursive Hierarchy Ranking

I have no idea if I wrote that correctly. I want to start learning higher end data mining techniques and I'm currently using SQL server and Access 2016.
I have a system that tracks ID cards. Each ID is tagged to one particular level of a security hierarchy, which has many branches.
For example
Root
-Maintenance
- Management
- Supervisory
- Manager
- Executive
- Vendors
- Secure
- Per Diem
- Inside Trades
There are many other departments like Maintenance, some simple, some with much more convoluted, hierarchies.
Each ID card is tagged to a level so in the Maintenance example, - Per Diem:Vendors:Maintenance:Root. Others may be just tagged to Vendors, Some to general Maintenance itself (No one has root, thank god).
So lets say I have 20 ID Cards selected, these are available personnel I can task to a job but since they have different area's of security I want to find a commonalities they can all work on together as a 20 person group or whatever other groupings I can make.
So the intended output would be
CommonMatch = - Per Diem
CardID = 1
CardID = 3
CommonMatch = Vendors
CardID = 1
CardID = 3
CardID = 20
So in the example above, while I could have 2 people working on -Per Diem work, because that is their lowest common security similarity, there is also card holder #20 who has rights to the predecessor group (Vendors), that 1 and 3 share, so I could have three of them work at that level.
I'm not looking for anyone to do the work for me (Although examples always welcome), more to point me in the right direction on what I should be studying, what I'm trying to do is called, etc. I know CTE's are a way to go but that seems like only a tool in a much bigger process that needs to be done.
Thank you all in advance
Well, it is not so much a graph-theory or data-mining problem but rather a data-structure problem and one that has almost solved itself.
The objective is to be able to partition the set of card IDs into disjoint subsets given a security clearance level.
So, the main idea here would be to layout the hierarchy tree and then assign each card ID to the path implied by its security level clearance. For this purpose, each node of the hierarchy tree now becomes a container of card IDs (e.g. each node of the hierarchy tree holds a) its own name (as unique identification) b) pointers to other nodes c) a list of card IDs assigned to its "name".)
Then, retrieving the set of cards with clearance UP TO a specific security level is simply a case of traversing the tree from that specific level downwards until the tree's leafs, all along collecting the card IDs from the node containers as they are encountered.
Suppose that we have access tree:
A
+-B
+-C
D
+-E
And card ID assignments:
B:[1,2,3]
C:[4,8]
E:[10,12]
At the moment, B,C,E only make sense as tags, there is no structural information associated with them. We therefore need to first "build" the tree. The following example uses Networkx but the same thing can be achieved with a multitude of ways:
import networkx
G = networkx.DiGraph() #Establish a directed graph
G.add_edge("A","B")
G.add_edge("A","C")
G.add_edge("A","D")
G.add_edge("D","E")
Now, assign the card IDs to the node containers (in Networkx, nodes can be any valid Python object so I am going to go with a very simple list)
G.node["B"]=[1,2,3]
G.node["C"]=[4,8]
G.node["E"]=[10,12]
So, now, to get everybody working under "A" (the root of the tree), you can traverse the tree from that level downwards either via Depth First Search (DFS) or Breadth First Search (BFS) and collect the card IDs from the containers. I am going to use DFS here, purely because Networkx has a function that returns the visited nodes depending on visiting order, directly.
#dfs_preorder_nodes returns a generator, this is an efficient way of iterating very large collections in Python but I am casting it to a "list" here, so that we get the actual list of nodes back.
vis_nodes = list(networkx.dfs_preorder_nodes(G,"A")); #Start from node "A" and DFS downwards
cardIDs = []
#I could do the following with a one-line reduce but it might be clearer this way
for aNodeID in vis_nodes:
if G.node[aNodeID]:
cardIDs.extend(G.node[aNodeID])
In the end of the above iteration, cardIDs will contain all card IDs from branch "A" downwards in one convenient list.
Of course, this example is ultra simple, but since we are talking about trees, the tree can be as large as you like and you are still traversing it in the same way requiring only a single point of entry (the top level branch).
Finally, just as a note, the fact that you are using Access as your backend is not necessarily an impediment but relational databases do not handle graph type data with great ease. You might get away easily for something like a simple tree (like what you have here for example), but the hassle of supporting this probably justifies undertaking this process outside of the database (e.g, use the database just for retrieving the data and carry out the graph type data processing in a different environment. Doing a DFS on SQL is the sort of hassle I am referring to above.)
Hope this helps.

how to list job ids from all users?

I'm using the Java API to query for all job ids using the code below
Bigquery.Jobs.List list = bigquery.jobs().list(projectId);
list.setAllUsers(true);
but it doesn't list me job ids that were run by Client ID for web applications (ie. metric insights) I'm using private key authentication.
Using the command line tool 'bq ls -j' in turn giving me only the metric insight job ids but not the ones ran with the private key auth. Is there a get all method?
The reason I'm doing this is trying to get better visibility into what queries are eating up our data usage. We have multiple sources of queries: metric insights, in house automation, some done manually, etc.
As of version 2.0.10, the bq client has support for API authorization using service account credentials. You can specify using a specific service account with the following flags:
bq --service_account your_service_account_here#developer.gserviceaccount.com \
--service_account_credential_store my_credential_file \
--service_account_private_key_file mykey.p12 <your_commands, etc>
Type bq --help for more information.
My hunch is that listing jobs for all users is broken, and nobody has mentioned it since there is usually a workaround. I'm currently investigating.
Jordan -- It sounds like you're honing in on what we want to do. For all access that we've allowed into our project/dataset we want to produce an aggregate/report of the "totalBytesProcessed" for all queries executed.
The problem we're struggling with is that we have a handful of distinct java programs accessing our data, a 3rd party service (metric insights) and 7-8 individual users who have query access via the web interface. Fortunately the incoming data only has one source so explaining the cost for that is simple. For queries though I am kinda blind at the moment (and it appears queries will be the bulk of the monthly bill).
It would be ideal if I can get the underyling data for this report with just one listing made with some single top level auth. With that I think from the timestamps and the actual SQL text I can attribute each query to a source.
One thing that might make this problem far easier is if there were more information in the job record (or some text adornment in the job_id for queries). I don't see that I can assign my own jobIDs on queries (perhaps I missed it?) and perhaps recording some source information in the job record would be possible? Just thinking out loud now...
There are three tables you can query for this.
region-**.INFORMATION_SCHEMA.JOBS_BY_{USER, PROJECT, ORGANIZATION}
Where ** should be replaced by your region.
Example query for JOBS_BY_USER in the eu region:
select
count(*) as num_queries,
date(creation_time) as date,
sum(total_bytes_processed) as total_bytes_processed,
sum(total_slot_ms) as total_slot_ms_cost
from
`region-eu.INFORMATION_SCHEMA.JOBS_BY_USER` as jobs_by_user,
jobs_by_user.referenced_tables
group by
2
order by 2 desc, total_bytes_processed desc;
Documentation is available at:
https://cloud.google.com/bigquery/docs/information-schema-jobs

Microsoft Decision Trees: support cases for a specific node

I'm using Microsoft Decision Trees in Microsoft Analysis Services Data Mining, and need to show the historical data (the support cases from the training data used to train the decision tree) for a given leaf node in my mining model. Is there a way to access those records directly based on the NodeID using a DMX query, or is the only way to get the NODE_DESCRIPTION for the node, replace not = with <> and execute a query against my live database with that as my WHERE clause?
Courtesy of rok1 on the MSDN forums: http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/thread/e6502263-a2b9-4fa1-b60b-04414e3efd29
SELECT * FROM [ModelName].Cases
where ISTrainingCase()
and IsInNode('0') --your intended node