Counting Ignite Physical Servers - sql

Newbie question, but not seeing a clear answer in the docs: I want to run a query on Ignite (2.13) that returns the number of physical ignite servers - despite Ignite running within containers. I suspect this will require some inference, as Ignite reports IP address per server (container or physical).
Something like Select * from sys.Nodes; but somehow collapsing containers on the same server together.
Any thoughts? Thx!

There is no such built in machinery, why would you need it?
I suppose you might do the following instead:
mark all containers running at the same machine by a single User Attribute, like MY_SERVER_NUMBER_1, MY_SERVER_NUMBER_2
query the nodes and filter by unique attriubteId, something like:
select count (distinct Name) from sys.NODE_ATTRIBUTES where Name like 'MY_SERVER_NUMBER_%'

Related

Ignite loading similar data to particular instance

So i'm really new to apache ignite here. What i'm trying to do is load data having similar properties to a single rather than it being loaded to random instances. For example, say that some data of this form:
ROLL_NO
34569
12349
34439
45329
32359
43549
53259
34229
As u can see, the above data is all ending with 9. Say that i have two ignite instances A and B currently running. Is there any way i can load these data ending with 9 to either of the instance A or B and NOT BOTH.
Please let me know if this is possible and if so how to accomplish this. Thanks in advance.
First of all, Ignite is a key-value storage, so you need to define what is a key and what is a value. The key should contain some ID that will uniquely identify an entry, and an affinity key that can be the same for multiple entries. All entries with the same affinity key will reside in same partition. Please refer to this page for more details: https://apacheignite.readme.io/docs/affinity-collocation
You need to set an AffinityKeyMapper for your cache. Read javadoc for details:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cache/affinity/AffinityKeyMapper.java

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

Migrating from Titan to DataStax Enterprise Graph

I'm migrating from Titan to Datastax. I have a graph with around 50 million nodes that is composed in Persons, Addresses, Phones, etc
I want to calculate a Person node connections (how many persons have the same phone, addresses, etc).
In Titan I wrote a Hadoop job that go over al the person nodes an the I could write a gremlin script to see how many persons have the same phone for this particular node
So as an input properties I have:
titan.hadoop.input.format=com.thinkaurelius.titan.hadoop.formats.hbase.TitanHBaseInputFormat
titan.hadoop.input.conf.storage.backend=hbase
For query filter I query only the person nodes
titan.hadoop.graph.input.vertex-query-filter=v.query().has('type',Compare.EQUAL,'person')
And to run a script I use
titan.hadoop.output.conf.script-file=scripts/calculate.groovy
this will calculate for every node the number of shared phones connection that the person has.
object.phone_shared= object.as('x').out('person_phones').in('person_phones').except('x').count()
Is there a way to write this kind of scripts in Datastax to go over the persons nodes. I see that Datastax uses Spark analytics to count the nodes for example,
https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/graphAnalytics/northwindDemoGraphSnapshot.html
but I didn't found any more documentation on how to run custom scripts using analytics
Thanks
The answer happens to be on the page you linked. It seems like it might just be a little easier than you are used to with Titan. The key is on step 8 where you configure the Traversal to use the preconfigured OLAP/Analytics TraversalSource, which is named a (for Analytics).
Alias the traversal to the Northwind analytics OLAP traversal source
a. Alias g to the OLAP traversal source for one-off analytic queries:
gremlin> :remote config alias g northwind.a
This basically says..
"When I execute a Traversal on TraversalSource g, I want it to be aliased to northwind.a on the server".
Once you do that, all Traversals of g will be executed using northwind.a and thus the Spark analytics engine.

how to list job ids from all users?

I'm using the Java API to query for all job ids using the code below
Bigquery.Jobs.List list = bigquery.jobs().list(projectId);
list.setAllUsers(true);
but it doesn't list me job ids that were run by Client ID for web applications (ie. metric insights) I'm using private key authentication.
Using the command line tool 'bq ls -j' in turn giving me only the metric insight job ids but not the ones ran with the private key auth. Is there a get all method?
The reason I'm doing this is trying to get better visibility into what queries are eating up our data usage. We have multiple sources of queries: metric insights, in house automation, some done manually, etc.
As of version 2.0.10, the bq client has support for API authorization using service account credentials. You can specify using a specific service account with the following flags:
bq --service_account your_service_account_here#developer.gserviceaccount.com \
--service_account_credential_store my_credential_file \
--service_account_private_key_file mykey.p12 <your_commands, etc>
Type bq --help for more information.
My hunch is that listing jobs for all users is broken, and nobody has mentioned it since there is usually a workaround. I'm currently investigating.
Jordan -- It sounds like you're honing in on what we want to do. For all access that we've allowed into our project/dataset we want to produce an aggregate/report of the "totalBytesProcessed" for all queries executed.
The problem we're struggling with is that we have a handful of distinct java programs accessing our data, a 3rd party service (metric insights) and 7-8 individual users who have query access via the web interface. Fortunately the incoming data only has one source so explaining the cost for that is simple. For queries though I am kinda blind at the moment (and it appears queries will be the bulk of the monthly bill).
It would be ideal if I can get the underyling data for this report with just one listing made with some single top level auth. With that I think from the timestamps and the actual SQL text I can attribute each query to a source.
One thing that might make this problem far easier is if there were more information in the job record (or some text adornment in the job_id for queries). I don't see that I can assign my own jobIDs on queries (perhaps I missed it?) and perhaps recording some source information in the job record would be possible? Just thinking out loud now...
There are three tables you can query for this.
region-**.INFORMATION_SCHEMA.JOBS_BY_{USER, PROJECT, ORGANIZATION}
Where ** should be replaced by your region.
Example query for JOBS_BY_USER in the eu region:
select
count(*) as num_queries,
date(creation_time) as date,
sum(total_bytes_processed) as total_bytes_processed,
sum(total_slot_ms) as total_slot_ms_cost
from
`region-eu.INFORMATION_SCHEMA.JOBS_BY_USER` as jobs_by_user,
jobs_by_user.referenced_tables
group by
2
order by 2 desc, total_bytes_processed desc;
Documentation is available at:
https://cloud.google.com/bigquery/docs/information-schema-jobs

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)