Can GraphDB load 10 million statements with OWL reasoning? - sparql

I am struggling to load most of the Drug Ontology OWL files and most of the ChEBI OWL files into GraphDB free v8.3 repository with Optimized OWL Horst reasoning on.
is this possible? Should I do something other than "be patient?"
Details:
I'm using the loadrdf offline bulk loader to populate an AWS r4.16xlarge instance with 488.0 GiB and 64 vCPUs
Over the weekend, I played around with different pool buffer sizes and found that most of these files individually load fastest with a pool buffer of 2,000 or 20,000 statements instead of the suggested 200,000. I also added -Xmx470g to the loadrdf script. Most of the OWL files would load individually in less than one hour.
Around 10 pm EDT last night, I started to load all of the files listed below simultaneously. Now it's 11 hours later, and there are still millions of statements to go. The load rate is around 70/second now. It appears that only 30% of my RAM is being used, but the CPU load is consistently around 60.
are there websites that document other people doing something of this scale?
should I be using a different reasoning configuration? I chose this configuration as it was the fastest loading OWL configuration, based on my experiments over the weekend. I think I will need to look for relationships that go beyond rdfs:subClassOf.
Files I'm trying to load:
+-------------+------------+---------------------+
| bytes | statements | file |
+-------------+------------+---------------------+
| 471,265,716 | 4,268,532 | chebi.owl |
| 61,529 | 451 | chebi-disjoints.owl |
| 82,449 | 1,076 | chebi-proteins.owl |
| 10,237,338 | 135,369 | dron-chebi.owl |
| 2,374 | 16 | dron-full.owl |
| 170,896 | 2,257 | dron-hand.owl |
| 140,434,070 | 1,986,609 | dron-ingredient.owl |
| 2,391 | 16 | dron-lite.owl |
| 234,853,064 | 2,495,144 | dron-ndc.owl |
| 4,970 | 28 | dron-pro.owl |
| 37,198,480 | 301,031 | dron-rxnorm.owl |
| 137,507 | 1,228 | dron-upper.owl |
+-------------+------------+---------------------+

#MarkMiller you can take a look at the Preload tool, which is part of GraphDB 8.4.0 release. It's specially designed to handle large amount of data with constant speed. Note that it works without inference, so you'll need to load your data and then change the ruleset and reinfer the statements.
http://graphdb.ontotext.com/documentation/free/loading-data-using-preload.html

Just typing out #Konstantin Petrov's correct suggestion with tidier formatting. All of these queries should be run in the repository of interest... at some point in working this out, I misled myself into thinking that I should be connected to the SYSTEM repo when running these queries.
All of these queries also require the following prefix definition
prefix sys: <http://www.ontotext.com/owlim/system#>
This doesn't directly address the timing/performance of loading large datasets into an OWL reasoning repository, but it does show how to switch to a higher level of reasoning after loading lots of triples into a no-inference ("empty" ruleset) repository.
Could start by querying for the current reasoning level/rule set, and then run this same select statement after each insert.
SELECT ?state ?ruleset {
?state sys:listRulesets ?ruleset
}
Add a predefined ruleset
INSERT DATA {
_:b sys:addRuleset "rdfsplus-optimized"
}
Make the new ruleset the default
INSERT DATA {
_:b sys:defaultRuleset "rdfsplus-optimized"
}
Re-infer... could take a long time!
INSERT DATA {
[] <http://www.ontotext.com/owlim/system#reinfer> []
}

Related

Kusto convert Bytes to MeBytes by division

Trying to graph Bandwidth consumed using Azure Log Analytics
Perf
| where TimeGenerated > ago(1d)
| where CounterName contains "Network Send"
| summarize sum(CounterValue) by bin(TimeGenerated, 1m), _ResourceId
| render timechart
This generates a reasonable chart except the y axis runs from 0 - 15,000,000,000. I tried
Perf
| where TimeGenerated > ago(1d)
| where CounterName contains "Network Send"
| extend MeB_bandwidth_out = todouble(CounterValue)/1,048,576
| summarize sum(MeB_bandwidth_out) by bin(TimeGenerated, 1m), _ResourceId
| render timechart
but I get exactly the same chart. I've tried without the todouble(), or doing it after the division, but nothing changes. Any hint why this is not working?
A bit hard to say without seeing a sample of the data, but here are a couple of idea:
Try removing the commas from 1,048,576
If this doesn't work, remove the last line from both queries and compare the results, and run them to see why the data doesn't make sense
P.S. Regardless, there's a good chance that you can replace contains with has to significantly improve the performance (note that has looks for full words, while contains doesn't - so they are not the same, be careful).

How to see the data(primary and backup) on each of the node in the Apache Ignite?

I have 6 Ignite nodes and all are connected well to form a cluster. Also, i am giving the backup copies as 2 . Now i have sent 20 data to the cluster to check the partition and data(primary and backup). I can see the count using the cache -a -r command .
Is there a command or way where i can see the actual data in each of the node, where i can see the primary data as well as the backup copies?
You could use cache -scan -c=cacheName
Entries in cache: SQL_PUBLIC_PERSON
+=============================================================================================================================================+
| Key Class | Key | Value Class | Value |
+=============================================================================================================================================+
| java.lang.Integer | 1 | o.a.i.i.binary.BinaryObjectImpl | SQL_PUBLIC_PERSON_.. [hash=357088963, NAME=Name1] |
+---------------------------------------------------------------------------------------------------------------------------------------------+
use help cache to see all cache related commands.
see: https://apacheignite-tools.readme.io/docs/command-line-interface
You also have the option of turning on SQL: https://apacheignite-sql.readme.io/docs/schema-and-indexes
and: https://apacheignite-sql.readme.io/docs/getting-started
then use JDBC/sql to see entries in your cache.

Chrome Selenium IDE 3.4.5 extension "execute script" command not storing as variable

I am trying to create a random number generator:
Command | Tgt | Val |
store | tom | tester
store | dominic | envr
execute script | Math.floor(Math.random()*11111); | number
type | id=XXX | ${tester}.${dominic}.${number}
Expected result:
tom.dominic.0 <-- random number
Instead I get this:
tom.dominic.${number}
I looked thru all the resources and it seems the recent selenium update/version has changed the approach and I cannot find a solution.
I realize this question is 2 years old, but it's a fairly common one, so I'll answer it and see if there are other answers that address it.
If you want to assign the result of a script run by the "execute script" in Selenium IDE to a Selenium variable, you have to return the value from JavaScript. So instead of
execute script | Math.floor(Math.random()*11111); | number
you need
execute script | return Math.floor(Math.random()*11111); | number
Also, in your final assignment that puts the 3 pieces together, you needed ${envr} instead of ${dominic}.

What is the easiest way to find the biggest objects in Redis?

I have a 20GB+ rdb dump in production.
I suspect there's a specific set of keys bloating it.
I'd like to have a way to always spot the first 100 biggest objects from static dump analysis or ask it to the server itself, which by the way has ove 7M objects.
Dump analysis tools like rdbtools are not helpful in this (I think) really common use case!
I was thinking to write a script and iterate the whole keyset with "redis-cli debug object", but I have the feeling there must be some tool I'm missing.
An option was added to redis-cli: redis-cli --bigkeys
Sample output based on https://gist.github.com/michael-grunder/9257326
$ ./redis-cli --bigkeys
# Press ctrl+c when you have had enough of it... :)
# You can use -i 0.1 to sleep 0.1 sec every 100 sampled keys
# in order to reduce server load (usually not needed).
Biggest string so far: day:uv:483:1201737600, size: 2
Biggest string so far: day:pv:2013:1315267200, size: 3
Biggest string so far: day:pv:3:1290297600, size: 5
Biggest zset so far: day:topref:2734:1289433600, size: 3
Biggest zset so far: day:topkw:2236:1318723200, size: 7
Biggest zset so far: day:topref:651:1320364800, size: 20
Biggest string so far: uid:3467:auth, size: 32
Biggest set so far: uid:3029:allowed, size: 1
Biggest list so far: last:175, size: 51
-------- summary -------
Sampled 329 keys in the keyspace!
Total key length in bytes is 15172 (avg len 46.12)
Biggest list found 'day:uv:483:1201737600' has 5235597 items
Biggest set found 'day:uvx:555:1201737600' has 47 members
Biggest hash found 'day:uvy:131:1201737600' has 2888 fields
Biggest zset found 'day:uvz:777:1201737600' has 1000 members
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
19 lists with 5236744 items (05.78% of keys, avg size 275618.11)
50 sets with 112 members (15.20% of keys, avg size 2.24)
250 hashs with 6915 fields (75.99% of keys, avg size 27.66)
10 zsets with 1294 members (03.04% of keys, avg size 129.40)
redis-rdb-tools does have a memory report that does exactly what you need. It generates a CSV file with memory used by every key. You can then sort it and find the Top x keys.
There is also an experimental memory profiler that started to do what you need. Its not yet complete, and so isn't documented. But you can try it - https://github.com/sripathikrishnan/redis-rdb-tools/tree/master/rdbtools/cli. And of course, I'd encourage you to contribute as well!
Disclaimer: I am the author of this tool.
I am pretty new to bash scripting. I came out with this:
for line in $(redis-cli keys '*' | awk '{print $1}'); do echo `redis-cli DEBUG OBJECT $line | awk '{print $5}' | sed 's/serializedlength://g'` $line; done; | sort -h
This script
Lists all the key with redis-cli keys "*"
Gets size with redis-cli DEBUG OBJECT
sorts the script based on the name prepend with the size
This may be very slow due to the fact that bash is looping through every single redis key. You have 7m keys you may need to cache the out put of the keys to a file.
If you have keys that follow this pattern "A:B" or "A:B:*", I wrote a tool that analyzes both existing content as well as monitors for things such as hit rate, number of gets/sets, network traffic, lifetime, etc. The output is similar to the one below.
https://github.com/alexdicianu/redis_toolkit
$ ./redis-toolkit report -type memory -name NAME
+----------------------------------------+----------+-----------+----------+
| KEY | NR KEYS | SIZE (MB) | SIZE (%) |
+----------------------------------------+----------+-----------+----------+
| posts:* | 500 | 0.56 | 2.79 |
| post_meta:* | 440 | 18.48 | 92.78 |
| terms:* | 192 | 0.12 | 0.63 |
| options:* | 109 | 0.52 | 2.59 |
Try redis-memory-analyzer - a console tool to scan Redis key space in real time and aggregate memory usage statistic by key patterns. You may use this tools without maintenance on production servers. It shows you detailed statistics about each key pattern in your Redis serve.
Also you can scan Redis db by all or selected Redis types such as "string", "hash", "list", "set", "zset". Matching pattern also supported.
RMA also try to discern key names by patterns, for example if you have keys like 'user:100' and 'user:101' application would pick out common pattern 'user:*' in output so you can analyze most memory distressed data in your instance.

Objective C: Accelerometer directions

I'm trying to detect a shaking motion in an up and down direction like this:
/ \
|
________________________________
| |
| |
| |
|o |
| |
| |
|______________________________|
|
\ /
-(void) bounce:(UIAcceleration *)acceleration {
NSLog(#"%f",acceleration.x);
}
I was thinking it's the x axis but that responds if you turn it to be parallel to the floor. How do I detect this?
I don't have a direct answer for you, but a great experiment is to make a quick app which spews the numbers out onto the screen or into a file (or both). Then shake it the way to you want to detect, and see which set has the greatest change.