loadrdf-tool aborts or stall loading 1B+ triples - graphdb

I have setup graphdb on a Windows server with 32gb memory. I have modified the loadrdf cmd and added "-Xms16G -Xmx24G".
No I'm trying to import the entire Wikidata rdf-dump but are having difficulties. First I tried with an entity-index-size value of 10.000.000 which worked alright undtil the loadrdf-tool aborted after reaching a little more than 1 billion triples. Then I tried to set an entity-index-size value of 2.000.000.000 but this works worse. Currently it has processed 500.000.000 triples but the load speed has dropped to 7.000 st/s.
Are there any other settings/configurations I should be aware of, that could make the import work?

The public Wikidata RDF dump has about 2 billion statements and probably around 500M unique RDF resources. By default, the entity pool structure (the index of all unique RDF resources) is stored in the off heap memory space, and you will need to reserve at least 8GB. Add at least 3GB more for the OS system, and this means you will need actually to decrease the amount of used memory to "-Xmx20G".
To speed up the data loading speed the GraphDB documentation recommends to use SSD. This will boost the data loading speed since SSD has a much lower seek time.

Related

GraphDB Free 8.5 re-inferring uses a single core?

I am trying to load several large biomedical ontologies into a GraphDB Owl Horst optimized repository, along with 10s of millions of triples using terms from those ontologies. I can load these data into a RDFS+ optimized repo in less than 1 hour, but I can't even load one of the ontologies (chebi_lite) if I let it go overnight. That's using loadrdf on a 64 core, 256 GB AWS server.
My earlier question Can GraphDB load 10 million statements with OWL reasoning? lead to the suggestion that I use the preload command, followed by re-inferring. The preload indeed went very quickly, but when it comes to re-inferring, only one core is used. I haven't been able to let it go for more than an hour yet. Is re-inferring using just one core a consequence of using the free version? Might loadrdf work better if I just did a better configuration?
When I use loadrdf, all cores go close to 100%, but the memory usage never goes over 10% or so. I have tinkered with the JVM memory settings a little, but I don't really know what I'm doing. For example
-Xmx80g -Dpool.buffer.size=2000000

Are there any in-memory (persistent) solutions faster than Aerospike for a single-node?

I am working on a cloud application that requires low latency and very high read/writes per second. I will only have around 1 million records stored persistently but this may fluctuate largely as the application runs.
After YCSB benchmarking Aerospike and Redis, I found that Aerospike beats Redis and MongoDB both in terms of performance on a single-node for 60/40 read write.
Some points to note:
Fetching all my data using a single 32-bit integer key (no advanced queries)
Running on a single machine with 8 GB RAM and an SSD (small number of records)
Multiple clients need access to the database at once (via LAN)
I'm also assuming that key-value stores will outperform document stores and are the best fit considering I do not need advanced queries.
Before committing myself to Aerospike, are there any other solutions which may be more fit for my scenario considering that I am only running a single node with a small-ish amount of records?
Not that I'm aware of. I think Aerospike is the fastest.
However, for some use cases you can consider Tarantool.
Here's one of the benchmarks: https://medium.com/#rvncerr/tarantool-vs-competitors-racing-in-microsoft-azure-ebde9c5d619

Postgres Not Using Memory

I am running some spatial queries on tables that have records close to a billion. However, I cannot understand why Postgres is not using the memory in the dedicated server (with 32GB of memory). I tuned the server based on the suggestions in here. However, couldn't see any difference in the running time at all and see only under 100Mb of memory usage. I would expect Postgres to consume more memory by loading bigger chunks of data to it; thus, reducing the disk reads and the time. What could I be doing wrong here?
Already looked at these posts:
https://dba.stackexchange.com/questions/18484/tuning-postgresql-for-large-amounts-of-ram
http://patshaughnessy.net/2016/1/22/is-your-postgres-query-starved-for-memory

using part of RAM as local storage

I'm setting up a virtuoso server on my local machine, the databse is not big (about 2GB)
The application I'm using the server for needs to make a very large number of queries and the results need to come fast.
The HDD I'm using is mechanical, so it's not that fast, I am now trying to find a way to allocate part of my main memory as a local storage so that I can put the database file on it.
is there's an easy way to do that ?
That's not what RAM is for.
If your server ever lost power, you would lose all of the data.
If you want a faster HDD, get one with a higher RPM, or get an SSD.
Take a look at the performance Tuning Guide...
It details, how to configure exactly what you are looking for.
Data is still held on disk - but the more data that can be loaded into memory too will see better performance.
get all your data into memory and that's probably as fast as it gets :-)
There's a software called RamDisk plus
you can see a demo here:
http://www.youtube.com/watch?v=vAdRsQJBEBE
This software allows you to create a disk partition right out of your RAM

Improving Solr performance

I have deployed a 5-sharded infrastructure where:
shard1 has 3124422 docs
shard2 has 920414 docs
shard3 has 602772 docs
shard4 has 2083492 docs
shard5 has 11915639 docs
Indexes total size: 100GB
The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I run the server using Jetty (from Solr example download) with:
java -Xmx3024M -Dsolr.solr.home=multicore -jar start.jar
The response time for a query is around 2-3 seconds. Nevertheless, if I execute several queries at the same time the performance goes down inmediately:
1 simultaneous query: 2516ms
2 simultaneous queries: 4250,4469 ms
3 simultaneous queries: 5781, 6219, 6219 ms
4 simultaneous queries: 6484, 7203, 7719, 7781 ms...
Using JConsole for monitoring the server java proccess I checked that Heap Memory and the CPU Usages don't reach the upper limits so the server shouldn't perform as overloaded. Can anyone give me an approach of how I should tune the instance for not being so hardly dependent of the number of simultaneous queries?
Thanks in advance
You may want to consider creating slaves for each shard so that you can support more reads (See http://wiki.apache.org/solr/SolrReplication), however, the performance you're getting isn't very reasonable.
With the response times you're seeing, it feels like your disk must be the bottle neck. It might be cheaper for you just to load up each shard with enough memory to hold the full index (20GB each?). You could look at disk access using the 'sar' utility from the sysstat package. If you're consistently getting over 30% disk utilization on any platter while searches are ongoing, thats a good sign that you need to add some memory and let the OS cache the index.
Has it been awhile since you've run an optimize? Perhaps part of the long lookup times is a result of a heavily fragmented index spread all over the platter.
As I stated on the Solr mailinglist, where you asked same question 3 days ago, Solr/Lucene benefits tremendously from SSD's. While sharding on more machines or adding bootloads of RAM will work for I/O, the SSD option is comparatively cheap and extremely easy.
Buy an Intel X25 G2 ($409 at NewEgg for 160GB) or one of the new SandForce based SSD's. Put your existing 100GB of indexes on it and see what happens. That's half a days work, tops. If it bombs, scavenge the drive for your workstation. You'll be very happy with the performance boost it gives you.