RavenDB - Fastest Insert Performance - What is the benchmark? - ravendb

I'm working on a prototype, using RavenDB, for my company to evaluate. We will have many threads inserting thousands of rows every few seconds, and many threads reading at the same time. I've done my first simple insert test, and before going much further, I want to make sure I'm using the recommended way of getting the best performance for RavenDB inserts.
I believe there is a bulk insert option. I haven't investigated that yet, as I'm not sure if that's necessary. I'm using the .NET API, and my code looks like this at the moment:
Debug.WriteLine("Number of Marker objects: {0}", markerList.Count);
StopwatchLogger.ExecuteAndLogPerformance(() =>
{
IDocumentSession ravenSession = GetRavenSession();
markerList.ForEach(marker => ravenSession.Store(marker));
ravenSession.SaveChanges();
}, "Save Marker data in RavenDB");
The StopwatchLogger simply invokes the action while putting a stopwatch around it:
internal static void ExecuteAndLogPerformance(Action action, string descriptionOfAction)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
action();
stopwatch.Stop();
Debug.WriteLine("{0} -- Processing time: {1} ms", descriptionOfAction, stopwatch.ElapsedMilliseconds);
}
Here is the output from a few runs. Note, I'm writing to a local instance of RavenDB (build 701). I know performance will be worse over the network, but I'm testing locally first.
One run:
Number of Marker objects: 671
Save Marker data in RavenDB -- Processing time: 1308 ms
Another run:
Number of Marker objects: 670
Save Marker data in RavenDB -- Processing time: 1266 ms
Another run:
Number of Marker objects: 667
Save Marker data in RavenDB -- Processing time: 625 ms
Another run:
Number of Marker objects: 639
Save Marker data in RavenDB -- Processing time: 639 ms
Ha. 639 objects in 639 ms. What are the odds of that? Anyway, that's one insert per millisecond, which would be 1000 every second.
The Marker object/document doesn't have much to it. Here is an example of one that has already been saved:
{
"ID": 14740009,
"SubID": "120403041588",
"ReadTime": "2012-04-03T13:51:45.0000000",
"CdsLotOpside": "163325",
"CdsLotBackside": "163325",
"CdteLotOpside": "167762",
"CdteLotBackside": "167762",
"EquipmentID": "VA_B"
}
Is this expected performance?
Is there a better way (best practice) to insert to gain speed?
Are there insert benchmarks available somewhere that I can target?

First, I would rather make sure that the number of items you save in a single batch doesn't get too big. There is no hard limit, however it hurts performance and eventually will crash if the transaction size gets too big. Using a value like 1024 items is safe, but it really depends on the size of your documents.
1000 documents per seconds is much lower than the number that you can actually reach with a single instance of RavenDB. You should do inserts in parallel and you can do some sort of tweaking with config option. For instance, you could increase the values defined by the settings beginning with Raven/Esent/. It is also a good idea (like in sql server) to put the logs and indexes to different hard drives. Depending on your concrete scenario you may also want to temporarily disable indexing while you're doing the inserts.
However, in most cases you don't want to care about that. If you need really high insert performance you can use multiple sharded instances and theoretically get an unlimited number of inserts/per second (just add more instances).

Related

BigQuery is there any way to break the large result into smaller chucks for processing?

Hi i am new to the BigQuery, if i need to fetch a very large set of data, say more than 1 GB, how can i break it into smaller pieces for quicker processing? i will need to process the result and dump it into a file or elasticsearch. i need to find a efficient way to handle it. i tried with the QueryRequest.setPageSize option, but that does't seem to work. I set 100 and it doesn't seem to break on every 100 record i put this line to see how many record i get back before i turn to a new page
result = result.getNextPage();
it displays at random number of records. sometimes at 1000, sometimes at 400, etc.
thanks
Not sure if this helps you but in our project we have something that seems to be similar: we process lots of data in BigQuery and need to use the final result for later usage (it contains roughly 15 Gbs for us when compressed).
What we did was to first save the results to a table with AllowLargeResults set to True and then export the result by compressing it into cloud storage using the Python API.
It automatically breaks the results into several files.
After that we have a Python script that downloads concurrently all files, reads through the whole thing and builds some matrices for us.
I don't quite remember how long it takes to download all the files, I think it's around 10 minutes. I'll try to confirm this one.

Redis 1+ min query time on lists with 15 larger JSON objects (20MB total)

I use Redis to cache database inserts. For this I created a list CACHE into which I push serialized JSON lists. In pseudocode:
let entries = [{a}, {b}, {c}, ...];
redis.rpush("CACHE", JSON.stringify(entries));
The idea is to run this code for an hour, then later do an
let all = redis.lrange("CACHE", 0, LIMIT);
processAndInsert(all);
redis.ltrim("CACHE", 0, all.length);
Now the thing is that each entries can be relatively large (but far below 512MB / whatever Redis limit I read about). Each of the a, b, c is an object of probably 20 bytes, and entries itself can easily have 100k+ objects / 2MB.
My problem now is that even for very short CACHE lists of only 15 entries a simple lrange can take many minutes(!) even from the redis-cli (my node.js actually dies with an "FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory", but that's a side comment).
The debug output for the list looks like this:
127.0.0.1:6379> debug object "CACHE"
Value at:00007FF202F4E330 refcount:1 encoding:linkedlist serializedlength:18104464 lru:12984004 lru_seconds_idle:1078
What is happening? Why is this so massively slow, and what can I do about it? This does not seem to be a normal slowness, something seems to be fundamentally wrong.
I am using a local Redis 2.8.2101 (x64), ioredis 1.6.1, node.js 0.12 on a relatively hardcore Windows 10 gaming machine (i5, 16GB RAM, 840 EVO SSD, ...) by the way.
Redis is great at doing lots of small operations,
but not so great at doing small numbers of "very big" operations.
I think you should re-evaluate your algorithm, and try to break apart your data in to smaller chunks. Not only you'll save the bandwidth, you'll also will not lock your redis instance long amounts of time.
Redis offers many data structures you should be able to use for more fine grain control over your data.
Well, still, in this case, since you are running the redis locally, and assuming you are not running anything else but this code, I doubt that the bandwidth, nor the redis is the problem. I'm more thinking this line:
JSON.stringify()
is the main culprit why you are seeing the slow execution.
JSON serialization of 20MB of string is not something simple,
The process needs allocate many small strings, and also has to go through all of your array and inspect each item individually. All of this will take a long time for a big object like this one.
Again, if you were breaking apart your data, and doing smaller operations with redis, you'd not need the JSON serializer at all.

camel split big sql result in smaller chunks

Because of memory limitation i need to split a result from sql-component (List<Map<column,value>>) into smaller chunks (some thousand).
I know about
from(sql:...).split(body()).streaming().to(...)
and i also know
.split().tokenize("\n", 1000).streaming()
but the latter is not working with List<Map<>> and is also returning a String.
Is there a out of the Box way to create those chunks? Or do i need to add a custom aggregator just behind the split? Or is there another way?
Edit
Additional info as requested by soilworker:
At the moment the sql endpoint is configured this way:
SqlEndpoint endpoint = context.getEndpoint("sql:select * from " + lookupTableName + "?dataSource=" + LOOK_UP_DS,
SqlEndpoint.class);
// returns complete result in one list instead of one exchange per line.
endpoint.getConsumerProperties().put("useIterator", false);
// poll interval
endpoint.getConsumerProperties().put("delay", LOOKUP_POLL_INTERVAL);
The route using this should poll once a day (we will add CronScheduledRoutePolicy soon) and fetch a complete table (view). All the data is converted to csv with a custom processor and sent via a custom component to proprietary software. The table has 5 columns (small strings) and around 20M entries.
I don't know if there is a memory issue. But i know on my local machine 3GB isn't enough. Is there a way to approximate the memory footprint to know if a certain amount of Ram would be enough?
thanks in advance
maxMessagesPerPoll will help you get the result in batches

Memory efficient (constant) and speed optimized iteration over a large table in Django

I have a very large table.
It's currently in a MySQL database.
I use django.
I need to iterate over each element of the table to pre-compute some particular data (maybe if I was better I could do otherwise but that's not the point).
I'd like to keep the iteration as fast as possible with a constant usage of memory.
As it is already clearly in Limiting Memory Use in a *Large* Django QuerySet and Why is iterating through a large Django QuerySet consuming massive amounts of memory?, a simple iteration over all objects in django will kill the machine as it will retrieve ALL objects from the database.
Towards a solution
First of all, to reduce your memory consumption you should be sure DEBUG is False (or monkey patch the cursor: turn off SQL logging while keeping settings.DEBUG?) to be sure django isn't storing stuff in connections for debug.
But even with that,
for model in Model.objects.all()
is a no go.
Not even with the slightly improved form:
for model in Model.objects.all().iterator()
Using iterator() will save you some memory by not storing the result of the cache internally (though not necessarily on PostgreSQL!); but will still retrieve the whole objects from the database, apparently.
A naive solution
The solution in the first question is to slice the results based on a counter by a chunk_size. There are several ways to write it, but basically they all come down to an OFFSET + LIMIT query in SQL.
something like:
qs = Model.objects.all()
counter = 0
count = qs.count()
while counter < count:
for model in qs[counter:counter+chunk_size].iterator()
yield model
counter += chunk_size
While this is memory efficient (constant memory usage proportional to chunk_size), it's really poor in term of speed: as OFFSET grows, both MySQL and PostgreSQL (and likely most DBs) will start choking and slowing down.
A better solution
A better solution is available in this post by Thierry Schellenbach.
It filters on the PK, which is way faster than offsetting (how fast probably depends on the DB)
pk = 0
last_pk = qs.order_by('-pk')[0].pk
queryset = qs.order_by('pk')
while pk < last_pk:
for row in qs.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
This is starting to get satisfactory. Now Memory = O(C), and Speed ~= O(N)
Issues with the "better" solution
The better solution only works when the PK is available in the QuerySet.
Unluckily, that's not always the case, in particular when the QuerySet contains combinations of distinct (group_by) and/or values (ValueQuerySet).
For that situation the "better solution" cannot be used.
Can we do better?
Now I'm wondering if we can go faster and avoid the issue regarding QuerySets without PK.
Maybe using something that I found in other answers, but only in pure SQL: using cursors.
Since I'm quite bad with raw SQL, in particular in Django, here comes the real question:
how can we build a better Django QuerySet Iterator for large tables
My take from what I've read is that we should use server-side cursors (apparently (see references) using a standard Django Cursor would not achieve the same result, because by default both python-MySQL and psycopg connectors cache the results).
Would this really be a faster (and/or more efficient) solution?
Can this be done using raw SQL in django? Or should we write specific python code depending on the database connector?
Server Side cursors in PostgreSQL and in MySQL
That's as far as I could get for the moment...
a Django chunked_iterator()
Now, of course the best would have this method work as queryset.iterator(), rather than iterate(queryset), and be part of django core or at least a pluggable app.
Update Thanks to "T" in the comments for finding a django ticket that carry some additional information. Differences in connector behaviors make it so that probably the best solution would be to create a specific chunked method rather than transparently extending iterator (sounds like a good approach to me).
An implementation stub exists, but there hasn't been any work in a year, and it does not look like the author is ready to jump on that yet.
Additional Refs:
Why does MYSQL higher LIMIT offset slow the query down?
How can I speed up a MySQL query with a large offset in the LIMIT clause?
http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/
postgresql: offset + limit gets to be very slow
Improving OFFSET performance in PostgreSQL
http://www.depesz.com/2011/05/20/pagination-with-fixed-order/
How to get a row-by-row MySQL ResultSet in python Server Side Cursor in MySQL
Edits:
Django 1.6 is adding persistent database connections
Django Database Persistent Connections
This should facilitate, under some conditions, using cursors. Still it's outside my current skills (and time to learn) how to implement such a solution..
Also, the "better solution" definitely does not work in all situations and cannot be used as a generic approach, only a stub to be adapted case by case...
Short Answer
If you are using PostgreSQL or Oracle, you can use, Django's builtin iterator:
queryset.iterator(chunk_size=1000)
This causes Django to use server-side cursors and not cache models as it iterates through the queryset. As of Django 4.1, this will even work with prefetch_related.
For other databases, you can use the following:
def queryset_iterator(queryset, page_size=1000):
page = queryset.order_by("pk")[:page_size]
while page:
for obj in page:
yield obj
pk = obj.pk
page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]
If you want to get back pages rather than individual objects to combine with other optimizations such as bulk_update, use this:
def queryset_to_pages(queryset, page_size=1000):
page = queryset.order_by("pk")[:page_size]
while page:
yield page
pk = max(obj.pk for obj in page)
page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]
Performance Profiling on PostgreSQL
I profiled a number of different approaches on a PostgreSQL table with about 200,000 rows on Django 3.2 and Postgres 13. For every query, I added up the sum of the ids, both to ensure that Django was actually retrieving the objects and so that I could verify correctness of iteration between queries. All of the timings were taken after several iterations over the table in question to minimize caching advantages of later tests.
Basic Iteration
The basic approach is just iterating over the table. The main issue with this approach is that the amount of memory used is not constant; it grows with the size of the table, and I've seen this run out of memory on larger tables.
x = sum(i.id for i in MyModel.objects.all())
Wall time: 3.53 s, 22MB of memory (BAD)
Django Iterator
The Django iterator (at least as of Django 3.2) fixes the memory issue with minor performance benefit. Presumably this comes from Django spending less time managing cache.
assert sum(i.id for i in MyModel.objects.all().iterator(chunk_size=1000)) == x
Wall time: 3.11 s, <1MB of memory
Custom Iterator
The natural comparison point is attempting to do the paging ourselves by progresively increased queries on the primary key. While this is an improvement over naieve iteration in that it has constant memory, it actually loses to Django's built-in iterator on speed because it makes more database queries.
def queryset_iterator(queryset, page_size=1000):
page = queryset.order_by("pk")[:page_size]
while page:
for obj in page:
yield obj
pk = obj.pk
page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]
assert sum(i.id for i in queryset_iterator(MyModel.objects.all())) == x
Wall time: 3.65 s, <1MB of memory
Custom Paging Function
The main reason to use the custom iteration is so that you can get the results in pages. This function is very useful to then plug in to bulk-updates while only using constant memory. It's a bit slower than queryset_iterator in my tests and I don't have a coherent theory as to why, but the slowdown isn't substantial.
def queryset_to_pages(queryset, page_size=1000):
page = queryset.order_by("pk")[:page_size]
while page:
yield page
pk = max(obj.pk for obj in page)
page = queryset.filter(pk__gt=pk).order_by("pk")[:page_size]
assert sum(i.id for page in queryset_to_pages(MyModel.objects.all()) for i in page) == x
Wall time: 4.49 s, <1MB of memory
Alternative Custom Paging Function
Given that Django's queryset iterator is faster than doing paging ourselves, the queryset pager can be alternately implemented to use it. It's a little bit faster than doing paging ourselves, but the implementation is messier. Readability matters, which is why my personal preference is the previous paging function, but this one can be better if your queryset doesn't have a primary key in the results (for whatever reason).
def queryset_to_pages2(queryset, page_size=1000):
page = []
page_count = 0
for obj in queryset.iterator():
page.append(obj)
page_count += 1
if page_count == page_size:
yield page
page = []
page_count = 0
yield page
assert sum(i.id for page in queryset_to_pages2(MyModel.objects.all()) for i in page) == x
Wall time: 4.33 s, <1MB of memory
Bad Approaches
The following are approaches you should never use (many of which are suggested in the question) along with why.
Do NOT Use Slicing on an Unordered Queryset
Whatever you do, do NOT slice an unordered queryset. This does not correctly iterate over the table. The reason for this is that the slice operation does a SQL limit + offset query based on your queryset and that django querysets have no order guarantee unless you use order_by. Additionally, PostgreSQL does not have a default order by, and the Postgres docs specifically warn against using limit + offset without order by. As a result, each time you take a slice, you are getting a non-deterministic slice of your table, which means your slices may not be overlapping and won't cover all rows of the table between them. In my experience, this only happens if something else is modifying data in the table while you are doing the iteration, which only makes this problem more pernicious because it means the bug might not show up if you are testing your code in isolation.
def very_bad_iterator(queryset, page_size=1000):
counter = 0
count = queryset.count()
while counter < count:
for model in queryset[counter:counter+page_size].iterator():
yield model
counter += page_size
assert sum(i.id for i in very_bad_iterator(MyModel.objects.all())) == x
Assertion Error; i.e. INCORRECT RESULT COMPUTED!!!
Do NOT use Slicing for Whole-Table Iteration in General
Even if we order the queryset, list slicing is abysmal from a performance perspective. This is because SQL offset is a linear time operation, which means that a limit + offset paged iteration of a table will be quadratic time, which you absolutely do not want.
def bad_iterator(queryset, page_size=1000):
counter = 0
count = queryset.count()
while counter < count:
for model in queryset.order_by("id")[counter:counter+page_size].iterator():
yield model
counter += page_size
assert sum(i.id for i in bad_iterator(MyModel.objects.all())) == x
Wall time: 15s (BAD), <1MB of memory
Do NOT use Django's Paginator for Whole-Table Iteration
Django comes with a built-in Paginator. It may be tempting to think that is appropriate for doing a paged iteration of a database, but it is not. The point of Paginator is for returning a single page of a result to a UI or an API endpoint. It is substantially slower than any of the good apporaches at iterating over a table.
from django.core.paginator import Paginator
def bad_paged_iterator(queryset, page_size=1000):
p = Paginator(queryset.order_by("pk"), page_size)
for i in p.page_range:
yield p.get_page(i)
assert sum(i.id for page in bad_paged_iterator(MyModel.objects.all()) for i in page) == x
Wall time: 13.1 s (BAD), <1MB of memory
The essential answer: use raw SQL with server-side cursors.
Sadly, until Django 1.5.2 there is no formal way to create a server-side MySQL cursor (not sure about other database engines). So I wrote some magic code to solve this problem.
For Django 1.5.2 and MySQLdb 1.2.4, the following code will work. Also, it's well commented.
Caution: This is not based on public APIs, so it will probably break in future Django versions.
# This script should be tested under a Django shell, e.g., ./manage.py shell
from types import MethodType
import MySQLdb.cursors
import MySQLdb.connections
from django.db import connection
from django.db.backends.util import CursorDebugWrapper
def close_sscursor(self):
"""An instance method which replace close() method of the old cursor.
Closing the server-side cursor with the original close() method will be
quite slow and memory-intensive if the large result set was not exhausted,
because fetchall() will be called internally to get the remaining records.
Notice that the close() method is also called when the cursor is garbage
collected.
This method is more efficient on closing the cursor, but if the result set
is not fully iterated, the next cursor created from the same connection
won't work properly. You can avoid this by either (1) close the connection
before creating a new cursor, (2) iterate the result set before closing
the server-side cursor.
"""
if isinstance(self, CursorDebugWrapper):
self.cursor.cursor.connection = None
else:
# This is for CursorWrapper object
self.cursor.connection = None
def get_sscursor(connection, cursorclass=MySQLdb.cursors.SSCursor):
"""Get a server-side MySQL cursor."""
if connection.settings_dict['ENGINE'] != 'django.db.backends.mysql':
raise NotImplementedError('Only MySQL engine is supported')
cursor = connection.cursor()
if isinstance(cursor, CursorDebugWrapper):
# Get the real MySQLdb.connections.Connection object
conn = cursor.cursor.cursor.connection
# Replace the internal client-side cursor with a sever-side cursor
cursor.cursor.cursor = conn.cursor(cursorclass=cursorclass)
else:
# This is for CursorWrapper object
conn = cursor.cursor.connection
cursor.cursor = conn.cursor(cursorclass=cursorclass)
# Replace the old close() method
cursor.close = MethodType(close_sscursor, cursor)
return cursor
# Get the server-side cursor
cursor = get_sscursor(connection)
# Run a query with a large result set. Notice that the memory consumption is low.
cursor.execute('SELECT * FROM million_record_table')
# Fetch a single row, fetchmany() rows or iterate it via "for row in cursor:"
cursor.fetchone()
# You can interrupt the iteration at any time. This calls the new close() method,
# so no warning is shown.
cursor.close()
# Connection must be close to let new cursors work properly. see comments of
# close_sscursor().
connection.close()
There is another option available. It wouldn't make the iteration faster, (in fact it would probably slow it down), but it would make it use far less memory. Depending on your needs this may be appropriate.
large_qs = MyModel.objects.all().values_list("id", flat=True)
for model_id in large_qs:
model_object = MyModel.objects.get(id=model_id)
# do whatever you need to do with the model here
Only the ids are loaded into memory, and the objects are retrieved and discarded as needed. Note the increased database load and slower runtime, both tradeoffs for the reduction in memory usage.
I've used this when running async scheduled tasks on worker instances, for which it doesn't really matter if they are slow, but if they try to use way too much memory they may crash the instance and therefore abort the process.

Lucene SimpleFacetedSearch Facet count exceeded 2048

I've stumbled into an issue using Lucene.net in one of my project where i'm using the SimpleFacetedSearch feature to have faceted search.
I get an exception thrown
Facet count exceeded 2048
I've a 3 columns which I'm faceting as soon as a add another facet I get the exception.
If I remove all the other facets the new facet works.
Drilling down into the source of SimpleFacetedSearch I can see inside the constructor of SimpleFacetedSearch it's checking of the number of facets don't exceed MAX_FACETS which is a constant set to 2048.
foreach (string field in groupByFields)
{
...
num *= fieldValuesBitSets1.FieldValueBitSetPair.Count;
if (num > SimpleFacetedSearch.MAX_FACETS)
throw new Exception("Facet count exceeded " + (object) SimpleFacetedSearch.MAX_FACETS);
fieldValuesBitSets.Add(fieldValuesBitSets1);
...
}
However as it's public I am able to set it like so.
SimpleFacetedSearch.MAX_FACETS = int.MaxValue;
Does anyone know why it is set to 2048 and if there are issues changing it? I was unable to find any documentation on it.
No there should't be any issue in changing it. But remember that using Bitsets(as done by SimpleFacetedSearch internally) is more performant when the search results are big but facet counts don't exceed some number. (Say 1000 facets 10M hits)
If you have much more facets but search results are not big you can iterate on the results(in a collector) and create facets. This way you may get a better performance. (say 100K facets 1000 hits)
So, 2048 may be an optimized number where exceeding it may result in performance loss.
The problem that MAX_FACETS is there to avoid is one of memory usage and performance.
Internally SimpleFS uses bitmaps to record which documents each facet value is used in. There is a bit for each document and each value has a separate bitmap. So if you have a lot of values the amount of memory needed grows quickly especially if you also have a lot of documents. memory = values * documents /8 bytes.
My company has indexes with millions of documents and 10's of thousands of values which would require many GB's of memory.
I've created another implementation which I've called SparseFacetedSearcher. This records the doc IDs for each value. So you only pay for hits not a bits per doc. If you have exactly one value in each document (like a product category) then the break even point is if you have more than 32 values (more than 32 product categories).
In our case the memory usage has dropped to a few 100MB.
Feel free to have a look at https://github.com/Artesian/SparseFacetedSearch