How well does a unique hash index perform in comparison to the record ID? - indexing

Following CQRS practices, I will need to supply a custom generated ID (like a UUID) in any create command. This means when using OrientDB as storage, I won't be able to use its generated RIDs, but rather perform lookups on a manual index using the UUIDs.
Now in the OrientDB docs it states that the performance of fetching records using the RID is independent of the database size O(1), presumably because it already describes the physical location of the record. Is that also the case when using a UNIQUE_HASH_INDEX?
Is it worth bending CQRS practices to request a RID from the database when assembling the create command, or is the performance difference negligible?

I have tested the performance of record retrieval based on RIDs and indexed UUID fields using a database holding 180,000 records. For the measurement, 30,000 records have been looked up, while clearing the local cache between each retrieval. This is the result:
RID: about 0.2s per record
UUID: about 0.3s per record
I've done queries throughout populating the database in 30,000 record steps. The retrieval time wasn't significantly influenced by the database size in both cases. Don't mind the relatively high times as this experiment was done on an overloaded PC. It's the relation between the two that is relavant.
To anser my own question, a UNIQUE_HAS_INDEX based query is close enough to RID-based queries.

Related

Performance of Lucene queries in Ignite

I have a very simple object as keys in my cache and I want to be able to iterate on the key/value pairs where a string matches a field in my keys.
Here is how the field is declared in the class
#AffinityKeyMapped #QueryTextField String crawlQueueID;
I run many queries and expect a small amount of documents to match. The queries take a relatively large amount of time, which is surprising given that there are maybe only 100K pairs locally in the cache. My queries are local, I want to hit only the K/V stored in the local node.
According to the profiler I am using, 80% of the CPU is spent here
GridLuceneIndex.java:285 org.apache.lucene.search.IndexSearcher.search(Query, int)
Knowing Lucene's performance, I am really surprised. Any suggestions?
BTW I want to sort the results based on a numerical field in the value object. Can this be done via annotations?
I could have one cache per value of the field I am querying against but given that there are potentially hundreds of thousands or even millions of different values, that would probably be too many caches for Ignite to handle.
EDIT
Looking at the code that handles the Lucene indexing and querying, the index gets reloaded for every query. Given that I do hundreds of them in a row, we probably don't benefit from any caching or optimisation of the index structure in Lucene.
Additionally, there is a range query running as a filter to check for the TTL. FilterQueries are faster but on a fresh indexreader, there would not be much caching either. Of course, if no TTL is needed for a given table, this should not be required.
Judging by the documentation about the indexing with SQL indexing:
Ignite automatically creates indexes for each primary key and affinity
key field.
the indexing is done on the key alone. In my case, the value I want to use for sorting is in the value object so that would not work.

Optimal DB for bulk inserts with unique constraint on binary data field

I am trying to find the optimal DB for inserting binary data many times, when theres a unique constraint on it - there is a hash identifying each record, that can't be repeated. I'm interested only in inserting the data, receiving a feedback if the data can't be inserted because another record has the same identifying hash. Not going to do anything else with the data - not querying it or anything else (don't want to do aggregations or any computations, etc.).
There is a huge amount of inserts, and I am not sure that a regular SQL DB can keep up with it.
The DB will also constantly grow in size (data will not be deleted. Maybe there will be some kind of retention every few years), so I can't use a DB which has a hard memory limit.
I've tried running on postgres with an index for the binary field, however it was too slow (around 2ms for each insert, inserting as a batch of 50000).
I've also tried mongoDB (which seemed to have a slightly better performance), however it stopped responding after many inserts for an unknown reason.
I have tried to write the hash field as a Base64 coded string instead of a binary field but it didn't work good as well.
Looking for the best simple DB that can support a very large amount of data (Terabytes over time) in good performance for a simple unique constraint.
Thanks in advance

Correct modeling in Redis for writing single entity but querying multiple

I'm trying to convert data which is on a Sql DB to Redis. In order to gain much higher throughput because it's a very high throughput. I'm aware of the downsides of persistence, storage costs etc...
So, I have a table called "Users" with few columns. Let's assume: ID, Name, Phone, Gender
Around 90% of the requests are Writes. to update a single row.
Around 10% of the requests are Reads. to get 20 rows in each request.
I'm trying to get my head around the right modeling of this in order to get the max out of it.
If there were only updates - I would use Hashes.
But because of the 10% of Reads I'm afraid it won't be efficient.
Any suggestions?
Actually, the real question is whether you need to support partial updates.
Supposing partial update is not required, you can store your record in a blob associated to a key (i.e. string datatype). All write operations can be done in one roundtrip, since the record is always written at once. Several read operations can be done in one rountrip as well using the MGET command.
Now, supposing partial update is required, you can store your record in a dictionary associated to a key (i.e. hash datatype). All write operations can be done in one roundtrip (even if they are partial). Several read operations can also be done in one roundtrip provided HGETALL commands are pipelined.
Pipelining several HGETALL commands is a bit more CPU consuming than using MGET, but not that much. In term of latency, it should not be significantly different, except if you execute hundreds of thousands of them per second on the Redis instance.

Improving query performance of of database table with large number of columns and rows(50 columns, 5mm rows)

We are building an caching solution for our user data. The data is currently stored i sybase and is distributed across 5 - 6 tables but query service built on top of it using hibernate and we are getting a very poor performance. In order to load the data into the cache it would take in the range of 10 - 15 hours.
So we have decided to create a denormalized table of 50 - 60 columns and 5mm rows into another relational database (UDB), populate that table first and then populate the cache from the new denormalized table using JDBC so the time to build us cache is lower. This gives us a lot better performance and now we can build the cache in around an hour but this also does not meet our requirement of building the cache whithin 5 mins. The denormlized table is queried using the following query
select * from users where user id in (...)
Here user id is the primary key. We also tried a query
select * from user where user_location in (...)
and created a non unique index on location also but that also did not help.
So is there a way we can make the queries faster. If not then we are also open to consider some NOSQL solutions.
Which NOSQL solution would be suited for our needs. Apart from the large table we would be making around 1mm updates on the table on a daily basis.
I have read about mongo db and seems that it might work but no one has posted any experience with mongo db with so many rows and so many daily updates.
Please let us know your thoughts.
The short answer here, relating to MongoDB, is yes - it can be used in this way to create a denormalized cache in front of an RDBMS. Others have used MongoDB to store datasets of similar (and larger) sizes to the one you described, and can keep a dataset of that size in RAM. There are some details missing here in terms of your data, but it is certainly not beyond the capabilities of MongoDB and is one of the more frequently used implementations:
http://www.mongodb.org/display/DOCS/The+Database+and+Caching
The key will be the size of your working data set and therefore your available RAM (MongoDB maps data into memory). For larger solutions, write heavy scaling, and similar issues, there are numerous approaches (sharding, replica sets) that can be employed.
With the level of detail given it is hard to say for certain that MongoDB will meet all of your requirements, but given that others have already done similar implementations and based on the information given there is no reason it will not work either.

What's the fastest way to copy data from one table to another in Django?

I have two models -
ChatCurrent - (which stores the messages for the current active chats)
ChatArchive - (which archives the messages for the chats that have ended)
The reason I'm doing this is so that the ChatCurrent table always has minimum number of entries, making querying the table fast (I don't know if this works, please let me know if I've got this wrong)
So I basically want to copy (cut) data from the ChatCurrent to the ChatArchive model. What would be the fastest way to do this. From what I've read online, it seems that I might have to execute a raw SQL query, if you would be kind enough to even state the Query I'll be grateful.
Additional details -
Both the models have the same schema.
My opinion is that today they are not reason to denormalize database in this way to improve performance. Indexes or partitioning + indexes should be enought.
Also, in case that, for semantic reasons, you prefer have two tables (models) like: Chat and ChatHistory (or ChatCurrent and ChatActive) as you say and manage it with django, I thing that the right way to keep consistence is to create ToArchive() method in ChatCurrent. This method will move chat entries to historical chat model. You can perform this operation in background mode, then you can thread the swap in a celery process, in this way online users avoid wait for request. Into celery process the fastest method to copy data is a raw sql. Remember that you can encapsulate sql into a stored procedure.
Edited to include reply to your comment
You can perform ChatCurrent.ToArchive() in ChatCurrent.save() method:
class ChatCurrent(model.Model):
closed=models.BooleanField()
def save(self, *args, **kwargs):
super(Model, self).save(*args, **kwargs)
if self.closed:
self.ToArchive()
def ToArchive(self):
from django.db import connection, transaction
cursor = connection.cursor()
cursor.execute("insert into blah blah")
transaction.commit_unless_managed()
#self.delete() #if needed (perhaps deleted on raw sql)
Try something like this:
INSERT INTO "ChatArchive" ("column1", "column2", ...)
SELECT "column1", "column2", ...
FROM "ChatCurrent" WHERE yourCondition;
and than just
DELETE FROM "ChatCurrent" WHERE yourCondition;
The thing you are trying to do is table partitioning.
Most databases support this feature without the need for manual book keeping.
Partitioning will also yield much better results than manually moving parts of the data to a different table. By using partitioning you avoid:
- Data inconsistency. Which is easy to introduce because you will move records in bulk and then remove a lot of them from the source table. It's easy to make a mistake and copy only a portion of the data.
- Performance drop - moving the data around and the associated overhead from transactions will generally neglect any benefit you got from reducing the size of the ChatCurrent table.
For a really quick rundown. Table partitioning allows you to tell the database that parts of the data are stored and retrieved together, this significantly speeds up queries as the database knows that it only has to look into a specific part of the data set. Example: chat's from the current day, last hour, last month etc. You can additionally store each partition on a different drive, that way you can keep your current chatter on a fast SSD drive and your history on regular slower disks.
Please refer to your database manual to know the details about how it handles partitioning.
Example for PostgreSQL: http://www.postgresql.org/docs/current/static/ddl-partitioning.html
Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE.
Seldom-used data can be migrated to cheaper and slower storage media.
def copyRecord(self,recordId):
emailDetail=EmailDetail.objects.get(id=recordId)
copyEmailDetail= CopyEmailDetail()
for field in emailDetail.__dict__.keys():
copyEmailDetail.__dict__[field] = emailDetail.__dict__[field]
copyEmailDetail.save()
logger.info("Record Copied %d"%copyEmailDetail.id)
As per the above solutions, don't copy over.
If you really want to have two separate tables to query, store your chats in a single table (and for preference, use all the database techniques here mentioned), and then have a Current and Archive table, whose objects simply point to Chat objects/