Suggestions to improve Azure Table Storage query performance - azure-storage

We have a table in Azure Table Storage which currently has 50,000 items since it is newly implemented.
PartitionKey: DateTime value in form of string
RowKey: numeric value in form of string
We use TableQuery to generate a filter condition. The PartitionKey filter is something like: PartitionKey ge '201801240000000000' && "PartitionKey lt '201806220000000000'"
Unfortunately, we cannot use a RowKey filter because we want data between two dates.
To fetch the data for around a month, it takes around 5 seconds. And to fetch for around 3 months, it takes more time than that.
Though we have a caching strategy in place, fetching the data the first time takes a long time. Just like it takes a long time when the filter on the data changes on the date.
Any suggestions to improve performance would be appreciated.

As far as I can see from your post, the biggest issue you have is that your query spans multiple partitions in one query. This is not optimal for performance. Based on the below list, you're somewhere between Partition Scan and Table Scan, since you are specifying the partition key, but you're using multiple of them.
A Point Query is the most efficient lookup to use and is recommended to be used for high-volume lookups or lookups requiring lowest latency. Such a query can use the indexes to locate an individual entity very efficiently by specifying both the PartitionKey and RowKey values. For example: $filter=(PartitionKey eq 'Sales') and (RowKey eq '2')
Second best is a Range Query that uses the PartitionKey and filters on a range of RowKey values to return more than one entity. The PartitionKey value identifies a specific partition, and the RowKey values identify a subset of the entities in that partition. For example: $filter=PartitionKey eq 'Sales' and RowKey ge 'S' and RowKey lt 'T'
Third best is a Partition Scan that uses the PartitionKey and filters on another non-key property and that may return more than one entity. The PartitionKey value identifies a specific partition, and the property values select for a subset of the entities in that partition. For example: $filter=PartitionKey eq 'Sales' and LastName eq 'Smith'
A Table Scan does not include the PartitionKey and is very inefficient because it searches all of the partitions that make up your table in turn for any matching entities. It will perform a table scan regardless of whether or not your filter uses the RowKey. For example: $filter=LastName eq 'Jones'
Queries that return multiple entities return them sorted in PartitionKey and RowKey order. To avoid resorting the entities in the client, choose a RowKey that defines the most common sort order.
Source: Azure Storage Table Design Guide: Designing Scalable and Performant Tables
Another very useful article is this one: What PartitionKey and RowKey are for in Windows Azure Table Storage, especially when you look at the this image:
Based on the size and load of a partition, partitions are fanned out across machines. Whenever a partition gets a high load or grows in size, the Windows Azure storage management can kick in and move a partition to another machine:
Edit:
If there are multiple ways you would like to query your data, think about storing them in multiple ways. Especially since storage is cheap, storing data multiple times is not that bad. This way you optimize for read. This is what's known as the Materialized View pattern which can "help support efficient querying and data extraction, and improve application performance".
However, you should keep in mind that this is simple for static data. If you have data that changes around a lot, keeping them in sync when storing it multiple times might become a hassle.

rickvdbosch's answer is spot on.
Here are some additional thoughts assuming this is an application. One approach would to read smaller PartitionKey ranges in parallel. For example, assuming the range being processed is June/2018, we would have:
Thread-1 => PartitionKey ge '20180601' && PartitionKey lt '20180605'
Thread-2 => PartitionKey ge '20180605' && PartitionKey lt '20180610'
Thread-3 => PartitionKey ge '20180610' && PartitionKey lt '20180615'
Thread-4 => PartitionKey ge '20180615' && PartitionKey lt '20180620'
Thread-5 => PartitionKey ge '20180620' && PartitionKey lt '20180725'
Thread-6 => PartitionKey ge '20180625' && PartitionKey lt '20180701'
Moreover, it is possible to be even more aggressive and read smaller partitions (e.g. daily) in parallel without using the TableQuery constructs.
Note that neither approach described above handles a partitioning strategy that is highly unbalanced. For example, assume that 95% of the data for June/2018 is stored on range '20180605' to '20180610' or in a single day, there may or may not be perceived improvement on the overall execution time compared to a serial read in this case, specially because of the parallelism overhead (e.g. threads, memory allocation, synchronization, etc.).
Now, assuming this is .NET application running on Windows OS and the approach described above is appealing to your scenario, consider:
Increasing the max number of connections;
Disabling the Nagle algorithm;
Find below a code snippet to change in the application configuration. Please note that:
It is possible to define the address (e.g. https://stackoverflow.com) for the maxconnection instead of using "*".
It is recommended to run performance tests to benchmark what is the appropriate configuration for the maxconnection before releasing to production.
Find more details about the connection management at https://learn.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/network/connectionmanagement-element-network-settings.

Related

Performance of Lucene queries in Ignite

I have a very simple object as keys in my cache and I want to be able to iterate on the key/value pairs where a string matches a field in my keys.
Here is how the field is declared in the class
#AffinityKeyMapped #QueryTextField String crawlQueueID;
I run many queries and expect a small amount of documents to match. The queries take a relatively large amount of time, which is surprising given that there are maybe only 100K pairs locally in the cache. My queries are local, I want to hit only the K/V stored in the local node.
According to the profiler I am using, 80% of the CPU is spent here
GridLuceneIndex.java:285 org.apache.lucene.search.IndexSearcher.search(Query, int)
Knowing Lucene's performance, I am really surprised. Any suggestions?
BTW I want to sort the results based on a numerical field in the value object. Can this be done via annotations?
I could have one cache per value of the field I am querying against but given that there are potentially hundreds of thousands or even millions of different values, that would probably be too many caches for Ignite to handle.
EDIT
Looking at the code that handles the Lucene indexing and querying, the index gets reloaded for every query. Given that I do hundreds of them in a row, we probably don't benefit from any caching or optimisation of the index structure in Lucene.
Additionally, there is a range query running as a filter to check for the TTL. FilterQueries are faster but on a fresh indexreader, there would not be much caching either. Of course, if no TTL is needed for a given table, this should not be required.
Judging by the documentation about the indexing with SQL indexing:
Ignite automatically creates indexes for each primary key and affinity
key field.
the indexing is done on the key alone. In my case, the value I want to use for sorting is in the value object so that would not work.

Datamodeling for Aerospike

I am doing an investigation on Aerospike.
We have a need to use it as a cache for data (no need for persistance) as those data just live for a very short period of time. (We create it, we read it and then the goal is to try to delete it as fast as possible based on some processing on a service)
Our data look something like this :
Record :
- RecordId
- ClientId
- Partition
- Region
- Size
- May have X number of custom attributes (I will probably limit the number of the attributes)
ClientId here represent the multitenancy we want to implement. We will always only query records that belong to one specific ClientId.
We need to query those data on different fields. I know that this is not easy for Aerospike as it only supports one filter on a secondary index per query.
As we need to support an important number of records (in the range of several millions probably) we want to partitions our records based on their Partition field. That should allow the queries to run faster and make post processing easier.
Each record would have the same format by Partition but maybe be different from one partition to another.
To solve this problem I want to model my data in Aerospike like this :
Sets :
Partition_{ClientId} : (string equality filter)
Key : RecordId
Bin : Partition
Index : Partition
Region_{ClientId} (string equality filter)
Key : RecordId
Bin : Region
Index : Region
Size_{ClientId} (integer range search)
Key : RecordId
Bin : Size
Index : Size
With as many sets necessary to filter my data.
The point of having
Then we would query the different sets and realize an intersection of the results of the queries to get the filtered queries.
First Question, I am doing this because from what I read there is no easy way to filter a set based on several filter. Is this a correct assumption
Second Question, based on that model we would reach the limit of set in one namespace much faster. Is there any other way to model the same sort of data while still being efficient ?
You can have max 1023 sets and define max 256 Secondary Indexes. If number of partitions are limited (under 1023), use that as the Secondary Index. SIs are built in process RAM and give the advantage of faster first grouping of eligible records for your query. And then filter using Expressions on ClientID, whatever other conditions. The records have metadata - expiration time (TTL) in your case, or could be the LastUpdateTime of the record (or None of these) - if you can first filter on metadata which can give a definitive GO/NOGO - metadata is in RAM (assuming Community Edition) - so that is fast - it will save reading the record from disk for the other bin values related filtering. Bin data is on disk - assuming you are using storage-engine device. If this is a cache and you are using storage-engine memory, then bin data retrieval will also be faster.
So, you can execute queries like this: For PartitionId==220, give me all records for ClientID=3005 where remaining life (TTL) is greater than 3600 seconds and Region=="North" and Size>300. i.e. you can build any combination of logic that evaluates to true or false on the record metadata and/or the bin values or bin values only. For this example query, you only need SI on PartitionId.

How well does a unique hash index perform in comparison to the record ID?

Following CQRS practices, I will need to supply a custom generated ID (like a UUID) in any create command. This means when using OrientDB as storage, I won't be able to use its generated RIDs, but rather perform lookups on a manual index using the UUIDs.
Now in the OrientDB docs it states that the performance of fetching records using the RID is independent of the database size O(1), presumably because it already describes the physical location of the record. Is that also the case when using a UNIQUE_HASH_INDEX?
Is it worth bending CQRS practices to request a RID from the database when assembling the create command, or is the performance difference negligible?
I have tested the performance of record retrieval based on RIDs and indexed UUID fields using a database holding 180,000 records. For the measurement, 30,000 records have been looked up, while clearing the local cache between each retrieval. This is the result:
RID: about 0.2s per record
UUID: about 0.3s per record
I've done queries throughout populating the database in 30,000 record steps. The retrieval time wasn't significantly influenced by the database size in both cases. Don't mind the relatively high times as this experiment was done on an overloaded PC. It's the relation between the two that is relavant.
To anser my own question, a UNIQUE_HAS_INDEX based query is close enough to RID-based queries.

Structuring a large DynamoDB table with many searchable attributes?

I've been struggling with the best way to structure my table. Its intended to have many, many GBs of data (I haven't been given a more detailed estimate). The table will be claims data (example here) with a partition key being the resourceType and a sort key being the id (although these could be potentially changed). The end user should be able to search by a number of attributes (institution, provider, payee, etc totaling ~15).
I've been toying with combining global and local indices in order to achieve this functionality on the backend. What would be the best way to structure the table to allow a user to search the data according to 1 or more of these attributes in essentially any combination?
If you use resourceType as a partition key you are essentially throwing away the horizontal scaling features that DynamoDB provides out of the box.
The reason to partition your data is such that you distribute it across many nodes in order to be able to scale without incurring a performance penalty.
It sounds like you're looking to put all claim documents into a single partition so you can do "searches" by arbitrary attributes.
You might be better off combining your DynamoDB table with something like ElasticSearch for quick, arbitrary search capabilities.
Keep in mind that DynamoDB can only accommodate approximately 10GB of data in a single partition and that a single partition is limited to up to 3000 reads per second, and up to 1000 writes per second (reads + 3 * writes <= 3000).
Finally, you might consider storing your claim documents directly into ElasticSearch.

How to design Redis data structures in order to perform queries similar to DB queries in redis?

I have tables like Job, JobInfo. And i want to perform queries like below -
"SELECT J.JobID FROM Job J, JobInfo B WHERE B.JobID = J.JobID AND BatchID=5850 AND B.Status=0 AND J.JobType<>2"
How shall i go about writing my redis data types so that i can map such queries in redis?
IF i try to map the rows of table job in a redis hash for e.g. (hash j jobid 1 status 2) & similarly the rows of table JobInfo in again a redis hash as (hash jinfo jobid 1 jobtype 3.)
So my tables can be a set of hashes. Job table can be set with entries JobSet:jobid & JobInfo table can be set with entries like JobInfoSet:jobid
But i am confused in when i will do a SINTER on JobSet & JobInfoSet. how am i going to query that hash to get keys? As in the hash content of set jobSet is not identical to hash content of table JobInfoSet (they may have different key value pair.
So what exactly am i going to get as an output of SINTER? And how am i going to query that output as key-value pair?
So the tables will be a collection of redis hashes
Redis is not designed to structure the data in SQL way. Beside a in-memory key value store, it supports five types of data structures: Strings, Hashes, Lists, Sets and Sorted Sets. At high level this is a sufficient hint that Redis is designed to solve performance problems that arises due to high computation in relational data models. However, if you want to execute sql query in a in-memory structure, you may want to look at memsql.
Let's break down the SQL statement into different components and I'll try to show how redis can accomplish various parts.
Select J.JobID, J.JobName from Job J;
We translate each row in "Job" into a hash in redis using the SQL primary index as the redis natural index in redis.
For example:
SQL
==JobId==|==Name==
123 Fred
Redis
HSET Job:123 Name Fred
which can be conceptualized as
Job-123 => {"Name":"Fred"}
Thus we can store columns as hash fields in redis
Let's say we do the same thing for JobInfo. Each JobInfo object has its own ID
JobInfo-876 => {"meta1": "some value", "meta2": "bla", "JobID": "123"}
In sql normally we would make a secondary index on JobInfo.JobID but in NoSql land we maintain our own secondary indexes.
Sorted Sets are great for this.
Thus when we want to fetch JobInfo objects by some field, JobId in this case we can add it to a sorted set like this
ZADD JobInfo-JobID 123 JobInfo-876
This results in a set with 1 element in it {JobInfo-876} which has a score of 123. I realize that forcing all JobIDs into the float range for the score is a bad idea, but work with me here.
Now when we want to find all JobInfo objects for a given JobID we just do a log(N) lookup into the index.
ZRANGEBYSCORE JobInfo-JobID 123 123
which returns "JobInfo-876"
Now to implement simple joins we simply reuse this JobInfo-JobID index by storing Job keys by their JobIDs.
ZADD JobInfo-JobID 123 Job-123
Thus when doing something akin to
SELECT J.JobID, J.Name, B.meta1 FROM Job, JobInfo USING (JobID).
This would translate to scanning through the JobInfo-JobID secondary index and reorganizing the Job and JobInfo objects returned.
ZRANGEBYSCORE JobInfo-JobID -inf +inf WITHSCORES
5 -> (Job-123, JobInfo-876)
These objects all share the same JobID. CLient side you'd then asynchronously fetch the needed fields. Or you could embed these lookups in a lua script. This lua script could make redis hang for a long time. Normally redis tries to be fair with clients and prefers you to have short batched queries instead of one long query.
Now we come to a big problem, what if we want to combine secondary indexes. Let's say we have a secondary index on JobInfo.Status, and another on Job.JobType.
If we make a set of all jobs with the right JobType and use that as a filter on the JobInfo-JobID shared secondary index then we not only eliminate the bad Job elements but also every JobInfo element. We could, I guess fetch the scores(JobID) on the intersection and refetch all JobInfo objects with those scores, but we lose some of the filtering we did.
It is at this point where redis breaks down.
Here is an article on secondary indexes from the creator of redis himself: http://redis.io/topics/indexes
He touches multi-dimensional indexes for filtering purposes. As you can see he designed the data structures in a very versatile way. One that is the most appealing is the fact that sorted set elements with the same score are stored in lexicographical order. Thus you can easily have all elements have a score of 0 and piggyback on Redis's speed and use it more like cockroachDB, which relies on a global order to implement many SQL features.
The other answer are completely correct for redis up to version 3.4
The latest releases of redis, from 4.0 onward, include supports for modules.
Modules are extremelly powerfull and it happens that I just wrote a small module to embed SQLite into redis itself; rediSQL.
With that module you can actually use a fully functional SQL database inside your redis instace.