I have a WorkItem doc with the properties defined below. There are 800K+ docs. I need to create a query that can quickly return the number of unique AccountIds for work items created for the API client's local day. API clients exist in different time zones. E.g. Client #1 may request the count for 01/10/22 (+1). Client #2 may request it for offset 0.
Client time zones are 0 to +3 and, if necessary, I could hard code each time zone in to the index definition.
What it the most efficient way to get the count for a given local date and offset? Without using an index then I can query the count by using the DotNet methods Select(m => m.AccountId), Distinct() & Count() but the query is slow. For efficiency then I wanted to do all of the work in a map/reduce index rather than just create an index on CreatedDateTimeUTC to filter the data down.
WorkItem document:
{
"Id": "<Guid>",
"CreatedDateTimeUTC": "<UTC Date Time>",
"AccountId": "<Account Foreign Key>"
}
I'm using DotNet Core 6 and RavenDB 5.3 (Cluster on AWS)
Related
Background
I am currently trying to figure out the best way of calculating some stats in lambda function based on the db design I have. Let’s say I have records of users from China which has 23 provinces that I stored in an array, to which I want to determine the total number of both females and males, as well as the number of users in each province.
Given a GSI table with 200,000 items with a total size of 100bytes per item as seen below, with the province attribute being the partition key.
{
"createdAt": {
"S": "2020-08-05T19:21:07.532Z"
},
"gender": {
"S": "Male"
},
"updatedAt": {
"S": "2020-08-05T19:21:07.532Z"
},
"province": {
"S": "Heilongjiang"
}
}
I am considering using two methods for this calculation:
1.Query method
I plan on looping over the province array and providing a partition key on each loop to the query method which would end up making too many requests(23 to be precise, that’s if the each request returned doesn’t pass the limit of 1MB which might lead me to keep repeating until there is no more lastEvaluationKey for the current query).
2.Scan method
In this method, I would make requests iteratively to the database until there is no more lastEvaluationKey.
Having the knowledge of both scan and query methods being able to return only 1mb of data, which method would be the most appropriate to use in this particular use case?
I am considering going for the scan method seeing as I would need to read all the data in table in order to calculate the stats anyways; however, I am afraid of how slow the operation will become when the table grows.
PS: Suggestions for a different keySchema for better access would also be very appreciated.
Neither.
Use DDB Streams + Lambda to update your stats as records are created/updated/deleted in your DDB table.
See also
Using Global Secondary Indexes for Materialized Aggregation Queries
How to do basic aggregation with DynamoDB?
We have a Couchbase server version Community Edition 5.1.1 build 5723
In our Cars bucket we have the Car Make and the Cars it manufactured.
The connection between the two is the Id of the Car Make that we save as another field in the Car document (like a foreign key in a MySQL table).
The bucket has only 330,000 documents.
Queries are taking a lot of time - dozens of seconds for very simple query such as
select * from cars where model="Camry" <-- we expect to have about 50,000 results for that
We perform the queries in 2 ways:
The UI of the Couchbase
A Spring boot app, that constantly get a TimeOutException after 7.5 seconds
We thought the issue is a missing index to the bucket.
So we added an index:
CREATE INDEX cars_idx ON cars(makeName, modelName, makeId, _class) USING GSI;
We can see that index when running
SELECT * FROM system:indexes
What are we missing here? Are these reasonable amount of times for such queries in a NoSQL DB?
Try
CREATE INDEX model_idx ON cars(model);
Your index does not cover model field.
And you should have index for spring data couchbase "_class" property
CREATE INDEX `type_idx` ON `cars`(`_class`)
So, this is how we solved the issue:
Using this link and #paralen answer, we created several indexes that speed up the queries.
We altered our code to use pagination when we know the returned result set will be big, and came up with something like this:
do{
Pageable pageable = PageRequest.of(pageNumber, SLICE_SIZE, Sort.by("id"));
Slice slice carsRepository.findAllByModelName("Camry", pageable);
List cars = slice.getContent();
} while (slice.hasNext());
How does one go about calculating the bit size of each record in BigQuery sharded tables across a range of time?
Objective: how much has it grown over time
Nuances: Of the 70 some fields, some records would have nulls for most, some records would have long string text grabbed directly from the raw logs, and some of them could be float/integer/date types.
Wondering if there's an easy way to do a proxy count of the bit size for one day and then I can expand that to a range of time.
Example from my experience:
One of my tables is daily sharded table with daily size of 4-5TB. Schema has around 780 fields. I wanted to understand cost of each data-point (bit-size) [it was used then for calculating ROI based on cost/usage]
So, let me give you an idea on how cost (bit-size) side of it was approached.
The main piece here is use of dryRun property of Jobs: Query API
Setting dryRun to true allows BigQuery (instead of actually running job) return statistics about the job such as how many bytes would be processed. And that’s exactly what is needed here!
So, for example, below Request is designed to get cost of trafficSource.referralPath in ga_session table for 2017-01-05
POST https://www.googleapis.com/bigquery/v2/projects/yourBillingProject/queries?key={YOUR_API_KEY}
{
"query": "SELECT trafficSource.referralPath FROM yourProject.yourDataset.ga_sessions_20170105`",
"dryRun": true,
"useLegacySql": false
}
You can get this value by parsing totalBytesProcessed out of Response. See example of such response below
{
"kind": "bigquery#queryResponse",
"jobReference": {
"projectId": "yourBillingProject"
},
"totalBytesProcessed": "371385",
"jobComplete": true,
"cacheHit": false
}
So, you can write relatively simple script in the client of your choice that:
reads schema of your table – you can use Tables: get API for this or if schema is known and readily available you can just simply hardcode it
organize loop through all (each and every) field in the schema
inside loop – call query api and extract size of respective filed (as it is outlined above)) and of course log it (or just collect it in memory)
As a result of above - you will have list of all fields with their respective size
If now, you need to analyze those sizes changes over the time – you can wrap above with yet another loop where you will iterate through as many days as you need and collect stats for each and every day
if you are not interested in day-by-day analysis - you just can make sure your query actually queries the range you are interested with. This can be done with use of a Wildcard Table
I consider this relatively easy way to go with
Me personally, I remember doing this with Go-lang, but it doesn't matter - you can use any client that you are most comfortable with
Hope this will help you!
I need to conduct a series of database performance tests using jMeter.
The database has ~32m accounts, and ~15 billion transactions.
I have configured a JDBC connection configuration and a JDBC request with a single SELECT statement and a hardcoded vAccountNum and this works fine.
SELECT col1,col2,col3,col4,col5 from transactions where account=vAccountNum
I need to measure how many results sets can be completed in five minutes for 1 session; then add sessions and tune until server resources are exhausted.
What is the best way to randomize vAccountNum so that I can get an equal distribution of accounts returned?
Depending on what type vAccountNum is the choices are in:
Various JMeter Functions like
__Random function - to generate random number within defined range
__threadNum function - returns current thread's number (1 for first thread, 2 for second, etc.)
__counter function - a simple counter which is being incremented by 1 each time it is called
CSV Data Set Config - to read pre-defined vAccountNum values from CSV file. In that case make sure that you provide enough account numbers so you won't be hammering the server with the same query which likely to be returned from cache.
I'm trying to figure out how to handle my data structure within Redis. What I am trying to accomplish is how I can count events with two parameters, and then query Redis for that data by date. Here's an example: events come in with two different parameters, let's call them site and event type, and also with the time that event occurred. From there, I will need to be able to query Redis for how many events occurred over a span of dates, grouped together by site and event type.
Here's a brief example data set:
Oct 3, 2012:
site A / event A
site A / event A
Site B / event A
Oct 4, 2012:
site B / event B
site A / event A
Site B / event A
... and so on.
In my query I would like to know the total number of events over the date span, which will be a span of five weeks. In the example above, this would be something like:
site A / event A ==> 3 events
site B / event A ==> 2 events
site B / event B ==> 1 event
I have looked at using Redis' Sorted Set feature, Hashes, and so on. It seems Sorted Set is the best way to do it, but querying the data with Redis' ZUNIONSTORE command seems like not such a great fit because these events will span five weeks. That makes for at least 35 arguments to the ZUNIONSTORE command.
Any hints, thoughts, ideas, etc?
Thanks so much for your help.
Contrary to a typical RDBMS or MongoDB, Redis has no rich query language you can use. With such stores, you accumulate the raw data in the store, and then you can use a query to calculate statistics. Redis is not adapted to this model.
With Redis, you are supposed to calculate your statistics on-the-fly and store them directly instead of the raw data.
For instance, supposing we are only interested in statistics over a range of weeks, I would structure the data as follows:
because all the critera are discrete, simple hash objects can be used instead of zsets
one hash object per week
in each hash object, one counter per couple site,event. Optionally, one counter per site, and/or one counter per event.
So when an event occurs, I would pipeline the following commands to Redis:
hincrby W32 site_A:event_A 1
hincrby W32 site_A:* 1
hincrby W32 *:event_A 1
Please note there is no need to initialize those counters. HINCRBY will create them (and the hash object) if they do not exist.
To retrieve the statistics for one week:
hgetall W32
In the statistics, you have the counters per site/event, per site only, per event only.
To retrieve the statistics for several weeks, pipeline the following commands:
hgetall W32
hgetall W33
hgetall W34
hgetall W35
hgetall W36
and perform the aggregation on client-side (quite simple if the language supports associative arrays such as map, dictionary, etc ...).