Query Vs Scan operation in dynamodb - optimization

Background
I am currently trying to figure out the best way of calculating some stats in lambda function based on the db design I have. Let’s say I have records of users from China which has 23 provinces that I stored in an array, to which I want to determine the total number of both females and males, as well as the number of users in each province.
Given a GSI table with 200,000 items with a total size of 100bytes per item as seen below, with the province attribute being the partition key.
{
"createdAt": {
"S": "2020-08-05T19:21:07.532Z"
},
"gender": {
"S": "Male"
},
"updatedAt": {
"S": "2020-08-05T19:21:07.532Z"
},
"province": {
"S": "Heilongjiang"
}
}
I am considering using two methods for this calculation:
1.Query method
I plan on looping over the province array and providing a partition key on each loop to the query method which would end up making too many requests(23 to be precise, that’s if the each request returned doesn’t pass the limit of 1MB which might lead me to keep repeating until there is no more lastEvaluationKey for the current query).
2.Scan method
In this method, I would make requests iteratively to the database until there is no more lastEvaluationKey.
Having the knowledge of both scan and query methods being able to return only 1mb of data, which method would be the most appropriate to use in this particular use case?
I am considering going for the scan method seeing as I would need to read all the data in table in order to calculate the stats anyways; however, I am afraid of how slow the operation will become when the table grows.
PS: Suggestions for a different keySchema for better access would also be very appreciated.

Neither.
Use DDB Streams + Lambda to update your stats as records are created/updated/deleted in your DDB table.
See also
Using Global Secondary Indexes for Materialized Aggregation Queries
How to do basic aggregation with DynamoDB?

Related

Count of distinct values for different time zones

I have a WorkItem doc with the properties defined below. There are 800K+ docs. I need to create a query that can quickly return the number of unique AccountIds for work items created for the API client's local day. API clients exist in different time zones. E.g. Client #1 may request the count for 01/10/22 (+1). Client #2 may request it for offset 0.
Client time zones are 0 to +3 and, if necessary, I could hard code each time zone in to the index definition.
What it the most efficient way to get the count for a given local date and offset? Without using an index then I can query the count by using the DotNet methods Select(m => m.AccountId), Distinct() & Count() but the query is slow. For efficiency then I wanted to do all of the work in a map/reduce index rather than just create an index on CreatedDateTimeUTC to filter the data down.
WorkItem document:
{
"Id": "<Guid>",
"CreatedDateTimeUTC": "<UTC Date Time>",
"AccountId": "<Account Foreign Key>"
}
I'm using DotNet Core 6 and RavenDB 5.3 (Cluster on AWS)

Unnesting a big quantity of columns in BigQuery and BigTable

I have a table in BigTable, with a single column family, containing some lead data. I was following the Google Cloud guide to querying BigTable data from BigTable (https://cloud.google.com/bigquery/external-data-bigtable) and so far so good.
I've crated the table definition file, like the docs required:
{
"sourceFormat": "BIGTABLE",
"sourceUris": [
"https://googleapis.com/bigtable/projects/{project_id}/instances/{instance_id}/tables/{table_id}"
],
"bigtableOptions": {
"readRowkeyAsString": "true",
"columnFamilies": [
{
"familyId": "leads",
"columns": [
{
"qualifierString": "Id",
"type": "STRING"
},
{
"qualifierString": "IsDeleted",
"type": "STRING"
},
...
]
}
]
}
}
But then, things started to go south...
This is how the BigQuery "table" ended up looking:
Each row is a rowkey and inside each column there's a nested cell, where the only value I need is the value from leads.Id.cell (in this case)
After a bit of searching I found a solution to this:
https://stackoverflow.com/a/70728545/4183597
So in my case it would be something like this:
SELECT
ARRAY_TO_STRING(ARRAY(SELECT value FROM UNNEST(leads.Id.cell)), "") AS Id,
...
FROM xxx
The problem is that I'm dealing with a dataset with more than 600 columns per row. It is unfeasible (and impossible, given BigQuery's subquery limits) to repeat this process more than 600 times per row/query.
I couldn't think of a way to automate this query or even think about other methods to unnest this many cells (my SQL knowledge stops here).
Is there any way to do a unesting like this for 600+ columns, with an SQL/BigQuery query? Preferable in a more efficient way? If not, I'm thinking of doing a daily batch process, using a simple Python connector from BigTable to BigQuery, but I'm afraid of the costs this will incur.
Any documentation, reference or idea will be greatly appreciated.
Thank you.
In general, you're setting yourself up for a world of pain when you try to query a NoSQL database (like BigTable) using SQL. Unnesting data is a very expensive operation in SQL because you're effectively performing a cross join (which is many-to-many) every time UNNEST is called, so trying to do that 600+ times will give you either a query timeout or a huge bill.
The BigTable API will be way more efficient than SQL since it's designed to query NoSQL structures. A common pattern is to have a script that runs daily (such as a Python script in a Cloud Function) and uses the API to get that day's data, parse it, and then output that to a file in Cloud Storage. Then you can query those files via BigQuery as needed. A daily script that loops through all the columns of your data without requiring extensive data transforms is usually cheap and definitely less expensive than trying to force it through SQL.
That being said, if you're really set on using SQL, you might be able to use BigQuery's JSON functions to extract the nested data you need. It's hard to visualize what your data structure is without sample data, but you may be able to read the whole row in as a single column of JSON or a string. Then if you have a predictable path for the values you are looking to extract, you could use a function like JSON_EXTRACT_STRING_ARRAY to extract all of those values into an array. A Regex function could be used similarly as well. But if you need to do this kind of parsing on the whole table in order to query it, a batch job to transform the data first will still be much more efficient.

Neo4J Optimsation For Creating Relationships Between Nodes

I am working on around 50000 tweets as node having similar data as shown below.
{
"date": "2017-05-26T09:50:44.000Z",
"author_name": "djgoodlook",
"share_count": 0,
"mention_name": "firstpost",
"tweet_id": "868041705257402368",
"mention_id": "256495314",
"location": "pune india",
"retweet_id": "868039862774931456",
"type": "Retweet",
"author_id": "103535663",
"hashtag": "KamalHaasan"
}
I have tried to make relationships between tweets having same location by using following command
MATCH (a:TweetData),(b:TweetData)
WHERE a.location = b.location AND NOT a.tweet_id = b.tweet_id
CREATE (a)-[r:SameLocation]->(b)
RETURN r
And using this command I didn't able to make relationship as it is took more than 20 hour and still didn't produced the results. While for hashtag relationship it worked fine with similar command as it took around 5 minutes.
Is their any other method to make relationship or any way to optimise this query.
Yes. First, make sure you have an index on :TweetData(location), that's the most important change, since without that every single node lookup will have to scan all 50k :TweetData nodes for a common location (that's 50k ^2 lookups).
Next, it's better to ensure one node's id is less than the other, otherwise you'll get the same pairs of nodes twice, with just the order reversed, resulting in two relationships for every pair, one in each direction, instead of just the single relationship you want.
Lastly, do you really need to return all relationships? That may kill your browser, maybe return just the count of relationships added.
MATCH (a:TweetData)
MATCH (b:TweetData)
WHERE a.location = b.location AND a.tweet_id < b.tweet_id
CREATE (a)-[r:SameLocation]->(b)
RETURN count(r)
One other thing to (strongly) consider is instead of tracking common locations this way, create a :Location node instead, and link all :TweetData nodes to it.
You will need an index or unique constraint on :Location(name), then:
MATCH (a:TweetData)
MERGE (l:Location {name:a.location})
CREATE (a)-[:LOCATION]->(l)
This approach also more easily lends itself to batching, if 50k nodes at once is too much. You can just use LIMIT and SKIP after your match to a.

Calculating a proxy bit size in a BigQuery table

How does one go about calculating the bit size of each record in BigQuery sharded tables across a range of time?
Objective: how much has it grown over time
Nuances: Of the 70 some fields, some records would have nulls for most, some records would have long string text grabbed directly from the raw logs, and some of them could be float/integer/date types.
Wondering if there's an easy way to do a proxy count of the bit size for one day and then I can expand that to a range of time.
Example from my experience:
One of my tables is daily sharded table with daily size of 4-5TB. Schema has around 780 fields. I wanted to understand cost of each data-point (bit-size) [it was used then for calculating ROI based on cost/usage]
So, let me give you an idea on how cost (bit-size) side of it was approached.
The main piece here is use of dryRun property of Jobs: Query API
Setting dryRun to true allows BigQuery (instead of actually running job) return statistics about the job such as how many bytes would be processed. And that’s exactly what is needed here!
So, for example, below Request is designed to get cost of trafficSource.referralPath in ga_session table for 2017-01-05
POST https://www.googleapis.com/bigquery/v2/projects/yourBillingProject/queries?key={YOUR_API_KEY}
{
"query": "SELECT trafficSource.referralPath FROM yourProject.yourDataset.ga_sessions_20170105`",
"dryRun": true,
"useLegacySql": false
}
You can get this value by parsing totalBytesProcessed out of Response. See example of such response below
{
"kind": "bigquery#queryResponse",
"jobReference": {
"projectId": "yourBillingProject"
},
"totalBytesProcessed": "371385",
"jobComplete": true,
"cacheHit": false
}
So, you can write relatively simple script in the client of your choice that:
reads schema of your table – you can use Tables: get API for this or if schema is known and readily available you can just simply hardcode it
organize loop through all (each and every) field in the schema
inside loop – call query api and extract size of respective filed (as it is outlined above)) and of course log it (or just collect it in memory)
As a result of above - you will have list of all fields with their respective size
If now, you need to analyze those sizes changes over the time – you can wrap above with yet another loop where you will iterate through as many days as you need and collect stats for each and every day
if you are not interested in day-by-day analysis - you just can make sure your query actually queries the range you are interested with. This can be done with use of a Wildcard Table
I consider this relatively easy way to go with
Me personally, I remember doing this with Go-lang, but it doesn't matter - you can use any client that you are most comfortable with
Hope this will help you!

Best way to store data and have ordered list at the same time

I have those datas that change enough not to be in my postgres tables.
I would like to get tops out of those data.
I'm trying to figure out a way to do this considering :
Easiness of use
Performance
1. Using Hash + CRON to build ordered sets frequently
In this case, I have lot of users data stored in hash like this :
u:25463:d = { "xp":45124, "lvl": 12, "like": 15; "liked": 2 }
u:2143:d = { "xp":4523, "lvl": 10, "like": 12; "liked": 5 }
If I want to get the top 15 of the higher lvl people. I dont think I can do this with a single command. I think I'll need to SCAN the all u:x:d datas and build sorted sets out of it. Am I mistaken ?
What about performance in this case ?
2.Multiple Ordered sets
In this case, I duplicate datas.
I still have to first case but I also update datas in the differents sorted sets and I don't need to use a CRON to built them.
I feel like the best approach is the first one but what if I have 1000000 users ?
Or is there another way ?
One possibility would be to use a single sorted set + hashes.
The sorted set would just be used as a lookup, it would store the key of a user's hash as the value and their level as the score.
Any time you add a new player / update their level, you would both set the hash, and insert the item into the sorted set. You could do this in a transaction based pipeline, or a lua script to be sure they both run at the same time, keeping your data consistent.
Getting the top players would mean grabbing the top entries in the sorted set, and then using the keys from that set, to go lookup the full data on those players with the hashes.
Hope that helps.