aws dynamodb global secondary index cannot query existing items - indexing

E.g:
myTable:
id field1
1 guy
id as the primary key and field1 as the global secondary index:
- AttributeName: id
KeyType: HASH
GlobalSecondaryIndexes:
- IndexName: Field1Index
KeySchema:
- AttributeName: field1
KeyType: HASH
when i use
dynamodb.scan()
or
dynamodb.get({TableName: 'myTable', Key: {id: 1}})
can find this record,
But when I use
ExpressionAttributeNames: {
'#field1': 'field1'
},
ExpressionAttributeValues: {
":field1": field1
},
KeyConditionExpression: "#field1 = :field1",
IndexName: "Field1Index",})
At the time, early data can be found, but recent data cannot be found.
I don't know if it has something to do with myTable's data volume exceeding 8000, but I still have to say it in advance, and I have to add that the data of this table is always written into the data table through batch upload.
At present, what I suspect is that this data does not seem to be registered in the secondary index. There are 8,357 items in the data table and 7,599 items in the secondary index. Therefore, it can be determined that the missing items in the secondary index cannot be queried. data.
At the same time, I also tried to delete the problem data first, and then try to add it again
Problem still can't be solved
Can anyone know what is the reason?

It looks like this, but there may be various warnings due to java version issues
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.ViolationDetection.html

Related

Best way to store data and have ordered list at the same time

I have those datas that change enough not to be in my postgres tables.
I would like to get tops out of those data.
I'm trying to figure out a way to do this considering :
Easiness of use
Performance
1. Using Hash + CRON to build ordered sets frequently
In this case, I have lot of users data stored in hash like this :
u:25463:d = { "xp":45124, "lvl": 12, "like": 15; "liked": 2 }
u:2143:d = { "xp":4523, "lvl": 10, "like": 12; "liked": 5 }
If I want to get the top 15 of the higher lvl people. I dont think I can do this with a single command. I think I'll need to SCAN the all u:x:d datas and build sorted sets out of it. Am I mistaken ?
What about performance in this case ?
2.Multiple Ordered sets
In this case, I duplicate datas.
I still have to first case but I also update datas in the differents sorted sets and I don't need to use a CRON to built them.
I feel like the best approach is the first one but what if I have 1000000 users ?
Or is there another way ?
One possibility would be to use a single sorted set + hashes.
The sorted set would just be used as a lookup, it would store the key of a user's hash as the value and their level as the score.
Any time you add a new player / update their level, you would both set the hash, and insert the item into the sorted set. You could do this in a transaction based pipeline, or a lua script to be sure they both run at the same time, keeping your data consistent.
Getting the top players would mean grabbing the top entries in the sorted set, and then using the keys from that set, to go lookup the full data on those players with the hashes.
Hope that helps.

How to add filter to match the value in a map bin Aerospike

I have a requirement where I have to find the record in an aerospike based on attributeId. The data in aerospike is inthe below format
{
name=ABC,
id=xyz,
ts=1445879080423,
inference={2601=0.6}
}
Now I will be getting the value "2601" programatically and I should find this record based on this value. But the problem is the value is in a Map and the size of this map may be more than 1 like
inference={{2601=0.6},{2830=0.9},{2931=0.8}}
So how can I find this record using attributeId in java. Any suggestions much appreciated
A little know feature of Aerospike is that, in addition to a Bin, you can define an index on:
List values
Map Keys
Map Values
Using in index defined on your map keys in the "inference" bin, you will be able to query (filter) base on the key's name.
I hope this helps

Use multiple fields as key in aerospike loader

I am wanting to upload a psv file with records holding key statistics for a physician, location and a practice, stored per day.
A unique key for this entry would consist of a:
physician name,
practice name,
location name, and
a date of service.
Four fields all together.
Configuration file example for Aerospike loader shows only version with single key, and I am not seeing the syntax for multiple entries.
Can someone advise me please if this would be possible to do (have configuration listing multiple key fields using columns from the loaded file), and also show me the example.
Join the keys into one string. For readability, use separator like ":".
It might useful to know that aerospike does not store original keys, it stores digests (hashes) instead.
There is no simple answer as to the "best way" and it depends on what you want to query at speed and scale. Your data model will reflect how you want to read the data and at what latency and throughput.
If you want high speed (1-5ms latency) and high throughput (100k per second) of a particular piece of data, you will need to aggregate the data as you write it to Aerospike and store it using a composite key that will allow you to get that data quickly e.g. doctor-day-location.
If you want a statistical analysis over a period of time, and the query can take a few seconds to several minutes, then you can store the data in a less structured format and run Aerospike aggregations on it, or even use the Hadoop or Spark directly on the Aerospike data.
You can create a byte buffer and convert the fields into bytes then add them to the byte buffer. But when reading you will need to know the dataType or the format for keys to extract them from the byte buffer.
var keyVal = new ArrayBuffer[Byte]
for ( j<- 0 until keyIndex.length)
{
val field = schema(keyIndex(j))
field.dataType match {
case value: StringType => {
keyVal = keyVal.+=(row(keyIndex(j)).asInstanceOf[String].toByte)
}
case value: IntegerType => {
keyVal = keyVal.+=(row(keyIndex(j)).asInstanceOf[Integer].toByte)
}
case value: LongType => {
keyVal = keyVal.+=(row(keyIndex(j)).asInstanceOf[Long].toByte)
}
}
}
val key: Key = new Key(namespace, set,keyVal.toArray)
KeyIndexes = array containing the index of key fileds
Schema = schema of the fileds.
row = a single record to be written.
When extracting the values if you know the schema for the keys Like you made a key from int, int,Long you can extract by first4bytes.toInt and next4.toInt and Last8.toLong.

Cannot insert new value to BigQuery table after updating with new column using streaming API

I'm seeing some strange behaviour with my bigquery table, I've just created added a new column to a table, it looks good on the interface and getting the schema via the api.
But when adding a value to the new column I get the following error:
{
"insertErrors" : [ {
"errors" : [ {
"message" : "no such field",
"reason" : "invalid"
} ],
"index" : 0
} ],
"kind" : "bigquery#tableDataInsertAllResponse"
}
I'm using the java client and streaming API, the only thing I added is:
tableRow.set("server_timestamp", 0)
Without that line it works correctly :(
Do you see anything wrong with it (the name of the column is server_timestamp, and it is defined as an INTEGER)
Updating this answer since BigQuery's streaming system has seen significant updates since Aug 2014 when this question was originally answered.
BigQuery's streaming system caches the table schema for up to 2 minutes. When you add a field to the schema and then immediately stream new rows to the table, you may encounter this error.
The best way to avoid this error is to delay streaming rows with the new field for 2 minutes after modifying your table.
If that's not possible, you have a few other options:
Use the ignoreUnknownValues option. This flag will tell the insert operation to ignore unknown fields, and accept only those fields that it recognizes. Setting this flag allows you to start streaming records with the new field immediately while avoiding the "no such field" error during the 2 minute window--but note that the new field values will be silently dropped until the cached table schema updates!
Use the skipInvalidRows option. This flag will tell the insert operation to insert as many rows as it can, instead of failing the entire operation when a single invalid row is detected. This option is useful if only some of your data contains the new field, since you can continue inserting rows with the old format, and decide separately how to handle the failed rows (either with ignoreUnknownValues or by waiting for the 2 minute window to pass).
If you must capture all values and cannot wait for 2 minutes, you can create a new table with the updated schema and stream to that table. The downside to this approach is that you need to manage multiple tables generated by this approach. Note that you can query these tables conveniently using TABLE_QUERY, and you can run periodic cleanup queries (or table copies) to consolidate your data into a single table.
Historical note: A previous version of this answer suggested that users stop streaming, move the existing data to another table, re-create the streaming table, and restart streaming. However, due to the complexity of this approach and the shortened window for the schema cache, this approach is no longer recommended by the BigQuery team.
I was running into this error. It turned out that I was building the insert object like i was in "raw" mode but had forgotten to set the flag raw: true. This caused bigQuery to take my insert data and nest it again under a json: {} node.
In otherwords, I was doing this:
table.insert({
insertId: 123,
json: {
col1: '1',
col2: '2',
}
});
when I should have been doing this:
table.insert({
insertId: 123,
json: {
col1: '1',
col2: '2',
}
}, {raw: true});
the node bigquery library didn't realize that it was already in raw mode and was then trying to insert this:
{
insertId: '<generated value>',
json: {
insertId: 123,
json: {
col1: '1',
col2: '2',
}
}
So in my case the errors were referring to the fact that the insert was expecting my schema to have 2 columns in it (insertId and json).

DynamoDB: Have sequencing within Items

I am developing forums on DynamoDB.
There is a table posts which contains all the posts in a thread.
I need to have a notion of sequence in the posts, i.e. I need to know which post came first and which came later.
My service would be running in a distributed env.
I am not sure if using Timestamp is the best solution for deciding the sequence, as the hosts might have slightly different times and might be off my milliseconds/ seconds.
Is there another way to do this?
Can I get DynamoDB to populate the date so it is consistent?
Or is there a sequence generator that I can use in a distributed env?
You can't use DynamoDB to auto-populate dates. You can use other services to provide you with auto-generating numbers or use DynamoDB's atomic increment to create your own UUID.
This can become a bottleneck if your forum is very successful (needs lots of numbers per second). I think you should start with timestamp and later on add complexity to your id generating (concatenate timestamp+uuid or timstamp+atomiccounter)
It is always a best practice to sync your servers clock (ntpd)
Use a dedicated sequence table. If you have only one sequence (say, PostId), then there's going to be only one row with two attributes in the table.
Yes, there's extra cost and effort of managing another table, but this is the best solution I know by far and haven't seen any one else mentioning it.
The table should have a key attribute as primary partition key, and a numeric value attribute with initial value of 1 (or whatever you want the initial value to be).
Every time you want to get the next available key, you tell DynamoDB to do this:
Increment the value where key = PostId by 1, and return the value before incrementing.
Note that this is one single atomic operation. DynamoDB handles the auto-incrementing, so there's no concurrency issues.
In code, there're more than one ways of implementing this. Here's one example:
Map<String,AttributeValue> key = new HashMap<>();
key.put("key", new AttributeValue("PostId"));
Map<String, AttributeValueUpdate> item = new HashMap<String, AttributeValueUpdate>();
item.put("value",
new AttributeValueUpdate()
.withAction(AttributeAction.ADD)
.withValue(1));
UpdateItemRequest request = new UpdateItemRequest("Sequences", key, item).withReturnValues(ReturnValue.ALL_OLD);
UpdateItemResult result = dynamoDBClient.updateItem(request);
Integer postId = Integer.parseInt(result.getAttributes().get("value").getN()); // <- this is the sequential ID you want to set to your post
Another variation of Chen's suggestion is to have strict ordering of posts within a given Forum Thread, as opposed to globally across all Threads. One way to do this is to have a Reply table with the Hash key of ThreadId, and a range key of ReplyId. The ReplyId would be a Number type attribute starting at 0. Every time someone replies, your app does a Query on the Reply table for the one most recent reply on that thread (ScanIndexForward: false, Limit: 1, ThreadId: ). To insert your new reply use the ReplyId of the one returned in the Query, + 1. Then use PutItem, using a Conditional Write, so that if someone else replies at the same time, an error will be returned, and your app can start again with the query.
If you want the simplest initial solution possible, then the timestamp+uuid concatenation Chen suggests is the simplest approach. A global atomic counter item will be a scaling bottleneck, as Chen mentions, and based on what you've described, a global sequence number isn't required for your app.