Randomly select DynamoDB entry - react-native

I'm have a DynamoDB table called URLArray that contains a list of URL's (myURL) and a unique video name (myKey).
I need to do two things:
When a user clicks the next video button, a random entry needs to be selected from this URLArray. There could be potentially tens of thousands of rows.
The user is logged into the app. Everytime they finish watching a video, the video's unique video name is recorded. So....when the user has seen a video, its added to a list in a table called Users under the user's info row.
Soo...This random entry that gets selected when the user clicks the next video button in point 1, has to be compared to the list of videos they've already seen. To make sure that it doesn't randomly appear again for that particular user.
I do something woefully inefficient so far, that works, but it's not great:
By the way i'm using AppSync + GraphQL to interact with the DynamoDB table. I first get a local copy of the URLArray:
//Gets a list of the Key/URL pairs in the UrlArrays table in GraphQL ****IN CONSTRUCTOR, so we have this URLArray data when componentDidMount()****
listUrlArrays = async () => {
try {
URLData = await API.graphql(graphqlOperation(ListUrlArrays)); //GraphQL query
//URLData[] is available in the entire class
this.setState({urlArrayLength: apiData.data.listURLArrays.items.length}); //gets the length of URLArray (i.e. how many videos are in the database)
}
}
As an overview, when user clicks for the next video:
//When clicking next video
async nextVideo(){
await this.logVideosSeen(); //add myKey to the list of videos in *Users* table the logged in user has now seen
await this.getURL(); //get the NEXT upcoming video's details, for Video Player to play and make sure it's not been seen before
}
//This will update the 'listOfVideosSeen[]' in Users table with videos unique myKey, the logged in user has seen
logVideosSeen = async () => {
.......
}
async getURL() {
var dbIndex = this.getUniqueRandomNumber(this.state.urlArrayLength); //Choose a number between 0 and N number of videos in URLArray
//the hasVideoBeenSeen() basically gets the list of videos a user has already seen from `Users` table with the GraphQL getusers command, and creates a local copy of this list (can get big). I use javascripts indexOf() to check whether myKey already exists in the list
while(await this.hasVideoBeenSeen(this.state.URLData[dbIndex].myKey)) //while true i.e. user has seen that video before
{
dbIndex = this.getUniqueRandomNumber(this.state.urlArrayLength); //get another random number to fetch a new myKey
}
//If false, we'll exit the loop and know we've got a not seen before myKey, proceed to set to play...
if(dbIndex != null){
this.setState({ playURL: this.state.URLData[dbIndex].vidURL }); //Retrieve the URL from the local URLArray that we're going to play (i.e. the next video to come)
}
}
I can share a little more code if needed, but essentially I wanted to know how to:
Let a Lambda function select a random number based on the current URLArray size (i may need to keep a local copy of URLArray anyway). But i think point 2 here is where it's really inefficient..
Let a Lambda function check (the while loop) against the Users table whether myKey has already been seen. Mainly to shift this computational burden to the cloud instead of the local device the app runs on.
AFTER A THINK................
Thanks for the suggestion Seth. I have been thinking about it for some time, and while the randomness requirement still holds true, I think there is some truth in what you’ve suggested. The reason I need randomness, is so that 2 users sat side by side for example, can’t predict which video is coming next. It shouldn’t be a predictable sequence of videos. I'm not sure I can use Scan function with AWS Amplify/GraphQL. So remember there’s 2 things going on here: (1) a video upload, recording it in the URLArray sensibly for future reference. (2) users viewing a previously unseen random video and then moving onto another unseen random video
*(1)
I like your idea of using a number to index the URLArray, and it’s helped to make life a bit easier. So the first URL being at index 0, the next at 1 etc…
My thinking here (to avoid me doing a ListUrlArrays() and bringing the WHOLE array locally to the phone), is to create a GSI called VideoNumber for the URLArray table. This will be the unique VideoNumber column with a number 0-N. So imagine the diagram above having another column called VideoNumber. Row 1 having VideoNumber set to 0, Row 2 having VideoNumber set to 1 etc… THEN all I would need to do, is locally on the device, generate a random number between 0-N, call a getURLArrayIdbyVideoNumber() query specific for that GSI, with the number that we just generated, and it’ll unlock the information I need from the row. Voila! I think that shifts most of that heavy burden away now.
Question: Before each video is uploaded, how do I easily get the current total number of rows N in the table (or row count)? I would then increment it by one.
The other thing I can do is save this current count number in another DynamoDB table that I use for persisted data, read the number from there before upload, and write an N+1 after upload to increment it (2 DynamoDB operations per upload). It’s not ideal.
*(2)
When a user has finished watching a video, I can log in a list (under the users information in DynamoDB), which video’s they’ve already seen. So for example this could now be a seen list: [3,12,73,108,57] for the 5 videos they’ve seen so far. When the user clicks nextVideo() we’ll generate a random newNumber, and straight away compare that with any number in the seen list. I use seenlist.indexOf(newNumber) and it will, either go again or stop if the newNumber doesn’t exist in the list. THEN I can go through the GSI query, and retrieve the relevant information to display the video from URLArray.
I think that this indexOf() is the biggest computational burden on the device, and obviously gets a little slower as the seenList increases. But it should be quicker with pure integer numbers then an alphanumeric myKey as I was using before. Any other suggestion would be welcome :)
I’ve yet to try it, but it was just an idea, as I need to keep the random element. But first, do you know how I can easily find the number of rows or table count of URLArray?

I think you'll have an easier time coming up with a solution to this problem if you drop the randomness requirement. It sounds like the more important requirement is presenting the user with a video they haven't seen before.
If that's correct, it sounds like your access pattern could be stated as
Fetch previously unseen video for user
which is an easier problem to solve.
Unlike SQL databases, there are often many ways to implement a given access pattern in DynamoDB. My answer here is just one way.
Imagine your URLArray table as a giant array. The first URL is at index 0, the next URL is at index 1, the second URL at index 2, and so on. Each user of your application would start by watching the video at URL index 0, then URL index 1, etc. This would ensure the user never sees the same video twice. You would not need to store a list of all the videos they've seen. Instead, you could store the index of the last video they saw.
Your application could grab the first n videos from the table to present to your users. Once that list was exhausted, it could go grab the next n videos. And so on...
What I've described here is essentially how pagination is implemented in DynamoDB. To bring this abstraction back to the world of DynamoDB, your algorithm could look something like this:
Scan the URLArray table for the first "page" of URLs (a scan operation with no filter criteria)
Along with the results, DynamoDB will respond with a LastEvaluatedKey, which will allow you to retrieve the next page of results starting from this position
Present your user with each video you pulled back from the scan operation, making sure to record the id (the Primary Key) the the last video they saw.
When you exhaust the URLs from step 1, execute another scan operation with the ExclusiveStartKey set to the LastEvaluatedKey returned from step 2.
When users return to your application, query for the next page from the URLArray table with ExclusiveStartKey set to the id of the last video they viewed.
This effectively uses the scan operation to search through your URLArray table one page at a time. Your application would effectively be searching the table from top to bottom, keeping track of where each user is at any given time. When a user revisits your application, just start where they left off.
In response to your edit:
If your use case requires the next video to be unpredictable (e.g. no 2 users can predict what video is next), you have a few problems to solve at the same time:
Selecting an item in an unpredictable/random manner
Tracking what a user has already seen
Putting those two requirements together makes for a tricky access pattern. Let's say you have N videos in your table, and the user has viewed N-1 of these videos leaving only one video unseen. If you are fetching your next video randomly and need to ensure it has not yet been seen, how will you find the last unseen video? How many times would you need to guess before you came across the only unseen video? What query/scan operation could you perform that does this in a single request to DDB? I'm not saying it's impossible, it's just...complicated.
I think it's better to generate a strategy that is unpredictable to the user, but predictable to you when it comes to select the next unseen video.
For example, you could pre-calculate a random order of indexes from 1..N ahead of time, which would represent the order you present the videos for a given user. You could go through that list sequentially, keeping track of the last seen index. That way, you'd always know which video was next and that the video hadn't previously been seen by this user. Fetching that video would be a simple query operation to DDB.
You also asked how to find the number of items in DynamoDB. Unfortunately, there is no DynamoDB equivalent of the SQL count operation. The answer to this question is not straightforward. For the benefit of the community (and to get a diverse set of answers), I'd suggest you make a separate question on Stackoverflow regarding the number of items in a DDB table.

Related

Checking Whether Table Data Exists, Updating / Inserting Into Two Tables & Posting End Outcome

I am working on my cron system which gathers informaiton via an API call. For most, it has been fairly straight forward, but now I am faced with multiple difficulties, as the API call is dependant on who is making the API request. It runs through each users API Key and certain information will be visible/hidden to them and visaversa to the public.
There are teams, and users are part of teams. A user can stealth their move, however all information will be showed to them and their team, however this will not be visible to their oponent, however both teams share the same id and have access tothe same informaiton, just one can see more of it than the other.
Defendants Point Of View
"attacks": {
"12345`": {
"timestamp": 1645345234,
"attacker_id": "",
"attacker_team_id": "",
"defender_id": 321,
"defender_team_id": 1,
"stealthed": 1
}
}
Attackers Point Of View
"attacks": {
"12345`": {
"timestamp": 1645345234,
"attacker_id": 123,
"attacker_team_id": 2
"defender_id": 321,
"defender_team_id": 1,
"stealthed": 1,
"boosters": {
"fair_fight": 3,
"retaliation": 1,
"group_attack": 1
}
}
}
So, if the defendant's API key is first used, id 12345 will already be in the team_attacks table but will not include the attacker_id and attacker_team_id. For each insert there after, I need to check to see whether the new insert's ID already exist and has any additional information to add to the row.
Here is the part of my code that loops through the API and obtains the data, it loops through all the attacks per API Key;
else if ($category === "attacks") {
$database = new Database();
foreach($data as $attack_id => $info) {
$database->query('INSERT INTO team_attacks (attack_id, attacker_id, attacker_team_id, defender_id, defender_team_id) VALUES (:attack_id, :attacker_id, :attacker_team_id, :defender_id, :defender_team_id)');
$database->bind(':attack_id', $attack_id);
$database->bind(':attacker_id', $info["attacker_id"]);
$database->bind(':attacker_team_id', $info["attacker_team_id"]);
$database->bind(':defender_id', $info["defender_id"]);
$database->bind(':defender_team_id', $info["defender_team_id"]);
$database->execute();
}
}
I have also been submitting to the news table, and typically I have simply been submitting X new entries have been added or whatnot, however I haven't a clue if there is a way to check during the above if any new entries and any updated entries to produce two news feeds:
2 attacks have bee updated.
49 new attack information added.
For this part, I was simply counting how many is in the array, but this only works for the first ever upload, I know I cannot simply count the array length on future inserts which require additional checks.
If The attack_id Does NOT Already Exist I also need to submit the boosters into another table, for this I was adding them to an array during the above loop and then looping through them to submit those, but this also depends on the above, not simply attempting to upload for each one without any checks. Boosters will share the attack_id.
With over 1,000 teams who will potentially have at least one members join my site, I need to be as efficient as this as possible. The API will give the last 100 attacks per call and I want this to be within my cron which collects any new data every 30 seconds, so I need to sort through potentially 100,000.
In SQL, you can check conditions when inserting new data using merge:
https://en.wikipedia.org/wiki/Merge_(SQL)
Depending on the database you are using, the name and syntax of the command might be different. Common names for the command are also upsert and replace.
But: If you are seeking for high performance and almost-realtimeness, consider using a cache holding critical aggregated data instead of doing the aggregation 100'000 times per minute.
This may or may not be the "answer" you're looking for. The question(s) imply use of a single table for both teams. It's worth considering one table per team for writes to avoid write contention altogether. The two data sets could be combined at query time in order to return "team" results via the API. At scale, you could have another process calculating and storing combined team results in an API-specific cache table that serves the API request.

How can I query data from HBase table in millisecond?

I'm writing a interface to query pagination data from Hbase table ,I query pagination data by some conditions, but it's very slow .My rowkey like this : 12345678:yyyy-mm-dd , length of 8 random Numbers and date .I try to use Redis cache all rowkeys and do pagination in it , but it's difficult to query data by the other conditions .
I also consider to design the secondary index in Hbase , and I discuss it with colleagues ,they think the secondary index is hard to maintain .
So , who can give me some ideas?
First thing, AFAIK random number + date pattern of rowkey may lead to hotspotting, if you scale with large data.
Regarding Pagination :
I'd offer solr + hbase if you are using cloudera then its cloudera search. It gives good performance(proved in our case) while querying 100 per page and with webservice call we have populated angularjs dashboard.
Also, most important thing is you can move back and forth between pages with out any issues..
Below diagram describes that.
To achieve this, you need to create collections(from hbase data) and can use solrj api
Hbase alone with scan api doesn't work for quick queries.
Apart from that, Please see my answer. Which is more insightful with implementation details...
How to achieve pagination in HBase?
Hbase only solution could be Hindex (co-processor based solution)
Link explains more in detail
Hindex architecture :
In Hbase to achieve good read performance you want your data retrieved by small number of gets (requests for single row) or a small scan (request over range of rows). Hbase stores your data sorted by key, so most important idea is to come up with such row key that would allow it.
Your key seems to contain only random integer and date so I assume that your queries are about pagination over records marked with time.
First idea is that in typical pagination scenario you access just 1 page at a time and navigate from page 1 to page 2 to page 3 etc. Given you want to paginate over all records for date 2015-08-16 you could use a scan of 50 rows with start key '\0:2015-08-16' (as it is smaller than any row in 2015-08-16) to retrieve first page. After retrieval of first page you have last key from a first page, say '12345:2015-08-16'. You can use it (or 12346:2015-08-16) to make another scan with start key 12346:2015-08-16 of 50 rows to retrieve page 2 and so on. So using this approach you query your pages fast as a single scan with predefined number of returned rows. So you can use last page row key as a parameter to paging API or just put last row key in redis so next paging API call will find it there.
All this works perfectly well until some user comes in and clicks directly to page 100. Or try to click on page 5 when he was on page 2. In such scenario you can use similar scan with nSkippedPages * 50 rows. This will not be as fast as a sequential access, but it's not a usual page usage pattern. You can use redis then to cache last row of the page result in a structure like pageNumber -> rowKey. Then if next user comes and clicks on page 100, it will see same performance as is in usual click page 1- click page 2- click page 3 scenario.
Then to make things more fast for users which click on page 99 first time, you could write a separate daemon which retrieves every 50th row and puts result in redis as a page index. Then launch it every 10-15 minutes and say that your page index has at most 10-15 minutes stale data.
You also can design a separate API which preloads row keys for a bulk of N pages (say about 100 pages, it could be async e.g. don't wait for actual preload to complete). What it will do is just a scan with KeyOnlyFilter and 50*N results and then selection of rowkeys for each page. So it accepts rowkey and populates redis with rowkey cache for N pages. Then when user walks in on a first page you fetch first 100 pages row keys for him so when he clicks on some page link seen on page, page start row key will be available. With right bulk size of preload you could approach your required latency.
Limit could be implemented using Scan.setMaxResults() or using PageFilter.
"skip nPages * 50 rows" and especially "output every 50th row" functionality seems to be trickier e.g. for latter you may end-up performing full scan which retrieves the keys or writing map-reduce to do it and for first it is not clear how to do it without sending rows over network since request can be distributed across several regions.
If you are looking for secondary indexes that are maintained in HBase there are several open source options (Splice Machine, Lilly, etc.). You can do index lookups in a few milliseconds.

Which datastructure should I use in Redis for a notification system?

I am trying to make a notification system with Redis rather than using MySQL which is what I use for the rest of the system. The reason for this is that I don't really need to save that much data so it can be saved in memory and I want it to be lightweight and fast.
The notifications will be kept temporarily. What I mean by that is that I do not want to save all notifications, but more like 50 latest unseen notifications for each user. So first thing I thought about was to use a linked list with a capped length of 50.
I would need to save this information for the notification:
postId
commentId
type
time
userId
username
image
So perhaps a JSON serialized string like this:
{"postId":1,"commentId":10,"type":1,"time":1462960058,"userId":2,"username":"Alexander","image":"ntfpRrgx.png"}
The notifications would be output like this on the client side:
Alexander commented on your post.
Alexander replied to your comment.
Where the type determines what kind of notification it is. I can handle "type" checks client side and output notification format accordingly. But here is the part I am having difficult with.
1) I need to be able to save the notifications in an ordered way so that I know which notification is newest.
2) I need to be able to know when a notification has been seen, so that it is not registered as not seen anymore.
3) I need to have a count of unseen notifications that I can show to the user. And If the user clicks on a notification, I need to mark that as a seen notification and decrement the count of unseen notifications.
4) I need to be able to mark all notifications as marked seen if the user wishes to do that.
5) I need to be able to get a subset of the notifications, whether seen or unseen, like an offset and limit on MySQL. For example, the user sees the newest 5 notifications, but he could click a next button and see the next 5, and the next 5 and so on.
I have no idea how to do all of this on Redis.
The key for the list or set could be user:1:notification. I know a list is sorted, and we can add and remove from the head and tail. But how do I achieve all these points?
1: You can use redis sorted sets (zset) operations and use timestamp as a score, and event id (or the entire event json) as a member.
ZADD my-set-key timestamp event-id
Then to get a page newest items you use zrevrange command. If you choose to use event id as a member, then you need additional structure to store event fields. I would recommend HSET eventid, field, value.
2: You can remove an item by member (event-id)
ZREM my-set-key event-id
3: Assuming your zset only keeps unseen, then you can use ZCARD to get size of the set
ZCARD my-set-key
4: You can remove an entire set in one shot using
DELETE my-set-key
5: You can paginate using zrange/zrevrange:
ZREVRANGE my-set-key start-position to-position
If you need to keep both seen and unseen items, then you need an extra zset where you only add, but don't remove once an item is seen

Concrete5 - unique block on every page

I have a block that displays the number of clicks on the page it is installed.
I want to display that block on every page.
I have tried creating a stack and then including that stack into a global area. The problem is that when the stack is included it includes the same block on every page.
So instead of having different blocks on every page, I have the same block on every page. That results in counting the clicks on all pages instead of only the current one.
How can I include a block on every page, but each having an unique ID as it has when added manually?
Thanks.
There's no way to have a unique block on every page without adding it manually. Even if you make it a "page type default", you'll end up with a shared block ID until you make the first edit (at which point it'll get it's own ID). And if you hardcode into the page's PHP code, it won't have a block ID at all.
With that being said, I don't know why you need a unique block ID. Obviously, each page is going to have a unique ID. So your block should be able to store clicks (I'm not sure exactly what you mean here, but it's probably irrelevant) against the page ID (collection ID in c5 parlance), and retrieve it against that, too.
Edit:
Considering your comment, and based on my understanding of what you're trying to do, there's no reason why you can't combine the block ID (which will, as you say, be duplicated across pages, but will be different on each block ON a page) and the page ID. So if you place two blocks in a stack, they'll get IDs 1 and 2. They'll have IDs 1 and 2 on every page the stack is own. So when you're trying to "record" whatever data they product, you combine the bID with the cID to get 1-103 and 2-103, and 1-4719, etc.
Edit 2:
So if your difficulty isn't so much "keying" the data, but physically storing the data, then see jordanlev's comment below. You won't be using the $btTable table as that's keyed off the bID. Instead you'll use db.xml to create a new table which can take your new key, and whatever other data you want to store. It's then your responsibility to query and update it with Loader::db(). See his block, or see the core's "survey" block for examples of how to manage your own db table.

How to keep a list of 'used' data per user

I'm currently working on a project in MongoDB where I want to get a random sampling of new products from the DB. But my problem is not MongoDB specific, I think it's a general database question.
The scenario:
Let's say we have a collection (or table) of products. And we also have a collection (or table) of users. Every time a user logs in, they are presented with 10 products. These products are selected randomly from the collection/table. Easy enough, but the catch is that every time the user logs in, they must be presented with 10 products that they have NEVER SEEN BEFORE. The two obvious ways that I can think of solving this problem are:
Every user begins with their own private list of all products. Each time they get one of these products, the product is removed from their private list. The result is that the next time products are chosen from this previously trimmed list, it already contains only new items.
Every user has a private list of previously viewed products. When a user logs in, they select 10 random products from the master list, compare the id of each against their list of previously viewed products, and if the item appears on the previously viewed list, the application throws this one away selects a new one, and iterates until there are 10 new items, which it then adds to the previously viewed list for next time.
The problem with #1 is it seems like a tremendous waste. You would basically be duplicating the list data for n number of users. Also removing/adding new items to the system would be a nightmare since it would have to iterate through all users. #2 seems preferable, but it too has issues. You could end up making a lot of extra and unnecessary calls to the DB in order to guarantee 10 new products. As a user goes through more and more products, there are less new ones to choose from, so the chances of having to throw one away and get new one from the DB greatly increases.
Is there an alternative solution? My first and primary concern is performance. I will give up disk space in order to optimize performance.
Those 2 ways are a complete waste of both primary and secondary memory.
You want to show 2 never before seen products, but is this a real must?
If you have a lot of products 10 random ones have a high chance of being unique.
3 . You could list 10 random products, even though not as easy as in MySQL, still less complicated than 1 and 2.
If you don't care how random the sequence of id's is you could do this:
Create a single randomized table of just product id's and a sequential integer surrogate key column. Start each customer at a random point in the list on first login and cycle through the list ordered by that key. If you reach the end, start again from the top.
The customer record would contain a single value for the last product they saw (the surrogate from the randomized list, not the actual id). You'd then pull the next ten on login and do a single update to the customer. It wouldn't really be random, of course. But this kind of table-seed strategy is how a lot of simpler pseudo-random number generators work.
The only problem I see is if your product list grows more quickly than your users log in. Then they'd never see the portions of the list which appear before wherever they started. Even so, with a large list of products and very active users this should scale much better than storing everything they've seen. So if it doesn't matter that products appear in a set psuedo-random sequence, this might be a good fit for you.
Edit:
If you stored the first record they started with as well, you could still generate the list of all things seen. It would be everything between that value and last viewed.
How about doing this: crate a collection prodUser where you will have just the id of the product and the list of customersID, (who have seen these products) .
{
prodID : 1,
userID : []
}
when a customer logs in you find the 10 prodID which has not been assigned to that user
db.prodUser.find({
userID : {
$nin : [yourUser]
}
})
(For some reason $not is not working :-(. I do not have time to figure out why. If you will - plz let me know.). After showing the person his products - you can update his prodUser collection. To mitigate mongos inability to find random elements - you can insert elements randomly and just find first 10.
Everything should work really fast.