My .Net code below is always returning a search.Matches.Count of 0 even though the movie is in the table. I've literally searched the whole internet but have not been able to get an answer, even on Amazon's AWS Developer website.
Please let me know what am I doing wrong? I appreciate your help. I'm totally new to this.
client = New AmazonDynamoDBClient(config)
table = Table.LoadTable(client, "MovieTable")
scanFilter = New ScanFilter
With scanFilter
.AddCondition("KeyCode", ScanOperator.NotEqual, MovieName)
.AddCondition("Status", ScanOperator.Equal, "In")
End With
search = table.Scan(scanFilter)
If search.Matches.Count = 1 then getMovieName
As the documentation explains, "Scan", a function which is supposed to go through the entire database, cannot go through the entire database at one fell swoop. Instead, it goes through it 1MB at a time, and after 1MB of data it returns to the caller, and you're supposed to ask to continue in the next page (again, see the documentation on how).
In your case, you have a very specific filter which matches only one item, but still - Scan will return after having read 1MB of data, even if none of the items in this 1MB match your request. It doesn't wait until 1MB of results have been collected! So in your use case it is not surprising that you're getting an empty result set, with LastEvaluatedKey set signalling that there are more pages to read.
By the way In your use case, where you are looking for just one item, doing a Scan of the entire database is obviously not a great choice (unless you're only doing this for debugging). a GetItem or Query operation will make more sense, if you can, and maybe a secondary index would be useful if you're searching by items not in the key.
Related
So, I had a nice thing going on a jsFiddle where I listed all my fiddles on one page:
jsfiddle.net/show
However, they have been changing things slowly this year, and I've already had to make some changes to keep it running. The newest change is rather annoying. Of course, I like to see ALL my fiddles all at once, make it easier to just hit ctrl+f and find what I might be looking for, but they' made it hard to do now. Used to I could just set the limit to 99999, and see everything, but now it appears I can't go past how many i actually have (186 atm).
I tried using a start to limit solution, but when it got to last 10|50 (i tried like start={x}&limit10 and start={x}&limit50) it would die. Namely because last pull had to be exact count. Example, I have 186, and use the by 10's solution, then it would die at start=180&limit=10.
I've search the API docs but can't seem to find a row count or anything of that manner. Anyone know of a good feasible solution that wont have me overloading there servers doing a constant single row check?
I'm having the same problem as you are. Then I checked the docs (Displaying user’s fiddles - Result) and found out that if you include callback=Api parameter, an additional overallResultSetCount field is included in the JSON response. I checked your fiddles and currently you have total of 229 public fiddles.
The solution I can think of will force you to only request twice. The first request's parameters doesn't matter as long as you have callback=Api. Then you send the second request in which your limit will be overallResultSetCount value.
Edit:
It's not in the documentation, however, I think the result set is limited to 200 entries only (hence your start/limit is from 0 - 199). I tried to query more than the 200 range but I get a Error 500. I couldn't find another user whose fiddle count is more than 200 (most of the usernames I tested only have less than 100 fiddles like zalun, oskar, and rpflorence).
Based on this new observation, you can update your script like this:
I have tested that if the total fiddle count is less than 200,
adding start=0&limit=199 parameter will only return all the
fiddles. Hence, you can add that parameter on your initial call.
Check if your total result set is more than 200. If yes, update your
parameters to reflect the range for the remaining result set (in
this case, start=199&limit=229) and add the new result set to your
old result set. Else, show/print the result set you initially got from your first query.
Repeat steps 1 and 2, if your total count reaches 400, 600, etc (any
multiple of 200).
I have about 20,000 documents stored in elastic search, at about 200kb each.
I have a search which has 733 hits total, I'm running that takes about 50ms to complete when returning 10 results.
If I set the size to 1000 so that it returns all results, the search takes 3-5 seconds to return.
Normally I would see that this is because it has to continue searching until it finds all of them, which takes extra time. However when returning 10 results only, the search still says 733 hits in total, so it already knows which documents are to be returned!
Note that I am not returning the _source field here, all I want it the list of _ids back, so I can't imagine that it would have to read any more data from the disk, as all the _ids are surely stored in the indices anyway.
Am I missing something in the way this works?
(My _ids are guids that we use internally).
EDIT: Since posting I've re-indexed with two changes to the mapping:
Set _source to false, so now the actual documents aren't stored.
Changed the index for the field that I was searching on to be not_analyzed.
This solves the problem, now I'm getting all 733 _ids back in ~50ms. Not sure which change solved it though. I'll take one of them back out and re-index.
It will take that Time. Because it need to fetch all data from ES and calculate score for your query.
Try
1)set fields to not analyzed which you Don search in.
2)change the store type of ES from simplfs to mmaps.. ( mention "index.store.type:mmaps" in elasticsearch.yml..)
3)configure less shard as much possible.. Shard more must be equal to move on nodes you gonna use..
I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.
I have been trying to get the full list of playlists matching a certain keyword. I have discovered however that using start-index past 100 brings the same set of results as using start-index=1. It does not matter what the max-results parameter is - still the same results. The total results returned however is way above 100, thus it cannot be that the query returned only 100 results.
What might the problem be? Is it a quota of some sort or any other authentication restriction?
As an example - the queries bring the same result set, whether you use start-index=1, or start-index=101, or start-index = 201 etc:
http://gdata.youtube.com/feeds/api/playlists/snippets?q=%22Jan+Smit+Laura%22&max-results=50&start-index=1&v=2
Any idea will be much appreciated!
Regards
Christo
I made an interface for my site, and the way I avoided this problem is to do a query for a large number, then store the results. Let your web page then break up the results and present them however is needed.
For example, if someone wants to do a search of over 100 videos, do the search and collect the results, but only present them with the first group, say 10. Then when the person wants to see the next ten, you get them from the list you stored, rather than doing a new query.
Not only does this make paging faster, but it cuts down on the constant queries to the YouTube database.
Hope this makes sense and helps.
I am looking for the fastest way to check for the existence of an object.
The scenario is pretty simple, assume a directory tool, which reads the current hard drive. When a directory is found, it should be either created, or, if already present, updated.
First lets only focus on the creation part:
public static DatabaseDirectory Get(DirectoryInfo dI)
{
var result = DatabaseController.Session
.CreateCriteria(typeof (DatabaseDirectory))
.Add(Restrictions.Eq("FullName", dI.FullName))
.List<DatabaseDirectory>().FirstOrDefault();
if (result == null)
{
result = new DatabaseDirectory
{
CreationTime = dI.CreationTime,
Existing = dI.Exists,
Extension = dI.Extension,
FullName = dI.FullName,
LastAccessTime = dI.LastAccessTime,
LastWriteTime = dI.LastWriteTime,
Name = dI.Name
};
}
return result;
}
Is this the way to go regarding:
Speed
Separation of Concern
What comes to mind is the following: A scan will always be performed "as a whole". Meaning, during a scan of drive C, I know that nothing new gets added to the database (from some other process). So it MAY be a good idea to "cache" all existing directories prior to the scan, and look them up this way. On the other hand, this may be not suitable for large sets of data, like files (which will be 600.000 or more)...
Perhaps some performance gain can be achieved using "index columns" or something like this, but I am not so familiar with this topic. If anybody has some references, just point me in the right direction...
Thanks,
Chris
PS: I am using NHibernate, Fluent Interface, Automapping and SQL Express (could switch to full SQL)
Note:
In the given problem, the path is not the ID in the database. The ID is an auto-increment, and I can't change this requirement (other reasons). So the real question is, what is the fastest way to "check for the existance of an object, where the ID is not known, just a property of that object"
And batching might be possible, by selecting a big group with something like "starts with C:Testfiles\" but the problem then remains, how do I know in advance how big this set will be. I cant select "max 1000" and check in this buffered dictionary, because i might "hit next to the searched dir"... I hope this problem is clear. The most important part, is, is buffering really affecting performance this much. If so, does it make sense to load the whole DB in a dictionary, containing only PATH and ID (which will be OK, even if there are 1.000.000 object, I think..)
First off, I highly recommend that you (anyone using NH, really) read Ayende's article about the differences between Get, Load, and query.
In your case, since you need to check for existence, I would use .Get(id) instead of a query for selecting a single object.
However, I wonder if you might improve performance by utilizing some knowledge of your problem domain. If you're going to scan the whole drive and check each directory for existence in the database, you might get better performance by doing bulk operations. Perhaps create a DTO object that only contains the PK of your DatabaseDirectory object to further minimize data transfer/processing. Something like:
Dictionary<string, DirectoryInfo> directories;
session.CreateQuery("select new DatabaseDirectoryDTO(dd.FullName) from DatabaseDirectory dd where dd.FullName in (:ids)")
.SetParameterList("ids", directories.Keys)
.List();
Then just remove those elements that match the returned ID values to get the directories that don't exist. You might have to break the process into smaller batches depending on how large your input set is (for the files, almost certainly).
As far as separation of concerns, just keep the operation at a repository level. Have a method like SyncDirectories that takes a collection (maybe a Dictionary if you follow something like the above) that handles the process for updating the database. That way your higher application logic doesn't have to worry about how it all works and won't be affected should you find an even faster way to do it in the future.