How scroll in elasticsearch handles continually updating data?

How scroll in elasticsearch handles continually updating data? - indexing

Here is an example with reindexing. But what if date field is updating during reindex? And after several scroll requests I need to set date inteval from date to now. How elasticsearch scroll handle this situation: it just scroll documents with old date value or scroll forever untill update requests ends?

The way the scan-and-scroll API works is described one link away from your link, i.e. at http://www.elastic.co/guide/en/elasticsearch/guide/master/scan-scroll.html
On that page, it's stated that
A scrolled search takes a snapshot in time — it doesn’t see any changes
that are made to the index after the initial search request has been
made. It does this by keeping the old datafiles around, so that it can
preserve its “view” on what the index looked like at the time it started.
So what this implies is that whether you use a specific date for your end date or simply now, it won't make any difference since the snapshot of documents taken into account by your scroll query will always be constant during the whole time the query runs.
Say you're issuing a scroll query right now (e.g. at 2015-05-11 06:22:27), then no new documents contributed to your index after that date will ever be returned.

Related

firestore how to check if new documents added matching a particular query

As I know that, instead of get() I can use onSapshot() for my queries and listen to the changes(additions, deletions, modifications) for that query. But when there are changes, onSnaphot() returns multiple full documents whether modified, deleted or added (a new snapshot). What I want to do is, to check if new documents have been added matching my query and show a notification to the user that there are new records available, and only when a button is clicked, they should be able to fetch and see the new records. I don't want to fetch from the firestore whole sets of documents when there are changes. I just want to know about those changes and fetch on demand. And fetch only the added ones.
How can I do that? Any ideas?
By the way, I am using react-native for my app.
P.S. - I know, I can run the query periodically and check if the id of the first item in the new query matches the first item id in the previous query, and so I can detect the added records and show my notification/button. But I am looking for a more elegant solution. I think the notification should be triggered from the backend instead of polling the backend periodically.
P.S. 2 - Using cloud functions would not seem to be a logical option since these queries will be different for each user of my app. Which would require running thousands of functions (hopefully more) on the firestore. Or would it?
Thanks!

There is no way in the Firestore API to get notified about changes to a query without actually retrieving the changed documents. But you can of course show just a notification to the user when your onSnapshot callback gets called, and then only show the actual data from those documents when the user chooses to refresh the UI.
On a second note, when you use a snapshot listener the Firestore client will only retrieve the modified documents on additional callbacks.
Say you attach a listener for a query that matches 10 documents. On the first onSnapshot callback, you will get 10 documents and will be charged for 10 document reads. Now say that one of the documents changes. When your onSnapshot callback gets invoked for this, you will see 10 documents again, but will be charged only 1 read - for the document that was changed.
If you only want to process the changes, have a look at the documentation on viewing changes between snapshots, which contains a good example of how to do this with the docChanges() method.

Pagination best practices and "pagination indexes"

Yesterday I went to add pagination to a chat service. I've implemented pagination a hundred times before using either page/per_page or start/end arguments, and I've always been aware that there is the potential for the order or offsets to change between page loads, but in the case of this chat service, that is much more likely to happen. My first approach was to use the updated_at column and have subsequent requests just pass something like "updated_after" but in this case, we often import multiple rooms at once that could easily have the same updated_at (down to the millisecond) for the same user.
One solution I found was to use an update count, or index that gets incremented (table wide) each time a row is updated. I've implemented it using a postgres SEQUENCE column that I set it to nextval('room_update_count') each time the room is updated (this could be a trigger but I wanted more control). Then I can just sort on that column and use a simple update_index < $next to get the following page.
However, I haven't been able to find any references to any technique like this. It seems to work perfectly but I'm skeptical that I'm not seeing some issue with it. I feel like I can't be the first person to come up with this idea, and that if I could just think of the correct name for it I would find other examples of it.
Is this a known technique? What are other ways to solve the pagination offset issue?

Call a Web Service to get search result in titanium

I have implemented a tableView with a searchBar added to it. I want to call a service when user start typing the search keyword in the search bar. I know that I can call the service in the change event listener that will call the service.
I know that for every change in the search bar it is not good to call a service. So what is the efficient approach of using search bar when the search result is coming from a service call or what we can do to make the search efficient.
For example: the search functionality on Apple's App store

I did something like this for one of my test projects. I would check in my change event that at least 3 characters were entered before I would attempt a look-up. I have no idea why I went with 3, but it seemed like a decent number of characters for filtering my data. I would also set a flag that a network request was in progress. So if they entered 3 characters, you could kick off the search if no look-up was already in progress. If there was a network request in progress, you could setup a wait interval to keep checking if the request came back and kick off an additional request when it is. I would send back short lists of items, which for me was 25 so that my table appeared fast.
Though I didn't do this, you could track the interval of time between characters typed to make sure the user is finished typing. For the best interval you will need to experiment with what is reasonable for an average user. Get some feedback from typically non-power users on this.
I can see a potential issue where you are in the middle of a look-up, but the user is still typing. You might need to track those character updates and perhaps kick off an additional search for the updated character string. You might even check the character search string you sent at the time with the current characters in the input box and choose to abandon the list of look-up items you already received and just do another search.
You might want to show the list of items you did receive just so the user knows the app is working, but immediately send another request for look-up items automatically. A user might eventually start hammering keys and think the app is unresponsive if you don't show something in the table once in a while.

What are the effects of the Last Modifed Header (LMH) changing too often on a dynamic site?

We have a web application that has a defect and is updating all pages Last Modified Header to the date of the last publish. We are in the process of fixing the defect, but we wanted to know if this defect might impact our SE results for this site.
Basically each time a page on the site get's updated all pages updates the last modified date even if the page has not been updated.
Is there any possibility of the Search Engine detecting the site as spam, since all pages are changing too often? -- Theory

It's unlikely to change much, since all the search engines will notice that your content hasn't actually changed. They will crawl at a rate commensurate with the observed rate of content change, more or less regardless of what you tell them, and small changes like that won't be marked as content changes in the index.

Changing the last modified date too often will NOT have a negative impact with the big 3.
The only way you can affect crawl rate via meta data (and sitemap.xml) is to reduce it. The reason for this is indicators that increase ranking/indexing are too easily abused. However, reducing spider rate is still an option for the resource conscious webmaster.

How to skip known entries when syncing with Google Reader?

for writing an offline client to the Google Reader service I would like to know how to best sync with the service.
There doesn't seem to be official documentation yet and the best source I found so far is this: http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI
Now consider this: With the information from above I can download all unread items, I can specify how many items to download and using the atom-id I can detect duplicate entries that I already downloaded.
What's missing for me is a way to specify that I just want the updates since my last sync.
I can say give me the 10 (parameter n=10) latest (parameter r=d) entries. If I specify the parameter r=o (date ascending) then I can also specify parameter ot=[last time of sync], but only then and the ascending order doesn't make any sense when I just want to read some items versus all items.
Any idea how to solve that without downloading all items again and just rejecting duplicates? Not a very economic way of polling.
Someone proposed that I can specify that I only want the unread entries. But to make that solution work in the way that Google Reader will not offer this entries again, I would need to mark them as read. In turn that would mean that I need to keep my own read/unread state on the client and that the entries are already marked as read when the user logs on to the online version of Google Reader. That doesn't work for me.
Cheers,
Mariano

To get the latest entries, use the standard from-newest-date-descending download, which will start from the latest entries. You will receive a "continuation" token in the XML result, looking something like this:
<gr:continuation>CArhxxjRmNsC</gr:continuation>`
Scan through the results, pulling out anything new to you. You should find that either all results are new, or everything up to a point is new, and all after that are already known to you.
In the latter case, you're done, but in the former you need to find the new stuff older than what you've already retrieved. Do this by using the continuation to get the results starting from just after the last result in the set you just retrieved by passing it in the GET request as the c parameter, e.g.:
http://www.google.com/reader/atom/user/-/state/com.google/reading-list?c=CArhxxjRmNsC
Continue this way until you have everything.
The n parameter, which is a count of the number of items to retrieve, works well with this, and you can change it as you go. If the frequency of checking is user-set, and thus could be very frequent or very rare, you can use an adaptive algorithm to reduce network traffic and your processing load. Initially request a small number of the latest entries, say five (add n=5 to the URL of your GET request). If all are new, in the next request,
where you use the continuation, ask for a larger number, say, 20. If those are still all new, either the feed has a lot of updates or it's been a while, so continue on in groups of 100 or whatever.
However, and correct me if I'm wrong here, you also want to know, after you've downloaded an item, whether its state changes from "unread" to "read" due to the person reading it using the Google Reader interface.
One approach to this would be:
Update the status on google of any items that have been read locally.
Check and save the unread count for the feed. (You want to do this before the next step, so that you guarantee that new items have not arrived between your download of the newest items and the time you check the read count.)
Download the latest items.
Calculate your read count, and compare that to google's. If the feed has a higher read count than you calculated, you know that something's been read on google.
If something has been read on google, start downloading read items and comparing them with your database of unread items. You'll find some items that google says are read that your database claims are unread; update these. Continue doing so until you've found a number of these items equal to the difference between your read count and google's, or until the downloads get unreasonable.
If you didn't find all of the read items, c'est la vie; record the number remaining as an "unfound unread" total which you also need to include in your next calculation of the local number you think are unread.
If the user subscribes to a lot of different blogs, it's also likely he labels them extensively, so you can do this whole thing on a per-label basis rather than for the entire feed, which should help keep the amount of data down, since you won't need to do any transfers for labels where the user didn't read anything new on google reader.
This whole scheme can be applied to other statuses, such as starred or unstarred, as well.
Now, as you say, this
...would mean that I need to keep my own read/unread state on the client and that the entries are already marked as read when the user logs on to the online version of Google Reader. That doesn't work for me.
True enough. Neither keeping a local read/unread state (since you're keeping a database of all of the items anyway) nor marking items read in google (which the API supports) seems very difficult, so why doesn't this work for you?
There is one further hitch, however: the user may mark something read as unread on google. This throws a bit of a wrench into the system. My suggestion there, if you really want to try to take care of this, is to assume that the user in general will be touching only more recent stuff, and download the latest couple hundred or so items every time, checking the status on all of them. (This isn't all that bad; downloading 100 items took me anywhere from 0.3s for 300KB, to 2.5s for 2.5MB, albeit on a very fast broadband connection.)
Again, if the user has a large number of subscriptions, he's also probably got a reasonably large number of labels, so doing this on a per-label basis will speed things up. I'd suggest, actually, that not only do you check on a per-label basis, but you also spread out the checks, checking a single label each minute rather than everything once every twenty minutes. You can also do this "big check" for status changes on older items less often than you do a "new stuff" check, perhaps once every few hours, if you want to keep bandwidth down.
This is a bit of bandwidth hog, mainly because you need to download the full article from Google merely to check the status. Unfortunately, I can't see any way around that in the API docs that we have available to us. My only real advice is to minimize the checking of status on non-new items.

The Google API hasn't yet been released, at which point this answer may change.
Currently, you would have to call the API and dis-regard items already downloaded, which as you said isn't terribly efficient as you will be re-downloading items every time, even if you already have them.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas