RavenDb paging with session.load - ravendb

I am running a batch job that selects a batch of 100 documents, and then grabs all of the documents linked to it - possibly upto 25 for each.
I do the "join" using the ids from the first batch. So potentially I am calling session.load with 25 * 100 ids. I tried to implement paging, but it does not look possible using the load method which returns an array.
What is the best practice here?

Best practice is to use the .Include method rather than do what you are describing. You can read more in the documentation.
If you want to post some code of what you are doing now, I can provide a more detailed response.

Related

Implementation of last comment and comments count SQL

Description
I am developing an app which has Posts in it and for each post users can comment and like.
I am running a PG db on the server side and the data is structured in three different tables: post, post_comments (with ref to post), post_likes (with ref to post).
In the feed of the app I want to display all the posts with the comments count, last comment, number of likes and the last user that liked the post.
I was wondering what is the best approach to create the API calls, and currently have two ideas in mind:
First Idea
Make one large request using a query with multiple joins and parse the query accordingly.
The down side that I see in this approach is that the query will be very heavy which will affect the load time of the users feed as it will have to run over post_comments, post_likes, etc and count all the rows and then retrieve also the last rows.
Second Idea
Add an extra table which I will call post_meta that will store those exact parameters I need and update them when needed.
This approach will make the retrieve query much lighter and faster (faster loading time), but will increase the adding & updating time of comments and likes.
Was wondering if someone could give me some insights about the preferred way to tackle this problem.
Thanks

Mule batch processing vs foreach vs splitter-aggregator

In Mule, I have quite many records to process, where processing includes some calculations, going back and forth to database etc.. We can process collections of records with these options
Batch processing
ForEach
Splitter-Aggregator
So what are the main differences between them? When should we prefer one to others?
Mule batch processing option does not seem to have batch job scope variable definition, for example. Or, what if I want to benefit multithreading to fasten the overall task? Or, which is better if I want to modify the payload during processing?
When you write "quite many" I assume it's too much for main memory, this rules out spliter/aggregator because it has to collect all records to return them as a list.
I assume you have your records in a stream or iterator, otherwise you probably have a memory problem...
So when to use for-each and when to use batch?
For Each
The most simple solution, but it has some drawbacks:
It is single threaded (so may be too slow for your use case)
It is "fire and forget": You can't collect anything within the loop, e.g. a record count
There is not support handling "broken" records
Within the loop, you can have several steps (message processors) to process your records (e.g. for the mentioned database lookup).
May be a drawback, may be an advantage: The loop is synchronous. (If you want to process asynchronous, wrap it in an async-scope.)
Batch
A little more stuff to do / to understand, but more features:
When called from a flow, always asynchronous (this may be a drawback).
Can be standalone (e.g. with a poll inside for starting)
When the data generated in the loading phase is too big, it is automatically offloaded to disk.
Multithreading for free (number of threads configurable)
Handling for "broken records": Batch steps may be executed for good/broken records only.
You get statitstics at the end (number of records, number of successful records etc.)
So it looks like you better use batch.
For Splitter and Aggregator , you are responsible for writing the splitting logic and then joining them back at the end of processing. It is useful when you want to process records asynchronously using different server. It is less reliable compared to other option, here parallel processing is possible.
Foreach is more reliable but it process records iteratively using single thread ( synchronous), hence parallel processing is not possible. Each records creates a single message by default.
Batch processing is designed to process millions of records in a very fast and reliable way. By default 16 threads will process your records and it is reliable as well.
Please go through the link below for more details.
https://docs.mulesoft.com/mule-user-guide/v/3.8/splitter-flow-control-reference
https://docs.mulesoft.com/mule-user-guide/v/3.8/foreach
I have been using approach to pass on records in array to stored procedure.
You can call stored procedure inside for loop and setting batch size of the for loop accordingly to avoid round trips. I have used this approach and performance is good. You may have to create another table to log results and have that logic in stored procedure as well.
Below is the link which has all the details
https://dzone.com/articles/passing-java-arrays-in-oracle-stored-procedure-fro

Why use multiple ElasticSearch indices for one web application?

In asking a questions relating to using ES for web applications, suggestions have been made to have one index for things like user profiles, another index for data, etc., and several other ones for logs.
Having these all on a cluster with several web applications, this seems like things could get messy or disorganized.
In that case, are people using one cluster per application? I am a bit confused because when I read articles about indexing logs, they seem to refer to storing the data in multiple indices, rather than types within an index.
Secondly, why not have one index per app, with types for logs, user profiles, data, etc.?
Is there some benefit to using multiple indices rather than many types within an index for a web application?
-- UPDATE --
To add to this, the comments in this question, Elastic search, multiple indexes vs one index and types for different data sets?, don't seem to go far enough in explaining why:
data retention: for application log/metric data, use different indexes
if you require different retention period
Is that recommended because it's just simpler to delete an entire index rather than a type within an index? Does it have to do with the way the data is stored then space recovered after deleting the data?
I found the primary reason for creating multiple indices that satisfies my quest for an answer in ElasticSearch's pagination documentation:
To understand why deep paging is problematic, let’s imagine that we
are searching within a single index with five primary shards. When we
request the first page of results (results 1 to 10), each shard
produces its own top 10 results and returns them to the requesting
node, which then sorts all 50 results in order to select the overall
top 10.
Now imagine that we ask for page 1,000—results 10,001 to 10,010.
Everything works in the same way except that each shard has to produce
its top 10,010 results. The requesting node then sorts through all
50,050 results and discards 50,040 of them!
You can see that, in a distributed system, the cost of sorting results
grows exponentially the deeper we page. There is a good reason that
web search engines don’t return more than 1,000 results for any query.

Applying pagination on keen's extraction api

I've large number of messages in a keen collection and want to expose them to our end users through pagination using an api. Is it possible to specify offset like queries in Keen ?
We earlier had tradition database so were able to support above operations and thinking to shift to Keen because of it's easier analysis capabilities.
It's not possible to paginate extractions.
We created the Extractions API to allow you to get your event data out of Keen IO any time you like. It's your data and we believe that you should always have full access to it! Think of extractions as a way to export data rather than a way to query it and you'll begin to understand how extractions are intended to be used.
Keen is great at collecting and analyzing data, but it's not great at being a database. You will struggle to provide the user experience your users deserve if you attempt to use extractions in a real-time user facing manner. Our recommendation for a use case like yours is to add a database layer that stores your entity data somewhere outside of Keen. Augment that entity data with the results of your queries from Keen and you'll be all set.
I hope this helps!
Terry's advice is sound, but if you can live with an approximation of pagination than consider making multiple requests with non-overlapping timeframes.
For example, if you wanted to paginate over an hour's worth of data you could issue extractions over 1 minute of data at a time until you reach the desired page size. You would keep track of where you left off to load the next "page", and so forth.

How to limit the number of records returned by BAPI call using JCO3

When retrieving information from SAP system, in certain cases, we get results in hundreds or thousands. In such cases, if we want to implement a kind of pagination mechanism, what are the options available in JCO3.
First of all, how to restrict the records to a desired number (100 or 1000, etc)? Where should we define this?
How to continue to the next iteration of results with a limited records in each iteration/page?
That depends on the BAPI / function module you're using. If the BAPI supports pagination, fine - if it doesn't, the JCo won't be able to help you out. You'll have to retrieve all records and do the pagination in your application.