Elastic Search Scroll API Asynchronous execution - api

I'm running an elastic search cluster 5.6 version with 70Gb index size/ day. At the end of the day we are requested to make summarizations of each hour for the last 7 day. We are using the java version of the High Level Rest client and considering the amount of docs each query returns is critical to scroll the results.
In order to take advantage of the CPUs we have, and decrease the reading time, we were thinking about using the search Scroll Asynchronous version but we are missing some example and at least the logic inside it to move forward.
We already check elastic related documentation but it's to vague:
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/5.6/java-rest-high-search-scroll.html#java-rest-high-search-scroll-async
We also ask in the elastic discussion forum as they say but it looks like nobody can't answer that:
https://discuss.elastic.co/t/no-code-for-example-of-using-scrollasync-with-the-java-high-level-rest-client/165126
Any help on this will be very appreciated and for sure I'm not the only one having this req.

Here the example code:
public class App {
public static void main(String[] args) throws IOException, InterruptedException {
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(HttpHost.create("http://localhost:9200")));
client.indices().delete(new DeleteIndexRequest("test"), RequestOptions.DEFAULT);
for (int i = 0; i < 100; i++) {
client.index(new IndexRequest("test", "_doc").source("foo", "bar"), RequestOptions.DEFAULT);
}
client.indices().refresh(new RefreshRequest("test"), RequestOptions.DEFAULT);
SearchRequest searchRequest = new SearchRequest("test").scroll(TimeValue.timeValueSeconds(30L));
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = searchResponse.getScrollId();
System.out.println("response = " + searchResponse);
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId)
.scroll(TimeValue.timeValueSeconds(30));
//I was missing to wait for the results
final CountDownLatch countDownLatch = new CountDownLatch(1);
client.scrollAsync(scrollRequest, RequestOptions.DEFAULT, new ActionListener<SearchResponse>() {
#Override
public void onResponse(SearchResponse searchResponse) {
System.out.println("response async = " + searchResponse);
}
#Override
public void onFailure(Exception e) {
}
});
//Here we wait
countDownLatch.await();
//Clear the scroll if we finish before the time to keep it alive.
//Otherwise it will be clear when the time is reached.
ClearScrollRequest request = new ClearScrollRequest();
request.addScrollId(scrollId);
client.clearScrollAsync(request, new ActionListener<ClearScrollResponse>() {
#Override
public void onResponse(ClearScrollResponse clearScrollResponse) {
}
#Override
public void onFailure(Exception e) {
}
});
client.close();
}
}
Thanks to David Pilato
elastic discussion

summarizations of each hour for the last 7 day
It sounds like you would like to run some aggregations on the data, and not get the raw docs. probably at the first level a date histogram in order to aggregate on intervals of 1hour. inside that date histogram you need an inner aggs to run your aggregations - either metrics/buckets depending on the summarizations needed.
Starting Elasticsearch v6.1 you can use the Composite Aggregation in order to get all the results buckets using paging. from the docs I linked:
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.
Unfortunately this option doesn't exist pre v6.1, so either you'll need to upgrade ES to use it, or find another way, like breaking to multiple queries, that together will cover the 7 days requirement.

Related

How to span a ConcurrentDictionary across load-balancer servers when using SignalR hub with Redis

I have ASP.NET Core web application setup with SignalR scaled-out with Redis.
Using the built-in groups works fine:
Clients.Group("Group_Name");
and survives multiple load-balancers. I'm assuming that SignalR persists those groups in Redis automatically so all servers know what groups we have and who are subscribed to them.
However, in my situation, I can't just rely on Groups (or Users), as there is no way to map the connectionId (Say when overloading OnDisconnectedAsync and only the connection id is known) back to its group, and you always need the Group_Name to identify the group. I need that to identify which part of the group is online, so when OnDisconnectedAsync is called, I know which group this guy belongs to, and on which side of the conversation he is.
I've done some research, and they all suggested (including Microsoft Docs) to use something like:
static readonly ConcurrentDictionary<string, ConversationInformation> connectionMaps;
in the hub itself.
Now, this is a great solution (and thread-safe), except that it exists only on one of the load-balancer server's memory, and the other servers have a different instance of this dictionary.
The question is, do I have to persist connectionMaps manually? Using Redis for example?
Something like:
public class ChatHub : Hub
{
static readonly ConcurrentDictionary<string, ConversationInformation> connectionMaps;
ChatHub(IDistributedCache distributedCache)
{
connectionMaps = distributedCache.Get("ConnectionMaps");
/// I think connectionMaps should not be static any more.
}
}
and if yes, is it thread-safe? if no, can you suggest a better solution that works with Load-Balancing?
Have been battling with the same issue on this end. What I've come up with is to persist the collections within the redis cache while utilising a StackExchange.Redis.IDatabaseAsync alongside locks to handle concurrency.
This unfortunately makes the entire process sync but couldn't quite figure a way around this.
Here's the core of what I'm doing, this attains a lock and return back a deserialised collection from the cache
private async Task<ConcurrentDictionary<int, HubMedia>> GetMediaAttributes(bool requireLock)
{
if(requireLock)
{
var retryTime = 0;
try
{
while (!await _redisDatabase.LockTakeAsync(_mediaAttributesLock, _lockValue, _defaultLockDuration))
{
//wait till we can get a lock on the data, 100ms by default
await Task.Delay(100);
retryTime += 10;
if (retryTime > _defaultLockDuration.TotalMilliseconds)
{
_logger.LogError("Failed to get Media Attributes");
return null;
}
}
}
catch(TaskCanceledException e)
{
_logger.LogError("Failed to take lock within the default 5 second wait time " + e);
return null;
}
}
var mediaAttributes = await _redisDatabase.StringGetAsync(MEDIA_ATTRIBUTES_LIST);
if (!mediaAttributes.HasValue)
{
return new ConcurrentDictionary<int, HubMedia>();
}
return JsonConvert.DeserializeObject<ConcurrentDictionary<int, HubMedia>>(mediaAttributes);
}
Updating the collection like so after I've done manipulating it
private async Task<bool> UpdateCollection(string redisCollectionKey, object collection, string lockKey)
{
var success = false;
try
{
success = await _redisDatabase.StringSetAsync(redisCollectionKey, JsonConvert.SerializeObject(collection, new JsonSerializerSettings
{
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
}));
}
finally
{
await _redisDatabase.LockReleaseAsync(lockKey, _lockValue);
}
return success;
}
and when I'm done I just ensure the lock is released for other instances to grab and use
private async Task ReleaseLock(string lockKey)
{
await _redisDatabase.LockReleaseAsync(lockKey, _lockValue);
}
Would be happy to hear if you find a better way of doing this. Struggled to find any documentation on scale out with data retention and sharing.

How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?

I want to read data from Cloud Pub/Sub and write it to BigQuery with Cloud Dataflow. Each data contains a table ID where the data itself will be saved.
There are various factors that writing to BigQuery fails:
Table ID format is wrong.
Dataset does not exist.
Dataset does not allow the pipeline to access.
Network failure.
When one of the failures occurs, a streaming job will retry the task and stall. I tried using WriteResult.getFailedInserts() in order to rescue the bad data and avoid stalling, but it did not work well. Is there any good way?
Here is my code:
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public class MyData implements Serializable {
String table_id;
}
public interface MyOptions extends PipelineOptions {
#Description("PubSub topic to read from, specified as projects/<project_id>/topics/<topic_id>")
#Validation.Required
ValueProvider<String> getInputTopic();
void setInputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<MyData> input = p
.apply("ReadFromPubSub", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("ParseJSON", MapElements.into(TypeDescriptor.of(MyData.class))
.via((String text) -> new Gson().fromJson(text, MyData.class)));
WriteResult writeResult = input
.apply("WriteToBigQuery", BigQueryIO.<MyData>write()
.to(new SerializableFunction<ValueInSingleWindow<MyData>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<MyData> input) {
MyData myData = input.getValue();
return new TableDestination(myData.table_id, null);
}
})
.withSchema(new TableSchema().setFields(new ArrayList<TableFieldSchema>() {{
add(new TableFieldSchema().setName("table_id").setType("STRING"));
}}))
.withFormatFunction(new SerializableFunction<MyData, TableRow>() {
#Override
public TableRow apply(MyData myData) {
return new TableRow().set("table_id", myData.table_id);
}
})
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry()));
writeResult.getFailedInserts()
.apply("LogFailedData", ParDo.of(new DoFn<TableRow, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
TableRow row = c.element();
LOG.info(row.get("table_id").toString());
}
}));
p.run();
}
}
There is no easy way to catch exceptions when writing to output in a pipeline definition. I suppose you could do it by writing a custom PTransform for BigQuery. However, there is no way to do it natively in Apache Beam. I also recommend against this because it undermines Cloud Dataflow's automatic retry functionality.
In your code example, you have the failed insert retry policy set to never retry. You can set the policy to always retry. This is only effective during something like an intermittent network failure (4th bullet point).
.withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
If the table ID format is incorrect (1st bullet point), then the CREATE_IF_NEEDED create disposition configuration should allow the Dataflow job to automatically create a new table without error, even if the table ID is incorrect.
If the dataset does not exist or there is an access permission issue to the dataset (2nd and 3rd bullet points), then my opinion is that the streaming job should stall and ultimately fail. There is no way to proceed under any circumstances without manual intervention.

HBase secondary index with observer coprocessor, .put on the index table results in recursion

In HBase database I want to create a secondary index by using additional "linking" table. I have followed the example given in this answer: Create secondary index using coprocesor HBase
I am not very well familiar with the entire concept of HBase, and I had read some examples on the issue of creating secondary indexes. I am attaching the coprocessor to single table only, like this:
disable 'Entry2'
alter 'Entry2', METHOD => 'table_att', 'COPROCESSOR' => '/home/user/hbase/rootdir/hcoprocessors.jar|com.acme.hobservers.EntryParentIndex||'
enable 'Entry2'
The source code of it, is as follows:
public class EntryParentIndex extends BaseRegionObserver{
private static final Log LOG = LogFactory.getLog(CoprocessorHost.class);
private HTablePool pool = null;
private final static String INDEX_TABLE = "EntryParentIndex";
private final static String SOURCE_TABLE = "Entry2";
#Override
public void start(CoprocessorEnvironment env) throws IOException {
pool = new HTablePool(env.getConfiguration(), 10);
}
#Override
public void prePut(
final ObserverContext<RegionCoprocessorEnvironment> observerContext,
final Put put,
final WALEdit edit,
final boolean writeToWAL)
throws IOException {
try {
final List<KeyValue> filteredList = put.get(Bytes.toBytes ("data"),Bytes.toBytes("parentId"));
byte [] id = put.getRow(); //Get the Entry ID
KeyValue kv=filteredList.get( 0 ); //get Entry PARENT_ID
byte[] parentId=kv.getValue();
HTableInterface htbl = pool.getTable(Bytes.toBytes(INDEX_TABLE));
//create row key for the index table
byte[] p1=concatTwoByteArrays(parentId, ":".getBytes()); //Insert a semicolon between two UUIDs
byte[] rowkey=concatTwoByteArrays(p1, id);
Put indexput = new Put(rowkey);
//The following call is setting up a strange? recursion, resulting
//...in thesame prePut method invoken again and again. Interestingly
//...the recursion limits itself up to 6 times. The actual row does not
//...get inserted into the INDEX_TABLE
htbl.put(indexput);
htbl.close();
}
catch ( IllegalArgumentException ex) { }
}
#Override
public void stop(CoprocessorEnvironment env) throws IOException {
pool.close();
}
public static final byte[] concatTwoByteArrays(byte[] first, byte[] second) {
byte[] result = Arrays.copyOf(first, first.length + second.length);
System.arraycopy(second, 0, result, first.length, second.length);
return result;
}
}
This executes when I perform put on the SOURCE_TABLE.
There is a comment in the code (please seek it): "The following call is setting up a strange".
I set a debugging print in the log confirming that the prePut method is being executed only on the SOURCE_TABLE, and never on the INDEX_TABLE. Yet I don't understand why this strange recursion is happening despite in the coprocessor I only execute one put on the INDEX_TABLE.
I have also confirmed that the put action on the source table is again only one.
I have fixed my problem. It came out to be that I was adding multiple times the same observer mistakenly thinking that it is getting lost after Hbase restart.
Also the reason why the .put call to the INDEX_TABLE was not working is because I did not set a value to it, but only a rowkey, mistakenly thinking that this is possible. Yet HBase did not throw any excepiton whatsoever, just did not perform the PUT, no info given, which may be confusing for newcommers to this technology.

Way of working with Lucene index custom Collector

Can someone help me in undestanding a way of working with customized implementations of abstract Collector class in Lucene?
I've implemented two ways of querying index with some test text:
1.Total hits is eq to 2. Both file names are the same, hence results size is eq to 1 because I keep them in a set.
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
LOG.info("Total hits " + topDocs.totalHits);
ScoreDoc[] scoreDosArray = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDosArray) {
Document doc = searcher.doc(scoreDoc.doc);
String fileName = doc.get(FILENAME_FIELD);
results.add(fileName);
}
2.CountCollect is eq to 2. Both documents from which I get files names in collect method of the Collector are unique, hence final results size is also eq to 2. CountNextReader variable is at the end of the logic is eq to 10.
private Set<String> doStreamingSearch(final IndexSearcher searcher, Query query) throws IOException {
final Set<String> results = new HashSet<String>();
Collector collector = new Collector() {
private int base;
private Scorer scorer;
private int countCollect;
private int countNextReader;
#Override
public void collect(int doc) throws IOException {
Document document = searcher.doc(doc);
String filename = document.get(FILENAME_FIELD);
results.add(filename);
countCollect++;
}
#Override
public boolean acceptsDocsOutOfOrder() {
return true;
}
#Override
public void setScorer(Scorer scorer) throws IOException {
this.scorer = scorer;
}
#Override
public void setNextReader(AtomicReaderContext ctx) throws IOException {
this.base = ctx.docBase;
countNextReader++;
}
#Override
public String toString() {
LOG.info("CountCollect: " + countCollect);
LOG.info("CountNextReader: " + countNextReader);
return null;
}
};
searcher.search(query, collector);
collector.toString();
return results;
}
I don't understand why within collect method I get different documents and different file names in comparison with previous implementation? I would expect the same result, or?
The Collector#collect method is the hotspot of a search request. It's called for every document that matches the query, not only the ones that you get back. In fact, you usually get back only the top documents, which are effectively the ones that you show to the users.
I would suggest not to do things like:
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
which would force lucene to return too many documents.
Anyway, if you only have two matching documents (or you are asking for all the documents that match), the number of documents that you get back and the number of calls to the collect method should be the same.
The setNextReader method is something completely different that you shouldn't care that much about. Have a look at this article if you want to know more about AtomicReader and so on. To keep it short, Lucene stores data as segments, which are mini searchable inverted indexes. Every query is executed on each segment sequentially. Every time the search switches to the next segment the setNextReader method is called to allow to do operations at a segment level in the Collector. For example, the internal lucene document id is unique only within the segment, thus you need to add docBase to it to make it unique within the whole index. That's why you need to store it when the segment changes and take it into account. Your countNextReader variable just contains the number of segments that have been analyzed for your query, it doesn't have anything to do with your documents.
Looking deeper at your Collector code I also noticed you are not taking into account the docBase when retrieving documents by id. This should fix it:
Document document = searcher.doc(doc + docBase);
Keep also in mind that loading a stored field within a Collector is not really a wise thing to do. It's gonna make your searches really slow, because stored fields are loaded from disk. You usually load stored fields only for the subset of documents that you want to return. Within a Collector you usually load information needed to score documents like payloads or similar things, usually making use of the lucene field cache too.

Delaying writes to SQL Server

I am working on an app, and need to keep track of how any views a page has. Almost like how SO does it. It is a value used to determine how popular a given page is.
I am concerned that writing to the DB every time a new view needs to be recorded will impact performance. I know this borderline pre-optimization, but I have experienced the problem before. Anyway, the value doesn't need to be real time; it is OK if it is delayed by 10 minutes or so. I was thinking that caching the data, and doing one large write every X minutes should help.
I am running on Windows Azure, so the Appfabric cache is available to me. My original plan was to create some sort of compound key (PostID:UserID), and tag the key with "pageview". Appfabric allows you to get all keys by tag. Thus I could let them build up, and do one bulk insert into my table instead of many small writes. The table looks like this, but is open to change.
int PageID | guid userID | DateTime ViewTimeStamp
The website would still get the value from the database, writes would just be delayed, make sense?
I just read that the Windows Azure Appfabric cache does not support tag based searches, so it pretty much negates my idea.
My question is, how would you accomplish this? I am new to Azure, so I am not sure what my options are. Is there a way to use the cache without tag based searches? I am just looking for advice on how to delay these writes to SQL.
You might want to take a look at http://www.apathybutton.com (and the Cloud Cover episode it links to), which talks about a highly scalable way to count things. (It might be overkill for your needs, but hopefully it gives you some options.)
You could keep a queue in memory and on a timer drain the queue, collapse the queued items by totaling the counts by page and write in one SQL batch/round trip. For example, using a TVP you could write the queued totals with one sproc call.
That of course doesn't guarantee the view counts get written since its in memory and latently written but page counts shouldn't be critical data and crashes should be rare.
You might want to have a look at how the "diagnostics" feature in Azure works. Not because you would use diagnostics for what you are doing at all, but because it is dealing with a similar problem and may provide some inspiration. I am just about to implement a data auditing feature and I want to log that to table storage so also want to delay and bunch the updates together and I have taken a lot of inspiration from diagnostics.
Now, the way Diagnostics in Azure works is that each role starts a little background "transfer" thread. So, whenever you write any traces then that gets stored in a list in local memory and the background thread will (by default) bunch all the requests up and transfer them to table storage every minute.
In your scenario, I would let each role instance keep track of a count of hits and then use a background thread to update the database every minute or so.
I would probably use something like a static ConcurrentDictionary (or one hanging off a singleton) on each webrole with each hit incrementing the counter for the page identifier. You'd need to have some thread handling code to allow multiple request to update the same counter in the list. Alternatively, just allow each "hit" to add a new record to a shared thread-safe list.
Then, have a background thread once per minute increment the database with the number of hits per page since last time and reset the local counter to 0 or empty the shared list if you are going with that approach (again, be careful about the multi threading and locking).
The important thing is to make sure your database update is atomic; If you do a read-current-count from the database, increment it and then write it back then you may have two different web role instances doing this at the same time and thus losing one update.
EDIT:
Here is a quick sample of how you could go about this.
using System.Collections.Concurrent;
using System.Data.SqlClient;
using System.Threading;
using System;
using System.Collections.Generic;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// You would put this in your Application_start for the web role
Thread hitTransfer = new Thread(() => HitCounter.Run(new TimeSpan(0, 0, 1))); // You'd probably want the transfer to happen once a minute rather than once a second
hitTransfer.Start();
//Testing code - this just simulates various web threads being hit and adding hits to the counter
RunTestWorkerThreads(5);
Thread.Sleep(5000);
// You would put the following line in your Application shutdown
HitCounter.StopRunning(); // You could do some cleverer stuff with aborting threads, joining the thread etc but you probably won't need to
Console.WriteLine("Finished...");
Console.ReadKey();
}
private static void RunTestWorkerThreads(int workerCount)
{
Thread[] workerThreads = new Thread[workerCount];
for (int i = 0; i < workerCount; i++)
{
workerThreads[i] = new Thread(
(tagname) =>
{
Random rnd = new Random();
for (int j = 0; j < 300; j++)
{
HitCounter.LogHit(tagname.ToString());
Thread.Sleep(rnd.Next(0, 5));
}
});
workerThreads[i].Start("TAG" + i);
}
foreach (var t in workerThreads)
{
t.Join();
}
Console.WriteLine("All threads finished...");
}
}
public static class HitCounter
{
private static System.Collections.Concurrent.ConcurrentQueue<string> hits;
private static object transferlock = new object();
private static volatile bool stopRunning = false;
static HitCounter()
{
hits = new ConcurrentQueue<string>();
}
public static void LogHit(string tag)
{
hits.Enqueue(tag);
}
public static void Run(TimeSpan transferInterval)
{
while (!stopRunning)
{
Transfer();
Thread.Sleep(transferInterval);
}
}
public static void StopRunning()
{
stopRunning = true;
Transfer();
}
private static void Transfer()
{
lock(transferlock)
{
var tags = GetPendingTags();
var hitCounts = from tag in tags
group tag by tag
into g
select new KeyValuePair<string, int>(g.Key, g.Count());
WriteHits(hitCounts);
}
}
private static void WriteHits(IEnumerable<KeyValuePair<string, int>> hitCounts)
{
// NOTE: I don't usually use sql commands directly and have not tested the below
// The idea is that the update should be atomic so even though you have multiple
// web servers all issuing similar update commands, potentially at the same time,
// they should all commit. I do urge you to test this part as I cannot promise this code
// will work as-is
//using (SqlConnection con = new SqlConnection("xyz"))
//{
// foreach (var hitCount in hitCounts.OrderBy(h => h.Key))
// {
// var cmd = con.CreateCommand();
// cmd.CommandText = "update hits set count = count + #count where tag = #tag";
// cmd.Parameters.AddWithValue("#count", hitCount.Value);
// cmd.Parameters.AddWithValue("#tag", hitCount.Key);
// cmd.ExecuteNonQuery();
// }
//}
Console.WriteLine("Writing....");
foreach (var hitCount in hitCounts.OrderBy(h => h.Key))
{
Console.WriteLine(String.Format("{0}\t{1}", hitCount.Key, hitCount.Value));
}
}
private static IEnumerable<string> GetPendingTags()
{
List<string> hitlist = new List<string>();
var currentCount = hits.Count();
for (int i = 0; i < currentCount; i++)
{
string tag = null;
if (hits.TryDequeue(out tag))
{
hitlist.Add(tag);
}
}
return hitlist;
}
}