Delaying writes to SQL Server - sql

I am working on an app, and need to keep track of how any views a page has. Almost like how SO does it. It is a value used to determine how popular a given page is.
I am concerned that writing to the DB every time a new view needs to be recorded will impact performance. I know this borderline pre-optimization, but I have experienced the problem before. Anyway, the value doesn't need to be real time; it is OK if it is delayed by 10 minutes or so. I was thinking that caching the data, and doing one large write every X minutes should help.
I am running on Windows Azure, so the Appfabric cache is available to me. My original plan was to create some sort of compound key (PostID:UserID), and tag the key with "pageview". Appfabric allows you to get all keys by tag. Thus I could let them build up, and do one bulk insert into my table instead of many small writes. The table looks like this, but is open to change.
int PageID | guid userID | DateTime ViewTimeStamp
The website would still get the value from the database, writes would just be delayed, make sense?
I just read that the Windows Azure Appfabric cache does not support tag based searches, so it pretty much negates my idea.
My question is, how would you accomplish this? I am new to Azure, so I am not sure what my options are. Is there a way to use the cache without tag based searches? I am just looking for advice on how to delay these writes to SQL.

You might want to take a look at http://www.apathybutton.com (and the Cloud Cover episode it links to), which talks about a highly scalable way to count things. (It might be overkill for your needs, but hopefully it gives you some options.)

You could keep a queue in memory and on a timer drain the queue, collapse the queued items by totaling the counts by page and write in one SQL batch/round trip. For example, using a TVP you could write the queued totals with one sproc call.
That of course doesn't guarantee the view counts get written since its in memory and latently written but page counts shouldn't be critical data and crashes should be rare.

You might want to have a look at how the "diagnostics" feature in Azure works. Not because you would use diagnostics for what you are doing at all, but because it is dealing with a similar problem and may provide some inspiration. I am just about to implement a data auditing feature and I want to log that to table storage so also want to delay and bunch the updates together and I have taken a lot of inspiration from diagnostics.
Now, the way Diagnostics in Azure works is that each role starts a little background "transfer" thread. So, whenever you write any traces then that gets stored in a list in local memory and the background thread will (by default) bunch all the requests up and transfer them to table storage every minute.
In your scenario, I would let each role instance keep track of a count of hits and then use a background thread to update the database every minute or so.
I would probably use something like a static ConcurrentDictionary (or one hanging off a singleton) on each webrole with each hit incrementing the counter for the page identifier. You'd need to have some thread handling code to allow multiple request to update the same counter in the list. Alternatively, just allow each "hit" to add a new record to a shared thread-safe list.
Then, have a background thread once per minute increment the database with the number of hits per page since last time and reset the local counter to 0 or empty the shared list if you are going with that approach (again, be careful about the multi threading and locking).
The important thing is to make sure your database update is atomic; If you do a read-current-count from the database, increment it and then write it back then you may have two different web role instances doing this at the same time and thus losing one update.
EDIT:
Here is a quick sample of how you could go about this.
using System.Collections.Concurrent;
using System.Data.SqlClient;
using System.Threading;
using System;
using System.Collections.Generic;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// You would put this in your Application_start for the web role
Thread hitTransfer = new Thread(() => HitCounter.Run(new TimeSpan(0, 0, 1))); // You'd probably want the transfer to happen once a minute rather than once a second
hitTransfer.Start();
//Testing code - this just simulates various web threads being hit and adding hits to the counter
RunTestWorkerThreads(5);
Thread.Sleep(5000);
// You would put the following line in your Application shutdown
HitCounter.StopRunning(); // You could do some cleverer stuff with aborting threads, joining the thread etc but you probably won't need to
Console.WriteLine("Finished...");
Console.ReadKey();
}
private static void RunTestWorkerThreads(int workerCount)
{
Thread[] workerThreads = new Thread[workerCount];
for (int i = 0; i < workerCount; i++)
{
workerThreads[i] = new Thread(
(tagname) =>
{
Random rnd = new Random();
for (int j = 0; j < 300; j++)
{
HitCounter.LogHit(tagname.ToString());
Thread.Sleep(rnd.Next(0, 5));
}
});
workerThreads[i].Start("TAG" + i);
}
foreach (var t in workerThreads)
{
t.Join();
}
Console.WriteLine("All threads finished...");
}
}
public static class HitCounter
{
private static System.Collections.Concurrent.ConcurrentQueue<string> hits;
private static object transferlock = new object();
private static volatile bool stopRunning = false;
static HitCounter()
{
hits = new ConcurrentQueue<string>();
}
public static void LogHit(string tag)
{
hits.Enqueue(tag);
}
public static void Run(TimeSpan transferInterval)
{
while (!stopRunning)
{
Transfer();
Thread.Sleep(transferInterval);
}
}
public static void StopRunning()
{
stopRunning = true;
Transfer();
}
private static void Transfer()
{
lock(transferlock)
{
var tags = GetPendingTags();
var hitCounts = from tag in tags
group tag by tag
into g
select new KeyValuePair<string, int>(g.Key, g.Count());
WriteHits(hitCounts);
}
}
private static void WriteHits(IEnumerable<KeyValuePair<string, int>> hitCounts)
{
// NOTE: I don't usually use sql commands directly and have not tested the below
// The idea is that the update should be atomic so even though you have multiple
// web servers all issuing similar update commands, potentially at the same time,
// they should all commit. I do urge you to test this part as I cannot promise this code
// will work as-is
//using (SqlConnection con = new SqlConnection("xyz"))
//{
// foreach (var hitCount in hitCounts.OrderBy(h => h.Key))
// {
// var cmd = con.CreateCommand();
// cmd.CommandText = "update hits set count = count + #count where tag = #tag";
// cmd.Parameters.AddWithValue("#count", hitCount.Value);
// cmd.Parameters.AddWithValue("#tag", hitCount.Key);
// cmd.ExecuteNonQuery();
// }
//}
Console.WriteLine("Writing....");
foreach (var hitCount in hitCounts.OrderBy(h => h.Key))
{
Console.WriteLine(String.Format("{0}\t{1}", hitCount.Key, hitCount.Value));
}
}
private static IEnumerable<string> GetPendingTags()
{
List<string> hitlist = new List<string>();
var currentCount = hits.Count();
for (int i = 0; i < currentCount; i++)
{
string tag = null;
if (hits.TryDequeue(out tag))
{
hitlist.Add(tag);
}
}
return hitlist;
}
}

Related

Elastic Search Scroll API Asynchronous execution

I'm running an elastic search cluster 5.6 version with 70Gb index size/ day. At the end of the day we are requested to make summarizations of each hour for the last 7 day. We are using the java version of the High Level Rest client and considering the amount of docs each query returns is critical to scroll the results.
In order to take advantage of the CPUs we have, and decrease the reading time, we were thinking about using the search Scroll Asynchronous version but we are missing some example and at least the logic inside it to move forward.
We already check elastic related documentation but it's to vague:
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/5.6/java-rest-high-search-scroll.html#java-rest-high-search-scroll-async
We also ask in the elastic discussion forum as they say but it looks like nobody can't answer that:
https://discuss.elastic.co/t/no-code-for-example-of-using-scrollasync-with-the-java-high-level-rest-client/165126
Any help on this will be very appreciated and for sure I'm not the only one having this req.
Here the example code:
public class App {
public static void main(String[] args) throws IOException, InterruptedException {
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(HttpHost.create("http://localhost:9200")));
client.indices().delete(new DeleteIndexRequest("test"), RequestOptions.DEFAULT);
for (int i = 0; i < 100; i++) {
client.index(new IndexRequest("test", "_doc").source("foo", "bar"), RequestOptions.DEFAULT);
}
client.indices().refresh(new RefreshRequest("test"), RequestOptions.DEFAULT);
SearchRequest searchRequest = new SearchRequest("test").scroll(TimeValue.timeValueSeconds(30L));
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = searchResponse.getScrollId();
System.out.println("response = " + searchResponse);
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId)
.scroll(TimeValue.timeValueSeconds(30));
//I was missing to wait for the results
final CountDownLatch countDownLatch = new CountDownLatch(1);
client.scrollAsync(scrollRequest, RequestOptions.DEFAULT, new ActionListener<SearchResponse>() {
#Override
public void onResponse(SearchResponse searchResponse) {
System.out.println("response async = " + searchResponse);
}
#Override
public void onFailure(Exception e) {
}
});
//Here we wait
countDownLatch.await();
//Clear the scroll if we finish before the time to keep it alive.
//Otherwise it will be clear when the time is reached.
ClearScrollRequest request = new ClearScrollRequest();
request.addScrollId(scrollId);
client.clearScrollAsync(request, new ActionListener<ClearScrollResponse>() {
#Override
public void onResponse(ClearScrollResponse clearScrollResponse) {
}
#Override
public void onFailure(Exception e) {
}
});
client.close();
}
}
Thanks to David Pilato
elastic discussion
summarizations of each hour for the last 7 day
It sounds like you would like to run some aggregations on the data, and not get the raw docs. probably at the first level a date histogram in order to aggregate on intervals of 1hour. inside that date histogram you need an inner aggs to run your aggregations - either metrics/buckets depending on the summarizations needed.
Starting Elasticsearch v6.1 you can use the Composite Aggregation in order to get all the results buckets using paging. from the docs I linked:
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.
Unfortunately this option doesn't exist pre v6.1, so either you'll need to upgrade ES to use it, or find another way, like breaking to multiple queries, that together will cover the 7 days requirement.

How to span a ConcurrentDictionary across load-balancer servers when using SignalR hub with Redis

I have ASP.NET Core web application setup with SignalR scaled-out with Redis.
Using the built-in groups works fine:
Clients.Group("Group_Name");
and survives multiple load-balancers. I'm assuming that SignalR persists those groups in Redis automatically so all servers know what groups we have and who are subscribed to them.
However, in my situation, I can't just rely on Groups (or Users), as there is no way to map the connectionId (Say when overloading OnDisconnectedAsync and only the connection id is known) back to its group, and you always need the Group_Name to identify the group. I need that to identify which part of the group is online, so when OnDisconnectedAsync is called, I know which group this guy belongs to, and on which side of the conversation he is.
I've done some research, and they all suggested (including Microsoft Docs) to use something like:
static readonly ConcurrentDictionary<string, ConversationInformation> connectionMaps;
in the hub itself.
Now, this is a great solution (and thread-safe), except that it exists only on one of the load-balancer server's memory, and the other servers have a different instance of this dictionary.
The question is, do I have to persist connectionMaps manually? Using Redis for example?
Something like:
public class ChatHub : Hub
{
static readonly ConcurrentDictionary<string, ConversationInformation> connectionMaps;
ChatHub(IDistributedCache distributedCache)
{
connectionMaps = distributedCache.Get("ConnectionMaps");
/// I think connectionMaps should not be static any more.
}
}
and if yes, is it thread-safe? if no, can you suggest a better solution that works with Load-Balancing?
Have been battling with the same issue on this end. What I've come up with is to persist the collections within the redis cache while utilising a StackExchange.Redis.IDatabaseAsync alongside locks to handle concurrency.
This unfortunately makes the entire process sync but couldn't quite figure a way around this.
Here's the core of what I'm doing, this attains a lock and return back a deserialised collection from the cache
private async Task<ConcurrentDictionary<int, HubMedia>> GetMediaAttributes(bool requireLock)
{
if(requireLock)
{
var retryTime = 0;
try
{
while (!await _redisDatabase.LockTakeAsync(_mediaAttributesLock, _lockValue, _defaultLockDuration))
{
//wait till we can get a lock on the data, 100ms by default
await Task.Delay(100);
retryTime += 10;
if (retryTime > _defaultLockDuration.TotalMilliseconds)
{
_logger.LogError("Failed to get Media Attributes");
return null;
}
}
}
catch(TaskCanceledException e)
{
_logger.LogError("Failed to take lock within the default 5 second wait time " + e);
return null;
}
}
var mediaAttributes = await _redisDatabase.StringGetAsync(MEDIA_ATTRIBUTES_LIST);
if (!mediaAttributes.HasValue)
{
return new ConcurrentDictionary<int, HubMedia>();
}
return JsonConvert.DeserializeObject<ConcurrentDictionary<int, HubMedia>>(mediaAttributes);
}
Updating the collection like so after I've done manipulating it
private async Task<bool> UpdateCollection(string redisCollectionKey, object collection, string lockKey)
{
var success = false;
try
{
success = await _redisDatabase.StringSetAsync(redisCollectionKey, JsonConvert.SerializeObject(collection, new JsonSerializerSettings
{
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
}));
}
finally
{
await _redisDatabase.LockReleaseAsync(lockKey, _lockValue);
}
return success;
}
and when I'm done I just ensure the lock is released for other instances to grab and use
private async Task ReleaseLock(string lockKey)
{
await _redisDatabase.LockReleaseAsync(lockKey, _lockValue);
}
Would be happy to hear if you find a better way of doing this. Struggled to find any documentation on scale out with data retention and sharing.

RavenDB processing all documents of a certain type

I have some problem with updating all documents in a collection. What I need to do: I need to iterate through ~2 million docs load each doc into memory, parse HTML from one of fields of a doc and save the doc back to DB.
I tried take/skip logic with/without indexes but Id etc. but some records still remain unchanged (even tested for 1000 records with 128 records in a page). In the process of updating documents no more records are inserted. Simple patching (patching API) does not work for this as the update I need to perform is quite complex
Please help with this. Thanks
Code:
public static int UpdateAll<T>(DocumentStore docDB, Action<T> updateAction)
{
return UpdateAll(0, docDB, updateAction);
}
public static int UpdateAll<T>(int startFrom, DocumentStore docDB, Action<T> updateAction)
{
using (var session = docDB.OpenSession())
{
int queryCount = 0;
int start = startFrom;
while (true)
{
var current = session.Query<T>().Take(128).Skip(start).ToList();
if (current.Count == 0)
break;
start += current.Count;
foreach (var doc in current)
{
updateAction(doc);
}
session.SaveChanges();
queryCount += 2;
if (queryCount >= 30)
{
return UpdateAll(start, docDB, updateAction);
}
}
}
return 1;
}
Move your session.SaveChanges(); to outside the while loop.
As per Raven's session design, you can only do 30 interactions with the database during any given instance of a session.
If you refactor your code to only SaveChanges() once (or very few times) per using block, it should work.
For more information, check out the Raven docs : Understanding The Session Object - RavenDB

RavenDB returns stale results after delete

We seem to have verified that RavenDB is getting stale results even when we use various flavors of "WaitForNonStaleResults". Following is the fully-functional sample code (written as a standalone test so that you can copy/paste it and run it as is).
public class Cart
{
public virtual string Email { get; set; }
}
[Test]
public void StandaloneTestForPostingOnStackOverflow()
{
var testDocument = new Cart { Email = "test#abc.com" };
var documentStore = new EmbeddableDocumentStore { RunInMemory = true };
documentStore.Initialize();
using (var session = documentStore.OpenSession())
{
using (var transaction = new TransactionScope())
{
session.Store(testDocument);
session.SaveChanges();
transaction.Complete();
}
using (var transaction = new TransactionScope())
{
var documentToDelete = session
.Query<Cart>()
.Customize(x => x.WaitForNonStaleResultsAsOfLastWrite())
.First(c => c.Email == testDocument.Email);
session.Delete(documentToDelete);
session.SaveChanges();
transaction.Complete();
}
RavenQueryStatistics statistics;
var actualCount = session
.Query<Cart>()
.Statistics(out statistics)
.Customize(x => x.WaitForNonStaleResultsAsOfLastWrite())
.Count(c => c.Email == testDocument.Email);
Assert.IsFalse(statistics.IsStale);
Assert.AreEqual(0, actualCount);
}
}
We have tried every flavor of WaitForNonStaleResults and there is no change. Waiting for non-stale results seems to work fine for the update, but not for the delete.
Update
Some things which I have tried:
Using separate sessions for each action. Outcome: no difference. Same successes and fails.
Putting Thread.Current.Sleep(500) before the final query. Outcome: success. If I sleep the thread for a half second, the count comes back zero like it should.
Re: my comment above on stale results, AllowNonAuthoritiveInformation wasn't working. Needing to put WaitForNonStaleResults in each query, which is the usual "answer" to this issue, feels like a massive "code smell" (as much as I normally hate the term, it seems completely appropriate here).
The only real solution I've found so far is:
var store = new DocumentStore(); // do whatever
store.DatabaseCommands.DisableAllCaching();
Performance suffers accordingly, but I think slower performance is far less of a sin than unreliable if not outright inaccurate results.
This is an old question, but I recently ran across this problem as well. I was able to work around it by changing the convention on the DocumentStore used by the session to make it wait for non stale as of last write:
session.DocumentStore.DefaultQueryingConsistency = ConsistencyOptions.AlwaysWaitForNonStaleResultsAsOfLastWrite;
This made it so that I didn't have to customize every query run after. That said, I believe this only works for queries. It definitely doesn't work on patches as I have found out through testing.
I would also be careful about this and only use it around the code that's needed as it can cause performance issues. You can set the store back to its default with the following:
session.DocumentStore.DefaultQueryingConsistency = ConsistencyOptions.None;
The problem isn't related to deletes, it is related to using TransactionScope. The problem here is that DTC transaction complete in an asynchronous manner.
To fix this issue, what you need to do is call:
session.Advanced.AllowNonAuthoritiveInformation = false;
Which will force RavenDB to wait for the transaction to complete.

Extremely slow manual indexing?

I am trying to add about 21,000 entities already in the database into an nhibernate-search Lucene index. When done, the indexes are around 12 megabytes. I think the time can vary quite a bit, but it's always very slow. In my last run (running with the debugger), it took over 12 minutes to index the data.
private void IndexProducts(ISessionFactory sessionFactory)
{
using (var hibernateSession = sessionFactory.GetCurrentSession())
using (var luceneSession = Search.CreateFullTextSession(hibernateSession))
{
var tx = luceneSession.BeginTransaction();
foreach (var prod in hibernateSession.Query<Product>())
{
luceneSession.Index(prod);
hibernateSession.Evict(prod);
}
hibernateSession.Clear();
tx.Commit();
}
}
The vast majority of the time is spent in tx.Commit(). From what I've read of Hibernate search, this is to be expected. I've come across quite a few ways to help, such as MassIndexer, flushToIndexes, batch modes, etc. But as far as I can tell these are Java-only options.
The session clear and evict are just desperate moves by me - I haven't seen them make a difference one way or another.
Has anyone had success quickly indexing a large amount of existing data?
I've been able to speed up considerable indexing by using a combination of batching and transactions.
My initial code took ~30 minutes to index ~20.000 entities. Using the code bellow I've got it down to ~4 minutes.
private void IndexEntities<TEntity>(IFullTextSession session) where TEntity : class
{
var currentIndex = 0;
const int batchSize = 500;
while (true)
{
var entities = session
.CreateCriteria<TEntity>()
.SetFirstResult(currentIndex)
.SetMaxResults(batchSize)
.List();
using (var tx = session.BeginTransaction())
{
foreach (var entity in entities)
{
session.Index(entity);
}
currentIndex += batchSize;
session.Flush();
tx.Commit();
session.Clear();
}
if (entities.Count < batchSize)
break;
}
}
It depends on lucene options you can set. See this page and check if nhibernate-search has wrappers for these options. If it doesn't, modify its source.