Extremely slow manual indexing? - nhibernate

I am trying to add about 21,000 entities already in the database into an nhibernate-search Lucene index. When done, the indexes are around 12 megabytes. I think the time can vary quite a bit, but it's always very slow. In my last run (running with the debugger), it took over 12 minutes to index the data.
private void IndexProducts(ISessionFactory sessionFactory)
{
using (var hibernateSession = sessionFactory.GetCurrentSession())
using (var luceneSession = Search.CreateFullTextSession(hibernateSession))
{
var tx = luceneSession.BeginTransaction();
foreach (var prod in hibernateSession.Query<Product>())
{
luceneSession.Index(prod);
hibernateSession.Evict(prod);
}
hibernateSession.Clear();
tx.Commit();
}
}
The vast majority of the time is spent in tx.Commit(). From what I've read of Hibernate search, this is to be expected. I've come across quite a few ways to help, such as MassIndexer, flushToIndexes, batch modes, etc. But as far as I can tell these are Java-only options.
The session clear and evict are just desperate moves by me - I haven't seen them make a difference one way or another.
Has anyone had success quickly indexing a large amount of existing data?

I've been able to speed up considerable indexing by using a combination of batching and transactions.
My initial code took ~30 minutes to index ~20.000 entities. Using the code bellow I've got it down to ~4 minutes.
private void IndexEntities<TEntity>(IFullTextSession session) where TEntity : class
{
var currentIndex = 0;
const int batchSize = 500;
while (true)
{
var entities = session
.CreateCriteria<TEntity>()
.SetFirstResult(currentIndex)
.SetMaxResults(batchSize)
.List();
using (var tx = session.BeginTransaction())
{
foreach (var entity in entities)
{
session.Index(entity);
}
currentIndex += batchSize;
session.Flush();
tx.Commit();
session.Clear();
}
if (entities.Count < batchSize)
break;
}
}

It depends on lucene options you can set. See this page and check if nhibernate-search has wrappers for these options. If it doesn't, modify its source.

Related

nhibernate : executing updates in batches

I am trying to do batch updates using NHibernate, but it is not doing batch updates, its doing individual writes for all the rows. I have to write around 10k rows to db.
using (var session = GetSessionFactory().OpenStatelessSession())
{
session.SetBatchSize(100);
using (var tx = session.BeginTransaction())
{
foreach (var pincode in list)
{
session.Update(pincode);
}
tx.Commit();
}
}
I am tried setting batch size to 100 using session.SetBatchSize(100); but that does not help. Also tried setting batch size using cfg.SetProperty("adonet.batch_size", "100"); but thats also not helping.
I am using GUID primary keys, hence I dont understand the reason for batch update failure. This is exactly the solution explained here. But its not working for me.
NOTE I have version field for optimistic concurrency mapped on all the entities. can that be the culprit for not having batch updates??
EDIT
i tried using state-ful session but that also did not help
//example 2
using (var session = GetSessionFactory().OpenSession())
{
session.SetBatchSize(100);
session.FlushMode = FlushMode.Commit;
foreach (var pincode in list)
{
session.Update(pincode);
}
session.Flush();
}
//example 3
using (var session = GetSessionFactory().OpenSession())
{
session.SetBatchSize(100);
using (var tx = session.BeginTransaction())
{
foreach (var pincode in list)
{
session.Update(pincode);
}
tx.Commit();
}
}
example 2 for some reason is causing double round trips.
EDIT
after further research I found that, each session.Update is actually updating the db
using (var session = SessionManager.GetStatelessSession())
{
session.SetBatchSize(100);
foreach (var record in list)
{
session.Update(record);
}
}
how can I avoid that.
EDIT
tried with flush mode as well, but thats also not helping
using (var session = SessionManager.GetNewSession())
{
session.FlushMode = FlushMode.Never;
session.SetBatchSize(100);
session.BeginTransaction();
foreach (var pincode in list)
{
session.SaveOrUpdate(pincode);
}
session.Flush();
session.Transaction.Commit();
}
EDIT 4
even below one is not working, given i am fetching all entities in same session and updating and saving them in that session only...
using (var session = SessionManager.GetSessionFactory().OpenSession())
{
session.SetBatchSize(100);
session.FlushMode = FlushMode.Commit;
session.Transaction.Begin();
var list = session.QueryOver<Pincode>().Take(1000).List();
list.ForEach(x => x.Area = "Abcd" + DateTime.Now.ToString("HHmmssfff"));
foreach (var pincode in list) session.SaveOrUpdate(pincode);
session.Flush();
session.Transaction.Commit();
}
You are using a stateless session. Since a stateless session has no state, it cannot remember anything to do later. Hence the update is executed immediately.
nhibernate does not batch versioned entities that was the issue in my case.
There is no way you can batch version entities, the only to do this is to make the entity non versioned.
Note that:
Batches are not visible in Sql Server Profiler. Do not depend on that.
When inserting using identity (or native) id generators, NH turns off ado.net batch size.
Additional notes:
make sure that you do not have a query for each changed entity, because it flushes before queries.
You probably should not call session.Update. In the best case, it doesn't do anything. In worst case, it really does updating thus breaking batching.
When doing having many objects in the session, don't forget to care about flushes and flush time. Sometimes flushing is more time consuming than updating. NH flushes before commit, when you call flush and before queries, unless you turned it off or you use a stateless session. Make sure that you only flush once.

RavenDB processing all documents of a certain type

I have some problem with updating all documents in a collection. What I need to do: I need to iterate through ~2 million docs load each doc into memory, parse HTML from one of fields of a doc and save the doc back to DB.
I tried take/skip logic with/without indexes but Id etc. but some records still remain unchanged (even tested for 1000 records with 128 records in a page). In the process of updating documents no more records are inserted. Simple patching (patching API) does not work for this as the update I need to perform is quite complex
Please help with this. Thanks
Code:
public static int UpdateAll<T>(DocumentStore docDB, Action<T> updateAction)
{
return UpdateAll(0, docDB, updateAction);
}
public static int UpdateAll<T>(int startFrom, DocumentStore docDB, Action<T> updateAction)
{
using (var session = docDB.OpenSession())
{
int queryCount = 0;
int start = startFrom;
while (true)
{
var current = session.Query<T>().Take(128).Skip(start).ToList();
if (current.Count == 0)
break;
start += current.Count;
foreach (var doc in current)
{
updateAction(doc);
}
session.SaveChanges();
queryCount += 2;
if (queryCount >= 30)
{
return UpdateAll(start, docDB, updateAction);
}
}
}
return 1;
}
Move your session.SaveChanges(); to outside the while loop.
As per Raven's session design, you can only do 30 interactions with the database during any given instance of a session.
If you refactor your code to only SaveChanges() once (or very few times) per using block, it should work.
For more information, check out the Raven docs : Understanding The Session Object - RavenDB

RavenDB returns stale results after delete

We seem to have verified that RavenDB is getting stale results even when we use various flavors of "WaitForNonStaleResults". Following is the fully-functional sample code (written as a standalone test so that you can copy/paste it and run it as is).
public class Cart
{
public virtual string Email { get; set; }
}
[Test]
public void StandaloneTestForPostingOnStackOverflow()
{
var testDocument = new Cart { Email = "test#abc.com" };
var documentStore = new EmbeddableDocumentStore { RunInMemory = true };
documentStore.Initialize();
using (var session = documentStore.OpenSession())
{
using (var transaction = new TransactionScope())
{
session.Store(testDocument);
session.SaveChanges();
transaction.Complete();
}
using (var transaction = new TransactionScope())
{
var documentToDelete = session
.Query<Cart>()
.Customize(x => x.WaitForNonStaleResultsAsOfLastWrite())
.First(c => c.Email == testDocument.Email);
session.Delete(documentToDelete);
session.SaveChanges();
transaction.Complete();
}
RavenQueryStatistics statistics;
var actualCount = session
.Query<Cart>()
.Statistics(out statistics)
.Customize(x => x.WaitForNonStaleResultsAsOfLastWrite())
.Count(c => c.Email == testDocument.Email);
Assert.IsFalse(statistics.IsStale);
Assert.AreEqual(0, actualCount);
}
}
We have tried every flavor of WaitForNonStaleResults and there is no change. Waiting for non-stale results seems to work fine for the update, but not for the delete.
Update
Some things which I have tried:
Using separate sessions for each action. Outcome: no difference. Same successes and fails.
Putting Thread.Current.Sleep(500) before the final query. Outcome: success. If I sleep the thread for a half second, the count comes back zero like it should.
Re: my comment above on stale results, AllowNonAuthoritiveInformation wasn't working. Needing to put WaitForNonStaleResults in each query, which is the usual "answer" to this issue, feels like a massive "code smell" (as much as I normally hate the term, it seems completely appropriate here).
The only real solution I've found so far is:
var store = new DocumentStore(); // do whatever
store.DatabaseCommands.DisableAllCaching();
Performance suffers accordingly, but I think slower performance is far less of a sin than unreliable if not outright inaccurate results.
This is an old question, but I recently ran across this problem as well. I was able to work around it by changing the convention on the DocumentStore used by the session to make it wait for non stale as of last write:
session.DocumentStore.DefaultQueryingConsistency = ConsistencyOptions.AlwaysWaitForNonStaleResultsAsOfLastWrite;
This made it so that I didn't have to customize every query run after. That said, I believe this only works for queries. It definitely doesn't work on patches as I have found out through testing.
I would also be careful about this and only use it around the code that's needed as it can cause performance issues. You can set the store back to its default with the following:
session.DocumentStore.DefaultQueryingConsistency = ConsistencyOptions.None;
The problem isn't related to deletes, it is related to using TransactionScope. The problem here is that DTC transaction complete in an asynchronous manner.
To fix this issue, what you need to do is call:
session.Advanced.AllowNonAuthoritiveInformation = false;
Which will force RavenDB to wait for the transaction to complete.

Delaying writes to SQL Server

I am working on an app, and need to keep track of how any views a page has. Almost like how SO does it. It is a value used to determine how popular a given page is.
I am concerned that writing to the DB every time a new view needs to be recorded will impact performance. I know this borderline pre-optimization, but I have experienced the problem before. Anyway, the value doesn't need to be real time; it is OK if it is delayed by 10 minutes or so. I was thinking that caching the data, and doing one large write every X minutes should help.
I am running on Windows Azure, so the Appfabric cache is available to me. My original plan was to create some sort of compound key (PostID:UserID), and tag the key with "pageview". Appfabric allows you to get all keys by tag. Thus I could let them build up, and do one bulk insert into my table instead of many small writes. The table looks like this, but is open to change.
int PageID | guid userID | DateTime ViewTimeStamp
The website would still get the value from the database, writes would just be delayed, make sense?
I just read that the Windows Azure Appfabric cache does not support tag based searches, so it pretty much negates my idea.
My question is, how would you accomplish this? I am new to Azure, so I am not sure what my options are. Is there a way to use the cache without tag based searches? I am just looking for advice on how to delay these writes to SQL.
You might want to take a look at http://www.apathybutton.com (and the Cloud Cover episode it links to), which talks about a highly scalable way to count things. (It might be overkill for your needs, but hopefully it gives you some options.)
You could keep a queue in memory and on a timer drain the queue, collapse the queued items by totaling the counts by page and write in one SQL batch/round trip. For example, using a TVP you could write the queued totals with one sproc call.
That of course doesn't guarantee the view counts get written since its in memory and latently written but page counts shouldn't be critical data and crashes should be rare.
You might want to have a look at how the "diagnostics" feature in Azure works. Not because you would use diagnostics for what you are doing at all, but because it is dealing with a similar problem and may provide some inspiration. I am just about to implement a data auditing feature and I want to log that to table storage so also want to delay and bunch the updates together and I have taken a lot of inspiration from diagnostics.
Now, the way Diagnostics in Azure works is that each role starts a little background "transfer" thread. So, whenever you write any traces then that gets stored in a list in local memory and the background thread will (by default) bunch all the requests up and transfer them to table storage every minute.
In your scenario, I would let each role instance keep track of a count of hits and then use a background thread to update the database every minute or so.
I would probably use something like a static ConcurrentDictionary (or one hanging off a singleton) on each webrole with each hit incrementing the counter for the page identifier. You'd need to have some thread handling code to allow multiple request to update the same counter in the list. Alternatively, just allow each "hit" to add a new record to a shared thread-safe list.
Then, have a background thread once per minute increment the database with the number of hits per page since last time and reset the local counter to 0 or empty the shared list if you are going with that approach (again, be careful about the multi threading and locking).
The important thing is to make sure your database update is atomic; If you do a read-current-count from the database, increment it and then write it back then you may have two different web role instances doing this at the same time and thus losing one update.
EDIT:
Here is a quick sample of how you could go about this.
using System.Collections.Concurrent;
using System.Data.SqlClient;
using System.Threading;
using System;
using System.Collections.Generic;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// You would put this in your Application_start for the web role
Thread hitTransfer = new Thread(() => HitCounter.Run(new TimeSpan(0, 0, 1))); // You'd probably want the transfer to happen once a minute rather than once a second
hitTransfer.Start();
//Testing code - this just simulates various web threads being hit and adding hits to the counter
RunTestWorkerThreads(5);
Thread.Sleep(5000);
// You would put the following line in your Application shutdown
HitCounter.StopRunning(); // You could do some cleverer stuff with aborting threads, joining the thread etc but you probably won't need to
Console.WriteLine("Finished...");
Console.ReadKey();
}
private static void RunTestWorkerThreads(int workerCount)
{
Thread[] workerThreads = new Thread[workerCount];
for (int i = 0; i < workerCount; i++)
{
workerThreads[i] = new Thread(
(tagname) =>
{
Random rnd = new Random();
for (int j = 0; j < 300; j++)
{
HitCounter.LogHit(tagname.ToString());
Thread.Sleep(rnd.Next(0, 5));
}
});
workerThreads[i].Start("TAG" + i);
}
foreach (var t in workerThreads)
{
t.Join();
}
Console.WriteLine("All threads finished...");
}
}
public static class HitCounter
{
private static System.Collections.Concurrent.ConcurrentQueue<string> hits;
private static object transferlock = new object();
private static volatile bool stopRunning = false;
static HitCounter()
{
hits = new ConcurrentQueue<string>();
}
public static void LogHit(string tag)
{
hits.Enqueue(tag);
}
public static void Run(TimeSpan transferInterval)
{
while (!stopRunning)
{
Transfer();
Thread.Sleep(transferInterval);
}
}
public static void StopRunning()
{
stopRunning = true;
Transfer();
}
private static void Transfer()
{
lock(transferlock)
{
var tags = GetPendingTags();
var hitCounts = from tag in tags
group tag by tag
into g
select new KeyValuePair<string, int>(g.Key, g.Count());
WriteHits(hitCounts);
}
}
private static void WriteHits(IEnumerable<KeyValuePair<string, int>> hitCounts)
{
// NOTE: I don't usually use sql commands directly and have not tested the below
// The idea is that the update should be atomic so even though you have multiple
// web servers all issuing similar update commands, potentially at the same time,
// they should all commit. I do urge you to test this part as I cannot promise this code
// will work as-is
//using (SqlConnection con = new SqlConnection("xyz"))
//{
// foreach (var hitCount in hitCounts.OrderBy(h => h.Key))
// {
// var cmd = con.CreateCommand();
// cmd.CommandText = "update hits set count = count + #count where tag = #tag";
// cmd.Parameters.AddWithValue("#count", hitCount.Value);
// cmd.Parameters.AddWithValue("#tag", hitCount.Key);
// cmd.ExecuteNonQuery();
// }
//}
Console.WriteLine("Writing....");
foreach (var hitCount in hitCounts.OrderBy(h => h.Key))
{
Console.WriteLine(String.Format("{0}\t{1}", hitCount.Key, hitCount.Value));
}
}
private static IEnumerable<string> GetPendingTags()
{
List<string> hitlist = new List<string>();
var currentCount = hits.Count();
for (int i = 0; i < currentCount; i++)
{
string tag = null;
if (hits.TryDequeue(out tag))
{
hitlist.Add(tag);
}
}
return hitlist;
}
}

How do I read a large file from disk to database without running out of memory

I feel embarrassed to ask this question as I feel like I should already know. However, given I don't....I want to know how to read large files from disk to a database without getting an OutOfMemory exception. Specifically, I need to load CSV (or really tab delimited files).
I am experimenting with CSVReader and specifically this code sample but I'm sure I'm doing it wrong. Some of their other coding samples show how you can read streaming files of any size, which is pretty much what I want (only I need to read from disk), but I don't know what type of IDataReader I could create to allow this.
I am reading directly from disk and my attempt to ensure I don't ever run out of memory by reading too much data at once is below. I can't help thinking that I should be able to use a BufferedFileReader or something similar where I can point to the location of the file and specify a buffer size and then CsvDataReader expects an IDataReader as it's first parameter, it could just use that. Please show me the error of my ways, let me be rid of my GetData method with it's arbitrary file chunking mechanism and help me out with this basic problem.
private void button3_Click(object sender, EventArgs e)
{
totalNumberOfLinesInFile = GetNumberOfRecordsInFile();
totalNumberOfLinesProcessed = 0;
while (totalNumberOfLinesProcessed < totalNumberOfLinesInFile)
{
TextReader tr = GetData();
using (CsvDataReader csvData = new CsvDataReader(tr, '\t'))
{
csvData.Settings.HasHeaders = false;
csvData.Settings.SkipEmptyRecords = true;
csvData.Settings.TrimWhitespace = true;
for (int i = 0; i < 30; i++) // known number of columns for testing purposes
{
csvData.Columns.Add("varchar");
}
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(#"Data Source=XPDEVVM\XPDEV;Initial Catalog=MyTest;Integrated Security=SSPI;"))
{
bulkCopy.DestinationTableName = "work.test";
for (int i = 0; i < 30; i++)
{
bulkCopy.ColumnMappings.Add(i, i); // map First to first_name
}
bulkCopy.WriteToServer(csvData);
}
}
}
}
private TextReader GetData()
{
StringBuilder result = new StringBuilder();
int totalDataLines = 0;
using (FileStream fs = new FileStream(pathToFile, FileMode.Open, System.IO.FileAccess.Read, FileShare.ReadWrite))
{
using (StreamReader sr = new StreamReader(fs))
{
string line = string.Empty;
while ((line = sr.ReadLine()) != null)
{
if (line.StartsWith("D\t"))
{
totalDataLines++;
if (totalDataLines < 100000) // Arbitrary method of restricting how much data is read at once.
{
result.AppendLine(line);
}
}
}
}
}
totalNumberOfLinesProcessed += totalDataLines;
return new StringReader(result.ToString());
}
Actually your code is reading all data from file and keep into TextReader(in memory). Then you read data from TextReader to Save server.
If data is so big, data size in TextReader caused out of memory. Please try this way.
1) Read data (each line) from File.
2) Then insert each line to Server.
Out of memory problem will be solved because only each record in memory while processing.
Pseudo code
begin tran
While (data = FilerReader.ReadLine())
{
insert into Table[col0,col1,etc] values (data[0], data[1], etc)
}
end tran
Probably not the answer you're looking for but this is what BULK INSERT was designed for.
I would just add using BufferedFileReader with the readLine method and doing exatcly in the fashion above.
Basically understanding the resposnisbilties here.
BufferedFileReader is the class reading data from file (buffe wise)
There should be a LineReader too.
CSVReader is a util class for reading the data assuming that its in correct format.
SQlBulkCopy you are anywsay using.
Second Option
You can go to the import facility of database directly. If the format of the file is correct and thw hole point of program is this only. that would be faster too.
I think you may have a red herring with the size of the data. Every time I come across this problem, it's not the size of the data but the amount of objects created when looping over the data.
Look in your while loop adding records to the db within the method button3_Click(object sender, EventArgs e):
TextReader tr = GetData();
using (CsvDataReader csvData = new CsvDataReader(tr, '\t'))
Here you declare and instantiate two objects each iteration - meaning for each chunk of file you read you will instantiate 200,000 objects; the garbage collector will not keep up.
Why not declare the objects outside of the while loop?
TextReader tr = null;
CsvDataReader csvData = null;
This way, the gc will stand half a chance. You could prove the difference by benchmarking the while loop, you will no doubt notice a huge performance degradation after you have created just a couple of thousand objects.
pseudo code:
while (!EOF) {
while (chosenRecords.size() < WRITE_BUFFER_LIST_SIZE) {
MyRecord record = chooseOrSkipRecord(file.readln());
if (record != null) {
chosenRecords.add(record)
}
}
insertRecords(chosenRecords) // <== writes data and clears the list
}
WRITE_BUFFER_LIST_SIZE is just a constant that you set... bigger means bigger batches and smaller means smaller batches. A size of 1 is RBAR :).
If your operation is big enough that failing partway through is a realistic possibility, or if failing partway through could cost someone a non-trivial amount of money, you probably want to also write to a second table the total number of records processed so far from the file (including the ones you skipped) as part of the same transaction so that you can pick up where you left off in the event of partial completion.
Instead of reading csv rows one by one and inserting into db one by one I suggest read a chunk and insert it into database. Repeat this process until the entire file has been read.
You can buffer in memory, say 1000 csv rows at a time, then insert them in the database.
int MAX_BUFFERED=1000;
int counter=0;
List<List<String>> bufferedRows= new ...
while (scanner.hasNext()){
List<String> rowEntries= getData(scanner.getLine())
bufferedRows.add(rowEntries);
if (counter==MAX_BUFFERED){
//INSERT INTO DATABASE
//append all contents to a string buffer and create your SQL INSERT statement
bufferedRows.clearAll();//remove data so it could be GCed when GC kicks in
}
}