Guidelines for using lucene.net in a web service app? - wcf

Just started reading up on Lucene.net and I would like some of my REST based web services to use the powerful searching facilities of Lucene.net
However I came across a link which said that I should create a windows service (with WCF) to do all the lucene searches/indexes etc as IIS recycles the application pool which will cause all sorts of locking issues.
My question is, is this correct? If so, is there another way of resolving this problem without creating a windows service (with WCF)? Also since I have REST based services, would I make a call from these services to the Windows WCF service which would make things slower?

Indexing
During your reading you would have picked up that indexing is done using the IndexWriter class. Lucene will only allow 1 IndexWriter instance open at a time. When using the default locking it creates a lock file in the index directory and prevents any other IndexWriter instances from being created. For this reason it may be better to implement indexing in a process that you have more control over.
If your indexing process is terminated with extreme prejudice and your IndexWriter class does not get closed, the lock on your index folder is maintained and no other instances will be allowed. Because of this Lucene allows you to lift a lock from an Indexed folder (using IndexWriter.unlock)- a dangerous method because if there are two IndexWriters open on the same index it will corrupt the index. If you have a windows service that is performing the indexing, and it's the only process in your solution that does the indexing (and any updates), you can confidently unlock the indexing folder on startup of the service. In a web service based environment where you are performing indexing from a web method - controlling and recovering from locking issues becomes problematic.
Searching
The IndexSearcher class is used for the searches. This in readonly mode can be done from your service based code. I don't think it's necessary to create a separate set of WCF methods for this purpose.
Optimization
The index may required to be optimized for performance periodically depending on the volumes. Once again having the indexing in a separate process you can schedule the optimization nightly, weekly or what ever is required. Optimization is done by a call to one method.
Indexing new data
How and when to get the indexing process to index new data.... I don't know what data you're indexing so it's hard to tell. In my scenario I have WCF methods that are responsible for input data - high volume. I require the data that has been received to be available for searching as soon as possible. So,
my Model layer has a notification layer that when new records of the required type have been successfully committed, a simple notification message is inserted into a local queue in MSMQ.
The reason for MSMQ is that the queue is persisted and transactional and that any messages in there are available even after a crash of system reboot - allowing me to never (cough!) lose any messages.
The indexing service takes the notification, build the Lucene Document and indexes the data.
The indexing service can also be triggered to do a full re-index by deleting the existing index an crawling the Db.
EDIT:
Example architecture:
WCF Service Methods taking on data commiting it to the Model layer. The Model layer notifies a listening client that an CRUD operation occurred successfully on items. The listening client posts the notification in a queue.
Windows Service handles Indexing of data, watching the queue for indexing requests.
ASP.Net app provides user interface with search features.

You can simply disable application pool recycling and host your application/service in IIS.
To disable recycling on config changes, use the disallowRotationOnConfigChange parameter.
You can also split your application in two parts: Index updates and searches.
Handle index updates from a windows service, and have your IIS portion handles searches (readonly). You would do this by having a mechanism that detects index updates, and refresh the IndexSearchers. This way, if the performance penalty of using services is a concern for you, it wont impact search time which is the important aspect for the users. With this configuration you can even have a master index update node, and distribute searches across different web servers in a farm. The only downside is you dont have the near real time searching functionality thats built in the IndexWriter class.
http://wiki.apache.org/lucene-java/NearRealtimeSearch
That being said, I've never had performance issues with setups that have the Lucene functions exposed over a WCF service, especially if your running either on the same machine with NetNamedPipe or on a local LAN with NetTcp.

Related

Shared Elasticsearch Index

I'm working on a new implementation where I have some queries regarding the pros and cons of having a shared Database in a microservice architecture.
Context:
Service A listens to an event from Kafka and based on the parameters updates a particular table. This table is owned entirely by Service A and not shared. Some of the data in this table needs to be accessed by other services based on the value of a particular field.
My Approach:
Once the Table is updated, if we know that this data might be required by some other service(by checking the value of the field) write it to an ES index. I want to keep the ES index shared across services.
The other services would read the ES index whenever required. These services would use the index only for read while Service A is the only service which writes to the index.
Also, I've added a fallback API in Service A which hits the table in case ES is down. Please check out the diagram, I've added a link to that below.
Issues:
One issue I can think of is that if ES is completely down then Service A won't be able to write to ES and hence that row update will fail. How do I handle this?
I also need help figuring out the fundamental scalability and deployment issues that can be counter productive to a microservice architecture by introducing a shared ES index. I think I have eliminated some of the resiliency issues by adding a fallback API for the other services in case ES is down.
Please criticise my design. Design Diagram
I see three options:
Option A: Service A needs to implement something equivalent to the two-phase commit protocol where an event consumed from Kafka by Service A would not be acknowledged until both the DB and ES have acknowledged their write.
It puts a big burden on your service, which in case one of the two sub-system goes down (DB and/or ES) would have to spend time retrying, and is not able to consume more events from Kafka. Events would start piling up in the topic. 2PC is hard to implement right in a distributed environment.
Option B: Service A consumes from Kafka topic A, does its things and produces another event in another Kafka topic B. Two other consumer groups responsible for updating sub-systems would then consume those events from topic B, one would keep updating the DB and another would keep updating ES. Service A can do its job rapidly and not have to worry or get bogged down with updates. Each updates can be retried independently by each consumer group without impacting upstream event consumption. Eventually, everything will be in synched.
Option C: It's a variation of option B, more lightweight. Service A consumes events from the Kafka topic, does its job and updates the DB as it does now. Another process (CDC, Logstash, etc) consumes updates from the DB and updates ES asynchronously and is also responsible for retrying is ES is down. Eventually, everything will be in synched as well.
There are other options, but these 3 are the most obvious ones to me.

.Net Core Hosted Services in a Load Balanced Environment

We are developing a Web API using .Net Core. To perform background tasks we have used Hosted Services.
System has been hosted in AWS Beantalk Environment with the Load Balancer. So based on the load Beanstalk creates/remove new instances of the system.
Our problem is,
Since background services also runs inside the API, When load balancer increases the instances, number of background services also get increased and there is a possibility to execute same task multiple times. Ideally there should be only one instance of background services.
One way to tackle this is to stop executing background services when in a load balanced environment and have a dedicated non-load balanced single instance environment for background services only.
That is a bit ugly solution. So,
1) Is there a better solution for this?
2) Is there a way to identify the primary instance while in a load balanced environment? If so I can conditionally register Hosted services.
Any help is really appreciated.
Thanks
I am facing the same scenario and thinking of a way to implement a custom service architecture that can run normally on all of the instance but to take advantage of pub/sub broker and distributed memory service so those small services will contact each other and coordinate what's to be done. It's complicated to develop yes but a very robust solution IMO.
You'll "have to" use a distributed "lock" system. You'll have to use, for example, a distributed memory cache who put a lock when someone (a node of your cluster) is working on background. If another node is trying to do the same job, he'll be locked by the first lock if the work isn't done yet.
What i mean, if all your nodes doesn't have a "sync handler" you can't handle this kind of situation. It could be SQL app lock, distributed memory cache or other things ..
There is something called Mutex but even that won't control this in multi-instance environment. However, there are ways to control it to some level (may be even 100%). One way would be to keep a tracker in the database. e.g. if the job has to run daily, before starting your job in the background service you might wanna query the database if there is any entry for today, if not then you will insert an entry and start your job.

SQL Server 2005, Caches and all that jazz

Background to question: I'm looking to implement a caching system for my website. Currently we're exploring memcache as a means of doing this. However, I am looking to see if something similar exists for SQL Server. I understand that MySQL has query cache which although is not distributed works as a sort of 'stop gap' measure. Is MySQL query cache equivalent to the buffer cache in SQL Server?
So here are my questions:
Is there a way to know is currently stored in the buffer cache?
Follow up to this, is there a way to force certain tables or result sets into the cache
How much control do I have over what goes on in the buffer and procedure cache? I understand there used to be a DBCC PINTABLE command but that has since been discontinued.
Slightly off topic: Should the caching even exists on the database layer? Or it is more prudent to manage caches using Velocity/Memcache? Is so, why? It seems like cache invalidation is something of a pain when handling many objects with overlapping triggers.
Thanks!
SQL Server implements a buffer pool same way every database product under the sun does (more or less) since System R showed the way. The gory details are explain in Transaction Processing: Concepts and Techniques. I addition it has a caching framework used by the procedure cache, permission token cache and many many other caching classes. This framework is best described in Clock Hands - what are they for.
But this is not the kind of caching applications are usually interested in. The internal database cache is perfect for scale-up scenarios where a more powerfull back end database is able to respond faster to more queries by using these caches, but the modern application stack tends to scale out the web servers and the real problem is caching the results of query interogations in a cache used by the web farm. Ideally, this cache should be shared and distributed. Memcached and Velocity are examples of such application caching infrastructure. Memcache has a long history by now, its uses and shortcommings are understood, there is significant know-how around how to use it, deploy it, manage it and monitor it.
The biggest problem with caching in the application layer, and specially with distributed caching, is cache invalidation. How to detect the changes that occur in the back end data and mark cached entries invalid so that new requests don't use stale data.
The simplest (for some definition of simple...) alternative is proactive invalidation from the application. The code knows when it changes an entity in the database, and after the change occurs it takes the extra step to mark the cached entries invalid. This has several short commings:
Is difficult to know exactly which cached entries are to be invalidated. Dependencies can be quite complex, things are always more that just a simple table/entry, there are aggregate queries, joins, partitioned data etc etc.
Code discipline is required to ensure all paths that modify data also invalidate the cache.
Changes to the data that occur outside the application scope are not detected. In practice, there are always changes that occur outside the application scope: other applications using the same data, import/export and ETL jobs, manual intervention etc etc.
A more complicated alternative is a cache that is notified by the database itself when changes occur. Not many technologies are around to support this though, it cannot work without an active support from the database. SQL Server has Query Notifications for such scenarios, you can read more about it at The Mysterious Notification. Implementing QN based caching in a standalone application is fairly complicated (and often done badly) but it works fine when implemented correctly. Doing so in a shared scaled out cache like Memcached is quite a feats of strength, but is doable.
Nai,
Answers to your questions follow:
From Wiki - Always correct... ? :-). For a more Microsoft answer, here is their description on Buffer Cache.
Buffer management
SQL Server buffers pages in RAM to
minimize disc I/O. Any 8 KB page can
be buffered in-memory, and the set of
all pages currently buffered is called
the buffer cache. The amount of memory
available to SQL Server decides how
many pages will be cached in memory.
The buffer cache is managed by the
Buffer Manager. Either reading from or
writing to any page copies it to the
buffer cache. Subsequent reads or
writes are redirected to the in-memory
copy, rather than the on-disc version.
The page is updated on the disc by the
Buffer Manager only if the in-memory
cache has not been referenced for some
time. While writing pages back to
disc, asynchronous I/O is used whereby
the I/O operation is done in a
background thread so that other
operations do not have to wait for the
I/O operation to complete. Each page
is written along with its checksum
when it is written. When reading the
page back, its checksum is computed
again and matched with the stored
version to ensure the page has not
been damaged or tampered with in the
meantime.
For this answer, please refer to the above answer:
Either reading from or writing to any page copies it to the buffer cache. Subsequent reads or writes are redirected to the in-memory copy, rather than the on-disc version.
You can query the bpool_commit_target and bpool_committed columns in the sys.dm_os_sys_info catalog view to return the number of pages reserved as the memory target and the number of pages currently committed in the buffer cache, respectively.
I feel like Microsoft has had time to figure out caching for their product and should be trusted.
I hope this information was helpful,
Thanks!
Caching can take many different meaning for an ASP.Net application spread from the browser all the way to your hardware with the IIS, Application, Database thrown in the middle.
The caching you are talking about is Database level caching, this is mostly transparent to your application. This level of caching will include buffer pools, statement caches etc. Make sure your DB server has plenty of RAM. In theory a DB server should be able to load the entire DB store in memory. There is not much you can do at this level unless you pre-fetch some anticipated data when you start the application and ensure that it is in DB cache.
On the other hand is in-memory distributed caching system. Apart from memcache and velocity, you can look at some commercial solutions like NCache or Oracle Coherence. I have no experience in either of them to recommend. This level of caching promises scalability at a cheaper cost. It is expensive to scale the DB tier compared to this. You may have to consider aspects like network bandwidth though. This type of caching, specially with invalidation and expiry can be complicated
You can cache at Web Service tier using output caching at IIS level (in IIS 7) and ASP.Net level.
At the application level you can use ASP.Net cache. This is the one that you can control most and gives you good benefits.
Then there is caching going on at client web proxy tier that can be controlled by cache-control HTTP header.
Finally you have browser level caching, view state and cookies for small data.
And don't forget that hardware like SAN caches at physical disk access level too.
In summary caching can occur at many levels and it for you to analyse and implement the best solution for your scenario. You have find out stability and volatility of your data, expected load etc. I believe caching at ASP.Net level (specially for objects) gives you most flexibility and control.
Your specific technical questions about SQL Server's buffer cache are going down the wrong path when it comes to "implement a caching system for my website".
Sure, SQL Server is going to cache data so it can improve its performance (and it does so rather well), but the point of implementing a caching layer on your web front-ends is to avoid from having to talk to the database at all - because there is still overhead and resource contention even when your query is fulfilled entirely from SQL Server's cache.
You want to be looking into is: memcached, Velocity, ASP.NET Cache, P&P Caching Application Block, etc.

WCF/Silverlight/SQL DB Caching Strategies

Ok, I have a pretty complex silverlight app that gets its data from a WCF service (asp.net hosted service layer) which in turn calls into a data layer that calls stored procedures in a SQL 2005 DB to extract the needed data. So the round trip goes like this:
Silverlight App --> WCF Service --> Data Layer --> DB --> Data Layer --> WCF Service transforms Data Entity into corresponding DTO (Data Transfer Object) or List<> thereof --> Silverlight App
Much of the data is highly relational (so it needs to exist in the DB), but it will change infrequently. It seems that I have several choices of locations to cache this "semi-constant" data:
I can cache it in the data layer. My data layer is already set up to use the SQLDependency class and cache the results from a stored procedure call. I think that this is or can be an application level cache.
I can cache the resulting DTO in an application level (or session level depending on the call) cache within the WCF service itself.
2(a) I could even take this a step further by serializing the XML for the resulting DTO(s) into a file on the WCF service side so that I could (a) check memory cache, then (b) check file cache and (c) hit the data layer
I could do something similar to 2(a) with isolated storage on the client side within the SL app. I could serialize the data to the local isolated storage with a hash (or a moddate or something) and then just make a call to check that.
One more thing to add: I am hosting this WCF service in IIS7 with dynamic compression turned on so that the (often very large and easily compressed) XML response gets gzip-ed. Ideally, it would seem, I would like IIS to cache this gzip-ed result to avoid all the extra processing. I think that it may do this already but I am not sure.
I am pretty sure that the final answer to this is some flavor of "it depends", but I would love to hear how others are approaching this. A good tactical recipe of Do X, Test Performance with tool Y, the do Z if needed would be great to have.
A few links (I will add to this as I research this):
WCF Caching Approach
If you have data that are user that will change quite rarely and need fast response, going for a custom mechanism bases on local storage is a great advantage quite faster than having to wait for a server roundtrip.
Dino Sposito published an interesting article about local storage and caching on MSDN Magazine there you can find as well an approach to catch assemblies (imagine just loading the minimum package required and just go loadin the rest of assemblies in background, ... performance rocket, more complexity on your code :)).
As you said is matter to go putting in a balance and decide.
HTH
Braulio
My approach would be this:
Determine if there is actually a problem with performance (isn't it alreade acceptable to my users?)
Measure the performance at each teir (how long does it take the database to come up with data? how long does it take the service to respond with data? how much time does it take from the service to the client?)
Based on the measurements I would then determine where to do my caching. Remember that, the closer to your data storage you do caching, the easier it is, but the closer to the client you do caching, the better the performance gain (usually).
Also remember that caching should not be the first thing to do to improve performance. You should also look into other performance gains as well. Are the stored procedures slow? Is there a lot of overhead in the WCF messages? Is there some inefficient processing in the service? Do I realy need all that data in one message?
HTH,
Jonathan
I think #2 is your best bet for maintainability and architecture. IIS provides caching, why not use it?
You don't want to have to reference System.Web from a data layer. Client side is not the best option either, because you'd have to write a bunch of additional code to keep the data synchronized.
Is System.Web caching even available to WCF when it's not running in ASP.NET compatible mode? Probably best not to depend on it and write your own.
On the other hand, look into Microsoft's Velocity project, which looks like it will produce a very interesting caching technology not dependant on ASP.NET.
We just recently implemented #3, the client-side caching using Isolated Storage.
In our app we have lot of drop downs and custom fields which the app used to get from the server every time it loads. Moving these data to IS really helped. The app now makes a call to check if there were any changes on the server, and if not - loads the data from the IS, otherwise ( which is pretty rare ) refreshes IS.
That eliminated a lot of WCF calls and data transfers, the SL pages' loading time is shorter, and the app in general became more scalable because of the reduced network traffic and db access.
Yes, there are some coding involved, but the benefits for the end users are essential.
Andrew
If you use RIA Services, then a simple approach is to have two separate edmx definitions. One for cached entities, one for transactional ones.
One domain context can reference the entities on another domaincontext via AddReference see.
The cached entities could be loaded immediately after user has authenticated. For simplicity, transactional data should not load until cached entities have loaded.
Depending on the size of the cache, you may also wish to consider serializing these values to local storage.

Index replication and Load balancing

Am using Lucene API in my web portal which is going to have 1000s of concurrent users.
Our web server will call Lucene API which will be sitting on an app server.We plan to use 2 app servers for load balancing.
Given this, what should be our strategy for replicating lucene indexes on the 2nd app server?any tips please?
You could use solr, which contains built in replication. This is possibly the best and easiest solution, since it probably would take quite a lot of work to implement your own replication scheme.
That said, I'm about to do exactly that myself, for a project I'm working on. The difference is that since we're using PHP for the frontend, we've implemented lucene in a socket server that accepts queries and returns a list of db primary keys. My plan is to push changes to the server and store them in a queue, where I'll first store them into the the memory index, and then flush the memory index to disk when the load is low enough.
Still, it's a complex thing to do and I'm set on doing quite a lot of work before we have a stable final solution that's reliable enough.
From experience, Lucene should have no problem scaling to thousands of users. That said, if you're only using your second App server for load balancing and not for fail over situations, you should be fine hosting Lucene on only one of those servers and accessing it via NDS (if you have a unix environment) or shared directory (in windows environment) from the second server.
Again, this is dependent on your specific situation. If you're talking about having millions (5 or more) of documents in your index and needing your lucene index to be failoverable, you may want to look into Solr or Katta.
We are working on a similar implementation to what you are describing as a proof of concept. What we see as an end-product for us consists of three separate servers to accomplish this.
There is a "publication" server, that is responsible for generating the indices that will be used. There is a service implementation that handles the workflows used to build these indices, as well as being able to signal completion (a custom management API exposed via WCF web services).
There are two "site-facing" Lucene.NET servers. Access to the API is provided via WCF Services to the site. They sit behind a physical load balancer and will periodically "ping" the publication server to see if there is a more current set of indicies than what is currently running. If it is, it requests a lock from the publication server and updates the local indices by initiating a transfer to a local "incoming" folder. Once there, it is just a matter of suspending the searcher while the index is attached. It then releases its lock and the other server is available to do the same.
Like I said, we are only approaching the proof of concept stage with this, as a replacement for our current solution, which is a load balanced Endeca cluster. The size of the indices and the amount of time it will take to actually complete the tasks required are the larger questions that have yet to be proved out.
Just some random things that we are considering:
The downtime of a given server could be reduced if two local folders are used on each machine receiving data to achieve a "round-robin" approach.
We are looking to see if the load balancer allows programmatic access to have a node remove and add itself from the cluster. This would lessen the chance that a user experiences a hang if he/she accesses during an update.
We are looking at "request forwarding" in the event that cluster manipulation is not possible.
We looked at solr, too. While a lot of it just works out of the box, we have some bench time to explore this path as a learning exercise - learning things like Lucene.NET, improving our WF and WCF skills, and implementing ASP.NET MVC for a management front-end. Worst case scenario, we go with something like solr, but have gained experience in some skills we are looking to improve on.
I'm creating the Indices on the publishing Backend machines into the filesystem and replicate those over to the marketing.
That way every single, load & fail balanced, node has it's own index without network latency.
Only drawback is, you shouldn't try to recreate the index within the replicated folder, as you'll have the lockfile lying around at every node, blocking the indexreader until your reindex finished.