Lucene index and Windows DFS replication

Lucene index and Windows DFS replication - lucene

I want to replicate Lucene index on my web servers periodically. Apart from Solr, can I setup DFS replication on my Windows 2008 servers and use that to replicate my indexes over my load balanced web servers? Will that approach work or I will have to write parallely to 2 different indexer locations within my crawler code?
Any help is appreciated.
Thanks!

I'm not exactly sure of your question. You cannot have two writers writing to the same location no matter what file system you use. So if you want each of your NLB servers to write to the same location, no, that probably won't work.
If you just have one writer and want to replicate its results across many servers, that will work, but it will probably be quite a bit slower (one of the major Lucene recommendations is to not use networked file systems).

Related

Does SQL Server write log files in parallel when multiple databases are involved?

I've read quite a bit now on Stack Overflow and other sites that, with respect to SQL Server, giving a single database multiple log files does not help improve performance. A number of people have separately made blanket statements that splitting one database into many does not improve performance, but they haven't explained why. One of my colleagues insists that using multiple databases does in fact improve performance, because he says that log files can be written in parallel if multiple databases are used, thus reducing the transaction-log-related IO bottleneck. Unfortunately, I can't find anything online--either on SO or otherwise--to clearly support that position.
The web site and associated Windows services I'm developing will receive a huge amount of database traffic, so I've been told I need to split my database into multiple smaller databases so that the transaction logs don't cause a bottleneck (i.e., so, for example, three heavily accessed tables in three separately databases can be updated simultaneously). I'm hesitant to do this because I'd lose the ability to use foreign keys and would thus lose referential integrity.
I sent my colleague a number of links that stated that multiple databases don't improve performance, but he responded back with this one:
https://dba.stackexchange.com/questions/62344/multiple-transaction-log-files-and-performance-impact
Notice how the top answer asserts "Transaction log writes are sequential. Only one of the log files will ever be written to at any one time, so having multiple files - in and of itself - can't possibly change your I/O patterns for that database."
Could anyone please shed some light on how transaction log IO works across multiple databases, and whether sequential logging is in fact a per-database limitation?

Here is another link that might be useful: https://www.sqlskills.com/blogs/jonathan/an-xevent-a-day-23-of-31-how-it-works-multiple-transaction-log-files/
The gist of it is, at least with Microsoft SQL Server, you can see using EVENTs that one log is filled before moving to the next (it doesn't even alternate log records between multiple logs), which effectively prevents performance improvements.
I think there are probably edge cases where highly fragmented log files (ie. with many small VLFs) and battery-backup caching controllers where the fragmentation and serial-as-far-as-software-is-concerned writes get parallelized by the controller(s), but I can't think of a real-world scenario where that would provide a performance advantage over using the same controllers and drives in a RAID setup.

SQL Server needs a single sequential log file per database, in order to maintain it's ACID properties. While you can create a second log file, you probably shouldn't unless under emergency situations, like the drive ran out of space you don't have a better option at the time.
You can see for yourself what SQL Server does with a second log file on a test server. While under an active workload, watching in something like Resource Monitor or Perfmon, create a second log file. You'll see a brief amount of activity while it gets allocated and initialized, then the engine will ignore the new file as the virtual log file chain stays in the original file(as long as that file is healthy). If you have write contention in the log file you way want to stick that single file on a separate dedicated drive that is optimized for sequential write I/O performance probably Raid 10.
For more on Write Ahead Logging and multiple log files see these articles:
https://technet.microsoft.com/en-us/library/ms186259%28v=sql.105%29.aspx
http://www.sqlskills.com/blogs/paul/multiple-log-files-and-why-theyre-bad/
Throwing out the foreign keys is usually a bad idea. SQL Server can use multiple the database(not log) files actively, you can even stick specific objects to a specific file, and partition single tables into different files.
Really depends on your I/O pattern on how you configure things. Of course for your situation if your data doesn't need ACID properties and your write performance is the most important thing, you could use some sort of NOSQL database. Microsoft has been doing a lot of work to bring hadoop into the SQL Server eco system.
It is generally recommended to use multiple database files, check out Paul Randal's blog on sql skills for more information.

What's the Point of Multiple Redis Databases?

So, I've come to a place where I wanted to segment the data I store in redis into separate databases as I sometimes need to make use of the keys command on one specific kind of data, and wanted to separate it to make that faster.
If I segment into multiple databases, everything is still single threaded, and I still only get to use one core. If I just launch another instance of Redis on the same box, I get to use an extra core. On top of that, I can't name Redis databases, or give them any sort of more logical identifier. So, with all of that said, why/when would I ever want to use multiple Redis databases instead of just spinning up an extra instance of Redis for each extra database I want? And relatedly, why doesn't Redis try to utilize an extra core for each extra database I add? What's the advantage of being single threaded across databases?

You don't want to use multiple databases in a single redis instance. As you noted, multiple instances lets you take advantage of multiple cores. If you use database selection you will have to refactor when upgrading. Monitoring and managing multiple instances is not difficult nor painful.
Indeed, you would get far better metrics on each db by segregation based on instance. Each instance would have stats reflecting that segment of data, which can allow for better tuning and more responsive and accurate monitoring. Use a recent version and separate your data by instance.
As Jonaton said, don't use the keys command. You'll find far better performance if you simply create a key index. Whenever adding a key, add the key name to a set. The keys command is not terribly useful once you scale up since it will take significant time to return.
Let the access pattern determine how to structure your data rather than store it the way you think works and then working around how to access and mince it later. You will see far better performance and find the data consuming code often is much cleaner and simpler.
Regarding single threaded, consider that redis is designed for speed and atomicity. Sure actions modifying data in one db need not wait on another db, but what if that action is saving to the dump file, or processing transactions on slaves? At that point you start getting into the weeds of concurrency programming.
By using multiple instances you turn multi threading complexity into a simpler message passing style system.

In principle, Redis databases on the same instance are no different than schemas in RDBMS database instances.
So, with all of that said, why/when would I ever want to use multiple
Redis databases instead of just spinning up an extra instance of Redis
for each extra database I want?
There's one clear advantage of using redis databases in the same redis instance, and that's management. If you spin up a separate instance for each application, and let's say you've got 3 apps, that's 3 separate redis instances, each of which will likely need a slave for HA in production, so that's 6 total instances. From a management standpoint, this gets messy real quick because you need to monitor all of them, do upgrades/patches, etc. If you don't plan on overloading redis with high I/O, a single instance with a slave is simpler and easier to manage provided it meets your SLA.

Even Salvatore Sanfilippo (creator of Redis) thinks it's a bad idea to use multiple DBs in Redis. See his comment here:
https://groups.google.com/d/topic/redis-db/vS5wX8X4Cjg/discussion
I understand how this can be useful, but unfortunately I consider
Redis multiple database errors my worst decision in Redis design at
all... without any kind of real gain, it makes the internals a lot
more complex. The reality is that databases don't scale well for a
number of reason, like active expire of keys and VM. If the DB
selection can be performed with a string I can see this feature being
used as a scalable O(1) dictionary layer, that instead it is not.
With DB numbers, with a default of a few DBs, we are communication
better what this feature is and how can be used I think. I hope that
at some point we can drop the multiple DBs support at all, but I think
it is probably too late as there is a number of people relying on this
feature for their work.

I don't really know any benefits of having multiple databases on a single instance. I guess it's useful if multiple services use the same database server(s), so you can avoid key collisions.
I would not recommend building around using the KEYS command, since it's O(n) and that doesn't scale well. What are you using it for that you can accomplish in another way? Maybe redis isn't the best match for you if functionality like KEYS is vital.
I think they mention the benefits of a single threaded server in their FAQ, but the main thing is simplicity - you don't have to bother with concurrency in any real way. Every action is blocking, so no two things can alter the database at the same time. Ideally you would have one (or more) instances per core of each server, and use a consistent hashing algorithm (or a proxy) to divide the keys among them. Of course, you'll loose some functionality - piping will only work for things on the same server, sorts become harder etc.

Redis databases can be used in the rare cases of deploying a new version of the application, where the new version requires working with different entities.

I know this question is years old, but there's another reason multiple databases may be useful.
If you use a "cloud Redis" from your favourite cloud provider, you probably have a minimum memory size and will pay for what you allocate. If however your dataset is smaller than that, then you'll be wasting a bit of the allocation, and so wasting a bit of money.
Using databases you could use the same Redis cloud-instance to provide service for (say) dev, UAT and production, or multiple instances of your application, or whatever else - thus using more of the allocated memory and so being a little more cost-effective.
A use-case I'm looking at has several instances of an application which use 200-300K each, yet the minimum allocation on my cloud provider is 1M. We can consolidate 10 instances onto a single Redis without really making a dent in any limits, and so save about 90% of the Redis hosting cost. I appreciate there are limitations and issues with this approach, but thought it worth mentioning.

I am using redis for implementing a blacklist of email addresses , and i have different TTL values for different levels of blacklisting , so having different DBs on same instance helps me a lot .

Using multiple databases in a single instance may be useful in the following scenario:
Different copies of the same database could be used for production, development or testing using real-time data. People may use replica to clone a redis instance to achieve the same purpose. However, the former approach is easier for existing running programs to just select the right database to switch to the intended mode.

Our motivation has not been mentioned above. We use multiple databases because we routinely need to delete a large set of a certain type of data, and FLUSHDB makes that easy. For example, we can clear all cached web pages, using FLUSHDB on database 0, without affecting all of our other use of Redis.
There is some discussion here but I have not found definitive information about the performance of this vs scan and delete:
https://github.com/StackExchange/StackExchange.Redis/issues/873

JSON vs classic schema design [duplicate]

The Project
I've been asked to work on an interesting project -- what amounts to a basic Web CMS -- that uses HTML/CSS/jQuery with PHP. However, one requirement is that there won't be a database to house the data (they want flat files for the documents/pages -- preferable in JSON format).
In a very basic sense, it'll be used to generate HTML pages via a very "non-techie" interface. Each installation would only have around 20 pages, but a few may get up to 100. It has to be fairly easy to drop onto a PHP capable server and run, with very little setup needed.
What's Out There
There are tons of CMS options and quite a few flat file versions. But an OSS or other existing CMS is not an option. They need a simple propriety system.
Initial Thoughts
So flat files it is... but I'd really like to get some feedback on the drawbacks, and if it is worth the effort to try and convince them to use something like MySQL (SQLite or CouchDB are out since none of the servers can be configured to run them at the present time).
Of course the document files are pretty straightforward, but we're also talking about login info for 1 or 2 admins per installation, a few lists, as well as configs/settings (which also can easily be stored in a file with protection).
The Dilemma
If there are benefits to using MySQL rather than JOSN formatted files and some arrays in a simple project like this -- beyond my own pre-conceived notions :) -- I'll be sure to argue them.
But honestly I can't see any that outweigh their need to not have a database system.
I'd appreciate you insight and opinions.

If you can't cite a specific need for relational table design, then you're good with flat files. Build as specified. The moment you can cite a specific need, let them know; upgrading isn't that hard, if you're perception is timely (that is, if you aren;t in the position of having to normalize data that should have been integrated earlier).

It's a shame you can't use CouchDB, this seems like the perfect application for it. Keep in mind that using flat-files severely constrains your architecture and, especially, scalability.
What's the best case scenario for your CMS app? It's successful and people want to use it more? If you're using flat-files it'll be harder to service and improve your system (e.g. make it more robust, and add new features for future versions) and performance will not scale well. So "success" in this case is at best short-lived, as success translates into more and more work for less and less gains in feature-set and performance.

Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.

Will this be installed on any shared hosting sites. For this to work somewhat safely, a mechanism like suEXEC needs to be set up properly as the web server will need write permissions to various directories.

What would be cool with a simple site that was feed via JSON and jQuery is that the site wouldn't need to load on each click. Just the relevant data would change. You could then use hashes in the location bar to keep track of where you were (ex. http://localhost/#about)
The problem being if they are editing the raw JSON file they can mess it up pretty quick. I think your admin tools would have to generate the JSON files based on the input so that you can ensure nothing breaks. The admin tools would be more entailed then the site (though isn't that always the case with dynamic sites)

What is the predicted data sizes for the CMS?
A large reason for the use of a RDMS is quick,specific access to large amounts of data. The data format might not be large, but if there is a lot of the data, then it might be better in the long run for a RDMS.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.

While an RDBMS may be necessary for a very large CMS, a small one could run off flat files very well. A lot of CMS products out there fall down in that regard, I think, by throwing an RDBMS into the mix when there's no real need.
However, if you are using flat files, there are security issues which others have highlighted. Another issue I've come across is hosting providers using the disable_functions directive in php.ini to disable file I/O functions like fopen() and friends. If you're hosting your CMS on a box you control, you won't have this problem but if you're using a third-party provider, check first.

As the original poster, I wasn't signed in, so I'm following up to the answers so far in an answer (sorry if this is bad form).
There may instances where this is on
a shared host.
Though the JSON files can technically
be edited, this won't be the case.
The admin interface will be robust
enough to do all of the creating/editing of pages
The size for each install will be
relatively small -- 1 - 2 admins,
10-100 pages. A few lists of common
items may run longer (snippets of
copy for example).
Security will be a big issue -- any
other options suggestions on this
specifically?

Well, isn't there a problem with they being distrustful to any database system? Isn't the problem more in their thinking than in technology? Maybe they are afraid of database because it sounds complex to them. In that case, if you just present them some very simple CMS (like CMS made simple, which I've heard is really simple and the learning process is very fast), if they see everything is easy then may be they just don't care what's behind, if it's a database or whatever!
They could hear to arguments like better maintenance, lower cost of maintenance, much better handover to another webmaster than proprietary solutions (they are not dependent on you) etc.

Lucene: replicated (FS)Directory implementation

Is there any alternative implementations of Lucene's (FS)Directory, notably ones related to replication? What I am looking forward to doing (but looking for something existing before implementing my own :) is a directory that writes to multiple identical directories at the same time. The idea behind is that I can't deploy DFS or SAN and thinking of a sort of a "manual" replication to another node with the minimum possible delay. Thoughts?
Many thanks!

Usually people use Solr for this. If you can't use Solr's replication functionality, you can do what people did before Solr: rsync your directories.

Implementing a massive search application

We have an email service that hosts close to 10000 domains such that we store the headers of messages in a SQL Server database.
I need to implement an application that will search the message body for keywords. The messages are stored as files on a NAS storage system.
As a proof of concept, I had implemented a SQL server based search system were I would parse the message and store all the words in a database table along with the memberid and the messageid. The database was on a separate server to the headers database.
The problem with that system was that I ended up with a table with 600 million rows after processing messages on just one domain. Obviously this is not a very scalable solution.
Since the headers are stored in a SQL Server table, I am going to need to join the messageIDs from the search application to the header table to display the messages that contain the searched for keywords.
Any suggestions on a better architecture? Any better alternative to using SQL server? We receive over 20 million messages a day.
We are a small company with limited resources with respect to servers, maintenance etc.
Thanks

have a look at Hadoop. It's complete "map-reduce" framework for working with huge datasets inspired by Google. It think (but I could be wrong) Rackspace is using it for email search for their clients.

lucene.net will help you a lot, but no matter how you approach this, it's going to be a lot of work.

Consider not using SQL for this. It isn't helping.
GREP and other flat-file techniques for searching the text of the headers is MUCH faster and much simpler.

You can also check out the java lucene stuff which might be useful to you. Both Katta which is a distributed lucene index and Solr which can use rsync for index syncing might be useful. While I don't consider either to be very elegant it is often better to use something that is already built and known to work before embarking on actual development. Without knowing more details its hard to make a more specific recommendation.

If you can break up your 600 million rows, look into database sharding. Any query across all rows is going to be slow. At very least you could break up by language. If they're all English, well, find some way to split the data that makes sense based on common searches. I'm just guessing here but maybe domains could be grouped by TLD (.com, .net, .org, etc).
For fulltext search, compare SQL Server vs Lucene.NET vs cLucene vs MySQL vs PostgreSQL. Note full-text search will be faster if you don't need to rank the results. If a database is still slow look into performance tuning and if that fails look into a Linux-based db.
http://incubator.apache.org/lucene.net/
http://sourceforge.net/projects/clucene/

i wonder if BigTable (http://en.wikipedia.org/wiki/BigTable) does searching.

Look into the SQL Server full text search services/functionality. I haven't used it myself, but I once read that Stack Overflow uses it.

three solutions:
Use an already-existant text search engine (lucene is the most mentioned, there are several more)
Store the whole message in the SQL database, and use included full text search (most DBs have it these days).
Don't create a new record for each word occurrence, just add a new value to a big field in the word record. Even better if you don't use SQL for this table, use a key-value store where the key is the word and the value is the list of occurrences. Check some Inverted Index bibliography for inspiration
but to be honest, i think the only reasonable approach is #1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas