Can we use Google Translate API to index files for searching? - lucene

I am using Lucene 4.2.1 to index files. I need to index multilingual content for which we use Analyzer based on the language, to tokenize and index keywords. However Lucene 4.2.1 does not have analyzers for some languages like Japanese, Korean. The one solution to this is updating the lucene version but since that involves a lot of changes for deprecated functions (in case), I'm trying to find a work around. Does anyone have any suggestions? Thank you!

Personally i would strongly suggest to invest this amount of time and upgrade to the most recent version. This "problem" is already solved in the never versions and building an own written solution may be much more time consuming than upgrading.
IMAO working with such an old version is a technical dept which should be solved. Technical dept always fires back and usually costs much more money as longer as they exist.

Not Sure - But Got This on Google. The Google Cloud Translation API can dynamically translate text between thousands of language pairs. The Cloud Translation API lets websites and programs integrate with the translation service programmatically. The Google Translation API is part of the larger Cloud Machine Learning API family. Please Refer Here Too https://cloud.google.com/translate/docs/ Cotton bags supplier in Dubai

Related

Is it worth to reindex when migrating from lucene 3.x to 4.x

I am going to migrate Lucene version from 3.5 to 4.7.
And as my index is really huge I am wondering if it is worth to reindex it.
Mostly I am interested if it is worth in case of performance.
Any suggestions?
Regards
As usual, there is no simple answer to this.
The big change is that in v4.0 Lucene has introduced the ability to provide custom codec/postings format. Michael McCandless (one of the Lucene authors), explains the difference between 3.X and 4.0:
By default, Lucene uses the StandardCodec, which writes and reads in
nearly the same format as the current stable branch (3.x). Details for
a given term are stored in terms dictionary files, while the docs and
positions where that term occurs are stored in separate files.
That said, there are different codecs and each of them focuses on different things.
This presentation covers some posting formats and has some insights which format is tuned for which scenario. If you are going to stay with StandardCodec, I'd guess you won't gain much choosing to reindex.
Consider using IndexUpgrader of 4.7 to upgrade your index as there are some changes in the index format (postings format to be precise) from 3.x to 4.x. Default Codec for Lucene 4.7 may not be able to read the index files of Lucene 3.x
IndexUpgrader is a utility provided by Lucene.
http://lucene.apache.org/core/4_7_0/core/org/apache/lucene/index/IndexUpgrader.html

Lucene in Java, C#.Net and C++. Which is the best version for long-term use on Windows server?

I am going to implement Lucene search into my project and I want to make a best start.
So I consider between 3 versions of Lucene (Java/C#.Net/C++) which is the best version upon these criterias :
1.performance
2.easy to implement
3.plenty of documents ?
Assume the system is Window server, and I ask it for a long-term use.
Thanks
I would say Java. Lucene was initially developed in Java and I would think there are bigger community, more documentation and bigger deployments using Java.
Granted, Windows is not usually considered as primary platform for deploying Java services but it still would work with flying colors. Many people using Windows for Java development and even deployment so I don't expect any major issues.
Unless you've got a specific feature you need, I would look at best being:
a) Whatever platform you are developing the program in -- there are lots of advantages to not having to switch tools/contexts/platforms to muck around with the search internals.
b) Whatever platform your ops guys want to deal with -- I know lots of windows ops guys hate dealing with java as it is a strange foreign language. For example.
c) All of the above being equal, Java is the real flagship lucene project that everyone else is keeping up with with and that has the most tools & resources. It is the way to go if you don't have any reason not to use java. Solr is another advantage here -- you can pretty easily use a pre-wrapped fully functional lucene http server.
In any case, keep in mind that at least theoretically any lucene index written on one platform is readable by others so you don't necessarily have to fully commit to a single platform.

Full text search for Rails 3

I’m evaluating full text search methods for Rails 3 ATM. Does anyone here have a recommendation? Seems to me as if most of the known methods (Sunspot, Sphinx, Ferret, Xapian) aren’t yet ready for Rails 3. Is that so? At the moment I’ve got plenty of resources left on the machine were I’d like to deploy my app but nevertheless, I’d like to keep the idle load for the search engine as low as possible. I’m planning to use PostgreSQL if that’s of any relevance here.
After some reading I’m almost sure that I’d like to use Sunspot or Xapian. But if there’s any other (and better) solution please tell me :-) Especially regarding Sunspot I’m not sure if it was clever to have a complete Tomcat running in addition to my Rails app. Anyone has experience with this constellation?
Thanks in advance,
Ulf
If you are using PostgreSQL you can get an awful lot out of its built-in text search capabilities before you need to reach for external libraries. I've been using tsearch queries for years with excellent results.
PostgreSQL full text search analyses word proximity to calculate Relevance & ranking and offers useful features like highlighting of search results.
It is also aware of language specific normalisation rules, for example it knows to ignore the s and es pluralization suffixes in English; so searches for 'country' will also bring back highlighted results for 'countries', much the same way that Google does.
I'm not suggesting that you shouldn't use the libraries that you've mentioned, but it is worth investigating the database to see if will already fulfil the majority, if not all of your requirements.
You can use sunspot with Rails3, no problem. We have done so successfully using the sunspot/sunspot_rails gems (1.2.rc4). And it's not too much of a hassle to run Solr within a Tomcat server.
For fulltext-search features you should use a search engine.
For example you could use the Lucene Library with jRuby.
If you like to stay with standard Ruby (cRuby) you coud use Solr.
For rails there are also some Solr plugins:
For example starting with http://wiki.apache.org/solr/SolRuby could be a good idea.
Sunspot is Rails3 ready, we're using it on a few Rails3 apps already. I've had a lot of success with Solr and Sunspot. So much that we're starting a blog series on it

Where do I begin learning Lucene.NET Solr Hadoop and MapReduce?

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1)
Where do I start learning, which products should I consider? To be honest, I'm a little overwhelmed, but I'm determined to figure it all out... eventually.
If you want a book that covers all the basics of Lucene, consider "Lucene in Action". Even though the code samples are Java, you can easily port them to .NET. Of course, there also are tonnes of resources on the web, such as SO and the Lucene mailing lists which should help you along.
For project you describe, you should look at Solr since it abstracts out lots of the issues of scalability etc. and via Solrnet can easily integrate into your .NET app. To restrict access by a level, your index documents should contain a field called "Level" (say) and in the background of your user query, you append the "Level:Level-1" query, using a boolean query construct.
At this stage, my recommendation would be to stay away from Hadoop (Apache Map-reduce implementation) for your project and stick with Solr. If you are however keen to learn about it. It too has a very useful book, you guessed it "Hadoop In Action" (also from Manning Publications).
You seem to be confused about what exactly each project (Lucene/Solr/Hadoop/etc) does. So the first thing to do would be understanding the purpose of each project. Read the docs and blogs about them. If possible, buy and read books about them.
For example, MapReduce and Hadoop have nothing to do with your security requirements. Hadoop is a platform for distributed, scalable computing. But Solr is scalable on its own. You might want to use Hadoop to distribute a crawler though (e.g. Nutch).

How do we create a simple search engine using Lucene, Solr or Nutch?

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document links of all matching PDF's.
I have had good luck with lucene, but it is not click, install and search, it does require a bit of work.
If you need something that yo can download and install and be searching within 10 minutes, look at the free Ominifind Yahoo Edition http://omnifind.ibm.yahoo.net/, it uses Lucene, but is packaged such that it is configured and ready to run upon install, a much easier way to try Lucene.
Nutch + Lucene + Pdf plugin enabled in Nutch is your solution. Nutch allows you to parse pdfs by enabling the pdf plugin.
Lucene will allow you to index the crawled and parsed data and Nutch has servelet which gives you a search interface.
We use the same for our internal lans.
None of the projects in the Lucene family can natively process PDFs, but there are utilities you can drop in and well written examples on how to roll your own.
Lucene will do pretty much whatever you need it to do, but there is overhead in terms of your time, as Tony said above. Thousands of documents really isn't that many, so you might be able to get away with a lighter weight alternative.
That said, I would still recommend looking at Solr - it's much, much easier to set up than Lucene, has support for backups, replication, etc., as well as a nifty JSON interface which would fit your use case very well: http://wiki.apache.org/solr/SolJSON
Google Search Appliance http://www.google.com/enterprise/gsa/
I think you want a system to manage your PDF file. Please try to use dspace system. Dspace is a digital library, it supports Lucene based on. www.dspace.org.
Take a look at eprints. It includes a workflow for adding new documents, automatically indexes and thumbnails PDF's and has fairly comprehensive full text search functionality. It can also be easily customised and branded.
Why re-invent the wheel. Again.
Answering such a broad question in this forum will be tough. I'd recommend you check out the book Lucene in Action, which covers the basics of indexing and searching in a quite readable fashion.
Given your application, it sounds like Nutch and Solr probably will not be necessary. Since all of your documents are available locally, Nutch probably won't be helpful. Solr may help you manage a cluster of searchers if you have a high query load, but Lucene is highly performant, and handles large document sets in a very scalable manner.
The one area that might consume a lot of your effort is the use of PDF. It's possible to index PDF documents, and there are Lucene contributions to facilitate the extraction of raw text from PDFs, but depending on the document, the quality of results can vary. Often, the context of a keyword in a PDF document is unclear because of formatting instructions, and that can make it hard to do proximity searches or show the context of a hit.
A great free search technology you might look at is the IBM Yahoo! free search. I'm not sure whether they followed through on plans to use Lucene under the covers, but it remains one of the really great, east to use free search technologies. It handles up to 500K documents, I believe, and it supports PDF and other non-text formats as well. Graphic user interface; easy to customize search results, and basic search analytics. Basic thesaurus, and powerful API so you can do pretty much whatever you want if the out of the box results are not to your liking. We've suggested this to a number of clients where there were fewer than half a million documents, and they love it.
If you've a Linux server, you could use Beagle to index them, and then just use the search functionality that comes with it. It has an (experimental) web search interface, and it can be hooked into the FireFox search box as well.
It automatically indexes files as they're included, and I'd suspect that you'll find it much more efficient to enhance or fix beagle than to write your own search interface to Lucene.
Having the (imho) distinct advantage of being on a Mac, I use SearchLight on a somewhat older G5. nice web interface to spotlight, the Mac OS' built-in indexing service.