is CLucene is faster than java lucene? - lucene

i am using java lucene and i am moving my code from java to c++ for some reason so i need to know about the performance of clucene
can any one explain

According to a benchmark posted on CLucene's SourceForge wiki, CLucene outperforms Java Lucene by a factor of 2 to 3 during indexing, but search performance is only about 10% better.

The data Michael linked to is quite old and incomplete. The answer is yes mainly because C++ has no GC threads and memory allocations are made in C++ by hand. Even reference counting in C++ will be performed faster in C++ since its compiled to machine code, unlike Java which runs on a VM.
For more info see the free chapter on CLucene from Lucene In Action, available from http://www.code972.com/blog/2010/06/lucene-in-action-free-chapter-coupon-code/

Related

Lucene in Java, C#.Net and C++. Which is the best version for long-term use on Windows server?

I am going to implement Lucene search into my project and I want to make a best start.
So I consider between 3 versions of Lucene (Java/C#.Net/C++) which is the best version upon these criterias :
1.performance
2.easy to implement
3.plenty of documents ?
Assume the system is Window server, and I ask it for a long-term use.
Thanks
I would say Java. Lucene was initially developed in Java and I would think there are bigger community, more documentation and bigger deployments using Java.
Granted, Windows is not usually considered as primary platform for deploying Java services but it still would work with flying colors. Many people using Windows for Java development and even deployment so I don't expect any major issues.
Unless you've got a specific feature you need, I would look at best being:
a) Whatever platform you are developing the program in -- there are lots of advantages to not having to switch tools/contexts/platforms to muck around with the search internals.
b) Whatever platform your ops guys want to deal with -- I know lots of windows ops guys hate dealing with java as it is a strange foreign language. For example.
c) All of the above being equal, Java is the real flagship lucene project that everyone else is keeping up with with and that has the most tools & resources. It is the way to go if you don't have any reason not to use java. Solr is another advantage here -- you can pretty easily use a pre-wrapped fully functional lucene http server.
In any case, keep in mind that at least theoretically any lucene index written on one platform is readable by others so you don't necessarily have to fully commit to a single platform.

What are alternatives to the Java VM?

As Oracle sues Google over the Dalvik VM it becomes clear, that you cannot implement a Java VM without license from Oracle (EDIT: Matthew Flaschen points out, that the claims of Oracle may not be valid. Anyways we have currently a situation, where Oracle threats VM-implementations.). That may become the death for Open-Source-implementations of Java (like Apache Harmony).
I don't want to discuss the impact or the legitimation of this lawsuit. but as a Java-programmer I want to take a deeper look into the alternatives, to be prepared for every case. As I see the creation of a compiler as a minor problem, my main interest are alternative VM-implementations, that serve a similar purpose as the JVM.
The VM I'm looking for, should meet some conditions:
free of patent-issues
an Open-Source-implementation exists
potential for optimizations/good performance
platform independent (the VM can be ported to different platforms without bigger hurdles)
Please add some recommendations for me.
LLVM is a really good optimizing, low level virtual machine. It can support languages like C and C++, and does not have built in support for high level features like garbage collection.
VMKit is an implementation of the Java and CLI virtual machines on top of LLVM. Since it uses Java bytecode, this probably wouldn't help with the patent issues.
HLVM is another interesting high level virtual machine built on top of LLVM. It is probably different enough to avoid most well known patents, but it is mainly targeted at numerical computing and functional programming.
On the dynamically typed side, there is Parrot.
I am actually working on a compiler and VM for a language of my own design, but don't count on it ever being finished. ;-)
Keep in mind that any large piece of software will infringe on numerous patents, the important thing is how well known they are (and how much the patents' owners actively seek out infringers). Of course, the whole patent system is absurd, and we would be much better off getting rid of it.
I don't think there is any significant piece of software that is free from patent issues.
If you are an independent developer or working for a smaller company you probably won't get hit directly by the problems though. It's unlikely that big companies holding patents will go after lots of small claims - it's an expensive process and causes a lot of resentment. SCO tried something like that and it didn't work out too well for them.
I would concentrate on finding the best tool for the job without worrying too much about the patent issues, otherwise you will never get anything done.
GraalVM is a research project developed by Oracle Labs and already in production at Twitter. I can't believe my eyes that no one mentions anything about it, it’s so weird. Anyways, GraalVM is a well promising extension of the java virtual machine to support more language and execution modes for running applications like JavaScript, Python, Ruby, R, JVM-based languages, and LLVM-based languages such as C and C++.The GraalVM project includes a new high-performance Java compiler, itself called Graal, which can be used in a just-in-time configuration on the HotSpot VM, or in an ahead-of-time configuration on the SubstrateVM. The main goal of this project is to improve the performance of the java virtual machine base language to match the performance of native languages. Let’s sum up the novel features that this project offers and make a brief explanation according to the docs why you should adopt it.
Polyglot: All languages (even LLVM-based) share the same VM and its capabilities. Zero overhead interoperability between programming languages allows you to write polyglot applications and select the best language for your task
Native: Native images compiled with GraalVM ahead-of-time improve the startup time and reduce the memory footprint of JVM-based applications.
Embeddable: GraalVM can be embedded in both managed and native applications. There are existing integrations into OpenJDK, Node.js, Oracle Database, and MySQL GraalVM removes the isolation between programming languages and enables interoperability in a shared runtime. It can run either standalone or in the context of OpenJDK, Node.js, Oracle Database, or MySQL.
Performance: Graal benchmark reports show great performance improvements in almost all of its implementations thanks to the way that GraalVM performs object allocations
If someone don’t get convinced by now that is a good choice and it is a really awesome project you can see this talk by Christian Thalinger on “on why Graal is a good fit for Twitter”

fastest scripting programming language? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a web application project where performances count more
than anything else, and I have the choice of the technologies
to use.
The language shootout benchmarks that are not really related
to web applications.
What would you recommand as the best suitable candidates?
Thanks!
A friend suggested the gwan server on IRC. Looks to be what I
was searching but I never heard about it before. Anybody with
prior experience on this package? Ease of use, reliability?
Before I leave Apache, I would like to get your thoughts.
G-WAN is a neat webserver: it's based around the "C scripts" concept:
A C script is simply C source-code that is compiled by the webserver and then loaded in protected memory. It will get called by the webserver when a request to the servlet is made. The servlet, as it's compiled by a C compiler, is "as fast" as normally compiling a C program. However, the advantage of C scripts to, for instance, CGI or FastCGI, is that the compiled program is in the same memory space as the webserver. This reduces the overhead of communication (either by creating a process, in the case of CGI, for each request, or the socket for FastCGI).
The webserver is using the select/poll technique: non-blocking I/O. However, there's a neat thing to it. Every program can be written as if it was using blocking I/O. As the webserver itself compiles each C script, it can transform the program to use non-blocking I/O. As of this, it can link itself to third-party libraries (like database access) and still make use of the non-blocking I/O nature: no thread/process context switching.
The tools provided for programming the C scripts are, for instance, caching and safe buffers. The next (not yet released as of writing this post) version will also include a Key-Value store.
Performance-wise: there are some benchmarks available showing it's outperforming any other webserver, however I don't trust these. Try writing a small CPU intensive program in C and in, for instance, PHP. Let the C script run on G-WAN and the PHP script on Apache, and do a benchmark yourself.
There is more to it, but that's out of scope for this question.
Some downsides of G-WAN is that it is developed by only one person. There is a forum, however, where you can ask questions.
Ease of use is limited by your skill in C. The API provided, however, is simple. It still has some inconsistencies and (in my opinion) ugly parts, but that's not a problem. A more serious problem is that each version is not guaranteed to be backwards-compatible and you may have to rewrite.
If you want to be safe: make use of C's platform independentness: allow your code to be compiled to (Fast)CGI programs and also to be used by G-WAN. Might G-WAN fail, you can always fallback to Apache's (Fast)CGI (see http://www.fastcgi.com/ for API's).
If performance counts more than anything else, don't use a scripting language. Especially since you have full control over the technology stack. Compiled languages will perform better for CPU intensive operations.
LuaJit (Lua) is the fastest scripting language with JIT technology..
if you want the fastest for server side web application (that not always scripting), that would be g-wan.. you can use c, c++, java..
ASP.NET is also fast enough for almost anything, but quite pricey
php with hiphop would be easiest to learn and also fast enough..
it depends on how many request do you need.. and how fast you learn the language ^^
don't forget to cache static data (using memcache or nosql)
Begin by identifying if your application performance really depends on the language or on some other factor (like database requests for instance). Ability to cache results can also be a very important factor.
For performance the language used come quite far in the list of important points to check and the use case also influence which language is better. For example if you have many regex to check you should check regex support in the candidate language, etc...
For image processing, the most important point will probably be the underlying image library you use, usually written in C. I have the case of ImageMagick in mind, because I'm currently using it. It's available for as a library for most languages and the scripting language layer is only necessary to call functions of the library and used language at that level won't change much (but caching precomputed result images could change performance by a large margin). This use case would probably be similar for calling a cryptographic lib.
If performance is really such an issue, for image processing you could also consider using a lib that works with GPU accelerator cards (libs with cuda or openGPU support).
Javascript is constantly being scrutinized and optimized for use on mobile devices, so on actual full-size servers it runs EXTREMELY fast. Check out Node.JS, a project for implementing server side javascript to serve webpages: http://nodejs.org/
Well, if you use a database with a large volume of data you will spend more time there than running a php or asp or (insert other flavours here) script
If you can you should build a mockup of your app (or at least a segment of the more database or processor-intensive parts) and try to benchmark those
Update: Seem like Java 7 using NIO.2 has manage to outperform Gwan using C but almost 2x in timing, it is incredible but you can a few a simple tests.
The only downside of Java is not able to integrate shared libraries built on C. I'm ready to challenge someone to prove me wrong that Java NIO.2 is slower than C.
I recommend the Java programming language; it's not a scripting language, but it's probably the fastest programming language that can be used for programming web applications. I also recommend using a framework like Spring for a better programming experience (versus "raw" Java Servlet Programming).
The fasted scripting Language is ASP followed by PHP, but if you want applications that scale to unlimited speeds, use C++ or Java.
Google Search uses C++
Gmail uses Java
YouTube = Python
Twiiter used to use Ruby now they shifted to Java
Facebook = PHP at front end and some java at the backend
But i recommend PHP at the front end and C++ at the back-end

Use of general-purpose scripting languages

There are many scripting language communities claiming that the language can be used for everything but in fact, nearly everybody uses it for one specific thing, e.g.: web development. If I take a look at Ruby, for example, they tell you its general-purpose but actually everybody is using it with rails for web development only..
Can you list me some uses of popular general-purpose scripting languages for the local PC? (except embedding) Are there any?
Is the fast development usually worth having to bring the whole interpreter with your program? Then there would be some language-dependent performance and stability problems too in most cases..
best regards,
lamas
I tend to use Python for most things that aren't compute bound, i.e. they aren't restricted by how many computations you do per second. Some of the things I've used Python for are:
General scripts to manipulate images etc. with the Python Imaging Library.
GUI frontends for command line applications using the pexpect module.
Mathematical modeling of microbial systems.
Bioinformatics.
Some web programming.
etc...
When the program/algorithm is compute bound, I use C together with Python and Ctypes. Does this fit your definition of general purpose? It's certainly useful for a wide variety of applications, but not suitable if the program needs to crunch numbers fast.
Stability: Python 2.5/2.6 is rock solid. Never had a crash that wasn't caused by self-stupidity.
Fast development: It's definitely worth it for me. For the most part, in the field where I work, programmer time is orders of magnitude more valuable than processor time. I'm quite happy to let a program run for hours if I can write it in a few days instead of a few weeks.
I often use PHP for things that I used to use bat files for. Much easier to write. Ironically, the deployment scripts to create installable materials for my web apps from the subversion sources are written in PHP.
Python is popular in the gaming community. EVE Online is written in python.
claiming that they can be used for everything but I often can't find any examples for that
You are basing your question on an incorrect assumption. Although, as pointed out, a Turing complete language will be able to compute what you require ... languages are 'viewed' by most as the sum of their most useful features and productive semantics.
The reality is:
Most scripting languages can do the same things, or support the most common things via libraries.
Some languages make a subset of operations more convenient, take Perl and regular expressions as an example
CPU time is cheap, as is RAM. Simple to understand code is the priority for most people.
The rise of the scripting languages is natural. Trying to assert any one language, approach or level of execution is good for a range of situations is usually fruitless.
What do you want?
What is the best language for that?
Is is fast enough or small enough? Usually the answer is yes
Imagine trying to use Python where you should be using Erlang, or C instead of Lisp because you thought all languages are equal. They aren't, even though, you can achieve the same things in a problem domain, in most languages/platforms with varying levels of ballache dependant on the task.
I often use ruby for what other people would create bash/sh files for. I find Ruby syntax intuitive for batch tasks along with a lot of other sorts of tasks(it's my goto language)
Perl is extremely popular for general scripting in unixes, such as there are package managers and websites and maintenance scripts written in perl.
Python is extremely popular for both web and application use.
VBA Is popular for being abused to write programs inside of Access, and also was once commonly used in ASP for websites (right?)
Nobody mentioned AppleScript!
Hahah, no seriously, Perl runs everywhere, is installed by default on (almost) any Unix-family OS (and is easy to get on Windows), and is extremely useful for gluing things together. And if you browse a bit at CPAN you'll see that it's extremely general-purpose. "Swiss army chainsaw" was intended as a slur but I think of it fondly. Performance is good too, though it hardly ever actually matters. Larry Wall's goal was "make easy things easy and hard things possible".
OK OK, so I'm a fanboy still, sigh.

Embedded platform development in (!C)

I'm curious to see how popular the alternatives to C are in the embedded developer world e.g. Ada...
I've only ever used C (with a little bit of assembler), but then my targets have very limited resources. Is there a move else where in this space to something else? What is winning the ware in set top boxes?
If !C what was the underlying reason?
Compiler support for target
Trace \ static analysis tools
other...
Thanks.
Forth is quite popular for embedded development.
Also, while Smalltalk is probably not popular in the embedded community, embedded development is definitely popular in the Smalltalk community.
When you say "embedded development", keep in mind that you have to consider the scale of the project.
When programming something on the scale of a microcontroller or the firmware for an ASIC, you tend to see C and assembly dominate the scene. Embedded developers tend to "specialize" in these languages since compilers for them are available for nearly every embedded target platform. If your project migrates from, say, a chip with a PowerPC core to a chip with an ARM core, you can be fairly confident that your C code will not be overly difficult to port over. Some chips do have compilers available for other languages, but typically they do not match the C compiler in terms of efficiency of the resulting binary. Since embedded systems are often low on resources, system designers want to make their code as efficient as possible (also one reason why you see a lot of assembly language code). I have seen development tools available for languages such as C++, Pascal, Basic, and others, but they are typically niche tools that are not mature enough to match the efficiency of the available C compilers. Debugging tools for these languages also tend to be harder to find than what is available for C/assembly.
You also mentioned set-top boxes. Embedded systems on this scale can pack the equivalent power of a desktop computer from 7-8 years ago. Their available RAM, storage space, and processing power allows them to run full-featured operating systems and interpreters for higher-level languages. On these more powerful systems you will still see C and assembly language being used (for driver code, if nothing else), but other languages (such as Java, Lua, Tcl, Ruby, etc) are becoming more and more common. Using interpreted languages makes porting code from one platform to another even easier, as long as the platform has sufficient resources to handle the overhead of the language interpreter. Any low-level code that interfaces directly with hardware (drivers) with still typically use assembly or C since high-level languages don't always have the capability to do this sort of thing. Anything running as an application on top of the embedded operating system can usually be developed and tested inside an emulator or virtual machine, and so you will see a lot of code being developed in whatever language the developer happens to be comfortable with.
TLDR version: C is popular because is it a versatile language that nearly all developers are familiar with. Assembly is popular because it allows for low-level hardware access in ways that would otherwise be difficult or impossible. Interpreted/scripted languages such as Java are becoming more popular, but the resource requirements of the interpreters for these languages may be too much for some embedded systems to handle. The quality and variety of development/debugging tools availability for the C and assembly languages also makes these options attractive.
Perhaps not quite the large step from C you're looking for but C++ is also resonably popular for embedded projects.
I haven't used myself, but Bascom is quite popular for AVR microcontrollers. It is a Basic IDE that lets you interact with the peripherals very easily. I've met hardware people that successfully use it.
Yes. Java is becoming more popular - many processors have added instructions that pertain primarily to Java and similar languages (.net). Also, uclinux runs on microcontrollers, so you can use practically any language for some of the larger micros.
Basic is still common, as is assembly.
You'll see Ada in certain gov't projects.
And some engineers are even putting Lua and other interpreters on their micros so their customers can extend the functionality.
But C is still dominant.
-Adam
In the early 90 I did a lot of embedded development on the 8051 using Intel PLM51 and the DCX51 operating system.
PLM is very simple language – but very powerful
We now use C
If you work in the smartcard space, you get to use Java Card. Yep, Java, on an 8-bit micro. It's kinda fun, actually. I get to develop in Eclipse, test ( & debug!) on the PC simulator, and can be confident that it'll run the same on the card. It's just such a pity Java is a terrible language for embedded apps :)
I've used EC++ (Embedded C++) quite extensively.
Also, PICBasic has been popular with the PIC'ers for eons now.
I have used Ada in embedded project for military avionics because of customer requirements. There is lots of Ada tools for embedded development but most of it is very expensive. Personally I would just use C.
There is a Pascal compiler for 8051
JAL
There is a group of folks working to make Lua a viable option for embedded work. They are targeting primarily 32-bit ARMs with 256K FLASH and 64K RAM or better, and seem happy with their work so far.
They are partly inspired by the classic BASIC-Stamp, a BASIC interpreter running in a moderately powerful PIC with the program itself stored in a serial EEPROM device.
At work, I am still maintaining a customer's embedded system that is written in a compiled flavor of BASIC running in a Zilog Z180 CPU. 1980's technology all around, with most of the system still built out of 24-bin DIP packages in sockets. The compiler runs under CP/M-80 running in a Z80 simulator, that itself runs in the MS-DOS simulator built into Windows. Aside from the shear amazement that anything productive can be done this way (and that you can still buy 27C256 UV erasable EPROMS, and that my nearly 20 year old Data/IO PROM programmer still works) I really wish the customer could afford to move to a new hardware design so the system could be rewritten in a maintainable language.
Depends on the microcontroller, many of them have C but the compilers are horribly, assembler is usually easy and the best performing, most efficient, etc. Ones like the msp, avr, and arm are good for C compilers and for those I would and do use C (depending on the problem).
I would stick to C or assembler, you are wasting memory, performance, and resources using anything else.
Pascal, Modula2 work fine too. Essentially they are pretty much equivalent to C, except for the inability to do alloca (though some have that as extension).
But the core problem will be the problem with any !C compiler: what do you prefer, a better compiler/toolchain or the language of preference.
Despite I like the Wirthian languages most, I simply use C, and am living with the consequences, simply because the toolchain is better.
There have been examples in the past (Pascals, or even tightly compiled Basics), but C is mostly the norm. I never understood why.
I worked on a device which ran some incredibly old version of python (1.4 or something). There was no way to debug it (other than printing debug messages) so when your code hit an exception everything would just stop and you scratched your head for an hour. Whenever you made a change and upgraded the code it was running, it took about 10 minutes to interpret and compile it.
Needless to say we scrapped that and replaced the microcontroller with one that ran C.
See this related question:
What languages are used for real-time systems programming.
In response to your "why" question, from the standpoint of government/military acquisition, there is a perception that Java (language, platform, etc...) is the lingua franca these days and that economies of scale in the language will reduce acquisition and maintenance cost. There's also a hope that one can efficiently train a competent Java programmer to be a reasonable RT/embedded programmer in Java faster than if they are required to learn a new language. This rationale is suspect, in my opinion, but it does answer the "why" question.
If you include the iPhone as an embedded platform then Objective-C
Considering how many times I've had a Java out-of-memory exception on my phone(most of the time I do anything remotely interesting), I'd run away from Java like a bat out of a hot place.
I've heard that Erlang was designed for use for cell phones. I think Lisp is a good architecture for remote device support- if the device cna handle the run-time.
A lot of home-brew users and small companies needing a cheap solution have found Tiny Tiger and Basic STAMP (using BASIC) meets their needs.