IndexWriterConfig in Lucene - lucene

During IndexWriteConfig object creation codec is loaded into memory from the resource file
META-INF/services/org.apache.lucene.codecs.Codec
I wondered if there is a way to specify the codec and not ask it to load, and read the file at runtime. For performance and security reasons, we would like this to happen.

Related

Binary open file in c++/cli or c#

I have problem with Binary open file in c++/cli. How to open the whole file and save the contest of the file to the array ^.
A typical method for reading the contest [sic] of a Binarny [sic] file is to:
1. Determine the length of the file.
2. Allocate dynamic memory for the file.
3. Block read the file, in binary mode, into memory.
Some operating systems may have a memory map capability. This allows a file to be treated as an array. The OS is in charge of reading the file into memory. It may read the entire file, or it may read pages as necessary (on demand).
See std::ifstream::read, std::ifstream::seekg and std::ifstream::tellg.

Lazily Read Stream of Bytes from File in Java 8

So Java 8 introduces a lot of lazily loaded Streams, including one for reading lines from a text file using a particular character encoding.
But after doing a LOT of reading I've determined that there is no out of the box method to lazily read chunks of bytes from file, and I'm a bit confused as to why this is the case. This is a pretty common use-case, so there must be a good reason for it to not have been included, right?
My best solution to this seems to be a custom implementation of a Spliterator to read byte chunks using some guidance from this post:
https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams
But I would love to know why Java 8 doesn't have this feature out of the box?

Objective-C - Finding directory size without iterating contents

I need to find the size of a directory (and its sub-directories). I can do this by iterating through the directory tree and summing up the file sizes etc. There are many examples on the internet but it's a somewhat tedious and slow process, particularly when looking at exceptionally large directory structures.
I notice that Apple's Finder application can instantly display a directory size for any given directory. This implies that the operating system is maintaining this information in real time. However, I've been unable to determine how to access this information. Does anyone know where this information is stored and if it can be retrieved by an Objective-C application?
IIRC Finder iterates too. In the old days, it used to use FSGetCatalogInfo (an old File Manager call) to do this quickly. I think there's a newer POSIX call for that these days that's the fastest, lowest-level API for this, especially if you're not interested in all the other info besides the size and really need blazing speed over easily maintainable code.
That said, if it is cached somewhere in a publicly accessible place, it is probably Spotlight. Have you checked whether the spotlight info for a folder includes its size?
PS - One important thing to remember when determining the size of a file: Mac files can have two "forks", the data fork, and the resource fork (where e.g. Finder keeps the info if you override a particular file to open with another application than the default for its file type, and custom icons assigned to files). So make sure you add up both forks' sizes, or your measurements will be off.

Is it possible to store files in Apache Lucene?

I'm new in Apache Lucene.
Is it possible to store files (e.g. pdf, doc) in Apache Lucene and later on to retrieve it? Or if i have to store those files somewhere else and just use it for indexing?
Technically you can, of course, store the contents of a file (e.g. in the StoredField or elsewhere) but I don't see any reason why you should. This will simply bring no added value but pain while serializing and deserializing file contents - and you will still have to keep the file name indexed somewhere else. Apart from serialization/deserialization pain, your app will likely have to block longer while Lucene will be merging index segments.
The best approach IMO is to store the path to the file relative to some file repository root - e.g. if your file is in /home/users/bob/files/123/file.txt, you might want to store the files/123/file.txt part without tokenization (using StringField).

Are file extensions required to correctly serve web content?

We're using Amazon S3 to store and serve images, videos, etc. When uploading this content we also always set the correct content-type (image/jpeg, etc.).
My question is this: Is a file extension required (or recommended) with this sort of setup? In other words, will I potentially run into any problems by naming an image "example" versus "example.jpg"?
I haven't seen any issues with doing this in my tests, but wanted to make sure there are any exceptions that I may be missing.
Extensions are just a means by which OS decides the operating program. As far as your scenario is concerned, as long as the content-type specifies the type, the extension doesn't matter. But why in the world, would you name a jpg file as .txt right ?
Regards