Strings indexing tool for binary files - indexing

Very often I've to deal with very large binary files (from 50 to 500Gb), in different formats, which contains basically mixed data including strings.
I need to index the strings inside the file, creating a database or an index, so I can do quick searches (basic search or complex with regex). The output of the search should be of course the offset of the found string in the binary file.
Does anyone know a tool, framework or library which can help me on this task?

You can run 'strings -t d' (Linux / OS X) on it to pull out strings with their corresponding offset and then put that into Solr or Elastic. If you want more than just ASCII though, it gets more complex.
Autopsy has its own strings extraction code (for UTF-8 and UTF-16) and puts it into Solr (and uses Tika if the file format is supported), but it doesn't record the offset from a binary file, so it may not meet your needs.

Related

Apache Lucene: Creating an index between strings and doing intelligent searching

My problem is as follows: Let's say I have three files. A, B, and C. Each of these files contains 100-150M strings (one per line). Each string is in the format of a hierarchical path like /e/d/f. For example:
File A (RTL):
/arbiter/par0/unit1/sigA
/arbiter/par0/unit1/sigB
...
/arbiter/par0/unit2/sigA
File B (SCH)
/arbiter_sch/par0/unit1/sigA
/arbiter_sch/par0/unit1/sigB
...
/arbiter_sch/par0/unit2/sigA
File C (Layout)
/top/arbiter/par0/unit1/sigA
/top/arbiter/par0/unit1/sigB
...
/top/arbiter/par0/unit2/sigA
We can think of file A corresponding to circuit signals in a hardware modeling language. File B corresponding to circuit signals in a schematic netlist. File C corresponding to circuit signals in a layout (for manufacturing).
Now a signal will have a mapping between File A <-> File B <-> File C. For example in this case, /arbiter/par0/unit1/sigA == /arbiter_sch/par0/unit1/sigA == /top/arbiter/par0/unit1/sigA. Of course, this association (equivalence) is established by me, and I don't expect the matcher to figure this out for me.
Now say, I give '/arbiter/par0/unit1/sigA'. In this case, the matcher should return a direct match from file A since it is found. For file B/C a direct match is not possible. So it should return the best possible matches (i.e., edit distance?) So in this example, it can give /arbiter_sch/par0/unit1/sigA from file B and /top/arbiter/par0/unit1/sigA from file C.
Instead of giving a full string search, I could also give something like *par0*unit1*sigA and it should give me all the possible matches from fileA/B/C.
I am looking for solutions, and came across Apache Lucene. However, I am not totally sure if this would work. I am going through the docs to get some idea.
My main requirements are the following:
There will be 3 text files with full path to signals. (I can adjust the format to make it more compact if it helps building the indexer more quickly).
Building the index should be fairly fast (take a couple of hours). The files above are static (no modifications).
Searching should be comprehensive. It is OK if it takes ~1s / search but the matching should support direct match, regex match, and edit distance matching. The main challenge is each file can have 100-150 million signals.
Can someone tell me if such a use case can be easily addressed by Lucene? What would be the correct way to go about building a index and doing quick/fast searching? I would like to write some proof-of-concept code and test the performance. Thanks.
i think based on your requirements the best solution would be a PoC with a given test set of entries. Based on this it should be possible to evaluate the target indexing time you like to achieve. Because you only use static informations it's easier, because do don't have to care about topics like NRT (near-real-time searches).
Personally i never used lucene for such a big information set but i think lucene is able to handle this.
How i would do it:
Read tutorials and best practices about lucene, indexing, searching and understand how it works
Define an data set for indexing lets say 1000 lines for each file
Define your lucene document structure
this is really important because based on this you will apply your
searches. take care about analyzer tasks like tokanization if needed
and how. If you need fulltext search care about a TextField.
Write code for simple indexing
Run small tests with indexing and inspect your index with Luke
Write code for simple searching
Define queries and your expected results. execute searches and check
results.
Try to structure your code. separate indexing and searching -> it will be easier to refactor.

Notating large batch of files

I have about 30,000 different files all with different file formats names. I want to put together a list of "unique" files given that the dates/etc. are replaced by generic characters/symbols.
For example:
20160105asdf_123456_CODE.txt
Would be notated into:
YYYYMMDD*_######_XXXX.txt
Any ideas on how to do this efficiently on a large scale? I thought about parsing it out per delimiter ("_"), but I'm sure there's something a lot easier out there.

How to store bytes or values in the most compact way?

I am wondering if there is a file format that will enable me to store values in the format as indicated below. I want the file format to be as efficient as possible (I.E no extra information, apart from what I place inside it). This is for a concept I have of creating a more efficient method of storing images. Here is an example of the data I wish to store:
800 600 0000FF FF0000 00FF00 969696
...
I was originally considering placing them in a .txt file, but I do not think that storing say 1 million numbers (For 1000x1000 image) in a .txt file is very compact.
So, what file format that can be written to in VB.net is the best for storing basic numbers?
EDIT 1: I plan to compress using GZip or some other compression afterwards.
Simply store them in binary format. Look at BinaryWriter.

how to look for the content of text file in pentaho?

I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?
If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.
If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..

How to find line number or page number using Lucene

Can anyone help me?
For my project i use lucene for indexing files. It only give me the file name and location not mention about the line number and page number.
If it is possible with Lucene to find line number or page number? Please Help me how to do it.
This ended up being too long for a comment so I just made it an answer.
Are you thinking of grep (*nix tool) output where you grep a set of documents and get a result set that contains matches with a line number and text? EG:
46: I saw the brown fox jumping over the lazy dog
If so, Lucene doesn't work like that. On the OS, grep, to simplify, opens each document serially and runs your specified pattern against each line of the contents inside each document. Hence, it can then produce output like the stuff I listed earlier because it's working on the file as it exists on the machine. Lucene behaves differently.
When you index a file with Lucene, Lucene creates a inverted index combining the contents of each document into a highly efficient structure that lets you quickly look up and find documents containing specific pieces of information. In turn, when you run a query against the Lucene Inverted Index, it will return its internal representation of all the documents that matched your query as well as a relevancy score to provide some indication of how useful a document might be to you, based on the query. It does this by operating against it's own internal inverted index structure, not iterating over all the files in place like grep. Lucene possesses no knowledge of line or page numbers, so no, it's not possible to replicate grep with Lucene right out of the box.