Objdump - How can I convert an elf file to several binary files, only on existing sections (excluding the gaps) - objdump

I have an ELF file that on disk is 70M, however it contains large gaps, leading it to be more the 26GB and I want to objdump it to several binary file, hoping over the gaps , just existing sections, a script is good too

Related

Strings indexing tool for binary files

Very often I've to deal with very large binary files (from 50 to 500Gb), in different formats, which contains basically mixed data including strings.
I need to index the strings inside the file, creating a database or an index, so I can do quick searches (basic search or complex with regex). The output of the search should be of course the offset of the found string in the binary file.
Does anyone know a tool, framework or library which can help me on this task?
You can run 'strings -t d' (Linux / OS X) on it to pull out strings with their corresponding offset and then put that into Solr or Elastic. If you want more than just ASCII though, it gets more complex.
Autopsy has its own strings extraction code (for UTF-8 and UTF-16) and puts it into Solr (and uses Tika if the file format is supported), but it doesn't record the offset from a binary file, so it may not meet your needs.

Notating large batch of files

I have about 30,000 different files all with different file formats names. I want to put together a list of "unique" files given that the dates/etc. are replaced by generic characters/symbols.
For example:
20160105asdf_123456_CODE.txt
Would be notated into:
YYYYMMDD*_######_XXXX.txt
Any ideas on how to do this efficiently on a large scale? I thought about parsing it out per delimiter ("_"), but I'm sure there's something a lot easier out there.

Pandas: in memory sorting hdf5 files

I have the following problem:
I have a set several hdf5 files with similar data frames which I want to sort globally based on multiple columns.
My input is the file names and an ordered list of columns I want to use for sorting.
The output should be a single hdf5 file containing all the sorted data.
Each file can contain millions of rows. I can afford loading a single file in memory but not the entire dataset.
Naively I would like first to copy all the data in a single hdf5 file (which is not difficult) and then find out a way to do in memory sorting of this huge file.
Is there a quick way to sort in memory a pandas datastructure stored in an hdf5 file based on multiple columns?
I have already seen ptrepack but it seems to allow you sorting only on a single column.

Compare files via CRC

I have 2 zip files. Inside each zip file there are multiple text and binary files. However not all files are the same. Some files are different due to time stamp and other data, others are identical.
Can I use CRC to definitively prove that specific files are identical?
Example: I have file A,B,C in both archives. Can I use CRC to prove that A,B,C files is identical in both archives?
Thank you.
Definitively? No - CRC collisions are perfectly possible, just very improbable.
If you need absolute proof then you're going to need to compare the files byte-for-byte. If you just mean within the expectations of everyday use, sure. If the filesize is the same and the CRC is the same then it's very very likely the files are the same.

efficient diff between large file and other small files

I wish to get some expert advice on this problem.
I have two text files, one very large ( ~ GB ) and other small ( ~ MB). These files essentially have information per line. I can say that bigger file has a subset of information about the smaller file. Each line in files is organized as tuples sperated by spaces and diff is found by looking at one or more of columns in those two files. Both of these files are sorted based on one of such column (document id).
I implemented it by keeping index on document id and line number and doing a random access to that line in larger file to start the diff. But this method is slow. I want to know any good mechanism for this scenario.
Thanks in advance.
If the files are known to be sorted in the same order by the same key, and the lines that share a common key are expected to match exactly, then comm is probably what you want - it has flags to allow you to show only the lines that are common between two files, or the lines that are in one file but not the other.