Search text among several parquet files
I need to search multiple mixed parquet files in a windows folder
Search for the desired text in all parquet files
Which tool has such a feature in Windows?
tnx
Related
There are countless questions and answers that present different solutions to merge 2 or more PDF files and how to extract specific pages and create a PDF with this subset.
Unfortunately I could not find a way (either using a library or command line tools, since it will be scripted) to merge files, such that the resulting file is a valid PDF and later "split back" this file in separate files, using the same page ranges, to obtain the exact same original files (at the binary level).
Is this possible?
Once you merged the PDF files you cannot split the result and obtain the exact same original files at binary level. Source PDF files are not included as opaque binaries blocks in the merged file.
One possible solution solution, as #mkl said, is to use a PDF portfolio to embed the source files as they are. When viewing the portfolio you will see each file as it is, not as a long merged PDF file.
there is pdfinfo for PDFs, exifinfo for images, ffprobe for multimedia and so on. are the a collection or a standardized toolset for extracting filetype dependent(like pdf, image, doc, odt) metadata of all files in linux?
or even distinct tool that are file-format specific for most common file types like ppt, epub and other file types we commonly find in internet downloads.
Exiftool can extract metadata from more than just images. See Supported File Types.
MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.
I am using Google Custom Search along with the XML API. From the documentation linked below, I can see that the XML API supports searches for .xls files, but what about .xlsx files? Half of our files are now the newer .xlsx format and we need for them to turn up in our search results.
How does one search for .xlsx files with the XML API? This is not covered anywhere in the XML API documentation and searching for .xls files does not return any results for .xlsx files, when it should.
https://developers.google.com/custom-search/docs/xml_results
I figured it out. You can use filetype:xlsx even though the documentation does not include xlsx as a supported file type to search for.
Also, you can search for multiple file types. Here is the documentation on that: https://developers.google.com/custom-search/docs/xml_results#wsSpecialQueryTerms
I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?
The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.
Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.
Has anyone used Tika to index other types of documents, much like the SOLR plugin?
Apache Tika
Some links:
PDF2TEXT is in poppler or poppler-utils on Linux
ANTIWORD -- seems to be for old .doc, not newer .docx