Is there a tool to easily convert text files to JSONL documents? - training-data

I have several fairly large .txt files that I need to convert to JSONL documents so that I can prepare training data and upload files into an OpenAI environment.
I searched for various free web-based sites and conversion tools and could not find anything that actually worked.

Related

importing HDF5 files to bigquery

The title pretty much summarizes the situation. BigQuery seems to accept a number of data formats for uploads, but hdf does not appear to be one of them. What is the "canonical" way to upload an hdf5 file? (obviously I can convert the file to a bunch of CSVs but that seems clunky).

Searching text inside AFP files

I've been asked to convert files from PDF to AFP and I've managed it using the IBM afp printer's driver. I was wondering if there's a way to search inside the afp file . I know I can do it on the pdf file but I've been asked to crosscheck the converted files searching inside it.
Is there a reason since a pdf file of 370kb is converted to a 11.5Mb afp file ? is it converted as an image ? (this would clarify why I couldn't search inside it)
C is the best option to you to search a string in AFP PTX records. However it depends on how are you converting your PDF to AFP. If you use IBM print dirvers it will rasterize the text. So, you'll be not able to search.
AFP Explorer is one of the best freeware tool if your request is one time.
http://www.compulsivecode.com/project_afpexplorer.aspx
We use COMPART CPMCOPY and CPMILL to convert POS and PDF files into AFP. where you will have MFF filters to get the required output. However it is licensed product.
IBM AFP printer driver can be configured, to some extent. Check this manual page: Creating AFP Resources Using the IBM AFP Printer Drivers for further details.
Make sure that "Print Text as Graphics" is turned off.
Some AFP viewers have the feature of text search within AFP files. Consider BTB Viewer (warning, it looks ridiculously outdated).
If you wish to develop your own solution, consider that in general, searching for text in AFP documents is complicated since each "logical" text block can be split into a series of MO:DCA text instructions, each positioned individually. And it is not for granted that these instructions will be sequential. So expect for problems searching for multi-word strings.
"Conversion" PDF to AFP is a generic term. It depends on what software you used to convert, and what settings were used for conversion. For instance, consider embedded images. Since many AFP devices do not support JPEG compression for I:OCA, the conversion app may convert raster images to raw 24-bit bitmap which is ridiculously ineffective in terms of file size; an innocent background image of 1000×1000 px would take a whopping 3Mb of file size (while the original JPEG stream can be tens kbytes).

Minimizing IO and Memory Usage When Extracting Pages from a PDF

I am working on a cloud-hosted web application that needs to serve up extracted pages from a library of larger PDFs. For example, 5 pages from a 50,000 page PDF that is > 1 GB in size.
To facilitate this, I am using iTextSharp to extract page ranges from the large PDFs using the advised approach found in this blog article.
The trouble I am running into is that during testing, I have found that the PdfReader is reading the entire source PDF in order to extract the few pages I need. I know enough about PDF structure to be dangerous, and I know that resources can be spread around such that random read access all over the file is going to be expected, but I was hoping to avoid the need to read ALL the file content.
I even found several mentions of RandomAccessFileOrArray being the silver bullet to address high memory usage when opening large PDFs, but alas, even when I use that, the source PDF is still being read in it's entirety.
Is there a more efficient method (using iText or otherwise) to access just the content I need from the source PDF in order to extract a few pages?

Compress Files before save in silverlight

I have a file upload to upload and save bytes of files into database.
Now I want to first compress my file size before save into database.
I have gone through below site;
http://programmerpayback.com/2010/01/21/use-silverlight-to-resize-images-and-increase-compression-before-uploading/
In above site, there is solution for jpeg and png but I want to compress all files and get bytes and save into database and when I get files bytes from database it will be same as original files.
please guide me how to do this.
Thanks,
While for images jpeg and png are often a better way of compression, zip files offer a decent compression across all sorts of file types.
In Silverlight, you have a few options, the most popular being DotNetZip and #ziplib.
You can install both as NuGet packages.
Such libraries also have the benefit of being able to package multiple files together, something the image compression formats don't offer in any convenient way.

Is it possible to extract tiff files from PDFs without external libraries?

I was able to use Ned Batchelder's python code, which I converted to C++, to extract jpgs from pdf files. I'm wondering if the same technique can be used to extract tiff files and if so, does anyone know the appropriate offsets and markers to find them?
Thanks,
David
PDF files may contain different image data (not surprisingly).
Most common cases are:
Fax data (CCITT Group 3 and 4)
raw raster data with decoding parameters and optional palette all compressed with Deflate or LZW compression
JPEG data
Recently, I (as developer of a PDF library) start noticing more and more PDFs with JBIG2 image data. Also, JPEG2000 sometimes can be put into a PDF.
I should say, that you probably can extract JPEG/JBIG2/JPEG2000 data into corresponding *.jpeg / *.jp2 / *.jpx files without external libraries but be prepared for all kinds of weird PDFs emitted by broken generators. Also, PDFs quite often use object streams so you'll need to implement sophisticated parser for PDF.
Fax data (i.e. what you probably call TIFF) should be at least packed into a valid TIFF. You can borrow some code for that from open source libtiff for example.
And then comes raw raster data. I don't think that it makes sense to try to extract such data without help of a library. You could do that, of course, but it will take months of work.
So, if you are trying to extract only specific kind of image data from a set of PDFs all created with the same generator, then your task is probably feasible. In all other cases I would recommend to save time, money and hair and use a library for the task.
PDF files store Jpegs as actual JPEGS (DCT and JPX encoding) so in most cases you can rip the data out. With Tiffs, you are looking for CCITT data (but you will need to add a header to the data to make it a Tiff). I wrote 2 blog articles on images in PDF files at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/ and http://www.jpedal.org/PDFblog/2011/07/extract-raw-jpeg-images-from-a-pdf-file/ which might help.