importing HDF5 files to bigquery - google-bigquery

The title pretty much summarizes the situation. BigQuery seems to accept a number of data formats for uploads, but hdf does not appear to be one of them. What is the "canonical" way to upload an hdf5 file? (obviously I can convert the file to a bunch of CSVs but that seems clunky).

Related

Is there a tool to easily convert text files to JSONL documents?

I have several fairly large .txt files that I need to convert to JSONL documents so that I can prepare training data and upload files into an OpenAI environment.
I searched for various free web-based sites and conversion tools and could not find anything that actually worked.

How to merge many VRT file into one

I have many VRT files generated using gdal_translate originally for adjacent images.
Is there away to merge all those VRT file into one VRT file so that when I run gdal2tiles.py I only need to give it this one composite VRT file?
I thought first gdal_wrap will do the trick, but it turn out that gdal_wrap images into one single image.. However, I dont want to merge images, I would like to merge VRT file.
There is gdalbuildvrt utility in GDAL since 1.6.1 - which merges multiple input files into one VRT mosaic file. See this official documentation for usage details:
http://www.gdal.org/gdalbuildvrt.html
You just need to list all the individual files and the output filename very probably.
You have tagged your questions with "maptiler" label, which refers to http://www.maptiler.com/ product. MapTiler is able to render multiple files out of the box and is not using VRT at all internally. It is more efficient to supply the individual input files to maptiler directly, then to create a VRT and pass it to the software. VRT introduces artificial internal block size for reading the data - which slows down the tile rendering process, in some cases significantly.
Feel free to request a demo of MapTiler Pro and compare the speed, size and quality of the map tiles you receive - and post the results here.

Write KML Extended Data in a different way

I have some GPS raw data that I want to put on a KML file.
Currently I can generate the KML file with the Extended Data using the KML format described here https://developers.google.com/kml/documentation/kmlreference#trackexample and that's fine, but it takes too much time.
I am collecting six different types of extended data, using an Arduino and writing them on a SD card, but the entire writing process for each sample is too slow (I write the data on six different files and then I append each file to the final KML, using the gx:track element).
Is there any other way to write all six parameters at the same time, in the KML format using the Extended Data ? maybe using different tags or same tags in different order?
I don't have enough cpu power to rework the file after collecting gps raw data, so I need to write it right the first time.
write the kml totally yourself, do not use an library. Then it is as fast as simply writing text to a file. if the bottleneck is the file system, then kml is not the right format. Use a custom binary file, and transform later to kml on server side.

Is it possible to extract tiff files from PDFs without external libraries?

I was able to use Ned Batchelder's python code, which I converted to C++, to extract jpgs from pdf files. I'm wondering if the same technique can be used to extract tiff files and if so, does anyone know the appropriate offsets and markers to find them?
Thanks,
David
PDF files may contain different image data (not surprisingly).
Most common cases are:
Fax data (CCITT Group 3 and 4)
raw raster data with decoding parameters and optional palette all compressed with Deflate or LZW compression
JPEG data
Recently, I (as developer of a PDF library) start noticing more and more PDFs with JBIG2 image data. Also, JPEG2000 sometimes can be put into a PDF.
I should say, that you probably can extract JPEG/JBIG2/JPEG2000 data into corresponding *.jpeg / *.jp2 / *.jpx files without external libraries but be prepared for all kinds of weird PDFs emitted by broken generators. Also, PDFs quite often use object streams so you'll need to implement sophisticated parser for PDF.
Fax data (i.e. what you probably call TIFF) should be at least packed into a valid TIFF. You can borrow some code for that from open source libtiff for example.
And then comes raw raster data. I don't think that it makes sense to try to extract such data without help of a library. You could do that, of course, but it will take months of work.
So, if you are trying to extract only specific kind of image data from a set of PDFs all created with the same generator, then your task is probably feasible. In all other cases I would recommend to save time, money and hair and use a library for the task.
PDF files store Jpegs as actual JPEGS (DCT and JPX encoding) so in most cases you can rip the data out. With Tiffs, you are looking for CCITT data (but you will need to add a header to the data to make it a Tiff). I wrote 2 blog articles on images in PDF files at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/ and http://www.jpedal.org/PDFblog/2011/07/extract-raw-jpeg-images-from-a-pdf-file/ which might help.

Are there different JPEG2000 file formats?

I've seen JPEG2000 files with both .J2K and .JP2 extensions, and codecs which read one won't always read the other. Can someone explain why there are multiple extensions for what I thought was a single format?
Because JPEG 2000 is both a codec and a file format. The standard is in many parts, with Part 1 giving (mostly) codec information (i.e. how to compress and decompress image data), with a container file format annex (JP2). Part 2 gives many extensions, and a more comprehensive container format (JPX).
JP2 is the "container" format for JPEG 2000 codestreams, and is modelled on the Quicktime format. J2K, I've not seen (we used J2C during standardisation), but I presume it is raw compressed data, without a wrapper. The point of the containers is that a "good" image comes with additional metadata - colour space information, tagging, etc. The JP2 format base allows many "boxes" of information in one file (including many images, if you like). It also allows the same container format to be used for many other parts of the standard (such as JP3D, and JPIP). Really, you shouldn't see many unwrapped, raw data streams - it is, in my opinion, bad practice.