How hive understands the size of input data? - apache

I'm trying to understand Hive internals. What class/method hive uses to understand size of dataset in S3 ?

Hive is built on top of hadoop, and uses hadoop's HDFS as API for input/output.
More precisely, it has a InputFormat and OutputFormat that are configurable when you create a table that get data from a FileSystem object (https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html).
The FileSystem object abstracts most aspects of file management, so hive does not have to worry if a file is on S3 or HDFS as the hadoop/HDFS layer takes care of that.
When dealing with files, each file has a path that is a URL (for instance, hdfs:///dir/file or s3:///bucket/path ).
The Path class resolves the filesystem using the getFileSystem method, which would be S3FileSystem for an S3 url.
From the FileSystem object, it can get the file size using the methods for FileStatus using the getLen method.
If you want to see where in the hive source this is done, it is usually in org.apache.hadoop.hive.ql.io.CombineHiveInputFormat which is the default setting for hive.input.format.

Related

Unzip files from S3 before putting them into Snowflake

I have data available in an S3 bucket we don't own, with a zipped folder containing files for each date.
We are using Snowflake as our data warehouse. Snowflake accepts gzip'd files, but does not ingest zip'd folders.
Is there a way to directly ingest the files into Snowflake that will be more efficient than copying them all into our own S3 bucket and unzipping them there, then pointing e.g. Snowpipe to that bucket? The data is on the order of 10GB per day, so copying is very doable, but would introduce (potentially) unnecessary latency and cost. We also don't have access to their IAM policies, so can't do something like S3 Sync.
I would be happy to write something myself, or use a product/platform like Meltano or Airbyte, but I can't find a suitable solution.
How about using SnowSQL to load the data into Snowflake, and using Snowflake stage table/user/named stage to hold files at stages.
https://docs.snowflake.com/en/user-guide/data-load-local-file-system-create-stage.html
I had a similar use case. I use an event based trigger that runs a Lambda function everytime there is a new zipped file in my S3 folder. The Lambda functions opens the zipped files, gzips each individual file and re-uploads them to a different S3 folder. Here's the full working code: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

Apache Flink - write Parquet file to S3

I have a Flink streaming pipeline that reads the messages from Kafka, the message has s3 path to the log file. Using the Flink async IO I download the log file, parse & extract some key information from them. I now need to write this extracted data (Hashmap<String, String>) as Parquet file back to another bucket in S3. How do I do it?
I have completed till the transformation, I have used 1.15 flink version. The Parquet format writing is unclear or some methods seem to be deprecated.
You should be using the FileSink. There are some examples in the documentation, but here's an example that writes protobuf data in Parquet format:
final FileSink<ProtoRecord> sink = FileSink
.forBulkFormat(outputBasePath, ParquetProtoWriters.forType(ProtoRecord.class))
.withRollingPolicy(
OnCheckpointRollingPolicy.builder()
.build())
.build();
stream.sinkTo(sink);
Flink includes support for Protobuf and Avro. Otherwise you'll need to implement a ParquetWriterFactory with a custom implementation of the ParquetBuilder interface.
The OnCheckpointRollingPolicy is the default for bulk formats like Parquet. There's no need to specify that unless you go further and include some custom configuration -- but I added it to the example to illustrate how the pieces fit together.

SnowFlake and S3 MetaData

I have custom metadata properties on my s3 files such as:
x-amz-meta-custom1: "testvalue"
x-amz-meta-custom2: "whoohoo"
When these files are loaded into SnowFLake, how do I access the custom properties associated with the files. Google and SnowFlake documentation haven't turned up any gems yet.
Based on docs, I think the only metadata that you can access via the stage is filename and row number. https://docs.snowflake.com/en/user-guide/querying-metadata.html
You could possibly write something custom that picks up the S3 metadata and writes out a s3 filename along with the metadata and then ingest that back into another snowflake table.

How to merge multiple parquet files in Glue

I have Glue job which is writing parquet files in S3 every 6 seconds and S3 is having folder for that hour. At the end of the hour I want to merge all the files in that hour partition then put it in the same location. I don't want to use the Athena tables because job becomes slow. I am trying using Python Shell. But so for I have not found correct solution. Can someone help me with this?
File is also snappy zipped
Depending on how big your Parquet files are, and what the target size is – here's an idea to do this without Glue:
Set up an hourly Cloudwatch cron rule to look in the directory of the previous file to invoke a Lambda function.
Open each Parquet file, and write them to a new parquet file.
Write the resulting Parquet file to the S3 key and remove the parts.
Note there are some limitations/considerations with this design:
Your Parquet files need to stay within the limits of your Lambda's memory capacity. If you aim for getting to parts that are 128mb, you should be able to achieve this
Your separate Parquet schemas need to be identical for you to be reliably "merging" them. If they are not, you need to look into the Parquet file's metadata footer which contains the schema to ensure the schema has the metadata for all the column chunks.
Because the S3 operation is not atomic, you may have a brief moment in which the new S3 Parquet object is uploaded but the old ones haven't been removed. If you don't require to query it within this window, that shouldn't be a problem.
If you require Glue specifically, you may be able to just invoke a Glue job from the Lambda as opposed to trying to do it yourself from within Lambda.

Querying compressed files using BigQuery federated source

According to the BigQuery federated source documentation:
[...]or are compressed must be less than 1 GB each.
This would imply that compressed files are supported types for federated sources in BigQuery.
However, I get the following error when trying to query a gz file in GCS:
I tested with an uncompressed file and it works fine. Are compressed files supported as federated sources in BigQuery, or have I misinterpreted the documentation?
Compression mode defaults to NONE and needs to be explicitly specified in the external table definition.
At the time of the question, this couldn't be done through the UI. This is now fixed and compressed data should be automatically detected.
For more background information, see:
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query
The interesting parameter is "configuration.query.tableDefinitions.[key].compression".