Apache Flink - write Parquet file to S3 - amazon-s3

I have a Flink streaming pipeline that reads the messages from Kafka, the message has s3 path to the log file. Using the Flink async IO I download the log file, parse & extract some key information from them. I now need to write this extracted data (Hashmap<String, String>) as Parquet file back to another bucket in S3. How do I do it?
I have completed till the transformation, I have used 1.15 flink version. The Parquet format writing is unclear or some methods seem to be deprecated.

You should be using the FileSink. There are some examples in the documentation, but here's an example that writes protobuf data in Parquet format:
final FileSink<ProtoRecord> sink = FileSink
.forBulkFormat(outputBasePath, ParquetProtoWriters.forType(ProtoRecord.class))
.withRollingPolicy(
OnCheckpointRollingPolicy.builder()
.build())
.build();
stream.sinkTo(sink);
Flink includes support for Protobuf and Avro. Otherwise you'll need to implement a ParquetWriterFactory with a custom implementation of the ParquetBuilder interface.
The OnCheckpointRollingPolicy is the default for bulk formats like Parquet. There's no need to specify that unless you go further and include some custom configuration -- but I added it to the example to illustrate how the pieces fit together.

Related

FLINK: Is it possible in the same flink job to read data from kafka topic (file names) and then read files content from amazon s3?

I have a use-case where i need to process data from files stored in s3 and write the processed data to local files.
The s3 files are constantly added to the bucket.
Each time a file is added to the bucket, the full path is published to a kafka topic.
I want to achieve on a single job the following:
To read the file names from kafka (unbounded stream).
An evaluator that receives the file name, reads the content from s3 (second source) and creates a dataStream.
Process the dataStream (adding some logic to each row).
Sink to file.
I managed to do the first, third and forth part of the design.
Is there a way to achieve this?
Thanks in advance.
I don't believe there's any straightforward way to do this.
To do everything in a single job, maybe you could convince the FileSource to use a custom FileEnumerator that gets the paths from Kafka.
A simpler alternative would be to launch a new (bounded) job for every file to be ingested. The file to be read could be passed in as a parameter.
This is possible to implement in general, but as David Anderson has already suggested, there is currently no straightforward way to this with the vanilla Flink connectors.
Other approach could be writing the pipeline in Apache Beam, that already supports this and can use Flink as a runtime (which is a proof that this can be implemented with the existing primitives).
I think this is a legitimate use case that Flink should eventually support out of the box.

airbyte ETL ,connection between http API source and big query

i have a task in hand, where I am supposed to create python based HTTP API connector for airbyte. connector will return a response which will contain some links of zip files.
each zip file contains csv file, which is supposed to be uploaded to the bigquery
now I have made the connector which is returning the URL of the zip file.
The main question is how to send the underlying csv file to the bigquery ,
i can for sure unzip or even read the csv file in the python connector, but i am stuck on the part of sending the same to the bigquery.
p.s if you guys can tell me even about sending the CSV to google cloud storage, that will be awesome too
When you are building an Airbyte source connector with the CDK your connector code must output records that will be sent to the destination, BigQuery in your case. This allows to decouple extraction logic from loading logic, and makes your source connector destination agnostic.
I'd suggest this high level logic in your source connector's implementation:
Call the source API to retrieve the zip's url
Download + unzip the zip
Parse the CSV file with Pandas
Output parsed records
This is under the assumption that all CSV files have the same schema. If not you'll have to declare one stream per schema.
A great guide, with more details on how to develop a Python connector is available here.
Once your source connector outputs AirbyteRecordMessages you'll be able to connect it to BigQuery and chose the best loading method according to your need (Standard or GCS staging).

What are valid (and invalid) characters for Avro Schema namespaces

I have an Avro schema with a namespace of "ca.gms.api-event-log". I have used this schema to serialize messages into Kafka, successfully registered that schema with the Kafka Schema Registry and am using a Kafka Connector to send that data to Amazon S3 as .avro files. So far, no issues.
I am now attempting to copy that data from AWS S3 to Azure using Azure Data Factory, and it's complaining about the following:
Failed to deserialize Avro source file 'topics/api-event-log/partition=0/api-event-log+0+0000000000.avro'. This could be caused by invalid Avro data. Check the data and try again. Namespace 'ca.gms.api-event-log' contains invalid characters. . Activity ID: 12a7dda0-8cb7-4c79-a070-d366fddb1c00
Does "ca.gms.api-event-log" really contain invalid characters? Are hyphens not allowed? The Apache Avro spec seems to indicate any valid JSON string should work: https://avro.apache.org/docs/current/spec.html
I've noticed that hyphens are not allowed in the Python avro client, but are fine in the Java API.
Therefore, it ultimately depends on the parser being used, but I'd say that the rule of thumb is keep the same naming rules of Java packages, where hyphens also aren't allowed.
Note: You should probably try using a Kafka Connector capable of writing to Azure rather than paying S3 storage + transfer fees. Also not clear why files are even being opened to check schemas if you are just copying raw files

How hive understands the size of input data?

I'm trying to understand Hive internals. What class/method hive uses to understand size of dataset in S3 ?
Hive is built on top of hadoop, and uses hadoop's HDFS as API for input/output.
More precisely, it has a InputFormat and OutputFormat that are configurable when you create a table that get data from a FileSystem object (https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html).
The FileSystem object abstracts most aspects of file management, so hive does not have to worry if a file is on S3 or HDFS as the hadoop/HDFS layer takes care of that.
When dealing with files, each file has a path that is a URL (for instance, hdfs:///dir/file or s3:///bucket/path ).
The Path class resolves the filesystem using the getFileSystem method, which would be S3FileSystem for an S3 url.
From the FileSystem object, it can get the file size using the methods for FileStatus using the getLen method.
If you want to see where in the hive source this is done, it is usually in org.apache.hadoop.hive.ql.io.CombineHiveInputFormat which is the default setting for hive.input.format.

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.