Reading yaml properties file from S3 - amazon-s3

I have a yaml properties file stored in a S3 bucket. In Mule4 I can read this file using S3 connector. I need to use properties defined in this file (for dynamic values reading and using it in Mule4) in DB connectors. I am not able to create properties from this file such that I can use them as ${dbUser} in mule configuration or flow as an example. Any guidance on how can I accomplish this?

You will not be able to use the S3 connector to do that. The connector can read the file in an operation at execution time, but properties placeholders, like ${dbUser} have to be defined earlier, at deployment time.
You might be able to to read the value into a variable (for example: #[vars.dbUser]) and use the variable in the database connector configuration. That is called a dynamic configuration, because it is evaluated dynamically at execution time.

Related

Can Confluent's S3 Sink Connector for Kafka Connect write topics to a nested (not a top-level) folder in an S3 bucket using `topics.dir`?

Can Confluent's S3 Sink Connector for Kafka Connect write topics to a nested (not a top-level) folder in an S3 bucket using topics.dir?
For example, if I set topics.dir to the value thisistoplevel/thisisnested, will the connector work?
The documentation for the topics.dir configuration property says:
Top level directory to store the data ingested from Kafka.
But it seems like a strange restriction to make. Perhaps the wording "Top level directory" is meant to mean something more like "the top-most directory under which topics will be written".
Maybe someone else has tested this or uses a nested directory in production?
In S3's context, there is no such thing as "nesting" of folders; S3 does not have "folders". It is rather just a string prefix within a bucket.
Yes, the prefix can be as long as you need it, and then topic names will be appended to that path when written

FLINK: Is it possible in the same flink job to read data from kafka topic (file names) and then read files content from amazon s3?

I have a use-case where i need to process data from files stored in s3 and write the processed data to local files.
The s3 files are constantly added to the bucket.
Each time a file is added to the bucket, the full path is published to a kafka topic.
I want to achieve on a single job the following:
To read the file names from kafka (unbounded stream).
An evaluator that receives the file name, reads the content from s3 (second source) and creates a dataStream.
Process the dataStream (adding some logic to each row).
Sink to file.
I managed to do the first, third and forth part of the design.
Is there a way to achieve this?
Thanks in advance.
I don't believe there's any straightforward way to do this.
To do everything in a single job, maybe you could convince the FileSource to use a custom FileEnumerator that gets the paths from Kafka.
A simpler alternative would be to launch a new (bounded) job for every file to be ingested. The file to be read could be passed in as a parameter.
This is possible to implement in general, but as David Anderson has already suggested, there is currently no straightforward way to this with the vanilla Flink connectors.
Other approach could be writing the pipeline in Apache Beam, that already supports this and can use Flink as a runtime (which is a proof that this can be implemented with the existing primitives).
I think this is a legitimate use case that Flink should eventually support out of the box.

Download big number of files (400k) from S3 bucket into Azure Datalake Gen2 using Azure Data Factory

I need to download a big number of files (around 400k) files from an S3 bucket. I have the paths stored in a csv file. Some of the paths may not exist.
The two options i see are:
Use the foreach activity and somehow pass the contents of the file there. But i think that this would flood my monitor pane with a huge number of runs, and it feels like it is meant to be for smaller pipelines.
Use the listOfFiles option which is supported in the S3 source. The problem with this approach is that the list must be in the S3 bucket and cannot be loaded from Azure Datalake Gen2 (anybody knows why, please let me know as well).
I have tried using the listOfFiles way, but the pipeline fails once it finds the first missing file. The fault tolerance options contain a "skip missing file" option but it is defined as "Skip the files if it is being deleted from source store during the data movement", so it is of no use to me.
I don't want to download more files than needed, so copying the bucket as-is is not an option. How can i approach this issue with ADF? I'm looking for a solution that uses the predefined transformations, ideally i would like to not involve Azure Batch or Azure Functions for such a simple task.

SnowFlake and S3 MetaData

I have custom metadata properties on my s3 files such as:
x-amz-meta-custom1: "testvalue"
x-amz-meta-custom2: "whoohoo"
When these files are loaded into SnowFLake, how do I access the custom properties associated with the files. Google and SnowFlake documentation haven't turned up any gems yet.
Based on docs, I think the only metadata that you can access via the stage is filename and row number. https://docs.snowflake.com/en/user-guide/querying-metadata.html
You could possibly write something custom that picks up the S3 metadata and writes out a s3 filename along with the metadata and then ingest that back into another snowflake table.

AWS Data pipeline postStepCommand unable to access INPUT1_STAGING_DIR

In EMR Activity of a Data pipline, I am trying to use postStepCommand (as documented here ) to invoke a shell script. As part of it I am trying to access the standard directory paths ${INPUT1_STAGING_DIR} and ${OUTPUT1_STAGING_DIR}
But seems like it's not able to access it's value. Is it by design ?