read parquet data from s3 bucket using NiFi - amazon-s3

guys!
I'm just starting to learn NiFi. Don't throw stones) just help or guide. I need to read parquet data from s3 bucket, I don’t understand how to set up lists3 and fetchs3object processors for reading data.
full path looks like this:
s3://inbox/prod/export/date=2022-01-07/user=100/
2022-01-09 06:51:23 23322557 cro.parquet
I"ll write data to sql database - I don"t have problems with it.
I tried to configure the lists3 processor myself and I think is not very good
bucket inbox
aws_access_key_id
aws_secret_access_key
region US EAST
endpoint override URL http://s3.wi-fi.ru:8080

What I would do is try to test the Access Key ID, and Secret Key outside of NiFi to make sure that they are working. If they are working fine, then it’s an issue with the NiFi configuration. If the keys/id isn’t working, then by getting new values that work and providing them to NiFi, it might have a better shot of working.

Related

FLINK: Is it possible in the same flink job to read data from kafka topic (file names) and then read files content from amazon s3?

I have a use-case where i need to process data from files stored in s3 and write the processed data to local files.
The s3 files are constantly added to the bucket.
Each time a file is added to the bucket, the full path is published to a kafka topic.
I want to achieve on a single job the following:
To read the file names from kafka (unbounded stream).
An evaluator that receives the file name, reads the content from s3 (second source) and creates a dataStream.
Process the dataStream (adding some logic to each row).
Sink to file.
I managed to do the first, third and forth part of the design.
Is there a way to achieve this?
Thanks in advance.
I don't believe there's any straightforward way to do this.
To do everything in a single job, maybe you could convince the FileSource to use a custom FileEnumerator that gets the paths from Kafka.
A simpler alternative would be to launch a new (bounded) job for every file to be ingested. The file to be read could be passed in as a parameter.
This is possible to implement in general, but as David Anderson has already suggested, there is currently no straightforward way to this with the vanilla Flink connectors.
Other approach could be writing the pipeline in Apache Beam, that already supports this and can use Flink as a runtime (which is a proof that this can be implemented with the existing primitives).
I think this is a legitimate use case that Flink should eventually support out of the box.

terraform reference existing s3 bucket and dynamo table

From my Terraform script, I am trying to get hold of data for existing resources such as the ARN of an existing DynamoDB table and the bucket Id for an exiting S3 bucket. I've tried to use terraform_remote_state for S3, however it doesn't fit my requirements as it requires a key and I haven't found anything yet that would work for Dynamo.
Is there a solution the would work for both or would there be two separate solutions?
Many thanks in advance.
Remote state is not the concept you need - that's for storage of the tfstate file. What you require is a "data source":
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/s3_bucket
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/dynamodb_table
In Terraform, you use "Resources" to declare what things need to be created (if they don't exist), and "Data Sources" to read information from things that already exist and are not managed by Terraform.

AWS Glue check file contents correctness

I have a project in AWS to insert data from some files, which will be in S3, to Redshift. The point is that the ETL has to be scheduled each day to find new files in S3 and then check if those files are correct. However, this has to be done with custom code as the files can have different formats depending of their kind, provider, etc.
I see that AWS Glue allows to schedule, crawl and do the ETL. However I'm lost at how to one can create its own code for the ETL and parse the files to check the correctness before ending up doing the copy instruction from S3 to Redshift. Do you know if that can be done and how?
Another issue is that if the correctness is OK then, the system should upload the data from S3 to a web via some API. But if it's not the file should be left into an ftp email. Here again, do you know if that can be done as well with the AWS Glue and how?
many thanks!
You can write your glue/spark code, upload it to s3 and create a glue job referring to this script/library. Anything you want to write in python can be done in glue. its just a wrapper around spark which in turn uses python....

Extract data fom Marklogic 8.0.6 to AWS S3

I'm using Marklogic 8.0.6 and we also have JSON documents in it. I need to extract a lot of data from Marklogic and store them in AWS S3. We tried to run "mlcp" locally and them upload the data to AWS S3 but it's very slow because it's generating a lot of files.
Our Marklogic platform is already connected to S3 to perform backup. Is there a way to extract a specific database in aws s3 ?
It can be OK for me if I have one big file with one JSON document per line
Thanks,
Romain.
I don't know about getting it to s3, but you can use CORB2 to extract MarkLogic documents to one big file with one JSON document per line.
S3:// is a native file type in MarkLogic. So you can also iterate through all your docs and export them with xdmp:save("s3://...).
If you want to make agrigates, then You may want to marry this idea into Sam's suggestion of CORB2 to control the process and assist in grouping your whole database into multiple manageable aggregate documents. Then use a post-back task to run xdmp-save
Thanks guys for your answers. I do not know about CORB2, this is a great solution! But unfortunately, due to bad I/O I prefer a solution to write directly on s3.
I can use a basic Ml query and dump to s3:// with native connector but I always face memory error even launching with the "spawn" function to generate a background process.
Do you have any xquey example to extract each document on s3 one by one without memory permission?
Thanks

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.