NiFi data insertion into s3 subdirectory - amazon-s3

I have a flow where I am extracting data from the database, converting the Avro to the CSV format and pushing the CSV in an s3 bucket which has subfolder in it. My S3 structure is like the following:
As you can see in the above screenshot my files are going into a blank folder(highlighted by red) instead of going inside a subfolder called 'Thermal'. Please see my PutS3Object settings:
The final s3 path I want my files to go into is: export-csv-vehicle-telemetry/vin11/Thermal
What settings should I change in my processor so the file goes directly inside the 'Thermal' folder?

Use Bucket name as: export-csv-vehicle-telemetry/vin15/Thermal instead of export-csv-vehicle-telemetry/vin15/Thermal/
The extra slash at the end is not required while specifying bucket names.
BTW, Your image shows vin11 directory instead of vin15. Check if that is correct.

Related

Aws s3 batch operation error: Task target couldn't be URL decoded

I need to restore a lot of object from aws s3 glacier deep archive. So i try to use a s3 batch jobs. For that i use a python code to create a manifest as a csv with to columns Bucket,Key.
But my first issue : some Key contain a comma so the job failed.
To solve (partialy) this issue i just cut the csv file to keep only the first two columns hoping that there are not many files involved.
But now i have another issue:
ErrorMessage: Task target couldn't be URL decoded
Any Idea ?
As mentioned on https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html#specify-batchjob-manifest, the manifest CSV file must be URL encoded. The , character in a key name gets converted to %2C with URL encoding so the resulting file will be valid CSV even with commas in the key name

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

I am trying to generate a file from a Dataframe that I have created in AWS-Glue, I am trying to give it a specific name, I see most answers on stack overflow actually uses Filesystem modules, but here this particular csv file is generated in S3, also I want to give the file a name while generating it, and not rename it after it is generated, is there any way to do that?
I have tried using df.save(s3:://PATH/filename.csv) which actually generates a new directory in S3 named filename.csv and then generates part-*.csv inside that directory
df.repartition(1).write.mode('append').format('csv').save('s3://PATH').option("header", "true")

Exporting large file from BigQuery to Google cloud using wildcard

I have 8Gb table in BigQuery that I'm trying to export to Google Cloud Storage (GCS). If I specify url as it is, I'm getting an error
Errors:
Table gs://***.large_file.json too large to be exported to a single file. Specify a uri including a * to shard export. See 'Exporting data into one or more files' in https://cloud.google.com/bigquery/docs/exporting-data. (error code: invalid)
Okay... I'm specifying * in a file name, but it exports it in 2 files: one 7.13Gb and one ~150Mb.
UPD. I thought I should get about 8 files, 1Gb each? Am I wrong? Or what am I doing wrong?
P.S. I tried this in WebUI mode as well as using Java library.
For files of certain size or larger, BigQuery will export to multiple GCS files - that's why it asks for the "*" glob.
Once you have multiple files in GCS, you can join them into 1 with the compose operation:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose
To export it to GCP you have to go to the table and click EXPORT > Export to GCS.
This opens the following screen
In Select GCS location you define the bucket, the folder and the file.
For instances, you have a bucket named daria_bucket (Use only lowercase letters, numbers, hyphens (-), and underscores (_). Dots (.) may be used to form a valid domain name.) and want to save the file(s) in the root of the bucket with the name test, then you write (in Select GCS location)
daria_bucket/test.csv
Because the file is too big, you're getting an error. To fix it, you'll have to break it down into more files using wildcard. So, you'll need to add *, just like that
daria_bucket/test*.csv
This is going to store, inside of the bucket daria_bucket, all the data extracted from the table in more than one file named test000000000000, test000000000001, test000000000002, ... testX.
In my case (more than 1 year after you've asked the question), using a random table of 1,25 GBs, got 16 files with 80,3 MBs each.

cfdirectory replacing spaces with + characters when action=list on an S3 folder

I am uploading a file, that contains spaces in the name, to Amazon S3 using cffile action="upload". The file name is burger+beans n beetroot.jpg.
As you can see, the name contains spaces and a plus sign.
When I read the directory, to list the contents, the file name returned by ColdFusion in the query is: burger+beans+n+beetroot.jpg. However, when viewing the file using Amazon S3 Browser, it is correctly listed as: burger+beans n beetroot.jpg. So it appears ColdFusion is replacing the spaces with + signs.
Does anyone know why this happens and if there is a way to disable this? I tried using both the DirectoryList() method as well as the <cfdirectory action="list"> tag, and both do this.
Please note: I am aware that the file name could be cleaned up before processing - that's a workaround, but not the solution I am looking for. Thanks!
I believe this is not a CF problem, it's a S3 problem. They send out their file names escaped. Which makes this a non-answer.
I created a folder in a S3 bucket. Then I uploaded a file named burger+beans n beetroot.jpg. I can see in AWS' console the file properly named. I select it, then in the Actions menu select Download. I get the modal window. Take a look at the URL in the browser footer - the file name is escaped.
I right-click their link and choose "Save Link As..." - the file name is escaped as well.
So I don't think there is anything you can do once the file is up there. You'll need to clean it before uploading. I know it's not what you want to hear.
Try URL encoding the filename, so the + sign will be converted to an url encoded space (%2b). You could use URLEncodedFormat, but make sure that the path to the file isn't urlencoded as well.

BigQuery loading batch folders error

I'm trying to load group of folders files in one time with when
i set
sourceURI = 'gs://ybbi/bi_landing_zone/files_to_load/app/reports/app_network_analytics_report/201409011*'
all the folders that i'm want to load start with 20140911
but i get the error:
ERROR: Invalid path: gs://ybbi/bi_landing_zone/files_to_load/apn/reports/appnexus_network_analytics_report/20140901191111_3bab8ec0_092a_43de_a157_db35d1555ea0/
20140901191111_3bab8ec0_092a_43de_a157_db35d1555ea0 is one of these folders(don't know why it's print the all folder name of this specific folder)
in some other folder tree cases it's works, but in this specific folder tree it's return the same error .
i know that cloud storage don't have real folders and it's part of the name of the object, but you understand what i mean.
is it bug?
Without more information, what it looks like is that you have a object file called gs://ybbi/bi_landing_zone/files_to_load/apn/reports/appnexus_network_analytics_report/20140901191111_3bab8ec0_092a_43de_a157_db35d1555ea0/ that is not a csv/json file. Some tools may create these dummy files in order to simulate directories. BigQuery requires all objects that match the input glob path to be importable files.
One solution would be to change the glob path to include a narrower set of files. You can pass multiple paths if that makes things easier. For example, you could pass
gs://ybbi/bi_landing_zone/files_to_load/apn/reports/appnexus_network_analytics_report/20140901191111_3bab8ec0_092a_43de_a157_db35d1555ea0/*
and
gs://ybbi/bi_landing_zone/files_to_load/apn/reports/appnexus_network_analytics_report/20140901191111_some_other_path/*