Exporting large file from BigQuery to Google cloud using wildcard

Exporting large file from BigQuery to Google cloud using wildcard - google-bigquery

I have 8Gb table in BigQuery that I'm trying to export to Google Cloud Storage (GCS). If I specify url as it is, I'm getting an error
Errors:
Table gs://***.large_file.json too large to be exported to a single file. Specify a uri including a * to shard export. See 'Exporting data into one or more files' in https://cloud.google.com/bigquery/docs/exporting-data. (error code: invalid)
Okay... I'm specifying * in a file name, but it exports it in 2 files: one 7.13Gb and one ~150Mb.
UPD. I thought I should get about 8 files, 1Gb each? Am I wrong? Or what am I doing wrong?
P.S. I tried this in WebUI mode as well as using Java library.

For files of certain size or larger, BigQuery will export to multiple GCS files - that's why it asks for the "*" glob.
Once you have multiple files in GCS, you can join them into 1 with the compose operation:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose

To export it to GCP you have to go to the table and click EXPORT > Export to GCS.
This opens the following screen
In Select GCS location you define the bucket, the folder and the file.
For instances, you have a bucket named daria_bucket (Use only lowercase letters, numbers, hyphens (-), and underscores (_). Dots (.) may be used to form a valid domain name.) and want to save the file(s) in the root of the bucket with the name test, then you write (in Select GCS location)
daria_bucket/test.csv
Because the file is too big, you're getting an error. To fix it, you'll have to break it down into more files using wildcard. So, you'll need to add *, just like that
daria_bucket/test*.csv
This is going to store, inside of the bucket daria_bucket, all the data extracted from the table in more than one file named test000000000000, test000000000001, test000000000002, ... testX.
In my case (more than 1 year after you've asked the question), using a random table of 1,25 GBs, got 16 files with 80,3 MBs each.

Related

Aws s3 batch operation error: Task target couldn't be URL decoded

I need to restore a lot of object from aws s3 glacier deep archive. So i try to use a s3 batch jobs. For that i use a python code to create a manifest as a csv with to columns Bucket,Key.
But my first issue : some Key contain a comma so the job failed.
To solve (partialy) this issue i just cut the csv file to keep only the first two columns hoping that there are not many files involved.
But now i have another issue:
ErrorMessage: Task target couldn't be URL decoded
Any Idea ?

As mentioned on https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html#specify-batchjob-manifest, the manifest CSV file must be URL encoded. The , character in a key name gets converted to %2C with URL encoding so the resulting file will be valid CSV even with commas in the key name

Redshift Unload command with CSV extension

I'm using the following Unload command -
unload ('select * from '')to 's3://**summary.csv**'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key=''' parallel off allowoverwrite CSV HEADER;
The file created in S3 is summary.csv000
If I change and remove the file extension from the command like below
unload ('select * from '')to 's3://**summary**'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key=''' parallel off allowoverwrite CSV HEADER;
The file create in S3 is summary000
Is there a way to get summary.csv, so I don't have to change the file extension before importing it into excel?
Thanks.

actually a lot of folks asked the similar question, right now it's not possible to have an extension for the files. (but parquet files can have)
The reason behind this is, RedShift by default export it in parallel which is a good thing. Each slice will export its data. Also from the docs,
PARALLEL
By default, UNLOAD writes data in parallel to multiple files,
according to the number of slices in the cluster. The default option
is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or
more data files serially, sorted absolutely according to the ORDER BY
clause, if one is used. The maximum size for a data file is 6.2 GB.
So, for example, if you unload 13.4 GB of data, UNLOAD creates the
following three files.
So it has to create new files after 6GB that's why they are adding numbers as a suffix.
How do we solve this?
No native options from RedShift, but we can do some workaround with lambda.
Create a new S3 bucket and a folder inside it specifically for this process.(eg: s3://unloadbucket/redshift-files/)
Your unload files should go to this folder.
Lambda function should be triggered based on S3 put object event.
Then the lambda function,
Download the file(if it is large use EFS)
Rename it with .csv
Upload to the same bucket(or different bucket) into a different path (eg: s3://unloadbucket/csvfiles/)
Or even more simple if you use shell/powershell script to do the following process
Download the file
Rename it with .csv

As per AWS Documentation around UNLOAD command, it's possible to save data as CSV.
In your case, this is what your code would look like:
unload ('select * from '')
to 's3://summary/'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key='''
CSV <<<
parallel off
allowoverwrite
CSV HEADER;

NiFi data insertion into s3 subdirectory

I have a flow where I am extracting data from the database, converting the Avro to the CSV format and pushing the CSV in an s3 bucket which has subfolder in it. My S3 structure is like the following:
As you can see in the above screenshot my files are going into a blank folder(highlighted by red) instead of going inside a subfolder called 'Thermal'. Please see my PutS3Object settings:
The final s3 path I want my files to go into is: export-csv-vehicle-telemetry/vin11/Thermal
What settings should I change in my processor so the file goes directly inside the 'Thermal' folder?

Use Bucket name as: export-csv-vehicle-telemetry/vin15/Thermal instead of export-csv-vehicle-telemetry/vin15/Thermal/
The extra slash at the end is not required while specifying bucket names.
BTW, Your image shows vin11 directory instead of vin15. Check if that is correct.

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

I am trying to generate a file from a Dataframe that I have created in AWS-Glue, I am trying to give it a specific name, I see most answers on stack overflow actually uses Filesystem modules, but here this particular csv file is generated in S3, also I want to give the file a name while generating it, and not rename it after it is generated, is there any way to do that?
I have tried using df.save(s3:://PATH/filename.csv) which actually generates a new directory in S3 named filename.csv and then generates part-*.csv inside that directory
df.repartition(1).write.mode('append').format('csv').save('s3://PATH').option("header", "true")

Split CSV file in records and save as a csv file format - Apache NIFI

What I want to do is the following...
I want to divide the input file into registers, convert each record into a
file and leave all the files in a directory.
My .csv file has the following structure:
ERP,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
ERP,FRANK,DIETSCH,5064 E METAIRIE AVE.,BRANDSVILLA,MO,65687,252-5592,1176 E THAYER ST.,COLUMBIA,MO,65215,557,291-9571,217-38-5525,129-10-0407,1/13/35,M,
As you can see it doesn't have Header row.
Here is my flow.
My problem is that when the Split Proccessor divides my csv into flows with 400 lines, it isn't save in my output directory.
It's first time using NIFI, sorry.

Make sure your RecordReader controller service is configured correctly(delimiter..etc) to read the incoming flowfile.
Records per split value as 1
You need to use UpdateAttribute processor before PutFile processor to change the filename to unique value (like UUID) unless if you are configured PutFile processor Conflict Resolution strategy as Ignore
The reason behind changing filename is SplitRecord processor is going to have same filename for all the splitted flowfiles.
Flow:
I tried your case and flow worked as expected, Use this template for your reference and upload to your NiFi instance, Make changes as per your requirements.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Exporting large file from BigQuery to Google cloud using wildcard - google-bigquery

Related

Aws s3 batch operation error: Task target couldn't be URL decoded

Redshift Unload command with CSV extension

NiFi data insertion into s3 subdirectory

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

Split CSV file in records and save as a csv file format - Apache NIFI

Categories

Resources