Cannot load backup data from GCS to BigQuery - google-bigquery

My backup table has 3 files: 2 ending with .backup_info and one folder with another folder containing 10 CSV files. What would be format of the URL which will specify the backup file location?
I'm trying below and every time I get a file not found error.
gs://bucket_name/name_of_the_file_which_ended_with_backup_info.info

When you go to look at your file from your backup, it should have a structure like this:
Buckets/app-id-999999999-backups
And the filenames should look like:
2017-08-20T02:05:19Z_app-id-999999999_data.json.gz
Therefore the path will be:
gs://app-id-9999999999-backups/2017-08-20T02:05:19Z_app-id-9999999999_data.json.gz
Make sure you do not include the word "Buckets", I am guess that is the confusion.

Related

save a csv file into s3 bucket from pypark dataframe

I would like to save the content of a spark dataframe into a csv file in s3 bucket:
df_country.repartition(1).write.csv('s3n://bucket/test/csv/a',sep=",",header=True,mode='overwrite')
the problem that it creaate a file with a name : part-00000-fc644e84-7579-48.
Is there any way to fix the name of this file. For example test.csv?
Thanks
Best
This is not possible since every partition in the job will create its own file and must follow a strict convention to avoid naming conflicts. The recommended solution is to rename the file after it is created.
Also, if you know you are only writing one file per path.
Ex. s3n://bucket/test/csv/a. Then it doesn't really matter what the name of the file is, simply read in all the contents of that unique directory name.
Sources:
1. Specifying the filename when saving a DataFrame as a CSV
2. Spark dataframe save in single file on hdfs location

Google Cloud Dataprep - Scan for multiple input csv and create corresponding bigquery tables

I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.

Dynamically populate external tables location

I'm trying to use oracle external tables to load flat files into a database but I'm having a bit of an issue with the location clause. The files we receive are appended with several pieces of information including the date so I was hoping to use wildcards in the location clause but it doesn't look like I'm able to.
I think I'm right in assuming I'm unable to use wildcards, does anyone have a suggestion on how I can accomplish this without writing large amounts of code per external table?
Current thoughts:
The only way I can think of doing it at the moment is to have a shell watcher script and parameter table. User can specify: input directory, file mask, external table etc. Then when a file is found in the directory, the shell script generates a list of files found with the file mask. For each file found issue a alter table command to change the location on the given external table to that file and launch the rest of the pl/sql associated with that file. This can be repeated for each file found with the file mask. I guess the benefit to this is I could also add the date to the end of the log and bad files after each run.
I'll post the solution I went with in the end which appears to be the only way.
I have a file watcher than looks for files in a given input dir with a certain file mask. The lookup table also includes the name of the external table. I then simply issue an alter table on the external table with the list of new file names.
For me this wasn't much of an issue as I'm already using shell for most of the file watching and file manipulation. Hopefully this saves someone searching for ages for a solution.

how to automate zipping of files from sql

I have an SQL table which has results for each period(e.g. Jan 2013). These are the steps that I follow:
I want to select records from each period.
Put the results into a CSV file.
Then copy the headers and save the CSV file to a text file with different name.
Then take the text file and gzip that file.
Now find the records that are in the text file and create a counts file. (.txt)
Now take that gzip file and counts file and create a .tar file.
Again create a counts file that points to .tar file.
I have to do these steps for all the periods that are in that table.
Is there an easier way to do this, like a Perl/Python script or a Batch file or something?

How to do loop in pentaho for getting file names?

I have 100 000 files.
I want to get the name of those file names and have to put in database,
I have to do like this
get 10 files name's;
update/insert names into database; and
move those 10 files to another directory;
and loop these three steps till no files are found.
Is this possible?
I'm attaching a working example (I tested it with ~400 text files on kettle 4.3.).
transformation.ktr
job.kjb
Both transformation and job contain detailed notes on what to set and where.
Transformation.ktr It reads first 10 filenames from given source folder, creates destination filepath for file moving. It outputs filenames to insert/update (I used dummy step as a placeholder) and uses "Copy rows to resultset" to output needed source and destination paths for file moving.
job.kjb All the looping is done in this job. It executes "transformation.ktr" (which does insert/update for 10 files), and then moves those 10 files to destination folder. After that, it checks whether there's any more files in source folder. If there is, process is repeated, if not, it declares success.