Does hive table skip headers from all the files? - hive

Hive has an option "skip.header.line.count"="1" to make the external table to skip header from the file.
So, what would be the behavior if the folder has multiple files. Had this doubt and just verified it.

Header would be skipped on every file in the folder.
If the folder (pointed by the table) had multiple files. Hive skips the first N rows ("skip.header.line.count"="N") from each and every file in the folder.

Related

Google Cloud Dataprep - Scan for multiple input csv and create corresponding bigquery tables

I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.

Cannot load backup data from GCS to BigQuery

My backup table has 3 files: 2 ending with .backup_info and one folder with another folder containing 10 CSV files. What would be format of the URL which will specify the backup file location?
I'm trying below and every time I get a file not found error.
gs://bucket_name/name_of_the_file_which_ended_with_backup_info.info
When you go to look at your file from your backup, it should have a structure like this:
Buckets/app-id-999999999-backups
And the filenames should look like:
2017-08-20T02:05:19Z_app-id-999999999_data.json.gz
Therefore the path will be:
gs://app-id-9999999999-backups/2017-08-20T02:05:19Z_app-id-9999999999_data.json.gz
Make sure you do not include the word "Buckets", I am guess that is the confusion.

Deleting NDF from file group completely?

I want a script to delete NDF file from a file group completely without using "shrinkfile " command
A file can be removed from the database only if the file is empty. Without SHRINKFILE, the implication is that the file must be the only file in a user-defined filegroup and you must first drop or move all the objects (or partitions) from the filegroup to a different file group. The empty file can then be dropped with ALTER DATABASE...REMOVE FILE.
It seems your objective is to delete data older than 6 months. It would be easier to just delete/truncate the data and not bother with files/filegroups at all.

When to delete a temp file?

Is there a standard or a common way when to delete a temporary file?
I'm currently writing a script in which I'm using many of them, but when should I remove them?
Delete the file after you used it / won't need it afterwards
Delete all temp files at the end of the script
I would give the temp files a specific directory or a specific file extension. And then delete it at the end of the script (all inside the temp-dir or all with the extension). I prefer doing it this way because:
1) If there are a lot of files you might forget one if you delete them by their names
2) If you still needs one of them later inside the code it's still there.

How to do loop in pentaho for getting file names?

I have 100 000 files.
I want to get the name of those file names and have to put in database,
I have to do like this
get 10 files name's;
update/insert names into database; and
move those 10 files to another directory;
and loop these three steps till no files are found.
Is this possible?
I'm attaching a working example (I tested it with ~400 text files on kettle 4.3.).
transformation.ktr
job.kjb
Both transformation and job contain detailed notes on what to set and where.
Transformation.ktr It reads first 10 filenames from given source folder, creates destination filepath for file moving. It outputs filenames to insert/update (I used dummy step as a placeholder) and uses "Copy rows to resultset" to output needed source and destination paths for file moving.
job.kjb All the looping is done in this job. It executes "transformation.ktr" (which does insert/update for 10 files), and then moves those 10 files to destination folder. After that, it checks whether there's any more files in source folder. If there is, process is repeated, if not, it declares success.