command line bq load that pulls variable out of load file name for table name - variables

I have a google storage directory that contains several files of the same file layout, with the same file naming convention for different states. I want to run one command line bq load to create a separate BiqQuery table for each file, that also pulls the 2-letter state abbreviation out of each load file name to include it in the created table name.
The file naming convention is:
PUF[STATE]Plan2021.csv (i.e. - PUFCAPlan2021.csv, PUFMDPlan2021.csv, PUFMNPlan2021.csv, etc.)
I want a table created for each file like:
HIOS_STATE_PLAN_ATTR_CA_RAW_2021, HIOS_STATE_PLAN_ATTR_MD_RAW_2021, HIOS_STATE_PLAN_ATTR_MN_RAW_2021
So - in the below example - I'd want to read the "CA" out of the "PUFCAPlan2021.csv" file name and use it for the [STATE] in the table name, for each of the files included in the directory. I have no idea if this is possible?
REM create a raw file
ECHO Creating raw file
call bq load --skip_leading_rows=1 ^
meritagedata:PAYER.HIOS_STATE_PLAN_ATTR_[STATE]_RAW_2021 gs://payer_raw_files/HIOS_STATE_PLAN_ATTR/2021/PUFCAPlan2021.csv ^
I know I can do something like below, but that still requires running a separate command line for each file in the directory. Wondering if it's possible to create one load statement that will create a separate table for each file and insert the state abbreviation into the table name.
REM create a raw file
SET STATE=CA
ECHO Creating raw file
call bq load --skip_leading_rows=1 ^
meritagedata:PAYER.HIOS_STATE_PLAN_ATTR_%STATE%_RAW_2021 gs://payer_raw_files/HIOS_STATE_PLAN_ATTR/2021/PUF%STATE%Plan2021.csv ^

Related

Get file name from SAP Data service

I'm unable to read file name from data services which contain date_time format, I can read date but time can be variable, I've tried with *.csv on file name(s) property for flat file, but this for static file name.
Example: File_20180520_200003.csv, File_20180519_192503.csv, etc.
My script:
$Filename= 'File_'|| to_char(sysdate()-1, 'YYYYMMDD')|| '_'|| '*.csv';
I want to find a solution to read the 6 digits (any number) *.
Finally, I've found a solution by using
$Csv = word(exec('cmd','dir /b [$Filename]*.csv',8),2) ;
on the flat file (file name property), I've added $Csv
It works fine.

Sejda merging PDFs from CSV filelist names

I recently installed sedja-console for merging pdf files from command line.
The names of the input pdf files are in a CSV file named filelist-inputs.csv like this:
./Temp/source/046032.pdf,./Temp/source/048155.pdf
./Temp/source/049278.pdf,./Temp/source/050818.pdf,./Temp/source/052962.pdf
./Temp/source/052962.pdf,./Temp/source/054117.pdf
I need one output pdf file for the first line of the CSV filelist names, other output pdf file for the second line of the second line, other output for the third line, and so...
I tried a command line like this:
~$ sejda-console merge -l filelist-inputs.csv -o ./Temp/target/merged[FILENUMBER####].pdf
But it only creates a unique file named literally merged[FILENUMBER####].pdf, when I want 3 files:
merged0001.pdf
merged0002.pdf
merged0003.pdf
I've simplified the problem, because I need to merge more than 3500 pdf files in 700 output files.
Sejda takes all the values in the CSV and generates a single merged PDF, there isn't any option or setting in Sejda to achieve what you asked, you will need some scripting to loop through the CSV lines, create a CSV per line and feed it to Sejda.
The output file name merged[FILENUMBER####].pdf is literally used because the PDF merge task generates one output file and it expects an explicit output file name. Prefixes like [CURRENTPAGE] or [FILENUMBER] are valid when used as -p argument in tasks generating multiple output PDF files (split tasks etc).

Using BQ command line change configuration configuration.load.quote

I want to know how using BQ command line tool I can change configuration of a BigQuery API job. E.g., I want to change configuration.load.quote property from command line tool. Is there is any way. I need this to load a table with field double quote(") inside.
You cannot modify a job once it is created, but I guess what you want is set the quote property when creating the job.
In most cases, bq help <command> will get you what you need. Here's the output of bq help load. As you can see, you just have to specify --quote="'" after the command but before the arguments.
$ bq help load
Python script for interacting with BigQuery.
USAGE: bq.py [--global_flags] <command> [--command_flags] [args]
load Perform a load operation of source into destination_table.
Usage:
load <destination_table> <source> [<schema>]
The <destination_table> is the fully-qualified table name of table to
create, or append to if the table already exists.
The <source> argument can be a path to a single local file, or a
comma-separated list of URIs.
The <schema> argument should be either the name of a JSON file or a
text schema. This schema should be omitted if the table already has
one.
In the case that the schema is provided in text form, it should be a
comma-separated list of entries of the form name[:type], where type
will default to string if not specified.
In the case that <schema> is a filename, it should contain a single
array object, each entry of which should be an object with properties
'name', 'type', and (optionally) 'mode'. See the online documentation
for more detail:
https://developers.google.com/bigquery/preparing-data-for-bigquery
Note: the case of a single-entry schema with no type specified is
ambiguous; one can use name:string to force interpretation as a
text schema.
Examples:
bq load ds.new_tbl ./info.csv ./info_schema.json
bq load ds.new_tbl gs://mybucket/info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string
bq load ds.small gs://mybucket/small.csv field1,field2,field3
Arguments:
destination_table: Destination table name.
source: Name of local file to import, or a comma-separated list of
URI paths to data to import.
schema: Either a text schema or JSON file, as above.
Flags for load:
/home/David/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_jagged_rows: Whether to allow missing trailing optional columns in
CSV import data.
--[no]allow_quoted_newlines: Whether to allow quoted newlines in CSV import
data.
-E,--encoding: <UTF-8|ISO-8859-1>: The character encoding used by the input
file. Options include:
ISO-8859-1 (also known as Latin-1)
UTF-8
-F,--field_delimiter: The character that indicates the boundary between
columns in the input file. "\t" and "tab" are accepted names for tab.
--[no]ignore_unknown_values: Whether to allow and ignore extra, unrecognized
values in CSV or JSON import data.
--max_bad_records: Maximum number of bad records allowed before the entire job
fails.
(default: '0')
(an integer)
--quote: Quote character to use to enclose records. Default is ". To indicate
no quote character at all, use an empty string.
--[no]replace: If true erase existing contents before loading new data.
(default: 'false')
--schema: Either a filename or a comma-separated list of fields in the form
name[:type].
--skip_leading_rows: The number of rows at the beginning of the source file to
skip.
(an integer)
--source_format: <CSV|NEWLINE_DELIMITED_JSON|DATASTORE_BACKUP>: Format of
source data. Options include:
CSV
NEWLINE_DELIMITED_JSON
DATASTORE_BACKUP
gflags:
--flagfile: Insert flag definitions from the given file into the command line.
(default: '')
--undefok: comma-separated list of flag names that it is okay to specify on
the command line even if the program does not define a flag with that name.
IMPORTANT: flags in this list that have arguments MUST use the --flag=value
format.
(default: '')

BIDS Import from changing file name [wildcard?]

I'm attempting to create a process to import data. I created the entire process and it works, but I'm having trouble creating the variable to find the file name of the csv i want to import automatically. Each time a new csv is uploaded to me it has a timestamp on it. I want to be able to grab that file no matter what the name is and do work to it.
So for example this week the file name would be
filename_4-14-2014.csv
And next week
filename_4_21_2014.csv
And so on into eternity. . .
Is there a way to create a variable that picks up the full file name even though its changing?
After doing some poking around, I've discovered the following...
You can use a file system task to perform the copy operation I was referring to. You can set the input file and the output file as variables. This way you can always know that the file you use for import is always named the same, and has the right data.
You just need to add the variables and a File System Task to your package.
Ok so to accomplish what I wanted I created a Foreach Loop Container. Using the foreach loop container I had it look for any files ending with .csv in my specified folder by using a wildcard [denoted by asterisk: *.csv] .
Within the Foreach Loop container is as follows.
Step 1: File System Task - rename file.
Step 2: Data Flow Task - Import data to sql
Step 3: File System Task - Copy the file to another folder, append datetime to filename
Step 4: File System Task - Delete source file.
I used variables to get all the file and folder names plus datetimes.

Hive Reading external table from compressed bz2 file

this is my scenario.
I have bz2 file in Amazon s3. Within the bz2 file, there lies files with .dat,.met,.sta extensions.I am only interested in files with *.dat extensions.You can download this samplefile to take a look at bz2 file.
create external table cdr (
anum string,
bnum string,
numOfTimes int
)
row format delimited
fields terminated by ','
lines terminated by '\n'
location 's3://mybucket/dir'; #the zip file is inside here
The problem lies such that when I execute the above command, some of the records/rows had issues.
1)all the data from files such as *.sta and *.met are also included.
2)the metadata of the filenames are also included.
The only idea I had was to show the INPUT_FILE_NAME. But then, all the records/rows had the same INPUT_FILE_NAME which was the filename.tar.bz2.
Any suggestions are welcome. I am currently completely lost.