Automation shell script to copy big query dataset using transfer service command - google-bigquery

I want to write an shell script which copy source datasets into target dataset through transfer command line.
Note: copy should happen at dataset level as we have thousands of datasets in bigquery.

Consider the following approach:
#!/bin/bash
Project_Id='input your project name'
Location='input your location'
Tar_Data_Set='input you targeted dataset name'
for Data_Set in $(bq ls -n 1000 --project_id=${Project_Id} --location=${Location}| sed -n '3,$p')
[ $? -ne 0 ] && echo "Input parameter error" && exit 1
do
for Table_Name in $(bq ls ${Data_Set}| awk '{if(NR>2){print $1}}')
do
echo "bq cp -f ${Project_Id}.${Data_Set}.${Table_Name} ${Project_Id}.${Tar_Data_Set}.${Table_Name};" &>/dev/null
done
done
Here, we assume that it copies tables under the same project id.

Related

Optimize informix update

I have a bash script that will update a table based on a file. The way I have it it opens and closes for every line in the file and would like to understand how to open, perform all the updates, and then close. It is fine for a few updates but if it ever requires more than a few hundred could be really taxing on the system.
#!/bin/bash
file=/export/home/dncs/tmp/file.csv
dateFormat=$(date +"%m-%d-%y-%T")
LOGFILE=/export/home/dncs/tmp/Log_${dateFormat}.log
echo "${dateFormat} : Starting work" >> $LOGFILE 2>&1
while IFS="," read mac loc; do
if [[ "$mac" =~ ^([0-9a-fA-F]{2}:){5}[0-9a-fA-F]{2}$ ]]; then
dbaccess thedb <<EndOfUpdate >> $LOGFILE 2>&1
UPDATE profile
SET local_code= '$loc'
WHERE mac_address = '$mac';
EndOfUpdate
else
echo "Error: $mac not valid format" >> $LOGFILE 2>&1
fi
IIH -i $mac >> $LOGFILE 2>&1
done <"$file"
Source File.
12:BF:20:04:BB:30,POR-4
12:BF:21:1C:02:B1,POR-10
12:BF:20:04:72:FD,POR-4
12:BF:20:01:5B:4F,POR-10
12:BF:20:C2:71:42,POR-7
This is more or less what I'd do:
#!/bin/bash
fmt_date() { date +"%Y-%m-%d.%T"; }
file=/export/home/dncs/tmp/file.csv
dateFormat=$(fmt_date)
LOGFILE="/export/home/dncs/tmp/Log_${dateFormat}.log"
exec >> $LOGFILE 2>&1
echo "${dateFormat} : Starting work"
valid_mac='/^\(\([0-9a-fA-F]\{2\}:\)\{5\}[0-9a-fA-F]\{2\}\),\([^,]*\)$/'
update_stmt="UPDATE profile SET local_code = '\3' WHERE mac_address = '\1';"
sed -n -e "$valid_mac s//$update_stmt/p" "$file" |
dbaccess thedb -
sed -n -e "$valid_mac d; s/.*/Error: invalid format: &/p" "$file"
sed -n -e "$valid_mac s//IIH -i \1/p" "$file" | sh
echo "$(fmt_date) : Finished work"
I changed the date format to a variant of ISO 8601; it is easier to parse. You can stick with your Y2K-non-compliant US-ish format if you prefer. The exec line arranges for standard output and standard error from here onwards to go to the log file. The sed command all use the same structure, and all use the same pattern match stored in a variable. This makes consistency easier. The first sed script converts the data into UPDATE statements (which are fed to dbaccess). The second script identifies invalid MAC addresses; it deletes valid ones and maps the invalid lines into error messages. The third script ignores invalid MAC addresses but generates a IIH command for each valid one. The script records an end time — it will allow you to assess how long the processing takes. Again, repetition is avoided by creating and using the fmt_date function.
Be cautious about testing this. I had a file data containing:
87:36:E6:5E:AC:41,loc-OYNK
B2:4D:65:70:32:26,loc-DQLO
ZD:D9:BA:34:FD:97,loc-PLBI
04:EB:71:0D:29:D0,loc-LMEE
DA:67:53:4B:EC:C4,loc-SFUU
I replaced the dbaccess with cat, and the sh with cat. The log file I relocated to the current directory — leading to:
#!/bin/bash
fmt_date() { date +"%Y-%m-%d.%T"; }
#file=/export/home/dncs/tmp/file.csv
file=data
dateFormat=$(fmt_date)
#LOGFILE="/export/home/dncs/tmp/Log_${dateFormat}.log"
LOGFILE="Log-${dateFormat}.log"
exec >> $LOGFILE 2>&1
echo "${dateFormat} : Starting work"
valid_mac='/^\(\([0-9a-fA-F]\{2\}:\)\{5\}[0-9a-fA-F]\{2\}\),\([^,]*\)$/'
update_stmt="UPDATE profile SET local_code = '\3' WHERE mac_address = '\1';"
sed -n -e "$valid_mac s//$update_stmt/p" "$file" |
cat
#dbaccess thedb -
sed -n -e "$valid_mac d; s/.*/Error: invalid format: &/p" "$file"
#sed -n -e "$valid_mac s//IIH -i \1/p" "$file" | sh
sed -n -e "$valid_mac s//IIH -i \1/p" "$file" | cat
echo "$(fmt_date) : Finished work"
After I ran it, the log file contained:
2017-04-27.14:58:20 : Starting work
UPDATE profile SET local_code = 'loc-OYNK' WHERE mac_address = '87:36:E6:5E:AC:41';
UPDATE profile SET local_code = 'loc-DQLO' WHERE mac_address = 'B2:4D:65:70:32:26';
UPDATE profile SET local_code = 'loc-LMEE' WHERE mac_address = '04:EB:71:0D:29:D0';
UPDATE profile SET local_code = 'loc-SFUU' WHERE mac_address = 'DA:67:53:4B:EC:C4';
Error: invalid format: ZD:D9:BA:34:FD:97,loc-PLBI
IIH -i 87:36:E6:5E:AC:41
IIH -i B2:4D:65:70:32:26
IIH -i 04:EB:71:0D:29:D0
IIH -i DA:67:53:4B:EC:C4
2017-04-27.14:58:20 : Finished work
The UPDATE statements would have gone to DB-Access. The bogus MAC address was identified. The correct IIH commands would have been run.
Note that piping the output into sh requires confidence that the data you generate (the IIH commands) will be clean.

How to run awk script on multiple files

I need to run a command on hundreds of files and I need help to get a loop to do this:
have a list of input files /path/dir/file1.csv, file2.csv, ..., fileN.csv
need to run a script on all those input files
script is something like: command input=/path/dir/file1.csv output=output1
I have tried things like:
for f in /path/dir/file*.csv; do command, but how do I get to read and write the new file every time?
Thank you....
Try this, (after changing /path/to/data to the correct path. Same with /path/to/awkscript and other places, pointing to your test data.)
#!/bin/bash
cd /path/to/data
for f in *.csv ; do
echo "awk -f /path/to/awkscript \"$f\" > ${f%.csv}.rpt"
#remove_me awk -f /path/to/awkscript "$f" > ${f%.csv}.rpt
done
make the script "executable" with
chmod 755 myScript.sh
The echo version will help you ensure the script is going to work as expected. You still have to carefully examine that output OR work on a copy of your data so you don't wreck you base-line data.
You could take the output of the last iteration
awk -f /path/to/awkscript myFileLast.csv > myFileLast.rpt
And copy/paste to cmd-line to confirm it will work.
WHen you comfortable that the awk script works as you need, then comment out the echo awk .. line, and uncomment the word #remove_me (and save your bash script).
for f in /path/to/files/*.csv ; do
bname=`basename $f`
pref=${bname%%.csv}
awk -f /path/to/awkscript $f > /path/to/store/output/${pref}_new.txt
done
Hopefully this helps, I am on my blackberry so there may be typos

Google BigQuery - how to drop table with bq command?

Google BigQuery - bq command enable you to create, load, query and alter table.
I did not find any documentation regarding dropping table, will be happy to know how to do it.
I found the bq tool much easier to implement instead of writing python interface for each command.
Thanks.
found it :
bq rm -f -t data_set.table_name
-t for table, -f for force, -r remove all tables in the named dataset
great tool.
Is there a way to bulk delete multiple tables? – activelearner
In bash, you can do something like:
for i in $(bq ls -n 9999 my_dataset | grep keyword | awk '{print $1}'); do bq rm -ft my_dataset.$i; done;
Explanation:
bq ls -n 9999 my_dataset - list up to 9999 tables in my dataset
| grep keyword - pipe the results of the previous command into grep, search for a keyword that your tables have in common
| awk '{print $1}' - pipe the results of the previous command into awk and print only the first column
Wrap all that into a for loop
do bq rm -ft my_dataset.$i; done; - remove each table from your dataset
I would highly recommend running the commands to list out the tables you want to delete before you add the 'do bq rm'. This way you can ensure you are only deleting the tables you actually want to delete.
UPDATE:
The argument -ft now returns an error and should be simply -f to force the deletion, without a prompt:
for i in $(bq ls -n 9999 my_dataset | grep keyword | awk '{print $1}'); do bq rm -f my_dataset.$i; done;
You can use Python code (on Jupyter Notebook) for the same purpose:
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id='Name of your dataset'
table_id='Table to be deleted'
table_ref = bigquery_client.dataset(dataset_id).table(table_id)
bigquery_client.delete_table(table_ref) # API request
print('Table {}:{} deleted.'.format(dataset_id, table_id))
if you want to delete complete dataset:
If dataset contains tables as well. And we want to delete dataset containing tables in one go the command is:
!bq rm -f -r serene-boulder-203404:Temp1 # It will remove complete data set along with the tables in it
If your dataset is empty then you can use the following command as well:
To use the following command make sure that you have deleted all the tables in that dataset otherwise, it will generate an error (dataset is still in use).
#Now remove an empty dataset using bq command from Python
!bq rm -f dataset_id
print("dataset deleted successfully !!!")
I used the command line for loop to delete a month of table data, but this is reliant on your table naming:
for %d in (01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31) DO bq rm -f -t dataset.tablename_201701%d
Expanding on the excellent answer from #james, I simply needed to remove all tables in a dataset but not actually remove the dataset itself. Hence the grep part was unnecessary for me however I still needed to get rid of the
table_id
------------------
header that bq returns when listing tables, for that I used sed to remove those first two lines:
for i in $(bq ls -n 9999 my_dataset | sed "1,2 d" | awk '{print $1}'); do bq rm -f my_dataset.$i; done;
perhaps there's a bq option to not return that header but if there is, I don't know it.

How to get get Output of SQL query in a flat file using shell script

I want to automate the a SQL file which contains multiple SQL queries using shell script and in the same time I want to get the Output of the last Query from that SQl file into any flat file.
You can create a file that executes each query but only sends the output of the last one.
#!/bin/bash
mysql -h... -u... -p... -e 'query1' > /dev/null
mysql -h... -u... -p... -e 'query2' > /dev/null
mysql -h... -u... -p... -e 'query3' > /result.sql
This awk command will find the last semicolon-separated query in the file and pipe it to mysql
awk 'BEGIN {RS=";"} NF > 0 {query=$0; } END {print query}' file.sql | mysql -u username -p password > output.txt
The NF > 0 keeps it from setting query to the empty line after the last ;.

ksh script optimization

I have a small script that simply reads each line of a file, retrieves id field, runs utility to get the name and appends the name at the end. The problem is the input file is huge (2GB). Since output is same as input with a 10-30 char name appended, it is of the same order of magnitude. How can I optimize it to read large buffers, process in buffers and then write buffers to the file so the number of file accesses are minimized?
#!/bin/ksh
while read line
do
id=`echo ${line}|cut -d',' -f 3`
NAME=$(id2name ${id} | cut -d':' -f 4)
if [[ $? -ne 0 ]]; then
NAME="ERROR"
echo "Error getting name from id2name for id: ${id}"
fi
echo "${line},\"${NAME}\"" >> ${MYFILE}
done < ${MYFILE}.csv
Thanks
You can speed things up considerably by eliminating the two calls to cut in each iteration of the loop. It also might be faster to move the redirection to your output file to the end of the loop. Since you don't show an example of an input line, or what id2name consists of (it's possible it's a bottleneck) or what its output looks like, I can only offer this approximation:
#!/bin/ksh
while IFS=, read -r field1 field2 id remainder # use appropriate var names
do
line=$field1,$field2,$id,$remainder
# warning - reused variables
IFS=: read -r field1 field2 field3 NAME remainder <<< $(id2name "$id")
if [[ $? -ne 0 ]]; then
NAME="ERROR"
# if you want this message to go to stderr instead of being included in the output file include the >&2 as I've done here
echo "Error getting name from id2name for id: ${id}" >&2
fi
echo "${line},\"${NAME}\""
done < "${MYFILE}.csv" > "${MYFILE}"
The OS will do the buffering for you.
Edit:
If your version of ksh doesn't have <<<, try this:
id2name "$id" | IFS=: read -r field1 field2 field3 NAME remainder
(If you were using Bash, this wouldn't work.)