Can the bq CLI list only views and exclude tables? - google-cloud-sdk

Listing the views is as simple as:
bq ls project_id:dataset_id
This includes both views and tables. Is there a way to filter this to only show views? The --filter parameter only appears to work on datasets and transfer jobs.
References:
https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_ls
https://cloud.google.com/bigquery/docs/listing-views

You have two options here:
Querying the INFORMATION_SCHEMA.VIEWS (google will bill you minimum 10GiB):
SELECT TABLE_NAME FROM `PROJECT_NAME`.dataset_name.INFORMATION_SCHEMA.VIEWS ;
Using the bq utility in combination with grep or awk:
bq ls __dataset__ | grep -i VIEW
or with awk looking at the second column:
bq ls __dataset__ | awk '{ if($2 == "VIEW"){ print $1; } }'

Related

Use multiline regex on cat output

I've the following file queries.sql that contains a number of queries, structured like this:
/* Query 1 */
SELECT cab_type_id,
Count(*)
FROM trips
GROUP BY 1;
/* Query 2 */
SELECT passenger_count,
Avg(total_amount)
FROM trips
GROUP BY 1;
/* Query 3 */
SELECT passenger_count,
Extract(year FROM pickup_datetime),
Count(*)
FROM trips
GROUP BY 1,
2;
Then I've written a regex, that finds all those queries in the file:
/\*[^\*]*\*/[^;]*;
What I'd like to achieve is the following:
Select all the queries with the regex.
Prefix each query with EXPLAIN ANALYZE
Execute each query and output the results to a new file. That means, query 1 will create a file q1.txt with the corresponding output, query 2 create q2.txt etc.
One of my main challenges (there are no problems, right? ;-)) is, that I'm rather unfamiliar with the linux bash I've to use.
I tried cat queries.sql | grep '/\*[^\*]*\*/[^;]*;' but that doesn't return anything.
So a solution could look like:
count = 0
for query in (cat queries.sql | grep 'somehow-here-comes-my-regex') do
count = $count+1
query = 'EXPLAIN ANALYZE '+query
psql -U postgres -h localhost -d nyc-taxi-data -c query > 'q'$count'.txt'
Except from: that doesn't work and I don't know how to make it work.
You have to omit spaces for variable assignments.
The following script would help. Save it in a file eg.: explain.sh, make it executable using chmod 0700 explain.sh and run in the following way: ./explain.sh query.sql.
#!/bin/bash
qfile="$1"
# number of queries
n="$(grep -oP '(?<=Query )[0-9]+ ' $qfile)"
count=1
for q in $n; do
# Corrected solution, modified after the remarks of #EdMorton
qn="EXPLAIN ANALYZE $(awk -v n="Query $q" 'flag; $0 ~ n {flag=1} /;/{flag=0}' $qfile)"
#qn="EXPLAIN ANALYZE $(awk -v n=$q "flag; /Query $q/{flag=1} /;/{flag=0}" $qfile)"
# psql -U postgres -h localhost -d nyc-taxi-data -c "$qn" > q$count.txt
echo "$qn" > q$count.txt
count=$(( $count + 1 ))
done
First of all, the script accounts for one argument (your example input query.sql file). It reads out the number of queries and save into a variable n. Then in a for loop it iterates through the query numbers and uses awk to extract the number n query and append EXPLAIN ANALYZE to the beginning. Then you can run your psql with the desired query. Here I commented out the psql part. This example script only creates qN.txt files for each explain query.
UPDATE:
The awk part: It is possible to use a shell variable in awk using the -v flag. Here we creates an awk variable n with the value of the q shell variable. n is used to create the starter pattern ie: Query 1. awk -v n="Query $q" 'flag; $0 ~ n {flag=1} /;/{flag=0}' $qfile matches everything between Query 1 and the first occurence of a semi-colon (;) excluding the line of Query 1 from query.sql. The $(...) means command-substitution in bash, thus we can save the output of a shell command into a variable. Here we save the output of awk and prefix it with the EXPLAIN ANALYZE string.
Here is a great answer about awk pattern matching.
It sounds like this is what you're looking for:
awk -v RS= -v ORS='\0' '{print "EXPLAIN ANALYZE", $0}' queries.sql |
while IFS= read -r -d '' query; do
psql -U postgres -h localhost -d nyc-taxi-data -c "$query" > "q$((++count)).txt"
do
The awk statement outputs each query as a NUL-terminated string, the shell loop reads it as such one at a time and calls psql on it. Simple, robust, efficient, etc...

Google BigQuery - how to drop table with bq command?

Google BigQuery - bq command enable you to create, load, query and alter table.
I did not find any documentation regarding dropping table, will be happy to know how to do it.
I found the bq tool much easier to implement instead of writing python interface for each command.
Thanks.
found it :
bq rm -f -t data_set.table_name
-t for table, -f for force, -r remove all tables in the named dataset
great tool.
Is there a way to bulk delete multiple tables? – activelearner
In bash, you can do something like:
for i in $(bq ls -n 9999 my_dataset | grep keyword | awk '{print $1}'); do bq rm -ft my_dataset.$i; done;
Explanation:
bq ls -n 9999 my_dataset - list up to 9999 tables in my dataset
| grep keyword - pipe the results of the previous command into grep, search for a keyword that your tables have in common
| awk '{print $1}' - pipe the results of the previous command into awk and print only the first column
Wrap all that into a for loop
do bq rm -ft my_dataset.$i; done; - remove each table from your dataset
I would highly recommend running the commands to list out the tables you want to delete before you add the 'do bq rm'. This way you can ensure you are only deleting the tables you actually want to delete.
UPDATE:
The argument -ft now returns an error and should be simply -f to force the deletion, without a prompt:
for i in $(bq ls -n 9999 my_dataset | grep keyword | awk '{print $1}'); do bq rm -f my_dataset.$i; done;
You can use Python code (on Jupyter Notebook) for the same purpose:
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id='Name of your dataset'
table_id='Table to be deleted'
table_ref = bigquery_client.dataset(dataset_id).table(table_id)
bigquery_client.delete_table(table_ref) # API request
print('Table {}:{} deleted.'.format(dataset_id, table_id))
if you want to delete complete dataset:
If dataset contains tables as well. And we want to delete dataset containing tables in one go the command is:
!bq rm -f -r serene-boulder-203404:Temp1 # It will remove complete data set along with the tables in it
If your dataset is empty then you can use the following command as well:
To use the following command make sure that you have deleted all the tables in that dataset otherwise, it will generate an error (dataset is still in use).
#Now remove an empty dataset using bq command from Python
!bq rm -f dataset_id
print("dataset deleted successfully !!!")
I used the command line for loop to delete a month of table data, but this is reliant on your table naming:
for %d in (01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31) DO bq rm -f -t dataset.tablename_201701%d
Expanding on the excellent answer from #james, I simply needed to remove all tables in a dataset but not actually remove the dataset itself. Hence the grep part was unnecessary for me however I still needed to get rid of the
table_id
------------------
header that bq returns when listing tables, for that I used sed to remove those first two lines:
for i in $(bq ls -n 9999 my_dataset | sed "1,2 d" | awk '{print $1}'); do bq rm -f my_dataset.$i; done;
perhaps there's a bq option to not return that header but if there is, I don't know it.

Use awk to set the title of a data series in gnuplot, based on the filename

I'm trying to use gnuplot to compare several data files. I get those files through a list, using an ls command, and then I plot those files.
List = "`echo $(ls GRID/3D/Ratio_5/DATA.dat.* | sort -V )`"
This gives me a list of files DATA.dat.0001 DATA.dat.0002 ... DATA.dat.0100 representing different times, and then I can plot them using
plot for [i in List] i u 1:2 title system('basename '.i)
In that case, no issues. However, in some cases, I need to compare different Ratio_ directories (no time dependence here), so I create my list
List = "`echo $(ls GRID/3D/Ratio_*/DATA.dat | sort -V )`"
But the same plot command gives all my series the same name: DATA.dat. I tried
plot for [i in List] i u 1:2 title system('dirname '.i)
but it prints the whole path except DATA.dat, and I'm just interested on the Ratio_ number.
I create a function to extract the Ratio from the filename string
Ratio(filename) = "awk -F / '{print $3}' ".filename
plot for [i in List] i u 1:2 title Ratio(i)
but it just print the whole thing awk -F / '{print $3}' GRID/3D/Ratio_1/DATA.dat for every data series. I tried to add a system call within the definition of Ratio(filename) without success
Ratio(filename)=system("awk '{print $3}' " .filename)
plot for [i in List] i u 1:2 title Ratio(i)
but now each data series is named like the whole 3rd column of the corresponding DATA.dat, with its several hundred lines.
I'm out of ideas. This is probably just a simple syntax problem, a misplaced quote, point or comma. I know I can call my gnuplot script in the 3D directory and use dirname option, but that's not what I want. I'm calling this script along with many others placed in the same directory, to create plots for a scientific article, and placing those scripts in different places would be a mess.
Any help is appreciated. Thanks
When you do
Ratio(filename) = "awk -F / '{print $3}' ".filename
you are just creating a string, not executing it. Furthermore, when you do
Ratio(filename)=system("awk '{print $3}' " .filename)
you're asking awk to parse the file's content, rather than its name. Try:
Ratio(filename)=system("echo ".filename." | awk -F / '{print $3}'")
You need to quote List to preserve the linefeeds in it when you pass it to awk. Compare the following:
List=$(ls GRID/3D/Ratio*/DATA.dat)
echo "$List"
GRID/3D/Ratio_1/DATA.dat
GRID/3D/Ratio_2/DATA.dat
echo $List
GRID/3D/Ratio_1/DATA.dat GRID/3D/Ratio_2/DATA.dat
and this:
awk -F\/ '{print $3}' <<< $list
awk -F\/ '{print $3}' <<< "$List"
Ratio_1
Ratio_2

Need help in executing the SQL via shell script and use the result set

I currently have a request to build a shell script to get some data from the table using SQL (Oracle). The query which I'm running return a number of rows. Is there a way to use something like result set?
Currently, I'm re-directing it to a file, but I'm not able to reuse the data again for the further processing.
Edit: Thanks for the reply Gene. The result file looks like:
UNIX_PID 37165
----------
PARTNER_ID prad
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
/mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml
pradeep1
/mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654
----------
PARTNER_ID swam
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
smariswam2
/mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
There are multiple rows like this. My requirement is only to use shell script and write this program.
I need to take each of the pid and check if the process is running, which I can take care of.
My question is how do I check for each PID so I can loop and get corresponding partner_id and the xml_file name? Since it is a file, how can I get the exact corresponding values?
Your question is pretty short on specifics (a sample of the file to which you've redirected your query output would be helpful, as well as some idea of what you actually want to do with the data), but as a general approach, once you have your query results in a file, why not use the power of your scripting language of choice (ruby and perl are both good choices) to parse the file and act on each row?
Here is one suggested approach. It wasn't clear from the sample you posted, so I am assuming that this is actually what your sample file looks like:
UNIX_PID 37165 PARTNER_ID prad XML_FILE /mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml pradeep1 /mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654 PARTNER_ID swam XML_FILE smariswam2 /mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
I am also assuming that:
There is a line-feed at the end of
the last line of your file.
The columns are separated by a single
space.
Here is a suggested bash script (not optimal, I'm sure, but functional):
#! /bin/bash
cat myOutputData.txt |
while read line;
do
myPID=`echo $line | awk '{print $2}'`
isRunning=`ps -p $myPID | grep $myPID`
if [ -n "$isRunning" ]
then
echo "PARTNER_ID `echo $line | awk '{print $4}'`"
echo "XML_FILE `echo $line | awk '{print $6}'`"
fi
done
The script iterates through every line (row) of the input file. It uses awk to extract column 2 (the PID), and then does a check (using ps -p) to see if the process is running. If it is, it uses awk again to pull out and echo two fields from the file (PARTNER ID and XML FILE). You should be able to adapt the script further to suit your needs. Read up on awk if you want to use different column delimiters or do additional text processing.
Things get a little more tricky if the output file contains one row for each data element (as you indicated). A good approach here is to use a simple state mechanism within the script and "remember" whether or not the most recently seen PID is running. If it is, then any data elements that appear before the next PID should be printed out. Here is a commented script to do just that with a file of the format you provided. Note that you must have a line-feed at the end of the last line of input data or the last line will be dropped.
#! /bin/bash
cat myOutputData.txt |
while read line;
do
# Extract the first (myKey) and second (myValue) words from the input line
myKey=`echo $line | awk '{print $1}'`
myValue=`echo $line | awk '{print $2}'`
# Take action based on the type of line this is
case "$myKey" in
"UNIX_PID")
# Determine whether the specified PID is running
isRunning=`ps -p $myValue | grep $myValue`
;;
"PARTNER_ID")
# Print the specified partner ID if the PID is running
if [ -n "$isRunning" ]
then
echo "PARTNER_ID $myValue"
fi
;;
*)
# Check to see if this line represents a file name, and print it
# if the PID is running
inputLineLength=${#line}
if (( $inputLineLength > 0 )) && [ "$line" != "XML_FILE" ] && [ -n "$isRunning" ]
then
isHyphens=`expr "$line" : -`
if [ "$isHyphens" -ne "1" ]
then
echo "XML_FILE $line"
fi
fi
;;
esac
done
I think that we are well into custom software development territory now so I will leave it at that. You should have enough here to customize the script to your liking. Good luck!

connect to a DB inside awk script

In a shell script we can connect to a Database using sqlplus on unix.
can i perform the same thing inside an awk script?
i need to access the output of a select query inside an awk script.is that possible?
I'd do the query and feed the output of it into awk:
sqlplus 'select onething from another' | awk '{ weave awk magic here }'
Just like any other command:
pax> ls -alF | awk '{print $9}'
file1.txt
file2.txt
my_p0rn_dir/
Just use some sort of command line client for your SQL database (if available) and pipe the output to awk.
E.g. with sqlite (I don't know what client SQL*Plus has):
echo "select * from foo;" | sqlite3 file.db | awk ...
awk can't do it. This is UNIX tools philosophy, instead of having few tools that do many tasks, you use many little tools that do one task and connect them together.