Use multiline regex on cat output - sql

I've the following file queries.sql that contains a number of queries, structured like this:
/* Query 1 */
SELECT cab_type_id,
Count(*)
FROM trips
GROUP BY 1;
/* Query 2 */
SELECT passenger_count,
Avg(total_amount)
FROM trips
GROUP BY 1;
/* Query 3 */
SELECT passenger_count,
Extract(year FROM pickup_datetime),
Count(*)
FROM trips
GROUP BY 1,
2;
Then I've written a regex, that finds all those queries in the file:
/\*[^\*]*\*/[^;]*;
What I'd like to achieve is the following:
Select all the queries with the regex.
Prefix each query with EXPLAIN ANALYZE
Execute each query and output the results to a new file. That means, query 1 will create a file q1.txt with the corresponding output, query 2 create q2.txt etc.
One of my main challenges (there are no problems, right? ;-)) is, that I'm rather unfamiliar with the linux bash I've to use.
I tried cat queries.sql | grep '/\*[^\*]*\*/[^;]*;' but that doesn't return anything.
So a solution could look like:
count = 0
for query in (cat queries.sql | grep 'somehow-here-comes-my-regex') do
count = $count+1
query = 'EXPLAIN ANALYZE '+query
psql -U postgres -h localhost -d nyc-taxi-data -c query > 'q'$count'.txt'
Except from: that doesn't work and I don't know how to make it work.

You have to omit spaces for variable assignments.
The following script would help. Save it in a file eg.: explain.sh, make it executable using chmod 0700 explain.sh and run in the following way: ./explain.sh query.sql.
#!/bin/bash
qfile="$1"
# number of queries
n="$(grep -oP '(?<=Query )[0-9]+ ' $qfile)"
count=1
for q in $n; do
# Corrected solution, modified after the remarks of #EdMorton
qn="EXPLAIN ANALYZE $(awk -v n="Query $q" 'flag; $0 ~ n {flag=1} /;/{flag=0}' $qfile)"
#qn="EXPLAIN ANALYZE $(awk -v n=$q "flag; /Query $q/{flag=1} /;/{flag=0}" $qfile)"
# psql -U postgres -h localhost -d nyc-taxi-data -c "$qn" > q$count.txt
echo "$qn" > q$count.txt
count=$(( $count + 1 ))
done
First of all, the script accounts for one argument (your example input query.sql file). It reads out the number of queries and save into a variable n. Then in a for loop it iterates through the query numbers and uses awk to extract the number n query and append EXPLAIN ANALYZE to the beginning. Then you can run your psql with the desired query. Here I commented out the psql part. This example script only creates qN.txt files for each explain query.
UPDATE:
The awk part: It is possible to use a shell variable in awk using the -v flag. Here we creates an awk variable n with the value of the q shell variable. n is used to create the starter pattern ie: Query 1. awk -v n="Query $q" 'flag; $0 ~ n {flag=1} /;/{flag=0}' $qfile matches everything between Query 1 and the first occurence of a semi-colon (;) excluding the line of Query 1 from query.sql. The $(...) means command-substitution in bash, thus we can save the output of a shell command into a variable. Here we save the output of awk and prefix it with the EXPLAIN ANALYZE string.
Here is a great answer about awk pattern matching.

It sounds like this is what you're looking for:
awk -v RS= -v ORS='\0' '{print "EXPLAIN ANALYZE", $0}' queries.sql |
while IFS= read -r -d '' query; do
psql -U postgres -h localhost -d nyc-taxi-data -c "$query" > "q$((++count)).txt"
do
The awk statement outputs each query as a NUL-terminated string, the shell loop reads it as such one at a time and calls psql on it. Simple, robust, efficient, etc...

Related

How to parse a column from one file in mutiple other columns and concatenate the output?

I have one file like this:
head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...
and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:
head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7 1 89295 129223 - 2073 1.03557 343.245
ENSG00000237683 AL627309.1 1 134901 139379 - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2 1 523009 530148 + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1 1 521369 523833 - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5 1 562757 564390 - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723 1 566454 567996 + 4247 1.05299 592.876
I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt
I tried using:
grep -w -f allGenes.txt *.v7.egenes.txt > output.txt
but this takes forever to complete. Is there is any way to do this in awk or?
Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt in memory, one awk solution comes to mind:
awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt
Where:
NR==FNR - this test only matches the first file to be processed (allGenes.txt)
gene[$1] - store each gene as an index in an associative array
next stop processing and go to next line in the file
$1 in gene - applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current line
I wouldn't expect this to run any/much faster than the grep solution the OP is currently using (especially with shelter's suggestion to use -F instead of -w), but it should be relatively quick to test and see ....
GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
You could try with a while read loop :
#!/bin/bash
while read -r line; do
grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt
So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?
EDIT :
New version :
#!/bin/bash
for name in $(cat allGenes.txt); do
grep -rnw *v7.egenes.txt* -e $name >> output.txt
done

grep a sentence from an SQL FIle

I have a very large SQL file and want to extract a single create statement from the file. THe command I use is;
cat dbdump.sql |grep "CREATE TABLE site_summaries" >newdump.sql
The problem with this approach is that the statemtn is spread over several lines and the grep only returns the first line. How do I continue the grep until it reaches a semi-colon denoting the end of the statement?
First of all, your command has a "useless usage of cat", next time just do grep '...' file not cat file|grep...
grep does line wise checking. You can use awk to achieve that easily:
awk -v RS=';' '/CREATE TABLE site_summaries/' foo.sql
made a little test:
kent$ cat f
foo
bar;
this that;
CREATE TABLE site_summaries
whatever
else you need;
trash here....;
kent$ awk -v RS=';' '/CREATE TABLE site_summaries/' f
CREATE TABLE site_summaries
whatever
else you need
If you want the ; still at the end of your extracted text, take this one:
awk -v RS=';' -v ORS=";\n" '/CREATE TABLE site_summaries/' file

How to get get Output of SQL query in a flat file using shell script

I want to automate the a SQL file which contains multiple SQL queries using shell script and in the same time I want to get the Output of the last Query from that SQl file into any flat file.
You can create a file that executes each query but only sends the output of the last one.
#!/bin/bash
mysql -h... -u... -p... -e 'query1' > /dev/null
mysql -h... -u... -p... -e 'query2' > /dev/null
mysql -h... -u... -p... -e 'query3' > /result.sql
This awk command will find the last semicolon-separated query in the file and pipe it to mysql
awk 'BEGIN {RS=";"} NF > 0 {query=$0; } END {print query}' file.sql | mysql -u username -p password > output.txt
The NF > 0 keeps it from setting query to the empty line after the last ;.

How can I convert SQL comments with -- to # using Perl?

UPDATE:
This is what works!
fgrep -ircl --include=*.sql -- -- *
I have various SQL files with '--' comments and we migrated to the latest version of MySQL and it hates these comments. I want to replace -- with #.
I am looking for a recursive, inplace replace one-liner.
This is what I have:
perl -p -i -e 's/--/# /g'` ``fgrep -- -- *
A sample .sql file:
use myDB;
--did you get an error
I get the following error:
Unrecognized switch: --did (-h will show valid options).
p.s : fgrep skipping 2 dashes was just discussed here if you are interested.
Any help is appreciated.
The command-line arguments after the -e 's/.../.../' argument should be filenames. Use fgrep -l to return names of files that contain a pattern:
perl -p -i -e 's/--/# /g' `fgrep -l -- -- * `
I'd use a combination of find and inplace sed
find . -name '*.sql' -exec sed -i -e "s/^--/#/" '{}' \;
Note that it will only replace lines beginning with --
The regex will become vastly more complex if you wan't to replace this for example:
INSERT INTO stuff VALUES (...) -- values used for xyz
because the -- might as well be in your data (I guess you don't want to replace those)
INSERT INTO stuff VALUES (42, "<!-- sboing -->") -- values used for xyz
The equivalent of that in script form is:
#!/usr/bin/perl -i
use warnings;
use strict;
while(<>) {
s/--/# /g;
print;
}
If I have several files with comments of the form of --comment and feed any number of names to this script, they are changed in place to # comment You could use find, ls, grep, etc to find the files...
There is nothing per se wrong with using a 1 liner.
Is that what you are looking for?

Need help in executing the SQL via shell script and use the result set

I currently have a request to build a shell script to get some data from the table using SQL (Oracle). The query which I'm running return a number of rows. Is there a way to use something like result set?
Currently, I'm re-directing it to a file, but I'm not able to reuse the data again for the further processing.
Edit: Thanks for the reply Gene. The result file looks like:
UNIX_PID 37165
----------
PARTNER_ID prad
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
/mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml
pradeep1
/mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654
----------
PARTNER_ID swam
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
smariswam2
/mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
There are multiple rows like this. My requirement is only to use shell script and write this program.
I need to take each of the pid and check if the process is running, which I can take care of.
My question is how do I check for each PID so I can loop and get corresponding partner_id and the xml_file name? Since it is a file, how can I get the exact corresponding values?
Your question is pretty short on specifics (a sample of the file to which you've redirected your query output would be helpful, as well as some idea of what you actually want to do with the data), but as a general approach, once you have your query results in a file, why not use the power of your scripting language of choice (ruby and perl are both good choices) to parse the file and act on each row?
Here is one suggested approach. It wasn't clear from the sample you posted, so I am assuming that this is actually what your sample file looks like:
UNIX_PID 37165 PARTNER_ID prad XML_FILE /mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml pradeep1 /mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654 PARTNER_ID swam XML_FILE smariswam2 /mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
I am also assuming that:
There is a line-feed at the end of
the last line of your file.
The columns are separated by a single
space.
Here is a suggested bash script (not optimal, I'm sure, but functional):
#! /bin/bash
cat myOutputData.txt |
while read line;
do
myPID=`echo $line | awk '{print $2}'`
isRunning=`ps -p $myPID | grep $myPID`
if [ -n "$isRunning" ]
then
echo "PARTNER_ID `echo $line | awk '{print $4}'`"
echo "XML_FILE `echo $line | awk '{print $6}'`"
fi
done
The script iterates through every line (row) of the input file. It uses awk to extract column 2 (the PID), and then does a check (using ps -p) to see if the process is running. If it is, it uses awk again to pull out and echo two fields from the file (PARTNER ID and XML FILE). You should be able to adapt the script further to suit your needs. Read up on awk if you want to use different column delimiters or do additional text processing.
Things get a little more tricky if the output file contains one row for each data element (as you indicated). A good approach here is to use a simple state mechanism within the script and "remember" whether or not the most recently seen PID is running. If it is, then any data elements that appear before the next PID should be printed out. Here is a commented script to do just that with a file of the format you provided. Note that you must have a line-feed at the end of the last line of input data or the last line will be dropped.
#! /bin/bash
cat myOutputData.txt |
while read line;
do
# Extract the first (myKey) and second (myValue) words from the input line
myKey=`echo $line | awk '{print $1}'`
myValue=`echo $line | awk '{print $2}'`
# Take action based on the type of line this is
case "$myKey" in
"UNIX_PID")
# Determine whether the specified PID is running
isRunning=`ps -p $myValue | grep $myValue`
;;
"PARTNER_ID")
# Print the specified partner ID if the PID is running
if [ -n "$isRunning" ]
then
echo "PARTNER_ID $myValue"
fi
;;
*)
# Check to see if this line represents a file name, and print it
# if the PID is running
inputLineLength=${#line}
if (( $inputLineLength > 0 )) && [ "$line" != "XML_FILE" ] && [ -n "$isRunning" ]
then
isHyphens=`expr "$line" : -`
if [ "$isHyphens" -ne "1" ]
then
echo "XML_FILE $line"
fi
fi
;;
esac
done
I think that we are well into custom software development territory now so I will leave it at that. You should have enough here to customize the script to your liking. Good luck!