grep a sentence from an SQL FIle - sql

I have a very large SQL file and want to extract a single create statement from the file. THe command I use is;
cat dbdump.sql |grep "CREATE TABLE site_summaries" >newdump.sql
The problem with this approach is that the statemtn is spread over several lines and the grep only returns the first line. How do I continue the grep until it reaches a semi-colon denoting the end of the statement?

First of all, your command has a "useless usage of cat", next time just do grep '...' file not cat file|grep...
grep does line wise checking. You can use awk to achieve that easily:
awk -v RS=';' '/CREATE TABLE site_summaries/' foo.sql
made a little test:
kent$ cat f
foo
bar;
this that;
CREATE TABLE site_summaries
whatever
else you need;
trash here....;
kent$ awk -v RS=';' '/CREATE TABLE site_summaries/' f
CREATE TABLE site_summaries
whatever
else you need
If you want the ; still at the end of your extracted text, take this one:
awk -v RS=';' -v ORS=";\n" '/CREATE TABLE site_summaries/' file

Related

How to parse a column from one file in mutiple other columns and concatenate the output?

I have one file like this:
head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...
and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:
head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7 1 89295 129223 - 2073 1.03557 343.245
ENSG00000237683 AL627309.1 1 134901 139379 - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2 1 523009 530148 + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1 1 521369 523833 - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5 1 562757 564390 - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723 1 566454 567996 + 4247 1.05299 592.876
I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt
I tried using:
grep -w -f allGenes.txt *.v7.egenes.txt > output.txt
but this takes forever to complete. Is there is any way to do this in awk or?
Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt in memory, one awk solution comes to mind:
awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt
Where:
NR==FNR - this test only matches the first file to be processed (allGenes.txt)
gene[$1] - store each gene as an index in an associative array
next stop processing and go to next line in the file
$1 in gene - applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current line
I wouldn't expect this to run any/much faster than the grep solution the OP is currently using (especially with shelter's suggestion to use -F instead of -w), but it should be relatively quick to test and see ....
GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
You could try with a while read loop :
#!/bin/bash
while read -r line; do
grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt
So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?
EDIT :
New version :
#!/bin/bash
for name in $(cat allGenes.txt); do
grep -rnw *v7.egenes.txt* -e $name >> output.txt
done

Bash script process csv file line by line while updateing $6 with different value but keeping other values unchanged

I am beginner at bash scripting and I have been trying to fix this for more than 8 hours.
I have searched on StackOwerflow and tried the answers to fit my needs, but without success.
I want to use bash script to change csv file's date value to current date.
I am using a dummy .csv file ( http://eforexcel.com/wp/wp-content/uploads/2017/07/100-Sales-Records.zip ) and I want to change the 6th value (date) to the current date.
What I have been doing so far:
I have created one line csv to test the script
cat oneline.csv:
Australia and Oceania,Tuvalu,Baby Food,Offline,H,5/28/2010,669165933,6/27/2010,9925,255.28,159.42,2533654.00,1582243.50,951410.50
then I have tested the one line script:
echo `cat oneline.csv | awk -F, '{ print $1"," $2"," $3"," $4"," $5","}'` `date` `cat oneline.csv |awk -F, '{print $7"," $8"," $9"," $10"," $11"," $12"," $13"," $14"\n"}'
then I have this code for the whole 100 line files in source.sh:
#I want to change 6th value for every line of source.csv to current date and keep the rest and export it to output.csv
while read
do
part1=$(`cat source.csv | awk -F, '{ print $1"," $2"," $3"," $4"," $5","}'`)
datum=$(`date`)
part2=$(`cat source.csv |awk -F, '{print $7"," $8"," $9"," $10"," $11"," $12"," $13"," $14"\n"}'`)
echo `$part1 $datum $part2`
done
and I expect to run the command like ./source.sh > output.csv
What I want for the full 100 lines file is to have result like:
Food,Offline,H,Thu Jan 17 06:34:03 EST 2019,669165933,6/27/2010,9925,255.28,159.42,2533654.00,1582243.50,951410.50
Could you guide me how to change the code to get the result?
Refactor everything to a single Awk script; that also avoids the echo in backticks.
awk -v datum="$(date)" -F , 'BEGIN { OFS=FS }
{ $6 = datum } 1' source.csv >output.csv
Briefly, we split on comma (-F ,) and replace the value of the sixth field with the value of the variable we passed in with -v. OFS=FS sets the output field separator to the input field separator (comma). Then the 1 means "print unconditionally".
Generally speaking, you should probably avoid while read.
Tangentially, your quoting looks wacky; you don't want backticks around $part1 unless it is a command you want the shell to run (which in turn is probably a bad idea in itself). Also, backticks have long been deprecated in favor of $(command) syntax which is more legible and offers some syntactic advantages.

Batch renaming files with text from file as a variable

I am attempting to convert the files with the titles {out1.hmm, out2.hmm, ... , outn.hmm} to unique identifiers based on the third line of the file {PF12574.hmm, PF09847.hmm, PF0024.hmm} The script works on a single file however the variable does not get overwritten and only one file remains after running the command below:
for f in *.hmm;
do output="$(sed -n '3p' < $f |
awk -F ' ' '{print $2}' |
cut -f1 -d '.' | cat)" |
mv $f "${output}".hmm; done;
The first line calls all the outn.hmms as an input. The second line sets a variable to return the desired unique identifier. SED, AWK, and CUT are used to get the unique identifier. The variable supposed to rename the current file by the unique identifier, however the variable remains locked and overwrites the previous file.
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm
How can I overwrite the variable to get the following file structure:
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm PF09847.hmm PF0024.hmm
You're piping the empty output of the assignment statement (to the variable named "output") into the mv command. That variable is not set yet, so what I think will happen is that you will - one after the other - rename all the files that match *.hmm to the file named ".hmm".
Try ls -a to see if that's what actually happened.
The sed, awk, cut, and (unneeded) cat are a bit much. awk can do all you need. Then do the mv as a separate command:
for f in *.hmm
do
output=$(awk 'NR == 3 {print $2}' "$f")
mv "$f" "${output%.*}.hmm"
done
Note that the above does not do any checking to verify that output is assigned to a reasonable value: one that is non-empty, that is a proper "identifier", etc.

Use multiline regex on cat output

I've the following file queries.sql that contains a number of queries, structured like this:
/* Query 1 */
SELECT cab_type_id,
Count(*)
FROM trips
GROUP BY 1;
/* Query 2 */
SELECT passenger_count,
Avg(total_amount)
FROM trips
GROUP BY 1;
/* Query 3 */
SELECT passenger_count,
Extract(year FROM pickup_datetime),
Count(*)
FROM trips
GROUP BY 1,
2;
Then I've written a regex, that finds all those queries in the file:
/\*[^\*]*\*/[^;]*;
What I'd like to achieve is the following:
Select all the queries with the regex.
Prefix each query with EXPLAIN ANALYZE
Execute each query and output the results to a new file. That means, query 1 will create a file q1.txt with the corresponding output, query 2 create q2.txt etc.
One of my main challenges (there are no problems, right? ;-)) is, that I'm rather unfamiliar with the linux bash I've to use.
I tried cat queries.sql | grep '/\*[^\*]*\*/[^;]*;' but that doesn't return anything.
So a solution could look like:
count = 0
for query in (cat queries.sql | grep 'somehow-here-comes-my-regex') do
count = $count+1
query = 'EXPLAIN ANALYZE '+query
psql -U postgres -h localhost -d nyc-taxi-data -c query > 'q'$count'.txt'
Except from: that doesn't work and I don't know how to make it work.
You have to omit spaces for variable assignments.
The following script would help. Save it in a file eg.: explain.sh, make it executable using chmod 0700 explain.sh and run in the following way: ./explain.sh query.sql.
#!/bin/bash
qfile="$1"
# number of queries
n="$(grep -oP '(?<=Query )[0-9]+ ' $qfile)"
count=1
for q in $n; do
# Corrected solution, modified after the remarks of #EdMorton
qn="EXPLAIN ANALYZE $(awk -v n="Query $q" 'flag; $0 ~ n {flag=1} /;/{flag=0}' $qfile)"
#qn="EXPLAIN ANALYZE $(awk -v n=$q "flag; /Query $q/{flag=1} /;/{flag=0}" $qfile)"
# psql -U postgres -h localhost -d nyc-taxi-data -c "$qn" > q$count.txt
echo "$qn" > q$count.txt
count=$(( $count + 1 ))
done
First of all, the script accounts for one argument (your example input query.sql file). It reads out the number of queries and save into a variable n. Then in a for loop it iterates through the query numbers and uses awk to extract the number n query and append EXPLAIN ANALYZE to the beginning. Then you can run your psql with the desired query. Here I commented out the psql part. This example script only creates qN.txt files for each explain query.
UPDATE:
The awk part: It is possible to use a shell variable in awk using the -v flag. Here we creates an awk variable n with the value of the q shell variable. n is used to create the starter pattern ie: Query 1. awk -v n="Query $q" 'flag; $0 ~ n {flag=1} /;/{flag=0}' $qfile matches everything between Query 1 and the first occurence of a semi-colon (;) excluding the line of Query 1 from query.sql. The $(...) means command-substitution in bash, thus we can save the output of a shell command into a variable. Here we save the output of awk and prefix it with the EXPLAIN ANALYZE string.
Here is a great answer about awk pattern matching.
It sounds like this is what you're looking for:
awk -v RS= -v ORS='\0' '{print "EXPLAIN ANALYZE", $0}' queries.sql |
while IFS= read -r -d '' query; do
psql -U postgres -h localhost -d nyc-taxi-data -c "$query" > "q$((++count)).txt"
do
The awk statement outputs each query as a NUL-terminated string, the shell loop reads it as such one at a time and calls psql on it. Simple, robust, efficient, etc...

Copy data from one database into another using bash

I need to copy data from one database into my own database, because i want to run it as a daily cronjob i prefer to have it in bash. I also need to store the values in variables so i can run various checks/validations on the values. This is what i got so far:
echo "SELECT * FROM table WHERE value='ABC' AND value2 IS NULL ORDER BY time" | mysql -u user -h ip db -p | sed 's/\t/,/g' | awk -F, '{print $3,$4,$5,$7 }' > Output
cat Output | while read line
do
Value1=$(awk '{print "",$1}')
Value2=$(awk '{print "",$2}')
Value3=$(awk '{print "",$3}')
Value4=$(awk '{print "",$4}')
echo "INSERT INTO db (value1,value2,value3,value4,value5) VALUES($Value1,$Value2,'$Value3',$Value4,'n')" | mysql -u rb db -p
done
I get the data i need from the database and store it in a new file seperated by spaces. Then i read the file line by line and store the values in variables, and last i run an insert query with the right varables.
I think something goes wrong while storing the values but i cant really figure out what goes wrong.
The awk used to get Value2, Value3 and Value4 does not get the input from $line. You can fix this as:
Value1=$(echo $line | awk '{print $1}')
Value2=$(echo $line | awk '{print $2}')
Value3=$(echo $line | awk '{print $3}')
Value4=$(echo $line | awk '{print $4}')
There's no reason to call awk four times in a loop. That could be very slow. If you don't need the temporary file "Output" for another reason then you don't need it at all - just pipe the output into the while loop. You may not need to use sed to change tabs into commas (you could use tr, by the way) since awk will split fields on tabs (and spaces) by default (unless your data contains spaces, but some of it seems not to).
echo "SELECT * FROM table WHERE value='ABC' AND value2 IS NULL ORDER BY time" |
mysql -u user -h ip db -p |
sed 's/\t/,/g' | # can this be eliminated?
awk -F, '{print $3,$4,$5,$7 }' | # if you eliminate the previous line then omit the -F,
while read line
do
tmparray=($line)
Value1=${tmparray[0]}
Value2=${tmparray[1]}
Value3=${tmparray[2]}
Value4=${tmparray[3]}
echo "INSERT INTO predb (value1,value2,value3,value4,value5) VALUES($Value1,$Value2,'$Value3',$Value4,'n')" | mysql -u rb db -p
done
That uses a temporary array to split the values out of the line. This is another way to do that:
set -- $line
Value1=$1
Value2=$2
Value3=$3
Value4=$4