Need help in executing the SQL via shell script and use the result set - sql

I currently have a request to build a shell script to get some data from the table using SQL (Oracle). The query which I'm running return a number of rows. Is there a way to use something like result set?
Currently, I'm re-directing it to a file, but I'm not able to reuse the data again for the further processing.
Edit: Thanks for the reply Gene. The result file looks like:
UNIX_PID 37165
----------
PARTNER_ID prad
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
/mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml
pradeep1
/mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654
----------
PARTNER_ID swam
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
smariswam2
/mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
There are multiple rows like this. My requirement is only to use shell script and write this program.
I need to take each of the pid and check if the process is running, which I can take care of.
My question is how do I check for each PID so I can loop and get corresponding partner_id and the xml_file name? Since it is a file, how can I get the exact corresponding values?

Your question is pretty short on specifics (a sample of the file to which you've redirected your query output would be helpful, as well as some idea of what you actually want to do with the data), but as a general approach, once you have your query results in a file, why not use the power of your scripting language of choice (ruby and perl are both good choices) to parse the file and act on each row?

Here is one suggested approach. It wasn't clear from the sample you posted, so I am assuming that this is actually what your sample file looks like:
UNIX_PID 37165 PARTNER_ID prad XML_FILE /mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml pradeep1 /mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654 PARTNER_ID swam XML_FILE smariswam2 /mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
I am also assuming that:
There is a line-feed at the end of
the last line of your file.
The columns are separated by a single
space.
Here is a suggested bash script (not optimal, I'm sure, but functional):
#! /bin/bash
cat myOutputData.txt |
while read line;
do
myPID=`echo $line | awk '{print $2}'`
isRunning=`ps -p $myPID | grep $myPID`
if [ -n "$isRunning" ]
then
echo "PARTNER_ID `echo $line | awk '{print $4}'`"
echo "XML_FILE `echo $line | awk '{print $6}'`"
fi
done
The script iterates through every line (row) of the input file. It uses awk to extract column 2 (the PID), and then does a check (using ps -p) to see if the process is running. If it is, it uses awk again to pull out and echo two fields from the file (PARTNER ID and XML FILE). You should be able to adapt the script further to suit your needs. Read up on awk if you want to use different column delimiters or do additional text processing.
Things get a little more tricky if the output file contains one row for each data element (as you indicated). A good approach here is to use a simple state mechanism within the script and "remember" whether or not the most recently seen PID is running. If it is, then any data elements that appear before the next PID should be printed out. Here is a commented script to do just that with a file of the format you provided. Note that you must have a line-feed at the end of the last line of input data or the last line will be dropped.
#! /bin/bash
cat myOutputData.txt |
while read line;
do
# Extract the first (myKey) and second (myValue) words from the input line
myKey=`echo $line | awk '{print $1}'`
myValue=`echo $line | awk '{print $2}'`
# Take action based on the type of line this is
case "$myKey" in
"UNIX_PID")
# Determine whether the specified PID is running
isRunning=`ps -p $myValue | grep $myValue`
;;
"PARTNER_ID")
# Print the specified partner ID if the PID is running
if [ -n "$isRunning" ]
then
echo "PARTNER_ID $myValue"
fi
;;
*)
# Check to see if this line represents a file name, and print it
# if the PID is running
inputLineLength=${#line}
if (( $inputLineLength > 0 )) && [ "$line" != "XML_FILE" ] && [ -n "$isRunning" ]
then
isHyphens=`expr "$line" : -`
if [ "$isHyphens" -ne "1" ]
then
echo "XML_FILE $line"
fi
fi
;;
esac
done
I think that we are well into custom software development territory now so I will leave it at that. You should have enough here to customize the script to your liking. Good luck!

Related

Looking for a good output format to use a value extracted from a file in new script/process in Nextflow

Subject: Looking for a good output format to use a value extracted from a file in new script/process in Nextflow
I can't seem to figure this one out:
I am writing some processes in Nextflow in which I'm extracting a value from a txt.file (PROCESS1) and I want to use it in a second process (PROCESS2). The extraction of the value is no problem but finding the suitable output format is. The problem is that when I save the stdout (OPTION1) to a channel there seems to be some kind of "/n" attached which gives problems in my second script.
Alternatively because this was not working I wanted to save the output of PROCESS1 as a file (OPTION2). Also this is no problem but I can't find the correct way to read the content of the file in PROCESS2. I suspect it has something to do with "getText()" but I tried several things and they all failed.
Finally I wanted to try to save the output as a variable (OPTION3) but I don't know how to do this.
PROCESS1
process txid {
publishDir "$wanteddir", mode:'copy', overwrite: true
input:
file(report) from report4txid
output:
stdout into txid4assembly //OPTION 1
file(txid.txt) into txid4assembly //OPTION 2
val(txid) into txid4assembly //OPTION 3: doesn't work
shell:
'''
column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 //OPTION1
column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 > txid.txt //OPTION2
column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 > txid //OPTION3
'''
}
PROCESS2
process accessions {
publishDir "$wanteddir", mode:'copy', overwrite: true
input:
val(txid) from txid4assembly //OPTION1 & OPTION3
file(txid) from txid4assembly //OPTION2
output:
file("${txid}accessions.txt") into accessionlist
script:
"""
esearch -db assembly -query '${txid}[txid] AND "complete genome"[filter] AND "latest refseq"[filter]' \
| esummary | xtract -pattern DocumentSummary -element AssemblyAccession > ${txid}accessions.txt
"""
}
RESULTING SCRIPT OF PROCESS2 AFTER OPTION 1 (remark: output = 573, lay-out unchanged)
esearch -db assembly -query '573
[txid] AND "complete genome"[filter] AND "latest refseq"[filter]' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession > 573
accessions.txt
Thank you for your help!
As you've discovered, your command-line writes a trailing newline character. You could try removing it somehow, perhaps by piping to another command, or (better) by refactoring to properly parse your report files. Below is an example using awk to print the fifth column without a trailing newline character. This might work fine for a simple CSV report file, but the CSV parsing capabilities of AWK are limited. So if your reports could contain quoted fields etc, consider using a language that offers CSV parsing in it's standard library (e.g. Python and the csv libary, or Perl and the Text::CSV module). Nextflow makes it easy to use your favourite scripting language.
process txid {
publishDir "$wanteddir", mode:'copy', overwrite: true
input:
file(report) from report4txid
output:
stdout into txid4assembly
shell:
'''
awk -F, '$4 == "S" { printf("%s", $5); exit }' "!{report}"
'''
In the case where your file contains an "S" in the forth column and the fifth column has some value with string length >= 1, this will give you a value that you can use in your 'accessions' process. But please be aware that this won't handle the case where the forth column in your file is never equal to "S". Nor will it handle the case where your fifth column could be an empty value (string length == 0). In these cases 'stdout' will be empty, so you'll get an empty value in your output channel. You may want to add some code to make sure that these edge cases are handled somehow.
I eventually fixed it by adding the following code, which only gets the numbers from my output
... | tr -dc '0-9'

How to parse a column from one file in mutiple other columns and concatenate the output?

I have one file like this:
head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...
and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:
head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7 1 89295 129223 - 2073 1.03557 343.245
ENSG00000237683 AL627309.1 1 134901 139379 - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2 1 523009 530148 + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1 1 521369 523833 - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5 1 562757 564390 - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723 1 566454 567996 + 4247 1.05299 592.876
I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt
I tried using:
grep -w -f allGenes.txt *.v7.egenes.txt > output.txt
but this takes forever to complete. Is there is any way to do this in awk or?
Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt in memory, one awk solution comes to mind:
awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt
Where:
NR==FNR - this test only matches the first file to be processed (allGenes.txt)
gene[$1] - store each gene as an index in an associative array
next stop processing and go to next line in the file
$1 in gene - applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current line
I wouldn't expect this to run any/much faster than the grep solution the OP is currently using (especially with shelter's suggestion to use -F instead of -w), but it should be relatively quick to test and see ....
GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
You could try with a while read loop :
#!/bin/bash
while read -r line; do
grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt
So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?
EDIT :
New version :
#!/bin/bash
for name in $(cat allGenes.txt); do
grep -rnw *v7.egenes.txt* -e $name >> output.txt
done

Batch renaming files with text from file as a variable

I am attempting to convert the files with the titles {out1.hmm, out2.hmm, ... , outn.hmm} to unique identifiers based on the third line of the file {PF12574.hmm, PF09847.hmm, PF0024.hmm} The script works on a single file however the variable does not get overwritten and only one file remains after running the command below:
for f in *.hmm;
do output="$(sed -n '3p' < $f |
awk -F ' ' '{print $2}' |
cut -f1 -d '.' | cat)" |
mv $f "${output}".hmm; done;
The first line calls all the outn.hmms as an input. The second line sets a variable to return the desired unique identifier. SED, AWK, and CUT are used to get the unique identifier. The variable supposed to rename the current file by the unique identifier, however the variable remains locked and overwrites the previous file.
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm
How can I overwrite the variable to get the following file structure:
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm PF09847.hmm PF0024.hmm
You're piping the empty output of the assignment statement (to the variable named "output") into the mv command. That variable is not set yet, so what I think will happen is that you will - one after the other - rename all the files that match *.hmm to the file named ".hmm".
Try ls -a to see if that's what actually happened.
The sed, awk, cut, and (unneeded) cat are a bit much. awk can do all you need. Then do the mv as a separate command:
for f in *.hmm
do
output=$(awk 'NR == 3 {print $2}' "$f")
mv "$f" "${output%.*}.hmm"
done
Note that the above does not do any checking to verify that output is assigned to a reasonable value: one that is non-empty, that is a proper "identifier", etc.

grep a number from the line and append it to a file

I went through several grep examples, but don't see how to do the following.
Say, i have a file with a line
! some test here and number -123.2345 text
i can get this line using
grep ! input.txt
but how do i get the number (possibly positive or negative) from this line and append it to the end of another file? Is it possible to apply grep to grep results?
If yes, then i could get the number via something like
grep -Eo "[0-9]{1,}|\-[0-9]{1,}"
p/s/ i am using OS-X
p/p/s/ i'm trying to fetch data from several files and put into a single file for later plotting.
The format with your commands would be:
grep ! input.txt | grep -Eo "[0-9]{1,}|\-[0-9]{1,}" >> output
To grep from grep we use the pipe operator | this lets us chain commands together. To append this output to a file we use the redirection operator >>.
However there are a couple of problems. You regexp is better written: grep -Eoe '-?[0-9.]+' this allows for the decimal and returns the single number instead of two and if you want lines that start with ! then grep ^! is better to avoid matches with lines what contain ! but don't start with it. Better to do:
grep '^!' input | grep -Eoe '-?[0-9.]+' >> output
perl -lne 'm/.*?([\d\.\-]+).*/g;print $1' your_file >>anotherfile_to_append
$foo="! some test here and number -123.2345 text"
$echo $foo | sed -e 's/[^0-9\.-]//g'
$-123.2345
Edit:-
for a file,
[ ]$ cat log
! some test here and number -123.2345 text
some blankline
some line without "the character" and with number 345.566
! again a number 34
[ ]$ sed -e '/^[^!]/d' -e 's/[^0-9.-]//g' log > op
[ ]$ cat op
-123.2345
34
Now lets see the toothpicks :) '/^[^!]/d' / start of pattern, ^ not (like multiply with false), [^!] anyline starting with ! and d delete. Second expression, [^0-9.-] not matching anything within 0 to 9, and . and -, (everything else) // replace with nothing (i.e. delete) and done :)

ksh script optimization

I have a small script that simply reads each line of a file, retrieves id field, runs utility to get the name and appends the name at the end. The problem is the input file is huge (2GB). Since output is same as input with a 10-30 char name appended, it is of the same order of magnitude. How can I optimize it to read large buffers, process in buffers and then write buffers to the file so the number of file accesses are minimized?
#!/bin/ksh
while read line
do
id=`echo ${line}|cut -d',' -f 3`
NAME=$(id2name ${id} | cut -d':' -f 4)
if [[ $? -ne 0 ]]; then
NAME="ERROR"
echo "Error getting name from id2name for id: ${id}"
fi
echo "${line},\"${NAME}\"" >> ${MYFILE}
done < ${MYFILE}.csv
Thanks
You can speed things up considerably by eliminating the two calls to cut in each iteration of the loop. It also might be faster to move the redirection to your output file to the end of the loop. Since you don't show an example of an input line, or what id2name consists of (it's possible it's a bottleneck) or what its output looks like, I can only offer this approximation:
#!/bin/ksh
while IFS=, read -r field1 field2 id remainder # use appropriate var names
do
line=$field1,$field2,$id,$remainder
# warning - reused variables
IFS=: read -r field1 field2 field3 NAME remainder <<< $(id2name "$id")
if [[ $? -ne 0 ]]; then
NAME="ERROR"
# if you want this message to go to stderr instead of being included in the output file include the >&2 as I've done here
echo "Error getting name from id2name for id: ${id}" >&2
fi
echo "${line},\"${NAME}\""
done < "${MYFILE}.csv" > "${MYFILE}"
The OS will do the buffering for you.
Edit:
If your version of ksh doesn't have <<<, try this:
id2name "$id" | IFS=: read -r field1 field2 field3 NAME remainder
(If you were using Bash, this wouldn't work.)