How to grep the outputs of awk, line by line? - awk

Let's say I have the following text file:
$ cat file1.txt outputs
MarkerName Allele1 Allele2 Freq1 FreqSE P-value Chr Pos
rs2326918 a g 0.8510 0.0001 0.5255 6 130881784
rs2439906 c g 0.0316 0.0039 0.8997 10 6870306
rs10760160 a c 0.5289 0.0191 0.8107 9 123043147
rs977590 a g 0.9354 0.0023 0.8757 7 34415290
rs17278013 t g 0.7498 0.0067 0.3595 14 24783304
rs7852050 a g 0.8814 0.0006 0.7671 9 9151167
rs7323548 a g 0.0432 0.0032 0.4555 13 112320879
rs12364336 a g 0.8720 0.0015 0.4542 11 99515186
rs12562373 a g 0.7548 0.0020 0.6151 1 164634379
Here is an awk command which prints MarkerName if Pos >= 11000000
$ awk '{ if($8 >= 11000000) { print $1 }}' file1.txt
This command outputs the following:
MarkerName
rs2326918
rs10760160
rs977590
rs17278013
rs7323548
rs12364336
rs12562373
Question: I would like to feed this into a grep statement to parse another text file, textfile2.txt. Somehow, one pipes the output from the previous awk command into grep AWKOUTPUT textfile2.txt
I would like each row of the awk command above to be grepped against textfile2.txt, i.e.
grep "rs2326918" textfile2.txt
## and then
grep "rs10760160" textfile2.txt
### and then
...
Naturally, I would save all resulting rows from textfile2.txt into a final file, i.e.
$ awk '{ if($8 >= 11000000) { print $1 }}' file1.txt | grep PIPE_OUTPUT_BY_ROW textfile2.txt > final.txt
How does one grep from a pipe line by line?
EDIT: To clarify, the one constraint I have is that file1.txt is actually the output of a previous pipe. (I'm trying to simplify the question somewhat.) How would that change the answer?

awk + grep solution:
grep -f <(awk '$8 >= 11000000{ print $1 }' file1.txt) textfile2.txt > final.txt
-f file - obtain patterns from file, one per line

You can use bash to do this:
bash-3.1$ echo "rs2326918" > filename2.txt
bash-3.1$ (for i in `awk '{ if($8 >= 11000000) { print $1 }}' file1.txt |
grep -v MarkerName`; do grep $i filename2.txt; done) > final.txt
bash-3.1$ cat final.txt
rs2326918
Alternatively,
bash-3.1$ cat file1.txt | (for i in `awk '{ if($8 >= 11000000) { print $1 }}' |
grep -v MarkerName`; do grep $i filename2.txt; done) > final.txt
The switch grep -v tells grep to reverse its usual activity and print all lines that do not match the pattern. This switch "inVerts" the match.

only using awk can do this for you:
$ awk 'NR>1 && NR==FNR {if ($8 >= 110000000) a[$1]++;next} \
{ for(i in a){if($0~i) print}}' file1.txt file2.txt> final.txt

Related

AWK taking very long to process file

I have a large file with about 6 million records. I need to chunk this file into smaller files based on the first 17 chars. So records where the first 17 chars are same will be grouped into a file with the same name
The command I use for this is :
awk -v FIELDWIDTHS="17" '{print > $1".txt"}' $file_name
The problem is that this is painfully slow. For a file with 800K records it took about an hour to complete.
sample input would be :-
AAAAAAAAAAAAAAAAAAAAAAAAAAAA75838458
AAAAAAAAAAAAAAAAAAAAAAAAAAAA48234283
BBBBBBBBBBBBBBBBBBBBBBBBBBBB34723643
AAAAAAAAAAAAAAAAAAAAAAAAAAAA64734987
BBBBBBBBBBBBBBBBBBBBBBBBBBBB18741274
CCCCCCCCCCCCCCCCCCCCCCCCCCCC38123922
Is there a faster solution to this problem?
I read that perl can also be used to split files but I couldnt find an option like fieldwidths in perl..
any help will be greatly appreciated
uname : Linux
bash-4.1$ ulimit -n
1024
sort file |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
Performance improvements included:
By not referring to any field it lets awk not do field splitting
By sorting first and changing output file names only when the key part of the input changes, it lets awk only use 1 output file at a time instead of having to manage opening/closing potentially thousands of output files
And it's portable to all awks since it's not using gawk-specific extension like FIELDWIDTHS.
If the lines in each output file have to retain their original relative order after sorting then it'd be something like this (assuming no white space in the input just like in the example you provided):
awk '{print substr($0,1,17)".txt", NR, $0}' file |
sort -k1,1 -k2,2n |
awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
After borrowing #dawg's script (perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file - thanks!) to generate the same type of sample input file he has, here's the timings for the above:
$ time sort ../file | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
real 0m45.709s
user 0m15.124s
sys 0m34.090s
$ time awk '{print substr($0,1,17)".txt", NR, $0}' ../file | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
real 0m49.190s
user 0m11.170s
sys 0m34.046s
and for #dawg's for comparison running on the same machine as the above with the same input ... I killed it after it had been running for 14+ minutes:
$ time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' ../file
real 14m23.473s
user 0m7.328s
sys 1m0.296s
I created a test file of this form:
% head file
SXXYTTLDCNKRTDIHE00004
QAMKKMCOUHJFSGFFA00001
XGHCCGLVASMIUMVHS00002
MICMHWQSJOKDVGJEO00005
AIDKSTWRVGNMQWCMQ00001
OZQDJAXYWTLXSKAUS00003
XBAUOLWLFVVQSBKKC00005
ULRVFNKZIOWBUGGVL00004
NIXDTLKKNBSUMITOA00003
WVEEALFWNCNLWRAYR00001
% wc -l file
600000 file
ie, 120,000 different 17 letter prefixes to with 01 - 05 appended in random order.
If you want a version for yourself, here is that test script:
perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file
If I run this:
% time awk -v FIELDWIDTHS="17" '{print > $1".txt"}' file
Well I gave up after about 15 minutes.
You can do this instead:
% time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file
You asked about Perl, and here is a similar program in Perl that is quite fast:
perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file
Here is a little script that compares Ed's awk to these:
#!/bin/bash
# run this in a clean directory Luke!
perl -le 'for (1..12000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} '
| awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}'
| awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'
| sort -n
| cut -c8- > file.txt
wc -l file.txt
#awk -v FIELDWIDTHS="17" '{cnt[$1]++} END{for (e in cnt) print e, cnt[e]}' file
echo "abd awk"
time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file.txt
echo "abd Perl"
time perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file.txt
echo "Ed 1"
time sort file.txt |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 2"
time sort file.txt | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 3"
time awk '{print substr($0,1,17)".txt", NR, $0}' file.txt | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
Which prints:
60000 file.txt
abd awk
real 0m3.058s
user 0m0.329s
sys 0m2.658s
abd Perl
real 0m3.091s
user 0m0.332s
sys 0m2.600s
Ed 1
real 0m1.158s
user 0m0.174s
sys 0m0.992s
Ed 2
real 0m1.069s
user 0m0.175s
sys 0m0.932s
Ed 3
real 0m1.174s
user 0m0.275s
sys 0m0.946s

Pad a file using awk

So I have this working on a server with awk 3.1.7 and when I try using on a server with awk 4.0.2 it does not seem to work. I am just trying to add a filename 14 digits, is the goal. I don't have the perl version on rename just in case it is brought up.
Anyhow this is my code
ls 800001.1.pull | awk -F'.pull' '{ printf "%s %014s.pull%s\n", $0, $1, $2; }' | xargs -n 2 mv
I get this error mv: ‘800001.1.pull’ and ‘800001.1.pull’ are the same file
ls 800001.1.pull | awk -F'.pull' '{ printf "%s %014s.pull%s\n", $0, $1, $2; }' | xargs -n 2 mv
mv: ‘800001.1.pull’ and ‘800001.1.pull’ are the same file
This same things that has been running on the other server for a few years and I don't get an error, I get a file named:
ls 800001.1.pull | awk -F'.pull' '{ printf "%s %014s.pull%s\n", $0, $1, $2; }' | xargs -n 2 mv
000000800001.1.pull
and this is why I said this:
I will take it that the newer awk handles delimiters different?
The expected output is 000000800001.1.pull
The original file needs to be 0 padded to 14.
Thanks!
$ echo 800001.1.pull | awk -F'.pull' '{ new=sprintf("%14s%s",$1,FS); gsub(/ /,0,new); print $0, new }'
800001.1.pull 000000800001.1.pull
$ echo 800001.12.345.pull | awk -F'.pull' '{ new=sprintf("%14s%s",$1,FS); gsub(/ /,0,new); print $0, new }'
800001.12.345.pull 0800001.12.345.pull

Replace end of line with comma and put parenthesis in sed/awk

I am trying to process the contents of a file from this format:
this1,EUR
that2,USD
other3,GBP
to this format:
this1(EUR),that2(USD),other3(GBP)
The result should be a single line.
As of now I have come up with this circuit of commands that works fine:
cat myfile | sed -e 's/,/\(/g' | sed -e 's/$/\)/g' | tr '\n' , | awk '{print substr($0, 0, length($0)- 1)}'
Is there a simpler way to do the same by just an awk command?
Another awk:
$ awk -F, '{ printf "%s%s(%s)", c, $1, $2; c = ","} END { print ""}' file
1(EUR),2(USD),3(GBP)
Following awk may help you on same.
awk -F, '{val=val?val OFS $1"("$2")":$1"("$2")"} END{print val}' OFS=, Input_file
Toying around with separators and gsub:
$ awk 'BEGIN{RS="";ORS=")\n"}{gsub(/,/,"(");gsub(/\n/,"),")}1' file
this1(EUR),that2(USD),other3(GBP)
Explained:
$ awk '
BEGIN {
RS="" # record ends in an empty line, not newline
ORS=")\n" # the last )
}
{
gsub(/,/,"(") # replace commas with (
gsub(/\n/,"),") # and newlines with ),
}1' file # output
Using paste+sed
$ # paste -s will combine all input lines to single line
$ seq 3 | paste -sd,
1,2,3
$ paste -sd, ip.txt
this1,EUR,that2,USD,other3,GBP
$ # post processing to get desired format
$ paste -sd, ip.txt | sed -E 's/,([^,]*)(,?)/(\1)\2/g'
this1(EUR),that2(USD),other3(GBP)

How to merge these codes, awk then cut

I am using awk in Debian.
input
11.22.33.44#55878:
11.22.33.43#55879:
...
...
(smtp:55.66.77.88)
(smtp:55.66.77.89)
...
...
cpe-33-22-11-99.buffalo.res.rr.com[99.11.22.33]
cpe-34-22-11-99.buffalo.res.rr.com[99.11.22.34]
...
Parts of sh codes (running in Debian)
awk '/#/ {print > "file1";next} \
/smtp/ {print > "file2";next} \
{print > "file7"}' input
#
if [ -s file1 ] ; then
#IP type => 11.22.33.44#55878:
cut -d'#' -f1 file1 >> output
rm -f file1
fi
#
if [ -s file2 ] ; then
#IP type => (smtp:55.66.77.88)
cut -d':' -f2 file2 | cut -d')' -f1 >> output
rm -f file2
fi
#
if [ -s file7 ] ; then
#IP type => cpe-33-22-11-99.buffalo.res.rr.com[99.11.22.33]
cut -d'[' -f2 file7 | cut -d']' -f1 >> output
rm -f file7
fi
then output
11.22.33.44
11.22.33.43
55.66.77.88
55.66.77.89
99.11.22.33
99.11.22.34
Is it possible to merge these codes only with awk , something like
awk '/#/ {print | cut -d'#' -f1 > "file1";next} \
/smtp/ {print | cut -d':' -f2 | cut -d')' -f1 > "file2";next} \
{print | cut -d'[' -f2 file7 | cut -d']' > "file7"}' input
I am newbie and have no idea for this,
After search questions, still no help.
any hint?
Thanks.
Best Regard.
$ awk -F'[][()#]|smtp:' '/#/{print $1;next} /smtp/{print $3;next} /\[/{print $2}' input
11.22.33.44
11.22.33.43
55.66.77.88
55.66.77.89
99.11.22.33
99.11.22.34
To save this in the file output:
awk -F'[][()#]|smtp:' '/#/{print $1;next} /smtp/{print $3;next} /\[/{print $2}' input >output
How it works
-F'[][()#]|smtp:'
This sets the field separator to (a) any of the characters ][()# or (b) the string smtp:.
/#/{print $1;next}
If the line contains #, then print the first field and skip to the next line.
/smtp/{print $3;next}
If the line contains smtp, then print the third field and skip to the next line.
/\[/{print $2}
If the line contains [, then print the second field.
Variation
There is more than one way to solve this problem, For example, using a slightly different field separator, we can still get the desired output:
$ awk -F'[][()#:]' '/#/{print $1;next} /smtp/{print $3;next} /\[/{print $2}' input
11.22.33.44
11.22.33.43
55.66.77.88
55.66.77.89
99.11.22.33
99.11.22.34

Merge commands line

I have these command lines:
grep -e "[0-9] ERROR" /home/aa/lab/utb/cic/nova-all.log | awk '{ print $6 }' | awk -F'-' '{print $3""$2""$1}' | cut -c 1-4,7-8 > part1date.txt
grep -e "[0-9] ERROR" /home/aa/lab/utb/cic/nova-all.log | awk '{ print $3" "$4" "$5" "$9 }' > part1rest.txt
grep -e "[0-9] ERROR" /home/aa/lab/utb/cic/nova-all.log | awk '{ s = ""; for (i = 15; i <= NF; i++) s = s $i " "; print s}' > part1end.txt
paste -d \ part1date.txt part1rest.txt part1end.txt > temp.txt
rm part1*
cat temp.txt
The first 3 lines will save its output in a text file.
Then I merged the columns of these texts in one file to show the output.
Can someone help me to use same command in one line without saving them in textfile?
This command used to change the standard output:
sep 10 11:13:55 node-20 nova-scheduler 2014-10-12 10:36:55.675 3817 ERROR nova.scheduler....
to this format:
ddmmyy hh:mm:ss node-xx PROCESS LOGLEVEL MESSAGE
that means change place of columns and change the format of the date.
awk '/[0-9] ERROR/{gsub("-","",$6);$2=$6;$6=$9;for(i=0;++i<=NF;)$i=i<6?$(i+1):$(i+9);NF-=9;print}' file