Delete lines with specific pattern name in the first column awk - awk
I want to delete all lines in which MGD is not between 676 and 900.
I wrote a script
#!/bin/bash
for index in {1..100} # I do this script on 100 files, that is why I use for loop
do
awk 'BEGIN { FS = "MGD" };
{if ($2 >= 676 && $2 <= 900) print}' eq2_15_333_$index.ndx | tee eq3_15_333_$index.ndx
done
Input example
MGD816 SOL77
MGD71 SOL117
MGD7 SOL13194
MGD18 SOL235
MGD740 SOL340
MGD697 SOL396
MGD70 SOL9910
Expected output
MGD816 SOL77
MGD740 SOL340
MGD697 SOL396
I don't know what my script do something wrong, because I still have something which has MGD7 or MGD71, but MGD18 I haven't in my output.
Edit
I tested this script and it works perfectly
awk '/^MGD/{val=substr($1,4);if(val+0 >= 676 && val+0 <= 900){print}}' new.txt | tee new2.txt
and I have output
MGD816 SOL77
MGD740 SOL340
MGD697 SOL396
Based on your shown samples try following once. This is completely based on your shown attempt, since no samples were provided so its not tested, should work though.
#!/bin/bash
for index in {1..100}
do
awk '/^MGD/{val=substr($1,4);if(val+0 >= 676 && val+0 <= 900){print}}' eq2_15_333_$index.ndx | tee eq3_15_333_$index.ndx
done
I want to explain why your original i.e.
awk 'BEGIN { FS = "MGD" }; {if ($2 >= 676 && $2 <= 900) print}'
did not work as excepted, you set "MGD" as FS thus AWK splitted only at MGD - if you do awk 'BEGIN{FS="MGD"}{print $2}' file.txt and content of file.txt is
MGD816 SOL77
MGD71 SOL117
MGD7 SOL13194
MGD18 SOL235
MGD740 SOL340
MGD697 SOL396
MGD70 SOL9910
output is
71 SOL117
7 SOL13194
18 SOL235
740 SOL340
697 SOL396
70 SOL9910
If you want $2 to be first number you should specify FS which is "MGD" or spaces i.e.
awk 'BEGIN{FS="MGD|[[:space:]]+"}...
Related
AWK taking very long to process file
I have a large file with about 6 million records. I need to chunk this file into smaller files based on the first 17 chars. So records where the first 17 chars are same will be grouped into a file with the same name The command I use for this is : awk -v FIELDWIDTHS="17" '{print > $1".txt"}' $file_name The problem is that this is painfully slow. For a file with 800K records it took about an hour to complete. sample input would be :- AAAAAAAAAAAAAAAAAAAAAAAAAAAA75838458 AAAAAAAAAAAAAAAAAAAAAAAAAAAA48234283 BBBBBBBBBBBBBBBBBBBBBBBBBBBB34723643 AAAAAAAAAAAAAAAAAAAAAAAAAAAA64734987 BBBBBBBBBBBBBBBBBBBBBBBBBBBB18741274 CCCCCCCCCCCCCCCCCCCCCCCCCCCC38123922 Is there a faster solution to this problem? I read that perl can also be used to split files but I couldnt find an option like fieldwidths in perl.. any help will be greatly appreciated uname : Linux bash-4.1$ ulimit -n 1024
sort file | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}' Performance improvements included: By not referring to any field it lets awk not do field splitting By sorting first and changing output file names only when the key part of the input changes, it lets awk only use 1 output file at a time instead of having to manage opening/closing potentially thousands of output files And it's portable to all awks since it's not using gawk-specific extension like FIELDWIDTHS. If the lines in each output file have to retain their original relative order after sorting then it'd be something like this (assuming no white space in the input just like in the example you provided): awk '{print substr($0,1,17)".txt", NR, $0}' file | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}' After borrowing #dawg's script (perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file - thanks!) to generate the same type of sample input file he has, here's the timings for the above: $ time sort ../file | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}' real 0m45.709s user 0m15.124s sys 0m34.090s $ time awk '{print substr($0,1,17)".txt", NR, $0}' ../file | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}' real 0m49.190s user 0m11.170s sys 0m34.046s and for #dawg's for comparison running on the same machine as the above with the same input ... I killed it after it had been running for 14+ minutes: $ time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' ../file real 14m23.473s user 0m7.328s sys 1m0.296s
I created a test file of this form: % head file SXXYTTLDCNKRTDIHE00004 QAMKKMCOUHJFSGFFA00001 XGHCCGLVASMIUMVHS00002 MICMHWQSJOKDVGJEO00005 AIDKSTWRVGNMQWCMQ00001 OZQDJAXYWTLXSKAUS00003 XBAUOLWLFVVQSBKKC00005 ULRVFNKZIOWBUGGVL00004 NIXDTLKKNBSUMITOA00003 WVEEALFWNCNLWRAYR00001 % wc -l file 600000 file ie, 120,000 different 17 letter prefixes to with 01 - 05 appended in random order. If you want a version for yourself, here is that test script: perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file If I run this: % time awk -v FIELDWIDTHS="17" '{print > $1".txt"}' file Well I gave up after about 15 minutes. You can do this instead: % time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file You asked about Perl, and here is a similar program in Perl that is quite fast: perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file Here is a little script that compares Ed's awk to these: #!/bin/bash # run this in a clean directory Luke! perl -le 'for (1..12000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8- > file.txt wc -l file.txt #awk -v FIELDWIDTHS="17" '{cnt[$1]++} END{for (e in cnt) print e, cnt[e]}' file echo "abd awk" time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file.txt echo "abd Perl" time perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file.txt echo "Ed 1" time sort file.txt | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}' echo "Ed 2" time sort file.txt | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}' echo "Ed 3" time awk '{print substr($0,1,17)".txt", NR, $0}' file.txt | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}' Which prints: 60000 file.txt abd awk real 0m3.058s user 0m0.329s sys 0m2.658s abd Perl real 0m3.091s user 0m0.332s sys 0m2.600s Ed 1 real 0m1.158s user 0m0.174s sys 0m0.992s Ed 2 real 0m1.069s user 0m0.175s sys 0m0.932s Ed 3 real 0m1.174s user 0m0.275s sys 0m0.946s
linux csv file concatenate columns into one column
I've been looking to do this with sed, awk, or cut. I am willing to use any other command-line program that I can pipe data through. I have a large set of data that is comma delimited. The rows have between 14 and 20 columns. I need to recursively concatenate column 10 with column 11 per row such that every row has exactly 14 columns. In other words, this: a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p will become: a,b,c,d,e,f,g,h,i,jkl,m,n,o,p I can get the first 10 columns. I can get the last N columns. I can concatenate columns. I cannot think of how to do it in one line so I can pass a stream of endless data through it and end up with exactly 14 columns per row. Examples (by request): How many columns are in the row? sed 's/[^,]//g' | wc -c Get the first 10 columns: cut -d, -f1-10 Get the last 4 columns: rev | cut -d, -f1-4 | rev Concatenate columns 10 and 11, showing columns 1-10 after that: awk -F',' ' NF { print $1","$2","$3","$4","$5","$6","$7","$8","$9","$10$11}'
Awk solution: awk 'BEGIN{ FS=OFS="," } { diff = NF - 14; for (i=1; i <= NF; i++) printf "%s%s", $i, (diff > 1 && i >= 10 && i < (10+diff)? "": (i == NF? ORS : ",")) }' file The output: a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
With GNU awk for the 3rd arg to match() and gensub(): $ cat tst.awk BEGIN{ FS="," } match($0,"(([^,]+,){9})(([^,]+,){"NF-14"})(.*)",a) { $0 = a[1] gensub(/,/,"","g",a[3]) a[5] } { print } $ awk -f tst.awk file a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
If perl is okay - can be used just like awk for stream processing $ cat ip.txt a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p 1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4 1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4 $ awk -F, '{print NF}' ip.txt 16 18 22 $ perl -F, -lane '$n = $#F - 4; print join ",", (#F[0..8], join("", #F[9..$n]), #F[$n+1..$#F]) ' ip.txt a,b,c,d,e,f,g,h,i,jkl,m,n,o,p 1,2,3,4,5,6,3,4,2,43432,5,2,3,4 1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4 -F, -lane split on , results saved in #F array $n = $#F - 4 magic number, to ensure output ends with 14 columns. $#F gives the index of last element of array (won't work if input line has less than 14 columns) join helps to stitch array elements together with specified string #F[0..8] array slice with first 9 elements #F[9..$n] and #F[$n+1..$#F] the other slices as needed Borrowing from Ed Morton's regex based solution $ perl -F, -lape '$n=$#F-13; s/^([^,]*,){9}\K([^,]*,){$n}/$&=~tr|,||dr/e' ip.txt a,b,c,d,e,f,g,h,i,jkl,m,n,o,p 1,2,3,4,5,6,3,4,2,43432,5,2,3,4 1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4 $n=$#F-13 magic number ^([^,]*,){9}\K first 9 fields ([^,]*,){$n} fields to change $&=~tr|,||dr use tr to delete the commas e this modifier allows use of Perl code in replacement section this solution also has the added advantage of working even if input field is less than 14
You can try this gnu sed sed -E ' s/,/\n/9g :A s/([^\n]*\n)(.*)(\n)(([^\n]*\n){4})/\1\2\4/ tA s/\n/,/g ' infile
First variant - with awk awk -F, ' { for(i = 1; i <= NF; i++) { OFS = (i > 9 && i < NF - 4) ? "" : "," if(i == NF) OFS = "\n" printf "%s%s", $i, OFS } }' input.txt Second variant - with sed sed -r 's/,/#/10g; :l; s/#(.*)((#[^#]){4})/\1\2/; tl; s/#/,/g' input.txt or, more straightforwardly (without loop) and probably faster. sed -r 's/,(.),(.),(.),(.)$/#\1#\2#\3#\4/; s/,//10g; s/#/,/g' input.txt Testing Input a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u Output a,b,c,d,e,f,g,h,i,jkl,m,n,o,p a,b,c,d,e,f,g,h,i,jklmn,o,p,q,r a,b,c,d,e,f,g,h,i,jklmnopq,r,s,t,u
Solved a similar problem using csvtool. Source file, copied from one of the other answers: $ cat input.txt a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p 1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4 1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4 Concatenating columns: $ cat input.txt | csvtool format '%1,%2,%3,%4,%5,%6,%7,%8,%9,%10%11%12,%13,%14,%15,%16,%17,%18,%19,%20,%21,%22\n' - a,b,c,d,e,f,g,h,i,jkl,m,n,o,p,,,,,, 1,2,3,4,5,6,3,4,2,434,3,2,5,2,3,4,,,, 1,2,3,4,5,6,3,4,2,4as,f,e,3,4,3,2,5,2,3,4 anatoly#anatoly-workstation:cbs$ cat input.txt
awk calculation fails in cases where zero is used
I am using awk to calculate % of each id using the below, which runs and is very close except when the # being used in the calculation is zero. I am not sure how to code this condition into the awk as it happens frequently. Thank you :). file1 ABCA2 9 232 ABHD12 211 648 ABL2 83 0 file2 CC2D2A 442 (CCDC114) 0 awk with error awk 'function ceil(v) {return int(v)==v?v:int(v+1)} > NR==FNR{f1[$1]=$2; next} > $1 in f1{print $1, ceil(10000*(1-f1[$1]/$3))/100 "%"}' all_sorted_genes_base_counts.bed all_sorted_total_base_counts.bed > total_panel_coverage.txt awk: cmd. line:3: (FILENAME=file1 FNR=3) fatal: division by zero attempted
When you have a script that fails when parsing 2 input files, I can't imagine why you'd only show 1 sample input file and no expected output thereby ensuring we can't test our potential solutions against a sample you think is relevant and we have no way of knowing if our script is doing what you want or not, but in general to guard against a zero denominator you'd use code like: awk '{print ($2 == 0 ? "NaN" : $1 / $2)}' e.g. $ echo '6 2' | awk '{print ($2 == 0 ? "NaN" : $1 / $2)}' 3 $ echo '6 0' | awk '{print ($2 == 0 ? "NaN" : $1 / $2)}' NaN
nested awk commands?
I have got following two codes: nut=`awk "/$1/{getline; print}" ids_lengths.txt` and grep -v '#' neco.txt | grep -v 'seq-name' | grep -E '(\S+\s+){13}\bAC(.)+CA\b' | awk '$6 >= 49 { print }' | awk '$6 <= 180 { print }' | awk '$4 > 1 { print }' | awk '$5 < $nut { print }' | wc -l I would like my script to replace "nut" at this place: awk '$4 < $nut { print }' with the number returned from this: nut=`awk "/$1/{getline; print}" ids_lengths.txt` However, $1 in code just above should represent not column from ids_lengths.txt, but first column from neco.txt! (similiarly as I use $6 and $4 in main code). A help how to solve these nested awks will definitely be appreciated:-) edit: Line of my input file (neco.txt) looks like this: FZWTUY402JKYFZ 2 100.000 3 11 9 4.500 7 0 0 0 . TG TGTGTGTGT The biggest problem is that I want to filter those lines that have in the fifth column number less than number, which I get from another file (ids_lengths.txt), when searching with first column (e.g. FZWTUY402JKYFZ). That's why I put "nut" variable in my draft script :-) ids_lengths.txt looks like this: >FZWTUY402JKYFZ 153 >FZWTUY402JXI9S 42 >FZWTUY402JMZO4 158
You can combine the two grep -v operations and the four consecutive awk operations into one of each. This gives you useful economy without completely rewriting everything: nut=`awk "/$1/{getline; print}" ids_lengths.txt` grep -E -v '#|seq-name' neco.txt | grep -E '(\S+\s+){13}\bAC(.)+CA\b' | awk -vnut="$nut" '$6 >= 49 && $6 <= 180 && $4 > 1 && $5 < nut { print }' | wc -l I would not bother to make a single awk script determine the value of nut and do the value-based filtering. It can be done, but it complicates things unnecessarily — unless you can demonstrate that the whole thing is a bottleneck for the performance of the production system, in which case you do work harder (though I'd probably use Perl in that case; it can do the whole lot in one command).
Approximately: awk -v select="$1" '$0 ~ select && FNR == NR { getline; nut = $0; } FNR == NR {next} $4 > 1 $5 < nut && $6 >= 49 && $6 <= 180 && ! /#/ && ! /seq-name/ && $NF ~ /^AC.+CA$/ {count++} END {print count}' neco.txt ids_lengths.txt The regex will need to be adjusted to something that AWK understands. I can't see how the regex matches the sample data you provided. Part of the solution may be to use a field count as one of the conditions. Perhaps NF == 13 or NF >= 13. Here's the script above broken out on multiple lines for readability: awk -v select="$1" ' $0 ~ select && FNR == NR { getline nut = $0; } FNR == NR {next} $4 > 1 $5 < nut && $6 >= 49 && $6 <= 180 && ! /#/ && ! /seq-name/ && $NF ~ /^AC.+CA$/ { count++ } END { print count }' ids_lengths.txt neco.txt
How to subtract a constant number from a column
Is there a way to subtract the smallest value from all the values of a column? I need to subtract the first number in the 1st column from all other numbers in the first column. I wrote this script, but it's not giving the right result: $ awk '{$1 = $1 - 1280449530}' file 1280449530 452 1280449531 2434 1280449531 2681 1280449531 2946 1280449531 1626 1280449532 3217 1280449532 4764 1280449532 4501 1280449532 3372 1280449533 4129 1280449533 6937 1280449533 6423 1280449533 4818 1280449534 4850 1280449534 8980 1280449534 8078 1280449534 6788 1280449535 5587 1280449535 10879 1280449535 9920 1280449535 8146 1280449536 6324 1280449536 12860 1280449536 11612
What you have essentially works, you're just not outputting it. This will output what you want: awk '{print ($1 - 1280449530) " " $2}' file You can also be slightly cleverer and not hardcode the shift amount: awk '{ if(NR == 1) { shift = $1 } print ($1 - shift) " " $2 }' file
You were on the right track: awk '{$1 = $1 - 1280449530; print}' file Here is a simplified version of Michael's second example: awk 'NR == 1 {origin = $1} {$1 = $1 - origin; print}' file
bash shell script #!/bin/bash exec 4<"file" read col1 col2<&4 while read -r n1 n2 <&4 do echo $((n1-$col1)) # echo "scale=2;$n1 - $col1" | bc # dealing with decimals.. done exec >&4-
In vim you can select the column with and go to the bottom of the page with G then e to go to the end of the number then you may enter the number like 56 56 this will add 56 to the column