Bash code for Selecting few columns from a variable - awk
In a file I have a list of coordinates stored (see figure, to the left).
From there I want to copy the coordinates only (red marked) and put them in another file.
I copy the correct section from the file using COORD=`grep -B${i} '&END COORD' ${cpki_file}. Then I tried to use awk to extract the required numbers from the COORD variable . It does output all the numbers in the file but deletes the spaces between values (figure, to the right).
How to write the red marked section as they are?
N=200
NEndCoord=`grep -B${N} '&END COORD' ${cpki_file}|wc -l`
NCoord=`grep -B${N} '&END COORD' ${cpki_file}| grep -B200 '&COORD' |wc -l`
let i=$NEndCoord-$NCoord
COORD=`grep -B${i} '&END COORD' ${cpki_file}`
echo "$COORD" | awk '{ print $2 $3 $4 }'
echo "$COORD" | awk '{ print $2 $3 $4 }'>tmp.txt
When you start using combinations of grep, sed, awk, cut and alike, you should realize you can do it all in a single awk command. In case of the OP, this would do exactly the same:
awk '/[&]END COORD/{p=0}
p { print $2,$3,$4 }
/[&]COORD/{p=1}' file
This parses the file keeping track of a printing flag p. The flag is set if "&COORD" is found and unset if "&END COORD" is found. Printing is done, only when the flag p is set. Since we don't want to print the line with "&END COORD", we have to reset the flag before we do the check for the printing. The same holds for the line with "&COORD", but there we have to reset it after we do the check for the printing (its a bit a weird reversed logic).
The problem with the above is that it will also process the lines
UNIT angstrom
If you want to have these removed, you might want to do a check on the total columns:
awk '/[&]END COORD/{p=0}
p && (NF==4){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
Of only print the lines which do not contain "UNIT" or are empty:
awk '/[&]END COORD/{p=0}
p && (NF>0) && ($1 != "UNIT"){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
sed one-liner:
sed -n '/^&COORD$/,/^UNIT/{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}' <infile.txt >outfile.txt
Explanation:
Invocation:
sed: stream editor
-n: do not print unless eplicit
Commands in sed:
/^&COORD$/,/^UNIT/: Selects groups of lines after &COORDS and before UNIT.
{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}: Process each selected lines.
s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\): Regex capture space delimited groups except the first.
/\1\t\2\t\3/: Replace with tab delimited values of the captured groups.
p: Explicit printout.
Related
Can I delete a field in awk?
This is test.txt: 0x01,0xDF,0x93,0x65,0xF8 0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0 0x01,0xB2,0x00,0x76 If I run awk -F, 'BEGIN{OFS=","}{$2="";print $0}' test.txt the result is: 0x01,,0x93,0x65,0xF8 0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0 0x01,,0x00,0x76 The $2 wasn't deleted, it just became empty. I hope, when printing $0, that the result is: 0x01,0x93,0x65,0xF8 0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0 0x01,0x00,0x76
All the existing solutions are good though this is actually a tailor made job for cut: cut -d, -f 1,3- file 0x01,0x93,0x65,0xF8 0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0 0x01,0x00,0x76 If you want to remove 3rd field then use: cut -d, -f 1,2,4- file To remove 4th field use: cut -d, -f 1-3,5- file
I believe simplest would be to use sub function to replace first occurrence of continuous ,,(which are getting created after you made 2nd field NULL) with single ,. But this assumes that you don't have any commas in between field values. awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file 2nd solution: OR you could use match function to catch regex from first comma to next comma's occurrence and get before and after line of matched string. awk ' match($0,/,[^,]*,/){ print substr($0,1,RSTART-1)","substr($0,RSTART+RLENGTH) }' Input_file
It's a bit heavy-handed, but this moves each field after field 2 down a place, and then changes NF so the unwanted field is not present: $ awk -F, -v OFS=, '{ for (i = 2; i < NF; i++) $i = $(i+1); NF--; print }' test.txt 0x01,0x93,0x65,0xF8 0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01 0x01,0x00,0x76 $ Tested with both GNU Awk 4.1.3 and BSD Awk ("awk version 20070501" on macOS Mojave 10.14.6 — don't ask; it frustrates me too, but sometimes employers are not very good at forward thinking). Setting NF may or may not work on older versions of Awk — I was a little surprised it did work, but the surprise was a pleasant one, for a change.
If Awk is not an absolute requirement, and the input is indeed as trivial as in your example, sed might be a simpler solution. sed 's/,[^,]*//' test.txt This is especially elegant if you want to remove the second field. A more generic approach to remove, the nth field would require you to put in a regex which matches the first n - 1 followed by the nth, then replace that with just the the first n - 1. So for n = 4 you'd have sed 's/\([^,]*,[^,]*,[^,]*,\)[^,]*,/\1/' test.txt or more generally, if your sed dialect understands braces for specifying repetitions sed 's/\(\([^,]*,\)\{3\}\)[^,]*,/\1/' test.txt Some sed dialects allow you to lose all those pesky backslashes with an option like -r or -E but again, this is not universally supported or portable. In case it's not obvious, [^,] matches a single character which is not (newline or) comma; and \1 recalls the text from first parenthesized match (back reference; \2 recalls the second, etc). Also, this is completely unsuitable for escaped or quoted fields (though I'm not saying it can't be done). Every comma acts as a field separator, no matter what.
With GNU sed you can add a number modifier to substitute nth match of non-comma characters followed by comma: sed -E 's/[^,]*,//2' file
Using awk in a regex-free way, with the option to choose which line will be deleted: awk '{ col = 2; n = split($0,arr,","); line = ""; for (i = 1; i <= n; i++) line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] ); print line }' test.txt Step by step: { col = 2 # defines which column will be deleted n = split($0,arr,",") # each line is split into an array # n is the number of elements in the array line = "" # this will be the new line for (i = 1; i <= n; i++) # roaming through all elements in the array line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] ) # appends a comma (except if line is still empty) # and the current array element to the line (except when on the selected column) print line # prints line }
Another solution: You can just pipe the output to another sed and squeeze the delimiters. $ awk -F, 'BEGIN{OFS=","}{$2=""}1 ' edward.txt | sed 's/,,/,/g' 0x01,0x93,0x65,0xF8 0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0 0x01,0x00,0x76 $
Commenting on the first solution of #RavinderSingh13 using sub() function: awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file The gnu-awk manual: https://www.gnu.org/software/gawk/manual/html_node/Changing-Fields.html It is important to note that making an assignment to an existing field changes the value of $0 but does not change the value of NF, even when you assign the empty string to a field." (4.4 Changing the Contents of a Field) So, following the first solution of RavinderSingh13 but without using, in this case,sub() "The field is still there; it just has an empty value, delimited by the two colons": awk 'BEGIN {FS=OFS=","} {$2="";print $0}' file 0x01,,0x93,0x65,0xF8 0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0 0x01,,0x00,0x76
My solution: awk -F, ' { regex = "^"$1","$2 sub(regex, $1, $0); print $0; }' or one line code: awk -F, '{regex="^"$1","$2;sub(regex, $1, $0);print $0;}' test.txt I found that OFS="," was not necessary
I would do it following way, let file.txt content be: 0x01,0xDF,0x93,0x65,0xF8 0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0 0x01,0xB2,0x00,0x76 then awk 'BEGIN{FS=",";OFS=""}{for(i=2;i<=NF;i+=1){$i="," $i};$2="";print}' file.txt output 0x01,0x93,0x65,0xF8 0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0 0x01,0x00,0x76 Explanation: I set OFS to nothing (empty string), then for 2nd and following column I add , at start. Finally I set what is now comma and value to nothing. Keep in mind this solution would need rework if you wish to remove 1st column.
How do I obtain a specific row with the cut command?
Background I have a file, named yeet.d, that looks like this JET_FUEL = /steel/beams ABC_DEF = /michael/jackson ....50 rows later.... SHIA_LEBEOUF = /just/do/it ....73 rows later.... GIVE_FOOD = /very/hungry NEVER_GONNA = /give/you/up I am familiar with the f and d options of the cut command. The f option allows you to specify which column(s) to extract from, while the d option allows you to specify what the delimiters. Problem I want this output returned using the cut command. /just/do/it From what I know, this is part of the command I want to enter: cut -f1 -d= yeet.d Given that I want the values to the right of the equals sign, with the equals sign as the delimiter. However this would return: /steel/beams /michael/jackson ....50 rows later.... /just/do/it ....73 rows later.... /very/hungry /give/you/up Which is more than what I want. Question How do I use the cut command to return only /just/do/it and nothing else from the situation above? This is different from How to get second last field from a cut command because I want to select a row within a large file, not just near from the end or the beginning.
This looks like it would be easier to express with awk... # awk -v _s="${_string}" '$3 == _s {print $3}' "${_path}" ## Above could be more _scriptable_ form of bellow example awk -v _search="/just/do/it" '$3 == _search {print $3}' <<'EOF' JET_FULE = /steal/beams SHIA_LEBEOUF = /just/do/it NEVER_GONNA = /give/you/up EOF ## Either way, output should be similar to ## /just/do/it -v _something="Some Thing" bit allows for passing Bash variables to awk $3 == _search bit tells awk to match only when column 3 is equal to the search string To search for a sub-string within a line one can use $0 ~ _search {print $3} bit tells awk to print column 3 for any matches And the <<'EOF' bit tells Bash to not expand anything within the opening and closing EOF tags ... however, the above will still output duplicate matches, eg. if yeet.d somehow contained... JET_FULE = /steal/beams SHIA_LEBEOUF = /just/do/it NEVER_GONNA = /give/you/up AGAIN = /just/do/it ... there'd be two /just/do/it lines outputed by awk. Quickest way around that would be to pipe | to head -1, but the better way would be to tell awk to exit after it's been told to print... _string='/just/do/it' _path='yeet.d' awk -v _s="${_string}" '$3 == _s {print $3; exit}' "${_path}" ... though that now assumes that only the first match is wanted, obtaining the nth is possible though currently outside the scope of the question as of last time read. Updates To trip awk on the first column while printing the third column and exiting after the first match may look like... _string='SHIA_LEBEOUF' _path='yeet.d' awk -v _s="${_string}" '$1 == _s {print $3; exit}' "${_path}" ... and generalize even further... _string='^SHIA_LEBEOUF ' _path='yeet.d' awk -v _s="${_string}" '$0 ~ _s {print $3; exit}' "${_path}" ... because awk totally gets regular expressions, mostly.
It depends on how you want to identify the desired line. You could identify it by the line number. In this case you can use sed cut -f2 -d= yeet.d | sed '53q;d' This extracts the 53th line. Or you could identify it by a keyword. In this case use grep cut -f2 -d= yeet.d | grep just This extracts all lines containing the word just.
awk command to print only part of matching lines
awk command to compare lines in file and print only first line if there are some new words in other lines. For example: file.txt is having i am going i am going today i am going with my friend output should be I am going
this will work for the sample input but perhaps will fail for the actual one, unless you have a representative input we wouldn't know... $ awk 'NR>1 && $0~p {if(!f) print p; f=1; next} {p=$0; f=0}' file i am going you may want play with p=$0 to restrict matching number of fields if the line lengths are not in increasing order...
AWK - get value between two strings over multiple lines
input.txt: >block1 111111111111111111111 >block2 222222222222222222222 >block3 333333333333333333333 AWK command: awk '/>block2.*>/' input.txt Expected output 222222222222222222222 However, AWK is returning nothing. What am I misunderstanding? Thanks!
If you want to print the line after the line containing >block2, then you could use: awk '/^>block2$/ { nr=NR+1 } NR == nr { print }' Track the record number plus 1 when you find the match; when the current record number matches the remembered one, print the current record. If you want all the lines between the line >block2 and >block3, then you'd use: awk '/^>block2$/,/^>block3/ {if ($0 !~ /^>block[23]$/) print }' For all lines between the two markers, if the line doesn't match either marker, print it. The output is the same with the sample data file.
another awk $ awk 'c&&c--; /^>block2/{c=1}' file 222222222222222222222 c specifies how many lines you want to print after the match. If you want the text between two markers $ awk '/^>block3/{exit} s; /^>block2/{s=1}' file 222222222222222222222 if there are multiple instances and you want them all, just change exit to s=0
You probably meant: $ awk '/>/{f=/^>block2$/;next} f' file 222222222222222222222
Use AWK to search through fasta file, given a second file containing sequence names
I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below). seq.fasta >Clone_18 GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT >Clone_23 GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA >Clone_27-1 GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC >Clone_27-2 GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT >Clone_34-1 GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG >Clone_34-3 GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC >Clone_44-1 GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC >Clone_44-3 GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC name.txt Clone_23 Clone_27-1 I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file. awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta The problem is that I can only extract the sequence of first candidate in name.txt, like this >Clone_23 GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA Can anyone help to fix one-line awk command above?
If it is ok or even desired to print the name as well, you can simply use grep: grep -Ff name.txt -A1 a.fasta -f name.txt picks patterns from name.txt -F treats them as literal strings rather than regular expressions A1 prints the matching line plus the subsequent line If the names are not desired in output I would simply pipe to another grep: above_command | grep -v '>' An awk solution can look like this: awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta Better explained in a multiline version: # True as long as we are reading the first file, name.txt NR==FNR { # Store the names in the array 'n' n[$0] next } # I use substr() to remove the leading `>` and check if the remaining # string which is the name is a key of `n`. getline retrieves the next line # If it succeeds the condition becomes true and awk will print that line substr($0,2) in n && getline
$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta >Clone_23 GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA >Clone_27-1 GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC