This is test.txt:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
If I run
awk -F, 'BEGIN{OFS=","}{$2="";print $0}' test.txt
the result is:
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
The $2 wasn't deleted, it just became empty.
I hope, when printing $0, that the result is:
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
All the existing solutions are good though this is actually a tailor made job for cut:
cut -d, -f 1,3- file
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
If you want to remove 3rd field then use:
cut -d, -f 1,2,4- file
To remove 4th field use:
cut -d, -f 1-3,5- file
I believe simplest would be to use sub function to replace first occurrence of continuous ,,(which are getting created after you made 2nd field NULL) with single ,. But this assumes that you don't have any commas in between field values.
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
2nd solution: OR you could use match function to catch regex from first comma to next comma's occurrence and get before and after line of matched string.
awk '
match($0,/,[^,]*,/){
print substr($0,1,RSTART-1)","substr($0,RSTART+RLENGTH)
}' Input_file
It's a bit heavy-handed, but this moves each field after field 2 down a place, and then changes NF so the unwanted field is not present:
$ awk -F, -v OFS=, '{ for (i = 2; i < NF; i++) $i = $(i+1); NF--; print }' test.txt
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01
0x01,0x00,0x76
$
Tested with both GNU Awk 4.1.3 and BSD Awk ("awk version 20070501" on macOS Mojave 10.14.6 — don't ask; it frustrates me too, but sometimes employers are not very good at forward thinking). Setting NF may or may not work on older versions of Awk — I was a little surprised it did work, but the surprise was a pleasant one, for a change.
If Awk is not an absolute requirement, and the input is indeed as trivial as in your example, sed might be a simpler solution.
sed 's/,[^,]*//' test.txt
This is especially elegant if you want to remove the second field. A more generic approach to remove, the nth field would require you to put in a regex which matches the first n - 1 followed by the nth, then replace that with just the the first n - 1.
So for n = 4 you'd have
sed 's/\([^,]*,[^,]*,[^,]*,\)[^,]*,/\1/' test.txt
or more generally, if your sed dialect understands braces for specifying repetitions
sed 's/\(\([^,]*,\)\{3\}\)[^,]*,/\1/' test.txt
Some sed dialects allow you to lose all those pesky backslashes with an option like -r or -E but again, this is not universally supported or portable.
In case it's not obvious, [^,] matches a single character which is not (newline or) comma; and \1 recalls the text from first parenthesized match (back reference; \2 recalls the second, etc).
Also, this is completely unsuitable for escaped or quoted fields (though I'm not saying it can't be done). Every comma acts as a field separator, no matter what.
With GNU sed you can add a number modifier to substitute nth match of non-comma characters followed by comma:
sed -E 's/[^,]*,//2' file
Using awk in a regex-free way, with the option to choose which line will be deleted:
awk '{ col = 2; n = split($0,arr,","); line = ""; for (i = 1; i <= n; i++) line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] ); print line }' test.txt
Step by step:
{
col = 2 # defines which column will be deleted
n = split($0,arr,",") # each line is split into an array
# n is the number of elements in the array
line = "" # this will be the new line
for (i = 1; i <= n; i++) # roaming through all elements in the array
line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] )
# appends a comma (except if line is still empty)
# and the current array element to the line (except when on the selected column)
print line # prints line
}
Another solution:
You can just pipe the output to another sed and squeeze the delimiters.
$ awk -F, 'BEGIN{OFS=","}{$2=""}1 ' edward.txt | sed 's/,,/,/g'
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
$
Commenting on the first solution of #RavinderSingh13 using sub() function:
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
The gnu-awk manual: https://www.gnu.org/software/gawk/manual/html_node/Changing-Fields.html
It is important to note that making an assignment to an existing field changes the value of $0 but does not change the value of NF, even when you assign the empty string to a field." (4.4 Changing the Contents of a Field)
So, following the first solution of RavinderSingh13 but without using, in this case,sub() "The field is still there; it just has an empty value, delimited by the two colons":
awk 'BEGIN {FS=OFS=","} {$2="";print $0}' file
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
My solution:
awk -F, '
{
regex = "^"$1","$2
sub(regex, $1, $0);
print $0;
}'
or one line code:
awk -F, '{regex="^"$1","$2;sub(regex, $1, $0);print $0;}' test.txt
I found that OFS="," was not necessary
I would do it following way, let file.txt content be:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
then
awk 'BEGIN{FS=",";OFS=""}{for(i=2;i<=NF;i+=1){$i="," $i};$2="";print}' file.txt
output
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
Explanation: I set OFS to nothing (empty string), then for 2nd and following column I add , at start. Finally I set what is now comma and value to nothing. Keep in mind this solution would need rework if you wish to remove 1st column.
My script will be receiving various lengths of input and I want to strip the last field separated by a "/". An example of the input I will be dealing with is.
this/that/and/more
But the issue I am running into is that the length of the input will vary like so:
this/that/maybe/more/and/more
or/even/this/could/be/it/and/maybe/more
short/more
In any case, the expected output should be the whole string minus the last "/more".
Note: The word "more" will not be a constant these are arbitrary examples.
Example input:
this/that/and/more
this/that/maybe/more/and/more
Expected output:
this/that/and
this/that/maybe/more/and
What I know works for a string you know the length of would be
cut -d'/' -f[x]
With what I need is a '/' delimited AWK command I'm assuming like:
awk '{$NF=""; print $0}'
With awk as requested:
$ awk '{sub("/[^/]*$","")} 1' file
this/that/maybe/more/and
or/even/this/could/be/it/and/maybe
short
but this is the type of job sed is best suited for:
$ sed 's:/[^/]*$::' file
this/that/maybe/more/and
or/even/this/could/be/it/and/maybe
short
The above were run against this input file:
$ cat file
this/that/maybe/more/and/more
or/even/this/could/be/it/and/maybe/more
short/more
Depending on how you have the input in your script, bash's Shell Parameter Expansion may be convenient:
$ s1=this/that/maybe/more/and/more
$ s2=or/even/this/could/be/it/and/maybe/more
$ s3=short/more
$ echo ${s1%/*}
this/that/maybe/more/and
$ echo ${s2%/*}
or/even/this/could/be/it/and/maybe
$ echo ${s3%/*}
short
(Lots of additional info on parameter expansion at https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html)
In your script, you could create a loop that removes the last character in the input string if it is not a slash through each iteration. Then, when the loop finds a slash character, exit the loop then remove the final character (which is supposed to be a slash).
Pseudo-code:
while (lastCharacter != '/') {
removeLastCharacter();
}
removeLastCharacter(); # removes the slash
(Sorry, it's been a while since I wrote a bash script.)
Another awk alternative using fields instead of regexs
awk -F/ '{printf "%s", $1; for (i=2; i<NF; i++) printf "/%s", $i; printf "\n"}'
Here is an alternative shell solution:
while read -r path; do dirname "$path"; done < file
I have a large file and would like to remove any lines from the file that contain an exact string listed in another file. However, the string must match exactly (I'm sorry I don't know how to describe this better).
Here is the file:
one#email.com,name,surname,city,state
two#email.com,name,surname,city,state
three#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
And here is the example list to filter with:
one#email.com
three#email.com
The desired output is:
two#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
I have tried to do this using the following:
grep -v -f 2.txt 1.txt > 3.txt
However this produces the output:
two#email.com,name,surname,city,state
I assume it's doing this because "anotherone#email.com" contains "one#email.com". I've searched for a way to include beginning of the line, but not found anything suitable.
I'm open to doing in something other than grep too, I used grep because I couldn't figure it out any other way.
Assuming that your input file contains three#gmail.com not three#email.com (typo perhaps)
$ grep -vw -f 2.txt 1.txt
two#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
-w, --word-regexp -
The expression is searched for as a word (as if surrounded by [[:<:]]' and[[:>:]]';
If you only like to print lines from first file that does not contains data from second file in the first field, then this should do:
$cat file
one#email.com,name,surname,city,state
two#email.com,name,surname,city,state
three#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
$cat filter
one#email.com
three#email.com
awk -F, 'NR==FNR {a[$0]++;next} !($1 in a)' filter file
two#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
For every line in the filter this creates an array a with the name and value to 1
Like a[one#email.com]=1 and a[three#email.com]=1
Then awk test line by line in the file against the array, giving
a[one#email.com]=1
a[two#email.com]=
a[three#email.com]=1
a[anotherone#email.com]=
Then print all line from the file without 1
two#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
For this particular case -- process first file by building an associative array with filter lines being the index. In subsequent files, test if the given line is not in the array indexes -- the default action of a pattern is to print.
awk -F, -v OFS=, '
BEGIN { split("", m) }
NR==FNR { m[$0] = ""; next }
!($1 in m)
' filter.txt file.txt
But... if we are looking to filter any occurrence of the string anywhere in the line (an unconstrained exact match) we need to do something less clever and more brute-force:
awk '
BEGIN {
split("", m)
n=0
}
NR==FNR {
m[n++] = $0
next
}
{
for (i=0; i<n; ++i) {
if (index($0, m[i]))
next
}
print
}
' filter.txt file.txt
Note that if the filter contains non-printable characters (e.g. non-unix line endings), we would need to deal with them by filtering them out (e.g. with sub(/\r/, "")).
I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;
If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.
The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa