Append prefix to first column of a file with awk - awk

I have a couple of hundreds of files which I want to process with xargs. They all need a fix of their first column.
Therefore I need an awk command to append the prefix "ID_" to the first column of a file (except for the first header line). Can anyone help me with this?
Something along the line:
gawk -f ';' "{$1='ID_' $1; print $0}" file.csv > file_processed.csv
I am not expert for the command, though. And I would rather like to have some inplace processing instead of making a copy of each file. Beforehand, I made it in VIM, but then I only had one file.
:%s/^-/ID_/
I hope someone can help me here.

gawk 'BEGIN{FS=";"; OFS=";"} {if(NR>1) $1="ID_"$1; print}' file.csv > file_processed.csv
FS and OFS set the input and output field separators, respectively.
NR>1 checks whether current line number is larger than 1, so we don't modify the header line.
You can also modify the file in place with -i inplace option:
gawk -i inplace 'BEGIN{FS=";"; OFS=";"} {if(NR>1) $1="ID_"$1; print}' file.csv
Edit
After elaborating the original question, here's the final version:
gawk -i inplace 'BEGIN{FS=OFS=";"} NR>1{sub(/^-/,"ID_",$2)} 1' file.csv
which substitutes - in the beginning of second column with ID_.
NR>1 action applies for all but first (header) line. 1 invokes the default default print action.

If you just want to do something, particularly adding a prefix, on the first field, it is not different from adding the prefix to the whole line.
So you can just awk '$0 = "ID_" $0' file.csv it should do the work. If you want to make it "change in place", you can:
awk '$0="ID_"$0' csv >/tmp/foo && mv /tmp/foo file.csv
You can also make use of sed:
sed -i 's/^/ID_/' file
The -i does "in-place modification"
You mentioned vim, and gave s/^-/ID_/ cmd, it doesn't add the prefix (ID_), it will replace the leading - by the ID_, they are different.

Related

Awk-How remove pattern with brackets

I made a mistake, adding Ini file (Yes we're in 2022 :D) a section with errors
I added a line [End[edit=true]
How could remove this entire line using awk (I don't have any others choice 😕)
I don't understand how escape the [ in the AWK command line.
Could you please help me?
Thanks
I don't understand how escape the [ in the AWK command line.
If line is always literal [End[edit=true] then you do not need to, just request lines which are not that one following way, let file.ini content be
[someline=true]
[End[edit=true]
[another=true]
then
awk '$0!="[End[edit=true]"' file.ini
gives output
[someline=true]
[another=true]
Explanation: $0 denotes whole line, if it is not [End[edit=true] then it is printed.
(tested in GNU Awk 5.0.1)
A couple ideas where you escape the leading (left) brackets:
awk '/\[End\[edit=true]/ {next} 1' file
# or
awk '!/\[End\[edit=true]/' file
Once you've confirmed the results, and assuming you're using GNU awk, you can add -i inplace to update the file:
awk -i inplace '!/\[End\[edit=true]/' file

awk command to print columns with colum data

cat file1.txt | awk -F '{print $1 "|~|" $2 "|~|" $3}' > file2.txt
I am using above command to filter first three columns from file1 and put into file.
But only getting the column names and not the column data.
How to do that?
|~| - is the delimiter.
file1.txt has values as :
a|~|b|~|c|~|d|~|e
1|~|2|~|3|~|4|~|5
11|~|22|~|33|~|44|~|55
111|~|222|~|333|~|444|~|555
my expedted output is :
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
With your shown samples, please try following awk code. You need to set field separator to |~| and remove starting space from lines, then print the lines.
awk -F'\\|~\\|' -v OFS='|~|' '{sub(/^[[:blank:]]+/,"");print $1,$2,$3}' Input_file
In case you want to keep spaces(which was in initial post before edit) then try following:
awk -F'\\|~\\|' -v OFS='|~|' '{print $1,$2,$3}' Input_file
NOTE: Had a chat with user in room and got to know why this code was not working for user because of gunzip -c file was being used wrongly, its output was being saved into a variable on which user was running awk program, so correcting that command generated right file and awk program ran fine on it. Adding this as a reference for future readers.
One approach would be:
awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
The approach simply replaces all "|~|" with a "," setting the output file separator to "|~|". All leading whitespace is trimmed with sub().
Example Use/Output
With your data in file1.txt, you would have:
$ awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
Let me know if this is what you intended. You can simply redirect, e.g. > file2.txt to write to the second file.
For such cases, my bash+awk script rcut comes in handy:
rcut -Fd'|~|' -f-3 ip.txt
The -F option enables fixed string input delimiter (which is given using the -d option). And by default, the output field separator will also be same as -d when -F is active. -f-3 is similar to cut syntax to specify first three fields.
For better speed, use hck command:
hck -Ld'|~|' -D'|~|' -f-3 ip.txt
Here, -L enables literal field separator and -D specifies output field separator.
Another benefit is that hck supports -z option to automatically handle common compressed formats based on filename extension (adding this since OP had an issue with compressed input).
Another way:
sed 's/|~|/\t/g' file1.txt | awk '{print $1"|~|"$2"|~|"$3}' > file2.txt
First replace the |~| delimiter, and use the default awk separator, then print columns what you need.

Inserting characters into specified fields in large files

Here is my question. I have several hundred files that are too large to edit utilizing vi editor. I'm looking for a possible awk or sed command to manipulate my files. Bit of a rookie. I have a simplified file:
001|1|3|053412|16|1234|||
001|21|4|123618|15|88|||
The files were created, with the fourth field being in the wrong format.
The fourth field should be 05:34:12 reflecting HH:MM:SS. The time values are correct, I just need to insert the : in the appropriate places.
How do I insert the colons after the second character and the fourth characters in the fourth field? I cannot do it by character count since the values before and after the fourth field are variable.
gawk to the rescue!
$ awk -F\| -v OFS=\| '{$4=gensub(/(..)(..)(..)/,"\\1:\\2:\\3","g",$4)}1' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
otherwise you can do the same with substr($4,1,2)":"...
With GNU awk for gensub() and inplace editing
awk -i inplace 'BEGIN{FS=OFS="|"} {$4=gensub(/(..)(..)/,"\\1:\\2:",1,$4)} 1' *
Similarly with GNU sed for EREs and inplace editing:
sed -i -E 's/(([^|]*\|){3}..)(..)/\1:\3:/' *
e.g.:
$ awk 'BEGIN{FS=OFS="|"} {$4=gensub(/(..)(..)/,"\\1:\\2:",1,$4)} 1' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
$ sed -E 's/(([^|]*\|){3}..)(..)/\1:\3:/' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
With sed:
$ sed -r 's/([^|]*|[^|]*|[^|]*|)([0-9]{2})([0-9]{2})([0-9]{2})/\1\2:\3:\4/' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
This might work for you (GNU sed):
sed -r 's/^(([^|]*\|){3})(..)(..)/\1\3:\4:/' file
Use back references to group the first three fields and the following two pairs. Then format the 4th field as required.
Try this awk!
awk -F"|" -v OFS="|" '{r=split($4,T,"");for(i=2;i<=r;i+=2){if(i!=r)T[i]=T[i]":"}tmp="";for(i=1;i<=r;i++){tmp=tmp T[i]}$4=tmp;}1' file
001|1|3|05:34:12|16|1234|||
001|21|4|12:36:18|15|88|||
Longer whit explanation:
BEGIN{
FS=OFS="|"; #Field separator and output field separator
}
{
tmp="";
r=split($4,time_field,""); # Chunk field into pieces
for(i=2;i<=r;i+=2) # Loop two by two
{
if(i!=r)
{
time_field[i]=time_field[i]":"; # Add ":"
}
}
for(i=1;i<=r;i++) # Loop over again to rebuild
{
tmp=tmp time_field[i];
}
$4=tmp; #rebuid field
print
}
How you could use it in bash: Save it as whatever.awk
while IFS='' read -r file
do
awk -f whatever.awk "$file" > out_file
done < list_of_files_to_edit.txt
If you want edit the files in place, you can add the option -i to Kenavoz sed command. sed -ri ...

Concatenating lines using awk

I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;
If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.
The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa

Combine grep -f and awk

I am using two commands:
awk '{ print $2 }' SomeFile.txt > Pattern.txt
grep -f Pattern.txt File.txt
With the first command I create a list of desirable patterns. With the second command I extract all lines in File.txt that match the lines in the Pattern.txt
My question is, is there a way to combine awk and grep in a pipeline so that I don't have to generate the intermediate Pattern.txt file?
Thanks!
You can do this all in one invocation of awk:
awk 'NR==FNR{a[$2];next}{for(i in a)if($0~i)print}' Somefile.txt File.txt
Populate keys in the array a from the second column of the first file. NR==FNR identifies the first file (total record number is equal to this file's record number). next skips the second block for the first file.
In the second block, loop through all the keys in the array and if the line matches any of them, print it. To avoid printing the line more than once if it matches more than one pattern, you could add a next here too, i.e. {for(i in a)if($0~i){print;next}}.
If the "patterns" are actually fixed strings, it is even simpler:
awk 'NR==FNR{a[$2];next}$0 in a' Somefile.txt File.txt
If your shell supports it, you can use process substitution:
grep -f <(awk '{ print $2 }' SomeFile.txt) File.txt
bash and zsh will support that, others will probably too, didn't tested.
Simpler as the above and supported by all shells would be to use a pipe:
awk '{ print $2 }' SomeFile.txt | grep -f - File.txt
- is used as the argument to -f. - has a special meaning here and stands for stdin. Thanks to Tom Fenech for mentioning that!