Replace using gsub in awk - awk

trying to do a gsub in awk.
I want to replace single space with underscore, but the adjoining characters are replaced
awk -F" +" 'NF > 1 {gsub(/[[:alnum:]][ ][[:alnum:]]/, "_")}1' file
Input:
this is example
ca bc dec cat
251 otg op con
this is what I get:
this is example
ca bc de_at
251 otg o_on
Desired output:
this is example
ca bc dec_cat
251 otg op_con

This is one area where (non-GNU) awk isn't the best tool for the job. I'd suggest using sed instead:
$ sed '/ / s/\([[:alnum:]]\) \([[:alnum:]]\)/\1_\2/g' file
this is example
ca bc dec_cat
251 otg op_con
This performs substitutions on lines containing 2 or more spaces, which is an equivalent condition to NF > 1 given your field separator.
The key here is to capture the characters before and after the space and then use them in the replacement. This can be done in GNU awk too, using gensub:
$ gawk -F" +" 'NF > 1 { $0 = gensub(/([[:alnum:]]) ([[:alnum:]])/, "\\1_\\2", 1) }1' file
this is example
ca bc dec_cat
251 otg op_con
gensub returns the result of the substitution, so it must be reassigned to $0 in order to affect the output.

Related

removing lines with special characters in awk

I have a text file like this:
VAREAKAVVLRDRKSTRLN 2888
ACP*VRWPIYTACGP 292
RDRKSTRLNSSHVVTSRMP 114
VAREA*KAVVLRDRRAHV*T 73
in the 1st column in some rows there is a "*". I want to remove all the lines with that '*'. here is the expected output:
expected output:
VAREAKAVVLRDRKSTRLN 2888
RDRKSTRLNSSHVVTSRMP 114
to do so, I am using this code:
awk -F "\t" '{ if(($1 == '*')) { print $1 "," $2} }' infile.txt > outfile.txt
this code does not return the expected output. how can I fix it?
how can I fix it?
You did
awk -F "\t" '{ if(($1 == '*')) { print $1 "," $2} }' infile.txt > outfile.txt
by doing $1 == "*" you are asking: is first field * not does first contain *? You might use index function which does return position of match if found or 0 otherwise. Let infile.txt content be
VAREAKAVVLRDRKSTRLN 2888
ACP*VRWPIYTACGP 292
RDRKSTRLNSSHVVTSRMP 114
VAREA*KAVVLRDRRAHV*T 73
then
awk 'index($1,"*")==0{print $1,$2}' infile.txt
output
VAREAKAVVLRDRKSTRLN 2888
RDRKSTRLNSSHVVTSRMP 114
Note that if you use index rather than pattern /.../ you do not have to care about characters with special meaning, e.g. .. Note that for data you have you do not have to set field separator (FS) explicitly. Important ' is not legal string delimiter in GNU AWK, you should use " for that purpose, unless your intent is to summon hard to find bugs.
(tested in gawk 4.2.1)
with your shown samples, please try following awk program.
awk '$1!~/\*/' Input_file
OR above will print complete line when condition is NOT matched, in case you want to print only 1st and 2nd fields of matched condition line then try following:
awk '$1!~/\*/{print $1,$2}' Input_file
Use grep like so to remove the lines that contain literal asterisk (*). Note that it should be escaped with a backslash (\*) or put in a character class ([*]) to prevent grep from interpreting * as a modifier meaning 0 or more characters:
echo "A*B\nCD" | grep -v '[*]'
CD
Here, GNU grep uses the following options:
-v : Print lines that do not match.

gawk - Delimit lines with custom character and no similar ending character

Let's say I have a file like so:
test.txt
one
two
three
I'd like to get the following output: one|two|three
And am currently using this command: gawk -v ORS='|' '{ print $0 }' test.txt
Which gives: one|two|three|
How can I print it so that the last | isn't there?
Here's one way to do it:
$ seq 1 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1
$ seq 3 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1|2|3
With paste:
$ seq 1 | paste -sd'|'
1
$ seq 3 | paste -sd'|'
1|2|3
Convert one column to one row with field separator:
awk '{$1=$1} 1' FS='\n' OFS='|' RS='' file
Or in another notation:
awk -v FS='\n' -v OFS='|' -v RS='' '{$1=$1} 1' file
Output:
one|two|three
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
awk solutions work great. Here is tr + sed solution:
tr '\n' '|' < file | sed 's/\|$//'
1|2|3
just flatten it :
gawk/mawk 'BEGIN { FS = ORS; RS = "^[\n]*$"; OFS = "|"
} NF && ( $NF ? NF=NF : —-NF )'
ascii | = octal \174 = hex 0x7C. The reason for —-NF is that more often than not, the input includes a trailing new line, which makes field count 1 too many and result in
1|2|3|
Both NF=NF and --NF are similar concepts to $1=$1. Empty inputs, regardless of whether trailing new lines exist or not, would result in nothing printed.
At the OFS spot, you can delimit it with any string combo you like instead of being constrained by tr, which has inconsistent behavior. For instance :
gtr '\012' '高' # UTF8 高 = \351\253\230 = xE9 xAB x98
on bsd-tr, \n will get replaced by the unicode properly 1高2高3高 , but if you're on gnu-tr, it would only keep the leading byte of the unicode, and result in
1 \351 2 \351 . . .
For unicode equiv-classes, bsd-tr works as expected while gtr '[=高=]' '\v' results in
gtr: ?\230: equivalence class operand must be a single character
and if u attempt equiv-classes with an arbitrary non-ASCII byte, bsd-tr does nothing while gnu-tr would gladly oblige, even if it means slicing straight through UTF8-compliant characters :
g3bn 77138 | (g)tr '[=\224=]' '\v'
bsd-tr : 77138=Koyote 코요태 KYT✜ 高耀太
gnu-tr : 77138=Koyote ?
?
태 KYT✜ 高耀太
I would do it following way, using GNU AWK, let test.txt content be
one
two
three
then
awk '{printf NR==1?"%s":"|%s", $0}' test.txt
output
one|two|three
Explanation: If it is first line print that line content sans trailing newline, otherwise | followed by line content sans trailing newline. Note that I assumed that test.txt has not trailing newline, if this is not case test this solution before applying it.
(tested in gawk 5.0.1)
Also you can try this with awk:
awk '{ORS = (NR%3 ? "|" : RS)} 1' file
one|two|three
% is the modulo operator and NR%3 ? "|" : RS is a ternary expression.
See Ed Morton's explanation here: https://stackoverflow.com/a/55998710/14259465
With a GNU sed, you can pass -z option to match line breaks, and thus all you need is replace each newline but the last one at the end of string:
sed -z 's/\n\(.\)/|\1/g' test.txt
perl -0pe 's/\n(?!\z)/|/g' test.txt
perl -pe 's/\n/|/g if !eof' test.txt
See the online demo.
Details:
s - substitution command
\n\(.\) - an LF char followed with any one char captured into Group 1 (so \n at the end of string won't get matched)
|\1 - a | char and the captured char
g - all occurrences.
The first perl command matches any LF char (\n) not at the end of string ((?!\z)) after slurping the whole file into a single string input (again, to make \n visible to the regex engine).
The second perl command replaces an LF char at the end of each line except the one at the end of file (eof).
To make the changes inline add -i option (mind this is a GNU sed example):
sed -i -z 's/\n\(.\)/|\1/g' test.txt
perl -i -0pe 's/\n(?!\z)/|/g' test.txt
perl -i -pe 's/\n/|/g if !eof' test.txt

Insert blank based on first digit of line

Input:
3abdce
412ae3
21dege
Expected Output - starting digit of line is removed and a blank inserted based on the offset specified by that digit
abd ce
12ae 3
1d ege
I can only remove the first character:
sed 's/^.\{1\}//g' file
GNU awk solution:
awk -v FS="" '{ print substr($0,2,$1), substr($0,$1+2) }' file
$1 - points to the 1st figure value (slice size)
The output:
abd ce
12ae 3
1d ege
this one should do the trick:
awk '{ split($0, a, ""); print substr($0, 2, a[1])" "substr($0, 2+a[1]) }' yourfile
Output:
abd ce
12ae 3
1d ege
If perl is okay
$ perl -F -lane 'print #F[1..$F[0]], " ", #F[$F[0]+1..$#F]' ip.txt
abd ce
12ae 3
1d ege
-F -lane split each line on empty string, so each character is a field, saved in #F array
Then print as required, indexing starts from 0
Using gawk as it supports empty FS and OFS
awk -v FS="" -v OFS="" '{gsub($($1+1),"& ");gsub(/^./,"")}1' inputfile
abd ce
12ae 3
1d ege
Here, FS and OFS are set to blank and two gsub functions are used to to the required search and replace operation.

Print rows that has numbers in it

this is my data - i've more than 1000rows . how to get only the the rec's with numbers in it.
Records | Num
123 | 7 Y1 91
7834 | 7PQ34-102
AB12AC|87 BWE 67
5690278| 80505312
7ER| 998
Output has to be
7ER| 998
5690278| 80505312
I'm new to linux programming, any help would be highly useful to me. thanks all
I would use awk:
awk -F'[[:space:]]*[|][[:space:]]*' '$2 ~ /^[[:digit:]]+$/'
If you want to print the number of lines deleted as you've been asking in comments, you may use this:
awk -F'[[:space:]]*[|][[:space:]]*' '
{
if($2~/^[[:digit:]]+$/){print}else{c++}
}
END{printf "%d lines deleted\n", c}' file
A short and simple GNU awk (gawk) script to filter lines with numbers in the second column (field), assuming a one-word field (e.g. 1234, or 12AB):
awk -F'|' '$2 ~ /\y[0-9]+\y/' file
We use the GNU extension for regexp operators, i.e. \y for matching the word boundary. Other than that, pretty straightforward: we split fields on | and look for isolated digits in the second field.
Edit: Since the question has been updated, and now explicitly allows for multiple words in the second field (e.g. 12 AB, 12-34, 12 34), to get lines with numbers and separators only in the second field:
awk -F'|' '$2 ~ /^[- 0-9]+$/' file
Alternatively, if we say only letters are forbidden in the second field, we can use:
awk -F'|' '$2 ~ /^[^a-zA-Z]+$/' file

take out specific columns from mulitple files

I have multiple files that look like the one below. They are tab-separated. For all the files I would like to take out column 1 and the column that start with XF:Z:. This will give me output 1
The files names are htseqoutput*.sam.sam where * varies. I am not sure about the awk function use, and if the for-loop is correct.
for f in htseqoutput*.sam.sam
do
awk ????? "$f" > “out${f#htseqoutput}”
done
input example
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 16 chr22 39715068 24 51M * 0 0 GACAATCAGCACACAGTTCCTGTCCGCCCGTCAATAAGTTCATCATCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T31G0 YT:Z:UU XF:Z:SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 16 chr19 4724687 40 33M * 0 0 AGGCGAATGTGATAACCGCTACACTAAGGAAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:26C6 YT:Z:UU XF:Z:tRNA
TCGACTCCCGGTGTGGGAACC_0 16 chr13 45492060 23 21M * 0 0 GGTTCCCACACCGGGAGTCGA IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C20 YT:Z:UU XF:Z:tRNA
output 1:
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
Seems like you could just use sed for this:
sed -r 's/^([ACGT0-9_]+).*XF:Z:([[:alnum:]]+).*/\1\t\2/' file
This captures the part at the start of the line and the alphanumeric part following XF:Z: and outputs them, separated by a tab character. One potential advantage of this approach is that it will work independently of the position of the XF:Z: string.
Your loop looks OK (you can use this sed command in place of the awk part) but be careful with your quotes. " should be used, not “/”.
Alternatively, if you prefer awk (and assuming that the bit you're interested in is always part of the last field), you can use a custom field separator:
awk -F'[[:space:]](XF:Z:)?' -v OFS='\t' '{print $1, $NF}' file
This optionally adds the XF:Z: part to the field separator, so that it is removed from the start of the last field.
You can try, if column with "XF:Z:" is always at the end
awk 'BEGIN{OFS="\t"}{n=split($NF,a,":"); print $1, a[n]}' file.sam
you get,
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
or, if this column is a variable position for each file
awk 'BEGIN{OFS="\t"}
FNR==1{
for(i=1;i<=NF;i++){
if($i ~ /^XF:Z:/) break
}
}
{n=split($i,a,":"); print $1, a[n]}' file.sam