awk to capture element upto space or using special escape character - awk

Trying to extract the 5th element in $1 after the - upto the space or \\. If a / was used then the script awk -F'[-/'] 'NR==0{print; next} {print $0"\t""\t"$5}' file works as expected. Thank you :).
file --tab-delimited--
00-0000-L-F-Male \\path\to xxx xxx
00-0001-L-F-Female \\path\to xxx xxx
desired (last field has two tabs before)
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female
awk
awk -F'-[[:space:]][[:space:]]+' 'NR==0{print; next} {print $0"\t""\t"$5}' file
00-0000-L-F-Male \\path\to xxx xxx
00-0001-L-F-Female \\path\to xxx xxx
awk 2
awk -F'[-\\]' 'NR==0{print; next} {print $0"\t""\t"$5}' file
awk: fatal: Unmatched [ or [^: /[-\]/

Using any awk:
$ awk -F'[-\t]' -v OFS='\t\t' '{print $0, $5}' file
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female
Regarding your scripts:
awk
awk -F'-[[:space:]][[:space:]]+' 'NR==0{print; next} {print $0"\t""\t"$5}' file
-F'-[[:space:]][[:space:]]+' says that your fields are separated by a - followed by 2 or more spaces, which they aren't.
NR==0{foo} says "do foo for line number 0" but there is no line number 0 in any input.
awk 2
awk -F'[-\\]' 'NR==0{print; next} {print $0"\t""\t"$5}' file
-F'[-\\]' appears to be trying to set FS to a minus sign or a backslash, but you already told us your fields are tab-separated, not backslash-separated.
When setting FS this way it goes through a few different phases of interpretation, converting a shell string to an awk string, converting an awk string to a regexp, and using the regexp as a field separator, so you need several layers of escaping, not just 1, to make a backslash literal. If unsure, keep adding backslashes until the warnings and errors go away.

You may use this awk:
awk -F'\t' '{n=split($1, a, /-/); print $0 FS FS a[(n > 4 ? 5 : n)]}' file
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female
a[(n > 4 ? 5 : n)] expression gets 5th element from array if there are 5 or more elements in array otherwise it gets last element.

Presuming your file is '\t' separated with one-tab per field and you want an empty field before the Male/Female output, you can use:
awk -F"\t" '{ split($1,arr,"-"); print $0 "\t\t" arr[5] }' filetabs.txt
Example Use/Output
Where filetabs.txt contains your sample data with tab field-separators you would get:
$ awk -F"\t" '{ split($1,arr,"-"); print $0 "\t\t" arr[5] }' filetabs.txt
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female

With perl one liner which supports lazy match we can try following code. Written and tested in shown samples only.
perl -pe 's/^((?:.*?-)+)([^[:space:]]+)([[:space:]]+.*)$/\1\2\3\t\t\2/' Input_file
OR above could be written as following also:
perl -pe 's/^((?:.*?-)+)(\S+)(\s+.*)$/\1\2\3\t\t\2/' Input_file
Explanation: Adding detailed explanation for used regex above. Here is the Online Demo for used regex in code.
^( ##From starting of the value creating one capturing group here.
(?: ##Opening non-capturing group here.
.*?- ##Using lazy match till - here.
)+ ##Closing non-capturing group here with matching 1 OR more occurrences of this.
) ##Closing 1st capturing group here.
([^[:space:]]+) ##Creating 2nd capturing group and matching all non-spaces in it.
([[:space:]]+.*)$ ##Creating 3rd capturing group which matches spaces till end of the value.

Related

awk/sed - replace column with pattern using variables from other columns

I have a tab delimited text file:
#CHROM
POS
ID
REF
ALT
1
188277
rs434
C
T
20
54183975
rs5321
CTAAA
C
and I try to replace the "ID" column with specific patern $CHROM_$POS_$REF_$ALT with sed or awk
#CHROM
POS
ID
REF
ALT
1
188277
1_188277_C_T
C
T
20
54183975
20_54183975_CTAAA_C
CTAAA
C
unfortunately, I managed only to delete this ID column with:
sed -i -r 's/\S+//3'
and all patterns I try do not work in all cases. To be honest I am lost in the documentation and I am looking for examples which could help me solve this problem.
Using awk, you can set the value of the 3rd field concatenating field 1,2,4 and 5 with an underscore except for the first line. Using column -t to present the output as a table:
awk '
BEGIN{FS=OFS="\t"}
NR>1 {
$3 = $1"_"$2"_"$4"_"$5
}1' file | column -t
Output
#CHROM POS ID REF ALT
1 188277 1_188277_C_T C T
20 54183975 20_54183975_CTAAA_C CTAAA C
Or writing all fields, with a custom value for the 3rd field:
awk '
BEGIN{FS=OFS="\t"}
NR==1{print;next}
{print $1, $2, $1"_"$2"_"$4"_"$5, $4, $5}
' file | column -t
GNU sed solution
sed '2,$s/\(\S*\)\t\(\S*\)\t\(\S*\)\t\(\S*\)\t\(\S*\)/\1\t\2\t\1_\2_\3_\4_\5\t\4\t\5/' file.txt
Explanation: from line 2 to last line, do following replace: put 5 \t-sheared columns (holding zero or more non-whitespace) into groups. Then replace it with these column joined using \t excluding third one, which is replace by _-join of 1st, 2nd, 3rd, 4th, 5th column.
(tested in sed (GNU sed) 4.2.2)
awk -v OFS='\t' 'NR==1 {print $0}; NR>1 {print $1, $2, $1"_"$2"_"$4"_"$5, $4, $5}' inputfile.txt

gawk - Delimit lines with custom character and no similar ending character

Let's say I have a file like so:
test.txt
one
two
three
I'd like to get the following output: one|two|three
And am currently using this command: gawk -v ORS='|' '{ print $0 }' test.txt
Which gives: one|two|three|
How can I print it so that the last | isn't there?
Here's one way to do it:
$ seq 1 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1
$ seq 3 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1|2|3
With paste:
$ seq 1 | paste -sd'|'
1
$ seq 3 | paste -sd'|'
1|2|3
Convert one column to one row with field separator:
awk '{$1=$1} 1' FS='\n' OFS='|' RS='' file
Or in another notation:
awk -v FS='\n' -v OFS='|' -v RS='' '{$1=$1} 1' file
Output:
one|two|three
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
awk solutions work great. Here is tr + sed solution:
tr '\n' '|' < file | sed 's/\|$//'
1|2|3
just flatten it :
gawk/mawk 'BEGIN { FS = ORS; RS = "^[\n]*$"; OFS = "|"
} NF && ( $NF ? NF=NF : —-NF )'
ascii | = octal \174 = hex 0x7C. The reason for —-NF is that more often than not, the input includes a trailing new line, which makes field count 1 too many and result in
1|2|3|
Both NF=NF and --NF are similar concepts to $1=$1. Empty inputs, regardless of whether trailing new lines exist or not, would result in nothing printed.
At the OFS spot, you can delimit it with any string combo you like instead of being constrained by tr, which has inconsistent behavior. For instance :
gtr '\012' '高' # UTF8 高 = \351\253\230 = xE9 xAB x98
on bsd-tr, \n will get replaced by the unicode properly 1高2高3高 , but if you're on gnu-tr, it would only keep the leading byte of the unicode, and result in
1 \351 2 \351 . . .
For unicode equiv-classes, bsd-tr works as expected while gtr '[=高=]' '\v' results in
gtr: ?\230: equivalence class operand must be a single character
and if u attempt equiv-classes with an arbitrary non-ASCII byte, bsd-tr does nothing while gnu-tr would gladly oblige, even if it means slicing straight through UTF8-compliant characters :
g3bn 77138 | (g)tr '[=\224=]' '\v'
bsd-tr : 77138=Koyote 코요태 KYT✜ 高耀太
gnu-tr : 77138=Koyote ?
?
태 KYT✜ 高耀太
I would do it following way, using GNU AWK, let test.txt content be
one
two
three
then
awk '{printf NR==1?"%s":"|%s", $0}' test.txt
output
one|two|three
Explanation: If it is first line print that line content sans trailing newline, otherwise | followed by line content sans trailing newline. Note that I assumed that test.txt has not trailing newline, if this is not case test this solution before applying it.
(tested in gawk 5.0.1)
Also you can try this with awk:
awk '{ORS = (NR%3 ? "|" : RS)} 1' file
one|two|three
% is the modulo operator and NR%3 ? "|" : RS is a ternary expression.
See Ed Morton's explanation here: https://stackoverflow.com/a/55998710/14259465
With a GNU sed, you can pass -z option to match line breaks, and thus all you need is replace each newline but the last one at the end of string:
sed -z 's/\n\(.\)/|\1/g' test.txt
perl -0pe 's/\n(?!\z)/|/g' test.txt
perl -pe 's/\n/|/g if !eof' test.txt
See the online demo.
Details:
s - substitution command
\n\(.\) - an LF char followed with any one char captured into Group 1 (so \n at the end of string won't get matched)
|\1 - a | char and the captured char
g - all occurrences.
The first perl command matches any LF char (\n) not at the end of string ((?!\z)) after slurping the whole file into a single string input (again, to make \n visible to the regex engine).
The second perl command replaces an LF char at the end of each line except the one at the end of file (eof).
To make the changes inline add -i option (mind this is a GNU sed example):
sed -i -z 's/\n\(.\)/|\1/g' test.txt
perl -i -0pe 's/\n(?!\z)/|/g' test.txt
perl -i -pe 's/\n/|/g if !eof' test.txt

Using awk to count names in a column

Indicates the initials of people who have three proper names;
so, i have a column in this case it's
awk -F ';' '{print $1, $2}' users.txt
output:
xxx JoaoPedroVilar
xxa JoaoMiguel
RMF RitaPereira
....
My question is: i need with function count ( i guess) , so count only in the column $2 which names have more than 2 names... because i just want in the output Acronyms and names with 2 plus letter upppercases like this:
xxx JoaoPedroVilar
RAT RicardoAntonioPereira
Sample data:
awk -F ';' '{print $1, $2}' users.txt
Output:
xxx NunoAndr�Ferreira
xxx HugoFernandes
xxx HugoGomes
xxx In�sSilva
xxx Jo�oTeixeira
xxx JoaquimGon�alves
JAR JoaquimRibeiro
xxx Jos�PedroRafael
xxx Jos�Soares
xxx LuisFernandes
xxx MiguelMadeira
xxx NunoAndr�Ferreira
xxx PedroLucasFarinha
the answer is :
awk -F';' -b '$2~/[A-Z]{1}.*[A-Z]{1}.*[A-Z]{1}.*/{print $1, $2}' users.txt
So -b it's just awk interpret all caracthers
This regex can probably be refactored, but it does the trick (I think)
awk -F';' '$2~/[A-Z]{1}.*[A-Z]{1}.*[A-Z]{1}.*/{print $1, $2}' users.txt
That's just matching for 3 single upper case characters in your second column. Note that this is going to have false positives if you have names like ScottMcMasters or BobO'Neal but trying to seperate names that aren't already separated is never 100%.
Example:
 cat users.txt
xxx;JoaoPedroVilar
xxx;PedroAndrePereira
RAT;RicardoAntonioPereira
xxx;BobBob
xxx;SomeName
 awk -F";" '$2~/[A-Z]{1}.*[A-Z].*[A-Z].*/' users.txt
xxx;JoaoPedroVilar
xxx;PedroAndrePereira
RAT;RicardoAntonioPereira
Without testable input/output it's a guess but it sounds like you need:
awk -F';' '$2 ~ /([[:upper:]][^[:upper:]]+){2}[[:upper:]]/{print $1, $2}' file
e.g. using the output you posted elsewhere to create sample input:
$ cat file
xxx;NunoAndr�Ferreira
xxx;HugoFernandes
xxx;HugoGomes
xxx;In�sSilva
xxx;Jo�oTeixeira
xxx;JoaquimGon�alves
JAR;JoaquimRibeiro
xxx;Jos�PedroRafael
xxx;Jos�Soares
xxx;LuisFernandes
xxx;MiguelMadeira
xxx;NunoAndr�Ferreira
xxx;PedroLucasFarinha
.
$ awk -F';' '$2 ~ /([[:upper:]][^[:upper:]]+){2}[[:upper:]]/{print $1, $2}' file
xxx NunoAndr�Ferreira
xxx Jos�PedroRafael
xxx NunoAndr�Ferreira
xxx PedroLucasFarinha
If that doesn't work for you then try setting your locale to C first (and next try setting it to whatever local understands the "control chars" in your files as letters) as you're probably just having a locale problem:
LC_ALL=C awk -F';' '$2 ~ /([[:upper:]][^[:upper:]]+){2}[[:upper:]]/{print $1, $2}' file
another approach of counting upper case chars
$ awk -F';' 'gsub(/[A-Z]/,"&",$2)>2 {print $1,$2}'
with:
awk -F";" '$2~/[A-Z]{1}.*[A-Z].*[A-Z].*/' users.txt
Output:
PLF PedroLucasFarinha
but when i print just the names i have example:
awk -F ';' '{print $1, $2}' users.txt
Output:
xxx NunoAndr�Ferreira
xxx HugoFernandes
xxx HugoGomes
xxx In�sSilva
xxx Jo�oTeixeira
xxx JoaquimGon�alves
JAR JoaquimRibeiro
xxx Jos�PedroRafael
xxx Jos�Soares
xxx LuisFernandes
xxx MiguelMadeira
xxx NunoAndr�Ferreira
xxx PedroLucasFarinha
but with :
awk -F';' -b '$2~/[A-Z]{1}.*[A-Z]{1}.*[A-Z]{1}.*/{print $1, $2}' users.txt
the output it's
xxx Jos�PedroRafael
xxx NunoAndr�Ferreira
xxx PedroLucasFarinha

list 3rd column of a file with spaces only

for listing 3rd column I am using
awk '{print $3}' inputfile.txt
and its output looks like
abc
xyz
lmn
pqr
But I need output like
abc xyz lmn pqr
How can I get this?
This might work for you (GNU sed):
sed -r 's/((\S*)\s){3}.*/\2/;1h;1!H;$!d;x;y/\n/ /' file
or more easily:
cut -d\ -f3 file | paste -sd\
print will always append a newline (actually, it will use ORS value). If you want more control, you can use printf:
awk '{printf "%s ", $3}'
This will also print an extra space character at the end, but for most use-cases this extra space is harmless.
Transliterate linefeeds into spaces
... | tr '\n' ' '
Use the awk Output Record Separator variable.
awk -v ORS=' ' '{print $3}' inputfile.txt
Avoiding adding a space to the beginning or end of the line:
awk '{printf "%s%s", fs, $3; fs=FS} END{print ""}' file

awk command to change field seperator from tilde to tab

I want to replace the delimter tilde into tab space in awk command, I have mentioned below how I would have expect.
input
~1~2~3~
Output
1 2 3
this wont work for me
awk -F"~" '{ OFS ="\t"; print }' inputfile
It's really a job for tr:
tr '~' '\t'
but in awk you just need to force the record to be recompiled by assigning one of the fields to its own value:
awk -F'~' -v OFS='\t' '{$1=$1}1'
awk NF=NF FS='~' OFS='\t'
Result
1 2 3
Code for sed:
$echo ~1~2~3~|sed 'y/~/\t/'
1 2 3