awk to handle un formatted input

awk to handle un formatted input - awk

Would like know how to handle below situation, sample input delimited by space and want to format as comma-separated output.
All the text in a line up until the first field starting with a digit should be considered as a single field in the output. In the sample data, there are always 3 numeric fields at the end of a line; in the real data, there are 14 such fields.
Input.txt
mmm 4394850 4465411 2579770
xxx yyy 2155419 2178791 1516446
aaa bbb (incl. ccc) 14291585 14438704 6106341
U.U.(W) 6789781 6882021 5940226
nnn 7335050 7534302 2963345
Have tried the command below, but I know it is incomplete:
awk 'BEGIN {FS =" "; OFS = ","} {print $1,$2,$3,$4,$5,$6} ' Input.txt
Desired output:
mmm,4394850,4465411,2579770
xxx yyy,2155419,2178791,1516446
aaa bbb (incl. ccc),14291585,14438704,6106341
U.U.(W),6789781,6882021,5940226
nnn,7335050,7534302,2963345

With GNU awk for gensub():
$ awk '{match($0,/[0-9 ]+$/); print substr($0,1,RSTART-1) gensub(/ /,",","g",substr($0,RSTART,RLENGTH))}' file
mmm,4394850,4465411,2579770
xxx yyy,2155419,2178791,1516446
aaa bbb (incl. ccc),14291585,14438704,6106341
U.U.(W),6789781,6882021,5940226
nnn,7335050,7534302,2963345
with other awks, save the 2nd substr() output in a var and use gsub():
awk '{match($0,/[0-9 ]+$/); digs=substr($0,RSTART,RLENGTH); gsub(/ /,",",digs); print substr($0,1,RSTART-1) digs}' file

Assuming that it's the last 3 columns that are numerical (as in your example):
awk '{for(i=1;i<=NF;++i)printf "%s%s",$i,(i<NF-3?OFS:(i<NF?",":ORS))}' file
Basically print each field followed by a space, comma or newline depending on the field number.

Another awk
awk '$0=gensub(/ ([0-9]+)/,",\\1","g")' file
mmm,4394850,4465411,2579770
xxx yyy,2155419,2178791,1516446
aaa bbb (incl. ccc),14291585,14438704,6106341
U.U.(W),6789781,6882021,5940226
nnn,7335050,7534302,2963345

Related

awk to capture element upto space or using special escape character

Trying to extract the 5th element in $1 after the - upto the space or \\. If a / was used then the script awk -F'[-/'] 'NR==0{print; next} {print $0"\t""\t"$5}' file works as expected. Thank you :).
file --tab-delimited--
00-0000-L-F-Male \\path\to xxx xxx
00-0001-L-F-Female \\path\to xxx xxx
desired (last field has two tabs before)
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female
awk
awk -F'-[[:space:]][[:space:]]+' 'NR==0{print; next} {print $0"\t""\t"$5}' file
00-0000-L-F-Male \\path\to xxx xxx
00-0001-L-F-Female \\path\to xxx xxx
awk 2
awk -F'[-\\]' 'NR==0{print; next} {print $0"\t""\t"$5}' file
awk: fatal: Unmatched [ or [^: /[-\]/

Using any awk:
$ awk -F'[-\t]' -v OFS='\t\t' '{print $0, $5}' file
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female
Regarding your scripts:
awk
awk -F'-[[:space:]][[:space:]]+' 'NR==0{print; next} {print $0"\t""\t"$5}' file
-F'-[[:space:]][[:space:]]+' says that your fields are separated by a - followed by 2 or more spaces, which they aren't.
NR==0{foo} says "do foo for line number 0" but there is no line number 0 in any input.
awk 2
awk -F'[-\\]' 'NR==0{print; next} {print $0"\t""\t"$5}' file
-F'[-\\]' appears to be trying to set FS to a minus sign or a backslash, but you already told us your fields are tab-separated, not backslash-separated.
When setting FS this way it goes through a few different phases of interpretation, converting a shell string to an awk string, converting an awk string to a regexp, and using the regexp as a field separator, so you need several layers of escaping, not just 1, to make a backslash literal. If unsure, keep adding backslashes until the warnings and errors go away.

You may use this awk:
awk -F'\t' '{n=split($1, a, /-/); print $0 FS FS a[(n > 4 ? 5 : n)]}' file
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female
a[(n > 4 ? 5 : n)] expression gets 5th element from array if there are 5 or more elements in array otherwise it gets last element.

Presuming your file is '\t' separated with one-tab per field and you want an empty field before the Male/Female output, you can use:
awk -F"\t" '{ split($1,arr,"-"); print $0 "\t\t" arr[5] }' filetabs.txt
Example Use/Output
Where filetabs.txt contains your sample data with tab field-separators you would get:
$ awk -F"\t" '{ split($1,arr,"-"); print $0 "\t\t" arr[5] }' filetabs.txt
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female

With perl one liner which supports lazy match we can try following code. Written and tested in shown samples only.
perl -pe 's/^((?:.*?-)+)([^[:space:]]+)([[:space:]]+.*)$/\1\2\3\t\t\2/' Input_file
OR above could be written as following also:
perl -pe 's/^((?:.*?-)+)(\S+)(\s+.*)$/\1\2\3\t\t\2/' Input_file
Explanation: Adding detailed explanation for used regex above. Here is the Online Demo for used regex in code.
^( ##From starting of the value creating one capturing group here.
(?: ##Opening non-capturing group here.
.*?- ##Using lazy match till - here.
)+ ##Closing non-capturing group here with matching 1 OR more occurrences of this.
) ##Closing 1st capturing group here.
([^[:space:]]+) ##Creating 2nd capturing group and matching all non-spaces in it.
([[:space:]]+.*)$ ##Creating 3rd capturing group which matches spaces till end of the value.

print specific value from 7th column using pattern matching along with first 6 columns

file1
1 123 ab456 A G PASS AC=0.15;FB=1.5;BV=45; 0|0 0|0 0|1 0|0
4 789 ab123 C T PASS FB=90;AC=2.15;BV=12; 0|1 0|1 0|0 0|0
desired output
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
I used
awk '{print $1,$2,$3,$4,$5,$6,$7}' file1 > out1.txt
sed -i 's/;/\t/g' out1.txt
awk '{print $1,$2,$3,$4,$5,$6,$7,$8}' out1.txt
output generated
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS FB=90
I want to print first 6 columns along with value of AC=(*) from 7th column.

With your shown samples, please try following awk code.
awk '
{
val=""
while(match($7,/AC=[^;]*/)){
val=(val?val:"")substr($7,RSTART,RLENGTH)
$7=substr($7,RSTART+RLENGTH)
}
print $1,$2,$3,$4,$5,$6,val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
val="" ##Nullifying val here.
while(match($7,/AC=[^;]*/)){ ##Running while loop to use match function to match AC= till semi colon all occurrences here.
val=(val?val:"")substr($7,RSTART,RLENGTH) ##Creating val and keep adding matched regex value to it, from 7th column.
$7=substr($7,RSTART+RLENGTH) ##Assigning rest pending values to 7th column itself.
}
print $1,$2,$3,$4,$5,$6,val ##Printing appropriate columns required by OP along with val here.
}
' Input_file ##Mentioning Input_file name here.

$ awk '{
n=split($7,a,/;/) # split $7 on ;s
for(i=1;i<=n&&a[i]!~/^AC=/;i++); # just loop looking for AC
print $1,$2,$3,$4,$5,$6,a[i] # output
}' file
Output:
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
If AC= was not found, and empty field is outputed instead.

Any time you have tag=value pairs in your data I find it best to first populate an array (f[] below) to hold those tag-value mappings so you can print/test/rearrange those values by their tags (names).
Using any awk in any shell on every Unix box:
$ cat tst.awk
{
split($7,tmp,/[=;]/)
for (i=1; i<NF; i+=2) {
f[tmp[i]] = tmp[i] "=" tmp[i+1]
}
sub(/[[:space:]]*[^[:space:]]+;.*/,"")
print $0, f["AC"]
}
$ awk -f tst.awk file
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15

This might work for you (GNU sed):
sed -nE 's/^((\S+\s){6})\S*;?(AC=[^;]*);.*/\1\3/p' file
Turn off implicit printing -n and add easier regexp -E.
Match the first six fields and their delimiters and append the AC tag and its value from the next.

With only GNU sed:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/' file1
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
Lines without a AC=... part in the 7th field will be printed without modification. If you prefer removing the 7th field and the end of the line, use:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/;t;s/\S+;.*//' file1

Confusion about conditional expression in AWK

Here is my input file:
$ cat abc
0 1
2 3
4 5
Why does the following give a one-column output instead of a two-column one?
$ cat abc | awk '{ print $1==0?"000":"111" $1==0? "222":"333" }'
000
333
333
Shouldn't the output be the following?
000 222
111 333
111 333

I think awk is going to parse this as:
awk '{ print ($1==0) ? "000" : (("111" $1==0) ? "222" : "333") }'
That is, when it prints the three zeros, it doesn't even consider the rest of the operation. And when it doesn't print the three zeros, it prints the triple threes because "111" concatenated with any string is not going to evaluate to zero.
You probably want to use:
awk '{ print ($1==0?"000":"111"), ($1==0? "222":"333") }'
where the comma puts a space (OFS or output field separator, to be precise) in the output between the two strings. Or you might prefer:
awk '{ print ($1==0?"000":"111") ($1==0? "222":"333") }'
which concatenates the two strings with no space.

awk to print unique latest date & time lines based on column fields

Would like to print unique lines based on first field AND latest Date & Time of third field,
keep the latest date and time occurrence of that line and remove duplicate of other occurrences.
Having around 50 million rows , file is not sorted ...
Input.csv
10,ab,15-SEP-14.11:09:06,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
20,ab,23-SEP-14.07:09:35,abc,xxx,yyy,zzz
Desired Output:
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
Have attempeted partial commands and in-complete due to Date and Time format of the file un sorting order ...
awk -F, '!seen[$1,$3]++' Input.csv
Looking for your suggestions ...

this awk command will do it for you:
awk -F, -v OFS=',' '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d}
!($1 in b)||d>b[$1] {b[$1] =d; a[$1] = $0}
END{for(x in a)print a[x]}' file
first line transforms the original $3 into valid date format string and get the seconds from 1970 via date cmd, so that we could later compare.
using a and b two arrays to hold the final result and the latest date (seconds)
the END block print all rows from a
test with your example data:
kent$ cat f
10,ab,15-SEP-14.11:09:06,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
20,ab,23-SEP-14.07:09:35,abc,xxx,yyy,zzz
kent$ awk -F, '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d}
!($1 in b)||d>b[$1] { b[$1] =d;a[$1] = $0 }
END{for(x in a)print a[x]}' f
10 ab 25-SEP-14 08:09:26 abc xxx yyy zzz
20 ab 23-SEP-14 08:09:35 abc xxx yyy zzz
58 ab 22-JUL-14 05:07:07 abc xxx yyy zzz
62 ab 12-SEP-14 03:09:23 abc xxx yyy zzz

This should do:
sort -t , -k 3 file | awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}'
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz

awk print line of file2 based on condition of file1

I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.

Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print

No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()

awk '{
getline value <"file2";
if ($1)
print value;
}' file1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk to handle un formatted input - awk

Assuming that it's the last 3 columns that are numerical (as in your example): awk '{for(i=1;i<=NF;++i)printf "%s%s",$i,(i<NF-3?OFS:(i<NF?",":ORS))}' file Basically print each field followed by a space, comma or newline depending on the field number.

Another awk awk '$0=gensub(/ ([0-9]+)/,",\\1","g")' file mmm,4394850,4465411,2579770 xxx yyy,2155419,2178791,1516446 aaa bbb (incl. ccc),14291585,14438704,6106341 U.U.(W),6789781,6882021,5940226 nnn,7335050,7534302,2963345

Related

awk to capture element upto space or using special escape character

print specific value from 7th column using pattern matching along with first 6 columns

Confusion about conditional expression in AWK

awk to print unique latest date & time lines based on column fields

awk print line of file2 based on condition of file1

Categories

Resources