I'm trying to add a delimiter to the following text format (actual file has many more fields).
What I see is the length of each field is given by the length of each underscores blocks ------------ that are below each header.
Input:
NAME ADDRESS PHONE
--------------------- ------------------------------------------------------------ ------------
CLARK KENT 344 Clinton Street, Apartment 3D, midtown Metropolis 11111111
TONY STARK Malibu Point 10880, 902XX 22222222
PETER PARKER 15th Street, Queens, New York City, New York 33333333
Output desired:
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333
My attempt so far it prints the lenght of each header but I don't know how to add the field separator | at the position:
$ awk 'FNR == 2 {for(i=1; i<=NF; i++) {print length($i)}}'
21
60
12
Please some help on this
in place FIELDWIDTHS
$ awk -v OFS='|' 'NR==1 {h=$0; next}
NR==2 {for(i=1;i<=NF;i++) f=f FS 1+length($i);
FIELDWIDTHS=f;
$0=h}
{$1=$1}1' file
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333
Using GNU awk
wid=$(awk '
NR == 2 {
for (i=1; i<=NF; i++) printf "%d ", 1 + length($i)
exit
}
' file)
gawk -v FIELDWIDTHS="$wid" '
NR != 2 {
for (i=1; i<NF; i++) printf "%s|", $i
print $NF
}
' file
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { OFS="|" }
NR==1 { hdr=$0; next }
NR==2 {
nf = split($0,f)
for (i=1; i<=nf; i++) {
FIELDWIDTHS = (i>1 ? FIELDWIDTHS " 1 " : "") length(f[i])
}
$0 = hdr
}
{
for (i=1; i<=NF; i+=2) {
printf "%s%s", $i, (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333
You may use this awk that will with any version of awk:
awk -v OFS='|' '
NR == 1 {
h = $0
next
}
NR == 2 {
for(i=1; i<NF; i++)
w[i] = (i == 1 ? 1 : w[i-1] + 1) + length($i)
$0 = h
}
{
for(i=1; i<=length(w); i++)
$0 = substr($0, 1, w[i]) "|" substr($0, w[i]+i)
} 1' file
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333
Old solutions based on sample data provided
You may try this sed that matches substring with 2+ whitespaces followed by 1 non-whitespace and inserts | between them:
sed -nE '/^-{3,}/! {s/([[:blank:]]{2,})([^[:blank:]])/\1|\2/gp;}' file
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333
Related
Below I am trying to have this awk script display each individual exam average as well as the exams themselves average. I know its a matter of where each line of code is placed as to how its executed. This is what I need it to look like:
Name Exam 1 Exam 2 Exam 3 Exam 4 Average
Joe 0.0 75 87 91
John 0.0 86 72 83
Exam 1 Average: 0.0
Exam 2 Average: 80.5
Exam 3 Average: 79.5
#!/usr/bin/awk -f
NR == 1{
printf "%s \t %28s %7s %7s %7s %7s\n", "Name", "Exam 1", "Exam 2", "Exam 3", "Exam 4", "Av\
erage"
}
{
examSTUAVG = ($3 + $4 + $5) / 4;
printf "%s \t %28s %7s %7s %7s %7.1f\n", $1, "0", $3, $4, $5,examSTUAVG
{exam2Total += $3}
{exam3Total += $4}
{exam4Total += $5}
printf "Exam 1 Average is %19s\n", "0.0"
printf "Exam 2 Average is %19.1f\n", exam2Total / NR
printf "Exam 3 Average is %19.1f\n", exam3Total / NR
printf "Exam 4 Average is %19.1f\n", exam4Total / NR
}
{ print ""}
You need this script. You can save it as: program.awk. I added END block for printing the averages values at final.
#!/usr/bin/awk -f
{
if(NR == 1){
printf "%s \t %28s %7s %7s %7s %7s\n", "Name", "Exam 1", "Exam 2", "Exam 3", "Exam 4", "Average"
}
else{
examSTUAVG = ($3 + $4 + $5) / 4;
printf "%s \t %28s %7s %7s %7s %7.1f\n", $1,$2, $3, $4, $5,examSTUAVG
{exam1Total += $2}
{exam2Total += $3}
{exam3Total += $4}
{exam4Total += $5}
}
}
END{
myrows=NR-1
printf "Exam 1 Average is %19.1f\n", exam1Total / myrows
printf "Exam 2 Average is %19.1f\n", exam2Total / myrows
printf "Exam 3 Average is %19.1f\n", exam3Total / myrows
printf "Exam 4 Average is %19.1f\n", exam4Total / myrows
}
The input is the file data.txt.
Name Exam 1 Exam 2 Exam 3 Exam 4 Average
Joe 0.0 75 87 91
John 0.0 86 72 83
And execute it as:
./program.awk data.txt
I got this output:
Name Exam 1 Exam 2 Exam 3 Exam 4 Average
Joe 0.0 75 87 91 63.2
John 0.0 86 72 83 60.2
Exam 1 Average is 0.0
Exam 2 Average is 80.5
Exam 3 Average is 79.5
Exam 4 Average is 87.0
I'm trying to match the lines containing (123) and then manipulate field 2 replacing x and + by space that will give 4 columns. Then change order of column 3 by Column 4.
To finally print sorted first by column 3 and second by column 4.
I'm able to get the output piping sort command after awk output in this way.
$ echo "
0: 1920x1663+0+0 kpwr(746)
323: 892x550+71+955 kpwr(746)
211: 891x550+1003+410 kpwr(746)
210: 892x451+71+410 kpwr(746)
415: 891x451+1003+1054 kpwr(746)
1: 894x532+70+330 kpwr(123)
324: 894x532+1001+975 kpwr(123)
2: 894x631+1001+330 kpwr(123)
212: 894x631+70+876 kpwr(123)
61: 892x1+71+375 kpwr(0)
252: 892x1+71+921 kpwr(0)" |
awk '/\(123\)/{b = gensub(/(.+)x(.+)\+(.+)\+(.+)/, "\\1 \\2 \\4 \\3", "g", $2); print b}' |
sort -k3 -k4 -n
894 532 330 70
894 631 330 1001
894 631 876 70
894 532 975 1001
How can I get the same output using only awk without the need to pipe sort? Thanks for any help.
Here is how you can get it from awk (gnu) itself:
awk '/\(123\)/{
$2 = gensub(/(.+)x(.+)\+(.+)\+(.+)/, "\\1 \\2 \\4 \\3", "g", $2)
split($2, a) # split by space and store into array a
# store array by index 3 and 4
rec[a[3]][a[4]] = (rec[a[3]][a[4]] == "" ? "" : rec[a[3]][a[4]] ORS) $2
}
END {
PROCINFO["sorted_in"]="#ind_num_asc" # sort by numeric key ascending
for (i in rec) # print stored array rec
for (j in rec[i])
print rec[i][j]
}' file
894 532 330 70
894 631 330 1001
894 631 876 70
894 532 975 1001
Can you handle GNU awk?:
$ gawk '
BEGIN {
PROCINFO["sorted_in"]="#val_num_asc" # for order strategy
}
/\(123\)$/ { # pick records
split($2,t,/[+x]/) # split 2nd field
if((t[4] in a) && (t[3] in a[t[4]])) { # if index collision
n=split(a[t[4]][t[3]],u,ORS) # split stacked element
u[n+1]=t[1] OFS t[2] OFS t[4] OFS t[3] # add new data
delete a[t[4]][t[3]] # del before rebuilding
for(i in u) # sort on whole record
a[t[4]][t[3]]=a[t[4]][t[3]] ORS u[i] # restack to element
} else
a[t[4]][t[3]]=t[1] OFS t[2] OFS t[4] OFS t[3] # no collision, just add
}
END {
PROCINFO["sorted_in"]="#ind_num_asc" # strategy on output
for(i in a)
for(j in a[i])
print a[i][j]
}' file
Output:
894 532 330 70
894 631 330 1001
894 631 876 70
894 532 975 1001
With collisioning data like:
1: 894x532+70+330 kpwr(123) # this
1: 123x456+70+330 kpwr(123) # and this, notice order
324: 894x532+1001+975 kpwr(123)
2: 894x631+1001+330 kpwr(123)
212: 894x631+70+876 kpwr(123)
output would be:
123 456 330 70 # ordered by the whole record when collision
894 532 330 70
894 631 330 1001
894 631 876 70
894 532 975 1001
I was almost done with writing and my solution was ditto as #anubhava's so adding a bit tweak to his solution :) This one will take care of multiple lines of same values here.
awk '
BEGIN{
PROCINFO["sorted_in"]="#ind_num_asc"
}
/\(123\)/{
$2 = gensub(/(.+)x(.+)\+(.+)\+(.+)/, "\\1 \\2 \\4 \\3", "g", $2)
split($2, a," ")
arr[a[3]][a[4]] = (arr[a[3]][a[4]]!=""?arr[a[3]][a[4]] ORS:"")$2
}
END {
for (i in arr){
for (j in arr[i]){ print arr[i][j] }
}
}' Input_file
I am trying to use awk to remove the lines in file that do not match the digits after the NM_ but before the . in $2 of list. Thank you :).
file
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905 chr7 + 138145078 138270332 138145293
list
TRIM24 NM_015905.2
awk
awk -v OFS="\t" '{ sub(/\r/, "") } ; NR==FNR { N=$2 ; sub(/\..*/, "", $2); A[$2]=N; next } ; $2 in A { $2=A[$2] } 1' list file > out
current output
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905.2 chr7 + 138145078 138270332 138145293
desired output (line 1 removed as that is the line that does not match)
204 NM_015905.2 chr7 + 138145078 138270332 138145293
awk 'NR==FNR{split($2,f2,".");a[f2[1]];next} $2 in a' list file
$ awk -F'[ .]' 'NR==FNR{a[$2];next}$2 in a' list file
204 NM_015905 chr7 + 138145078 138270332 138145293
I have a tab-separated file containing a series of lemmas with associated scores.
The file contains 5 columns, the first column is the lemma and the third is the one that contains the score. What I need to do is print the line as it is, when lemma is not repeated and print the line with the highest score when lemma is repeated.
IN
Lemma --- Score --- ---
cserép 06a 55 6 bueno
darázs 05 38 1 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
díj 05 14 89 malo
díj 06a 2 101 malo
díj 06b 2 101 malo
díj 07 90 13 bueno
díj 08a 2 101 malo
díj 08b 2 101 malo
egér 06a 66 5 bueno
fonal 05 12 1 bueno
fonal 07 52 4 bueno
Desired output
Lemma --- Score --- ---
cserép 06a 55 6 bueno
darázs 05 38 1 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
díj 07 90 13 bueno
egér 06a 66 5 bueno
fonal 07 52 4 malo
What I have done. But it only works when the lemma is repeated once.
BEGIN {
OFS=FS="\t";
flag="";
}
{
id=$1;
if (id != flag)
{
if (line != "")
{
sub("^;","",line);
z=split(line,A,";");
if ((A[3] > A[8]) && (A[8] != ""))
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5];
}
else if ((A[8] > A[3]) && (A[8] != ""))
{
print A[6]"\t"A[7]"\t"A[8]"\t"A[9]"\t"A[10]
}
else
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5];
}
}
delete line;
flag=id;
}
line[$1]=line[$1]";"$2";"$3";"$4";"$5;
}
END {
line=line ";"$1";"$2";"$3";"$4";"$5
sub("^;","",line);
z=split(line,A,";");
if ((A[3] > A[8]) && (A[8] != ""))
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5];
}
else if ((A[8] > A[3]) && (A[8] != ""))
{
print A[6]"\t"A[7]"\t"A[8]"\t"A[9]"\t"A[10]
}
else
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5]
}
}
This one doesn't require the file to be sorted by lemma, but, it keeps all the lines to be printed in memory (one for each lemma) so may not be appropriate for a file with millions of different lemmas.
It also does not respect the order of the original file.
Finally, it assumes that all scores are non-negative!
$ cat lemma.awk
BEGIN { FS = OFS = "\t" }
NR == 1 { print }
NR > 1 {
if ($3 > score[$1]) {
score[$1] = $3
line[$1] = $0
}
}
END { for (lemma in line) print line[lemma] }
$ awk -f lemma.awk lemma.txt
Lemma --- Score --- ---
cserép 06a 55 6 bueno
díj 07 90 13 bueno
fonal 07 52 4 bueno
darázs 05 38 1 bueno
egér 06a 66 5 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
Tested with gnu awk:
prevLemma != $1 {
if( prevLemma ) {
print line;
}
prevLemma = $1;
prevScore = $3;
line = $0;
}
prevLemma == $1 { if( prevScore < $3 ) {
prevScore = $3;
line = $0;
}
}
END { print line;}
assumption is: the file is sorted by lemma
when the lemma changes (or at the very beginning when the var is empty) the lemma, score and line are saved
when the lemma changes (or in the END), the line for the previous lemma is printed
when the current line belongs to the same lemma and has a higher score the values are saved again
$ cat tst.awk
$1 != prev { printf "%s", maxLine; maxLine=""; max=$3; prev=$1 }
$3 >= max { max=$3; maxLine=$0 ORS }
END { printf "%s", maxLine }
$ awk -f tst.awk file
Lemma --- Score --- ---
cserép 06a 55 6 bueno
darázs 05 38 1 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
díj 07 90 13 bueno
egér 06a 66 5 bueno
fonal 07 52 4 bueno
Use a script:
if ($1 != $5) print $0
else
{
score($NR) = $3
print $0
}
Actually , this might be better done with perl.
I have two files such as the following:
file1
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
END
file2
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END
Is it possible to copy the t=10 and t=20 which are over of HELLO and paste them to the exact location at file2 making it like
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END
Of course my files are not so small and imagine that I would like to do this over 100000 times in a file
With the help of other members of the community I created this script but it doesn't give the right result
for frame in $(seq 1 1 2)
do
add=$(awk '/t=/{i++}i=='$frame' {print; exit}' $file1)
awk -v var="$add" 'NR>1 && NR%9==0 {print var} {print $0}' $file2
done
Please if anyone can help my I could appreciate it.
Thanks in advance
You can try following awk script. It reads file1 and saves each line before the HELLO one in an indexed array and extract each position of it when it finds again the line HELLO in the second file:
awk '
NR == 1 { prev_line = $0 }
FNR == NR {
if ( $1 == "HELLO" ) {
hash[ i++ ] = prev_line
}
prev_line = $0
next
}
$1 == "HELLO" {
printf "%s\n", hash[ j++ ]
}
{ print }
' file1 file2
It yields:
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END
awk 'BEGIN{FS="\n";RS="END\n"}
NR==FNR{for(i=2;i<=NF;i++) a[$1]=a[$1]==""?$i:a[$1] FS $i;next}
{for (i in a) {if ($0~a[i]) printf i ORS $0 RS}
}' file1 file2
Result:
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END