Copy lines from one file and paste it to other n times - awk

I have two files such as the following:
file1
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
END
file2
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END
Is it possible to copy the t=10 and t=20 which are over of HELLO and paste them to the exact location at file2 making it like
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END
Of course my files are not so small and imagine that I would like to do this over 100000 times in a file
With the help of other members of the community I created this script but it doesn't give the right result
for frame in $(seq 1 1 2)
do
add=$(awk '/t=/{i++}i=='$frame' {print; exit}' $file1)
awk -v var="$add" 'NR>1 && NR%9==0 {print var} {print $0}' $file2
done
Please if anyone can help my I could appreciate it.
Thanks in advance

You can try following awk script. It reads file1 and saves each line before the HELLO one in an indexed array and extract each position of it when it finds again the line HELLO in the second file:
awk '
NR == 1 { prev_line = $0 }
FNR == NR {
if ( $1 == "HELLO" ) {
hash[ i++ ] = prev_line
}
prev_line = $0
next
}
$1 == "HELLO" {
printf "%s\n", hash[ j++ ]
}
{ print }
' file1 file2
It yields:
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END

awk 'BEGIN{FS="\n";RS="END\n"}
NR==FNR{for(i=2;i<=NF;i++) a[$1]=a[$1]==""?$i:a[$1] FS $i;next}
{for (i in a) {if ($0~a[i]) printf i ORS $0 RS}
}' file1 file2
Result:
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END

Related

How to add field separator based on headers length?

I'm trying to add a delimiter to the following text format (actual file has many more fields).
What I see is the length of each field is given by the length of each underscores blocks ------------ that are below each header.
Input:
NAME ADDRESS PHONE
--------------------- ------------------------------------------------------------ ------------
CLARK KENT 344 Clinton Street, Apartment 3D, midtown Metropolis 11111111
TONY STARK Malibu Point 10880, 902XX 22222222
PETER PARKER 15th Street, Queens, New York City, New York 33333333
Output desired:
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333
My attempt so far it prints the lenght of each header but I don't know how to add the field separator | at the position:
$ awk 'FNR == 2 {for(i=1; i<=NF; i++) {print length($i)}}'
21
60
12
Please some help on this
in place FIELDWIDTHS
$ awk -v OFS='|' 'NR==1 {h=$0; next}
NR==2 {for(i=1;i<=NF;i++) f=f FS 1+length($i);
FIELDWIDTHS=f;
$0=h}
{$1=$1}1' file
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333
Using GNU awk
wid=$(awk '
NR == 2 {
for (i=1; i<=NF; i++) printf "%d ", 1 + length($i)
exit
}
' file)
gawk -v FIELDWIDTHS="$wid" '
NR != 2 {
for (i=1; i<NF; i++) printf "%s|", $i
print $NF
}
' file
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { OFS="|" }
NR==1 { hdr=$0; next }
NR==2 {
nf = split($0,f)
for (i=1; i<=nf; i++) {
FIELDWIDTHS = (i>1 ? FIELDWIDTHS " 1 " : "") length(f[i])
}
$0 = hdr
}
{
for (i=1; i<=NF; i+=2) {
printf "%s%s", $i, (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333
You may use this awk that will with any version of awk:
awk -v OFS='|' '
NR == 1 {
h = $0
next
}
NR == 2 {
for(i=1; i<NF; i++)
w[i] = (i == 1 ? 1 : w[i-1] + 1) + length($i)
$0 = h
}
{
for(i=1; i<=length(w); i++)
$0 = substr($0, 1, w[i]) "|" substr($0, w[i]+i)
} 1' file
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333
Old solutions based on sample data provided
You may try this sed that matches substring with 2+ whitespaces followed by 1 non-whitespace and inserts | between them:
sed -nE '/^-{3,}/! {s/([[:blank:]]{2,})([^[:blank:]])/\1|\2/gp;}' file
NAME |ADDRESS |PHONE
CLARK KENT |344 Clinton Street, Apartment 3D, midtown Metropolis |11111111
TONY STARK |Malibu Point 10880, 902XX |22222222
PETER PARKER |15th Street, Queens, New York City, New York |33333333

Print sorted output with awk to avoid pipe sort command

I'm trying to match the lines containing (123) and then manipulate field 2 replacing x and + by space that will give 4 columns. Then change order of column 3 by Column 4.
To finally print sorted first by column 3 and second by column 4.
I'm able to get the output piping sort command after awk output in this way.
$ echo "
0: 1920x1663+0+0 kpwr(746)
323: 892x550+71+955 kpwr(746)
211: 891x550+1003+410 kpwr(746)
210: 892x451+71+410 kpwr(746)
415: 891x451+1003+1054 kpwr(746)
1: 894x532+70+330 kpwr(123)
324: 894x532+1001+975 kpwr(123)
2: 894x631+1001+330 kpwr(123)
212: 894x631+70+876 kpwr(123)
61: 892x1+71+375 kpwr(0)
252: 892x1+71+921 kpwr(0)" |
awk '/\(123\)/{b = gensub(/(.+)x(.+)\+(.+)\+(.+)/, "\\1 \\2 \\4 \\3", "g", $2); print b}' |
sort -k3 -k4 -n
894 532 330 70
894 631 330 1001
894 631 876 70
894 532 975 1001
How can I get the same output using only awk without the need to pipe sort? Thanks for any help.
Here is how you can get it from awk (gnu) itself:
awk '/\(123\)/{
$2 = gensub(/(.+)x(.+)\+(.+)\+(.+)/, "\\1 \\2 \\4 \\3", "g", $2)
split($2, a) # split by space and store into array a
# store array by index 3 and 4
rec[a[3]][a[4]] = (rec[a[3]][a[4]] == "" ? "" : rec[a[3]][a[4]] ORS) $2
}
END {
PROCINFO["sorted_in"]="#ind_num_asc" # sort by numeric key ascending
for (i in rec) # print stored array rec
for (j in rec[i])
print rec[i][j]
}' file
894 532 330 70
894 631 330 1001
894 631 876 70
894 532 975 1001
Can you handle GNU awk?:
$ gawk '
BEGIN {
PROCINFO["sorted_in"]="#val_num_asc" # for order strategy
}
/\(123\)$/ { # pick records
split($2,t,/[+x]/) # split 2nd field
if((t[4] in a) && (t[3] in a[t[4]])) { # if index collision
n=split(a[t[4]][t[3]],u,ORS) # split stacked element
u[n+1]=t[1] OFS t[2] OFS t[4] OFS t[3] # add new data
delete a[t[4]][t[3]] # del before rebuilding
for(i in u) # sort on whole record
a[t[4]][t[3]]=a[t[4]][t[3]] ORS u[i] # restack to element
} else
a[t[4]][t[3]]=t[1] OFS t[2] OFS t[4] OFS t[3] # no collision, just add
}
END {
PROCINFO["sorted_in"]="#ind_num_asc" # strategy on output
for(i in a)
for(j in a[i])
print a[i][j]
}' file
Output:
894 532 330 70
894 631 330 1001
894 631 876 70
894 532 975 1001
With collisioning data like:
1: 894x532+70+330 kpwr(123) # this
1: 123x456+70+330 kpwr(123) # and this, notice order
324: 894x532+1001+975 kpwr(123)
2: 894x631+1001+330 kpwr(123)
212: 894x631+70+876 kpwr(123)
output would be:
123 456 330 70 # ordered by the whole record when collision
894 532 330 70
894 631 330 1001
894 631 876 70
894 532 975 1001
I was almost done with writing and my solution was ditto as #anubhava's so adding a bit tweak to his solution :) This one will take care of multiple lines of same values here.
awk '
BEGIN{
PROCINFO["sorted_in"]="#ind_num_asc"
}
/\(123\)/{
$2 = gensub(/(.+)x(.+)\+(.+)\+(.+)/, "\\1 \\2 \\4 \\3", "g", $2)
split($2, a," ")
arr[a[3]][a[4]] = (arr[a[3]][a[4]]!=""?arr[a[3]][a[4]] ORS:"")$2
}
END {
for (i in arr){
for (j in arr[i]){ print arr[i][j] }
}
}' Input_file

AWK failing to sum floats

I am trying to sum the last 12 values in a field in a particular csv file, but AWK is failing to correctly sum the values. If I output the data to a new file then run the same AWK statement against the new file it works.
Here are the contents of the original file. The fields are separated by ";"
I want to sum the values in the 3rd field
...$ tail -12 OriginalFile.csv...
02/02/2020 10:30:00;50727.421;0.264;55772.084;0.360;57110.502;0.384
02/02/2020 10:35:00;50727.455;0.408;55772.126;0.504;57110.548;0.552
02/02/2020 10:40:00;50727.489;0.408;55772.168;0.504;57110.593;0.540
02/02/2020 10:45:00;50727.506;0.204;55772.193;0.300;57110.621;0.336
02/02/2020 10:50:00;50727.541;0.420;55772.236;0.516;57110.667;0.552
02/02/2020 10:55:00;50727.566;0.300;55772.269;0.396;57110.703;0.432
02/02/2020 11:00:00;50727.590;0.288;55772.300;0.372;57110.737;0.408
02/02/2020 11:05:00;50727.605;0.180;55772.321;0.252;57110.762;0.300
02/02/2020 11:10:00;50727.621;0.192;55772.344;0.276;57110.786;0.288
02/02/2020 11:15:00;50727.659;0.456;55772.389;0.540;57110.835;0.588
02/02/2020 11:20:00;50727.681;0.264;55772.417;0.336;57110.866;0.372
02/02/2020 11:25:00;50727.704;0.276;55772.448;0.372;57110.900;0.408
I used the following code to print the original value and the summed value of field 3 for each record, but it just returns the same output for the summed value for each line
...$ awk 'BEGIN { FS = ";" } ; { sum += $3 } { print $3, sum }' OriginalFile.csv|tail -12...
0.264 2.00198e+09
0.408 2.00198e+09
0.408 2.00198e+09
0.204 2.00198e+09
0.420 2.00198e+09
0.300 2.00198e+09
0.288 2.00198e+09
0.180 2.00198e+09
0.192 2.00198e+09
0.456 2.00198e+09
0.264 2.00198e+09
0.276 2.00198e+09
If I output the contents of the file into a different file, the same code works as expected
...$ tail -12 OriginalFile.csv > testfile2.csv...
...$ awk 'BEGIN { FS = ";" } ; { sum += $3 } { print $3, sum }' testfile2.csv...
0.264 0.264
0.408 0.672
0.408 1.08
0.204 1.284
0.420 1.704
0.300 2.004
0.288 2.292
0.180 2.472
0.192 2.664
0.456 3.12
0.264 3.384
0.276 3.66
How can I get the correct output from the original file without having to create a new file?
As #Shawn's excellent comment points out, the order in which you pipe in your data is the problem. By the time you reach the 12th line from the end, sum is already 2.00198e+09; adding many small fractions is not significant, so it seems like it is "the same output".
Simply:
tail -12 OriginalFile.csv | awk 'BEGIN { FS = ";" } ; { sum += $3 } { print $3, sum }'

Merge files where some columns matched

Match columns 1,2,3 in both files, if they are equal then.
For files where columns match write value of column 4 in file1 into file2
If there is not match then write NA
file1
31431 37150 100 10100
31431 37201 100 12100
31431 37471 100 14100
file2
31431 37150 100 14100
31431 37131 100 14100
31431 37201 100 14100
31431 37478 100 14100
31431 37471 100 14100
Desired output:
31431 37150 100 14100 10100
31431 37131 100 14100 NA
31431 37201 100 14100 12100
31431 37478 100 14100 NA
31431 37471 100 14100 14100
I tried
awk '
FNR==NR{
a[$1 $2 $3]=$4
next
}
($1 in a){
$1=a[$1]
found=1
}
{
$0=found==1?$0",":$0",NA"
sub(/^...../,"&,")
$1=$1
found=""
}
1
' FS=" " file1 FS=" " OFS="," file2
$ awk ' {k=$1 FS $2 FS $3}
NR==FNR {a[k]=$4; next}
{$(NF+1)=k in a?a[k]:"NA"}1' file1 file2
31431 37150 100 14100 10100
31431 37131 100 14100 NA
31431 37201 100 14100 12100
31431 37478 100 14100 NA
31431 37471 100 14100 14100
Could you please try following.
awk 'FNR==NR{a[$1,$2,$3]=$NF;next} {print $0,($1,$2,$3) in a?a[$1,$2,$3]:"NA"}' Input_file1 Input_file2
OR with creating a variable for fields as per Ed sir's comment.
awk '{var=$1 OFS $2 OFS $3} FNR==NR{a[var]=$NF;next} {print $0,var in a?a[var]:"NA"}' Input_file1 Input_file2
Output will be as follows.
31431 37150 100 14100 10100
31431 37131 100 14100 NA
31431 37201 100 14100 12100
31431 37478 100 14100 NA
31431 37471 100 14100 14100
Explanation: Adding explanation for above code now.
awk '
{
var=$1 OFS $2 OFS $3 ##Creating a variable named var whose value is first, second ansd third field of current lines of Input_file1 and Input_file2.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file1 is being read.
a[var]=$NF ##Creating an array named a whose index is variable var and value is $NF of curent line.
next ##next keyword will skip all further lines from here.
}
{
print $0,var in a?a[var]:"NA" ##Printing current line value and along with that printing either value of a[var] or NA based upon if var present in array a then print a[var] else print NA.
}' Input_file1 Input_file2 ##Mentioning Input_file names here.

remove lines that do not match specific digits in list file using awk

I am trying to use awk to remove the lines in file that do not match the digits after the NM_ but before the . in $2 of list. Thank you :).
file
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905 chr7 + 138145078 138270332 138145293
list
TRIM24 NM_015905.2
awk
awk -v OFS="\t" '{ sub(/\r/, "") } ; NR==FNR { N=$2 ; sub(/\..*/, "", $2); A[$2]=N; next } ; $2 in A { $2=A[$2] } 1' list file > out
current output
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905.2 chr7 + 138145078 138270332 138145293
desired output (line 1 removed as that is the line that does not match)
204 NM_015905.2 chr7 + 138145078 138270332 138145293
awk 'NR==FNR{split($2,f2,".");a[f2[1]];next} $2 in a' list file
$ awk -F'[ .]' 'NR==FNR{a[$2];next}$2 in a' list file
204 NM_015905 chr7 + 138145078 138270332 138145293