How to edit number in column i-th of row n-th with bash shell - awk

I have a data file like this:
1 7.1 3.506678
2 4.7 4.700778
3 5.9 5.901760
4 8.2 4.397694
.....
I would like to replace the third column of several lines, for example, 2nd and 3rd line with the same number. Can anyone suggest a solution?
for i in 2 3;
do
awk -v line="$i" 'NR==line {gsub(/4\.7/, "2\.3", $3)} 1'
done
Expected Output
1 7.1 3.506678
2 4.7 2.300778 #changed
3 5.9 2.301760 #changed
4 8.2 4.397694

First create a backup of your input file just in case.
Then maybe this loop could work?
for i in 3 4; do
awk -v row="$i" -v column=3 -v new_string="some_value" 'BEGIN{OFS="\t"}NR==row{sub(/.*/,new_string,$column)} 1' stack.tsv > out.tsv
cp out.tsv stack.tsv
done
INPUT:
cat stack.tsv
1 7.1 3
2 4.7 4
3 5.9 5
4 8.2 4
OUTPUT:
cat stack.tsv
1 7.1 3
2 4.7 4
3 5.9 some_value
4 8.2 some_value

$ awk 'BEGIN{FS=OFS=" "; a[2]; a[3]} NR in a{sub(/.../,2.3,$3)} 1' file
1 7.1 3.506678
2 4.7 2.300778
3 5.9 2.301760
4 8.2 4.397694
or in general for many lines:
$ awk 'BEGIN{split("2 3",tmp); for (i in tmp) a[tmp[i]]; FS=OFS=" "} NR in a{sub(/.../,2.3,$3)} 1' file
1 7.1 3.506678
2 4.7 2.300778
3 5.9 2.301760
4 8.2 4.397694

Repeatedly running Awk on the same file is inelegant, slow, and somewhat brittle. Refactor your script to perform all the changes in one go.
awk -v lines=3:4 'BEGIN { i = split(lines, n, ":"); for(j=1; j<=i; ++j) line[n[j]]++ }
FNR in line { sub(/^[0-9]\.[0-9]/, "2\.3", $3)} 1' file >newfile
Awk will mangle (or, well, normalize) the spacing on the lines it modifies; if this is a significant problem, there are ways to preserve the original spacing, but that will complicate the solution.

Related

How to swap the last two column pairs with awk?

I am trying this
awk '{B=$(NF-1);A=$NF; $NF=$(NF-2); $(NF-1) = $(NF-3); $(NF-2)=A; $(NF-3) = B; print;}' input_text.txt
but I get the error:
awk: cmd. line:1: (FILENAME=cazzo.txt FNR=2) fatal: attempt to access field -1
Sample input:
$ cat input_text.txt
1 7 9 11 0 5 2
The same happens if I replace the spaces with tabs in the input_text.txt file.
Expected output:
1 7 9 5 2 11 0
I am running with Cygwin on Windows 10.
You can try this awk for swapping values:
awk 'NF > 3 {a=$NF; b=$(NF-1); $NF=$(NF-2); $(NF-1)=$(NF-3); $(NF-3)=b; $(NF-2)=a} 1' file
1 7 9 5 2 11 0
If there are DOS line breaks then use:
awk -v RS='\r?\n' 'NF > 3 {a=$NF; b=$(NF-1); $NF=$(NF-2); $(NF-1)=$(NF-3); $(NF-3)=b; $(NF-2)=a} 1' file
If you have gnu awk then you can use this regex based approach:
awk -v RS='\r?\n' 'NF > 3 {
$0 = gensub(/(\S+\s+\S+)(\s+)(\S+\s+\S+)$/, "\\3\\2\\1", "1")} 1' file
1 7 9 5 2 11 0
To swap the last n fields with the n fields before them:
$ awk -v n=2 'NF>=(2*n){ for (i=NF-(n-1); i<=NF; i++) {t=$i; $i=$(i-n); $(i-n)=t} } 1' file
1 7 9 5 2 11 0
$ awk -v n=3 'NF>=(2*n){ for (i=NF-(n-1); i<=NF; i++) {t=$i; $i=$(i-n); $(i-n)=t} } 1' file
1 0 5 2 7 9 11
With your shown samples, please try following code. This is a Generic code, where you have 2 awk variables named fromFields and toFields. So you need to give their values like: let's say you want to substitute 4th field value with 6th field AND 5th field value with 7th field, so you will set it like: fromFields="4,5" and toFields="6,7". I am assuming user will understand that values which are given are feasible with respect to Input_file.
awk -v fromFields="4,5" -v toFields="6,7" '
BEGIN{
num1=split(fromFields,arr1,",")
num2=split(toFields,arr2,",")
}
{
tmp=""
for(i=1;i<=num1;i++){
tmp=$arr1[i]
$arr1[i]=$arr2[i]
$arr2[i]=tmp
}
}
1
' Input_file

Which AWK program can do this manipulation?

Given a file containing a structure arranged like the following (with fields separated by SP or HT)
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
Which AWK program do I need to get the following output?
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w
Thanks in advance for any and all help.
Postscript
Please bear in mind,
My input file is much larger than the one depicted in this question.
My computer science skills are seriously limited.
This task has been imposed on me.
awk -v n=4 '
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{
c=0
for (i=1; i<NF; i+=n) {
c++
col[c] = col[c] join(i, i+n-1)
}
}
END {
for (i=1; i<=c; i++)
printf "%s", col[i] # the value already ends with newline
}
' file
The awk info page has a short primer on awk, so read that too.
Benchmarking
create an input file with 1,000,000 columns and 8 rows (as specified by OP)
#!perl
my $cols = 2**20; # 1,048,576
my $rows = 8;
my #alphabet=( 'a'..'z', 0..9 );
my $size = scalar #alphabet;
for ($r=1; $r <= $rows; $r++) {
for ($c = 1; $c <= $cols; $c++) {
my $idx = int rand $size;
printf "%s ", $alphabet[$idx];
}
printf "\n";
}
$ perl createfile.pl > input.file
$ wc input.file
8 8388608 16777224 input.file
time various implementations: I use the fish shell, so the timing output is different from bash's
my awk
$ time awk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 3.62 secs fish external
usr time 3.49 secs 0.24 millis 3.49 secs
sys time 0.11 secs 1.96 millis 0.11 secs
$ wc output.file
2097152 8388608 16777216 output.file
Timur's perl:
$ time perl -lan columnize.pl input.file > output.file
________________________________________________________
Executed in 3.25 secs fish external
usr time 2.97 secs 0.16 millis 2.97 secs
sys time 0.27 secs 2.87 millis 0.27 secs
Ravinder's awk
$ time awk -f columnize.ravinder input.file > output.file
________________________________________________________
Executed in 4.01 secs fish external
usr time 3.84 secs 0.18 millis 3.84 secs
sys time 0.15 secs 3.75 millis 0.14 secs
kvantour's awk, first version
$ time awk -f columnize.kvantour -v n=4 input.file > output.file
________________________________________________________
Executed in 3.84 secs fish external
usr time 3.71 secs 166.00 micros 3.71 secs
sys time 0.11 secs 1326.00 micros 0.11 secs
kvantour's second awk version: Crtl-C interrupted after a few minutes
$ time awk -f columnize.kvantour2 -v n=4 input.file > output.file
^C
________________________________________________________
Executed in 260.80 secs fish external
usr time 257.39 secs 0.13 millis 257.39 secs
sys time 1.68 secs 2.72 millis 1.67 secs
$ wc output.file
9728 38912 77824 output.file
The $0=a[j] line is pretty expensive, as it has to parse the string into fields each time.
dawg's python
$ timeout 60s fish -c 'time python3 columnize.py input.file 4 > output.file'
[... 60 seconds later ...]
$ wc output.file
2049 8196 16392 output.file
another interesting data point: using different awk implementations. I'm on a Mac with GNU awk and mawk installed via homebrew
with many columns, few rows
$ time gawk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 3.78 secs fish external
usr time 3.62 secs 174.00 micros 3.62 secs
sys time 0.13 secs 1259.00 micros 0.13 secs
$ time /usr/bin/awk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 17.73 secs fish external
usr time 14.95 secs 0.20 millis 14.95 secs
sys time 2.72 secs 3.45 millis 2.71 secs
$ time mawk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 2.01 secs fish external
usr time 1892.31 millis 0.11 millis 1892.21 millis
sys time 95.14 millis 2.17 millis 92.97 millis
with many rows, few columns, this test took over half an hour on a MacBook Pro, 6 core Intel cpu, 16GB ram
$ time mawk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 32.30 mins fish external
usr time 23.58 mins 0.15 millis 23.58 mins
sys time 8.63 mins 2.52 millis 8.63 mins
Use this Perl script:
perl -lane '
push #rows, [#F];
END {
my $delim = "\t";
my $cols_per_group = 2;
my $col_start = 0;
while ( 1 ) {
for my $row ( #rows ) {
print join $delim, #{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
}
$col_start += $cols_per_group;
last if ($col_start + $cols_per_group - 1) > $#F;
}
}
' in_file > out_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
This script reads the file into memory. This is okay for most modern computers and the file sizes in question.
Each line is split on whitespace (use -F'\t' for TAB as delimiter) into array #F. The references to this array for each line are stored as elements in array #rows. After the file is read, and the end of the script (in the END { ... } block), the contents of the file are printed in groups of columns, with $cols_per_group columns per group. Columns are delimited by $delim.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
Could you please try following, written and tested with ONLY shown samples in GNU awk.
awk '
{
for(i=1;i<=NF;i+=2){
arr[i]=(arr[i]?arr[i] ORS :"")$i OFS $(i+1)
}
}
END{
for(i=1;i<=NF;i+=2){
print arr[i]
}
}' Input_file
Since we all love awk, here is another one:
awk -v n=2 '{for(i=1;i<=NF;++i) { j=int((i-1)/n); a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS) }}
END{for(j=0;j<=int(NF/n);j++) printf "%s", a[j]}'
This will output exactly what is requested by the OP.
How does it work?
Awk performs actions per record/line it reads. For each record, it processes all the fields and appends them to a set of strings stored in an array a. It processes it in such way that a[1] contains the first n columns. a[2] the second set of n columns, etc. The relation between field number and string number is based on the equation int((i-1)/n).
When creating the strings, we try to keep track if we need to add a field separator OFS or a new line (record separator ORS). We decide this based on the modulus of the field number and the total number of columns we expect (i.e. n). Note, that we always use ORS if we process the last field.
An alternative approach:
Thanks to the comment Dawg, a flaw in the above code was found. He informed us that the program scales really badly when moving to large files. The real reason for this is not 100% known, but I assume it is from constantly having to rewrite memory by doing operations as a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS). This can be eliminated by just buffering the entire file and do all operations in the end:
awk -v n=2 '{a[NR]=$0}
END{ for(i=1;i<=NF;i+=n)
for(j=1;j<=NR;++j) {
$0=a[j]
for(k=0;k<n&&(i+k<=NF);++k)
printf "%s%s", $(i+k), ((i+k==NF || ((i+k) % n == 0)) ? ORS : OFS)
}
}' file
Note: the latter seems only efficient for a small amount of columns. This is because of the constant re-splitting done with $0=a[j]. The split takes much more time due to the large amount of fields. The complexity of this system is O(NF^2*NR)
A final alternative approach: While the first solution is fast for large amount of columns and small amount of rows, the second is fast for small amount of columns and large amount of rows. Below you find a final version that is not as fast, but stable and produces the same timing for the transposed files.
awk -v n=2 '{ for(i=1;i<=NF;i+=n) {
s=""
for(k=0;k<n&&(i+k<=NF);++k)
s=s $(i+k) ((i+k==NF || ((i+k) % n == 0)) ? "ORS" : OFS);
a[i,NR]=s
}
}
END{for(i=1;i<=NF;i+=n)for(j=1;j<=NR;++j) printf "%s",a[i,j]}' file
My old answer is below and no longer applicable...
You can use this awk for a file that could be millions of rows or millions of columns. The basic scheme is to suck all the values into a single array then use indexing arithmetic and nested loops at the end to get the correct order:
$ cat col.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
for (i=1; i<=numVals; i+=NF) {
for(j=0; j<cols; j++) {
printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
}
}
}
}
$ awk -f col.awk -v cols=2 file
4 5
m d
t 7
h 5
...
3 4
4 1
d f
5 9
q w
My old answer is based on the dramatic slowdown seen in most of these awks with a large number of rows.
See this question for more discussion regarding the slowdown.
The comments in the original answer below are no longer applicable.
OLD ANSWER
Only here for consistency...
The awk solutions here are all good for small files. What they all have in common is the file either needs to fit in RAM or the OS virtual memory is an acceptable fall back if the file does not fit in RAM. But with a larger file, since the time of the awk increases exponentially, you can have a very bad result. With a 12MB version of your file, an in-memory awk becomes unusably slow.
This is the case if there are millions of rows not millions of columns.
The only alternative to an in-memory solution is reading the file multiple times or managing temp files yourself. (Or use a scripting language that manages VM internally, such as Perl or Python... Timur Shatland's Perl is fast even on huge files.)
awk does not have an easy mechanism to loop over a file multiple times until a process is done. You would need to use the shell to do that and invoke awk multiple times.
Here is a python script that reads a file line-by-line and prints cols number of columns until all the original columns have been printed
$ cat sys.py
import sys
filename=sys.argv[1]
cols=int(sys.argv[2])
offset=0
delimiter=" "
with open(filename, "r") as f:
max_cols=len(f.readline().split())
while offset<max_cols:
with open(filename, "r") as f:
for line in f:
col_li=line.rstrip().split()
l=len(col_li)
max_cols=l if max_cols>l else l
print(delimiter.join(col_li[offset:offset+cols]))
offset+=cols
It is counterintuitive, but it is often significantly faster and more efficient to read a file multiple times than it is to gulp the entire thing -- if that gulp then results in a bad result with larger data.
So how does this perform compared to one of the awks in this post? Let's time it.
Given your example, the in memory awk will likely be faster:
$ cat file
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
$ time python pys.py file 2 >file2
real 0m0.027s
user 0m0.009s
sys 0m0.016s
$ time awk -v n=2 '{for(i=1;i<=NF;++i) {
j=int((i-1)/n); a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS) }}
END{for(j=0;j<=int(NF/n);j++) printf "%s", a[j]}' file >file3
real 0m0.009s
user 0m0.003s
sys 0m0.003s
And that is true. BUT, let's make the file 1000x bigger with this Python script:
txt='''\
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
'''
with open('/tmp/file', 'w') as f:
f.write(txt*1000) # change the 1000 to the multiple desired
# file will have 8000 lines and about 125KB
Rerun those timings, same way, and you get:
#python
real 0m0.061s
user 0m0.044s
sys 0m0.015s
# awk
real 0m0.050s
user 0m0.043s
sys 0m0.004s
About the same time... Now make the file BIGGER by multiplying the original by 100,000 to get 800,000 lines and 12MB and run the timing again:
# python
real 0m3.475s
user 0m3.434s
sys 0m0.038s
#awk
real 22m45.118s
user 16m40.221s
sys 6m4.652s
With a 12MB file, the in memory method becomes essentially unusable since the VM system on this computer with is subject to massive disc swapping to manage that particular type of memory allocation. It is likely O n**2 or worse. This computer is a 2019 Mac Pro 16 core Xeon with 192GB memory, so it is not hardware...

awk: Search missing value in file

awk newbie here! I am asking for help to solve a simple specific task.
Here is file.txt
1
2
3
5
6
7
8
9
As you can see a single number (the number 4) is missing. I would like to print on the console the number 4 that is missing. My idea was to compare the current line number with the entry and whenever they don't match I would print the line number and exit. I tried
cat file.txt | awk '{ if ($NR != $1) {print $NR; exit 1} }'
But it prints only a newline.
I am trying to learn awk via this small exercice. I am therefore mainly interested in solutions using awk. I also welcome an explanation for why my code does not do what I would expect.
Try this -
awk '{ if (NR != $1) {print NR; exit 1} }' file.txt
4
since you have a solution already, here is another approach, comparing with previous values.
awk '$1!=p+1{print p+1} {p=$1}' file
you positional comparison won't work if you have more than one missing value.
Maybe this will help:
seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'
4
Since I am learning awk as well
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
`awk` explanation read each number from `$1` into array `i` and increment that number list line by line with `i++`, if the number is not sequential, then print it.
cat file
1
2
3
5
6
7
8
9
11
12
13
15
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
10
14

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

AWK: Divide any element of any row by some element of another row

I have got a text file with some structure like this:
2 2 4 5 6
1 9 7 6 2
1 5 2 8 5
I want to be able to divide any element of any row by an element of another row. For example if I wanted to divide the 3rd element of the 1st row by the 2nd element of the 3rd row that would give:
4/5 = 0.8
Couldn't figure out a smart way to do this with AWK. Suggestions?
This MAY be what you want but it's hard to tell without more details and the expected output:
$ awk -v num=1,5 -v den=3,3 '{for (i=1;i<=NF;i++) cell[NR","i]=$i} END{print (cell[den] ? cell[num]/cell[den] : "NaN")}' file
3
$ awk -v num=3,4 -v den=1,2 '{for (i=1;i<=NF;i++) cell[NR","i]=$i} END{print (cell[den] ? cell[num]/cell[den] : 0)}' file
4
If (i1, j1) and (i2, j2) are the coordinates of the numerator and the denominator, you can do this :
i1=1
j1=3
i2=3
j2=2
awk 'NR=='$i1'{a=$'$j1'} NR=='$i2' {b=$'$j2'} END {print a"/"b " = " a/b}' file