Inplace remove last n lines of files without opening them more than once in gawk? - awk

https://www.baeldung.com/linux/remove-last-n-lines-of-file
awk -v n=3 'NR==FNR{total=NR;next} FNR==total-n+1{exit} 1' input.txt input.txt
01 is my line number. Keep me please!
02 is my line number. Keep me please!
03 is my line number. Keep me please!
04 is my line number. Keep me please!
05 is my line number. Keep me please!
06 is my line number. Keep me please!
07 is my line number. Keep me please!
Here is a way to remove the last n lines. But it is not done inplace and the file is read twice, and it only deals with one file at a time.
How can I inplace remove the last n lines of many files without opening them more than once with one gawk command but without using any other external commands?

With your shown samples please try following awk code. Without using any external utilities as per OP's request in question. We could make use of END block here of awk programming.
awk -v n="3" '
{
total=FNR
lines[FNR]=$0
}
END{
till=total-n
for(i=1;i<=till;i++){
print lines[i]
}
}
' Input_file

single-pass awk solution that requires neither arrays nor gawk
— (unless your file is over 500 MB, then it might be slightly slower) :
rm -f file.txt
jot -c 30 51 > file.txt
gcat -n file.txt | rs -t -c$'\n' -C'#' 0 5 | column -s'#' -t
1 3 7 9 13 ? 19 E 25 K
2 4 8 : 14 # 20 F 26 L
3 5 9 ; 15 A 21 G 27 M
4 6 10 < 16 B 22 H 28 N
5 7 11 = 17 C 23 I 29 O
6 8 12 > 18 D 24 J 30 P
mawk -v __='file.txt' -v N='13' 'BEGIN {
OFS = FS = RS
RS = "^$"
getline <(__); close(__)
print $!(NF -= NF < (N+=_==$NF) ? NF : N) >(__) }'
gcat -n file.txt | rs -t -c$'\n' -C'#' 6 | column -s'#' -t ;
1 3 7 9 13 ?
2 4 8 : 14 #
3 5 9 ; 15 A
4 6 10 < 16 B
5 7 11 = 17 C
6 8 12 >
Speed is hardly a concern :
115K rows 198 MB file took 0.254 secs
rows = 115567. | UTF8 chars = 133793410. | bytes = 207390680.
( mawk2 -v __="${fn1}" -v N='13' ; )
0.04s user 0.20s system 94% cpu 0.254 total
rows = 115554. | UTF8 chars = 133779254. | bytes = 207370006.
5.98 million rows 988 MB file took 1.44 secs
rows = 5983333. | UTF8 chars = 969069988. | bytes = 1036334374.
( mawk2 -v __="${fn1}" -v N='13' ; )
0.33s user 1.07s system 97% cpu 1.435 total
rows = 5983320. | UTF8 chars = 969068062. | bytes = 1036332426.

Another way to do it, using GAWK, with option The BEGINFILE and ENDFILE Special Patterns:
{ lines[++numLines] = $0 }
BEGINFILE { fname=FILENAME}
ENDFILE { prt() }
function prt( lineNr,maxLines) {
close(fname)
printf "" > fname
maxLines = numLines - n
for ( lineNr=1; lineNr<=maxLines; lineNr++ ) {
print lines[lineNr] > fname
}
close(fname)
numLines = 0
}

I find that this is the most succinct solution to the problem.
$ gawk -i inplace -v n=3 -v ORS= -e '{ lines[FNR]=$0 RT }
ENDFILE {
for(i=1;i<=FNR-n;++i) {
print lines[i]
}
}' -- file{1..3}.txt

Related

How to calculate anomaly using awk

A have a file:
file.txt
1 32
2 34
3 32
4 43
5 25
6 34
7 65
8 34
9 23
10 44
I would like to find anomaly on send column:
my below script printing anomalies considering row-2 to row-10 values. It is not considering row-1 values.
awk 'FNR==NR{
f=1;
if($1 >= 1 && $1 <= 10){
count++;
SUM+=$2;
};
next
}
FNR==1 && f==1{
AVG=SUM/count;
next
}
($1 >= 1 && $1 <= 10){
print $1, $2-AVG
}
' file.txt file.txt
My desire output:
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
I got a solution of it:
awk '{f=$1>=1 && $1<=10}f && NR==FNR{sum+=$2; c++; next}f{ print $1, $2-(sum/c) }' file.txt file.txt
I am still wondering why the first script is not giving correct answer.
Since this is just 2 columns file, this can be done in a single pass awk also:
awk '{map[$1] = $2; s += $2}
END {mean = s/NR; for (i in map) print i, map[i] - mean}' file
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
The first script in the OP is not giving the correct value, because you skip the first line in the second pass of your file. This is seen in the statement (FNR==1 && f==1) { AVG=sum/count; next }. Due to the next statement, you skip the computation of the deviation from the mean value for the first record.
This is an efficient computation of the deviation from the mean in a double pass:
awk '(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
If file contains values bigger than 10 or smaller than 1 in the first, column, but you only want to see this for values in the range of [0,10], then you can do:
awk '($1<1 || $1>10) {next}
(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
There are still other optimizations that can be done, but these only become beneficial when working with extremely large files (many millions of lines).

How to print the next or previous line using awk?

I have a file with 8 columns
1743 abc 04 10 29 31 34 35
1742 def 11 19 21 23 27 52
1741 ghi 15 18 20 32 48 49
and I also have a awk line that print the complete line that contains some specific numbers. The code is
awk -v col=1 '{ delete c; for (i=col; i<=NF; ++i) ++c[$i];
if (c['"$1"']>0 && c['"$2"']>0 && c['"$3"']>0 && c['"$4"']>0) print }'
< input_file
(the variables $1,$2,$3 and $4 is because I'm using it on bash).
In the previous example, when I put the numbers 11 21 27 and 52 I'll get the line 1742.
How can I print the next or the previous line? Like in the previous example, if I use the numbers, 11 21 27 and 52 how I take the line 1743 or the line 1741?
$ cat a.sh
echo "BEFORE"
awk -v p1="$1" -v p2="$2" -v p3="$3" -v p4="$4" -v col=1 -f before.awk file
echo "AFTER"
awk -v p1="$1" -v p2="$2" -v p3="$3" -v p4="$4" -v col=1 -f after.awk file
Quoting #triplee: "To print the previous line, remember the previous line in a variable."
$ cat before.awk
prev { delete c;
for (i=col; i<=NF; ++i) ++c[$i]
if (c[p1]>0 && c[p2]>0 && c[p3]>0 && c[p4]>0) print prev
}
{ prev = $0 }
Again, #triplee: "To print the next line, remember that you want to, and print and reset this variable on the next iteration."
$ cat after.awk
f { print; f = 0 }
{
delete c;
for (i=col; i<=NF; ++i) ++c[$i]
if (c[p1]>0 && c[p2]>0 && c[p3]>0 && c[p4]>0) f = 1
}
$ ./a.sh 11 21 27 52
BEFORE
1743 abc 04 10 29 31 34 35
AFTER
1741 ghi 15 18 20 32 48 49
a different approach with double scanning
$ awk -v search="11 21 27 52" -v offset=-1 '
NR==FNR {n=split(search,s);
for(i=1;i<=n;i++) if(FS $0 FS !~ FS s[i] FS) next;
line=NR; next}
FNR==line+offset' file{,}
1743 abc 04 10 29 31 34 35
you can set offset to any value (not just -1,0,1).
N.B. It only find one match though, if there are multiple matches only the last one will be reported. This can be handled by keeping the matched line numbers in an array instead of a scalar value (here line variable).

How to sum a selection of columns?

I'd like to sum multiple columns in a text file similar to this:
GeneA Sample 34 7 8 16
GeneA Sample 17 7 10 91
GeneA Sample 42 9 8 11
I'd like to generate the sum at the bottom of columns 3-5 so it will look like:
GeneA Sample 34 7 8 16
GeneA Sample 17 7 10 91
GeneA Sample 42 9 8 11
93 23 26
I can use this for a single column but don't know how to specify a range of columns:
awk -F'\t' '{sum+=$3} END {print sum}' input file> out
The easiest way is just repeat summing for each column, i.
awk -F '\t' '{
s3 += $3
s4 += $4
s5 += $5
}
END {
print s3, s4, s5
}' input_file > out
In awk:
$ awk '
{
for(i=3;i<=NF;i++) # loop wanted fields
s[i]+=$i } # sum to hash, index on field #
END {
for(i=3;i<=NF;i++) # same old loop
printf "%s%s",s[i],(i==NF?ORS:OFS) } # output
' file
93 23 26 118
Currently the for loop goes thru every numeric field. Change the parameters if needed.
$ awk -v OFS='\t' '{s3+=$3; s4+=$4; s5+=$5; $1=$1} 1;
END {print "","",s3,s4,s5}' file
GeneA Sample 34 7 8 16
GeneA Sample 17 7 10 91
GeneA Sample 42 9 8 11
93 23 26
Try this. Note that NF just means number of fields. And AWK indexing starts with 1. So the example here has a range of 3 to the last col.
awk '{ for(i=3;i<=NF;i++) sum[i] += $i } END { for(i=3;i<=NF;i++) printf( "%d ", sum[i] ); print "" }' input_file
If you want fewer columns, say 3 and 4, then I'd suggest:
awk '{ for(i=3;i<=4 && i<=NF;i++) sum[i] += $i } END { for(i=3;i<=4 && i<=NF;i++) printf( "%d ", sum[i] ); print "" }' input_file

awk Repeat the values of a column n times

I have a column and I want to repeat it multiple times.
eg
for example I want the following column
1
2
3
4
5
to repeat n times and become
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
.
.
.
etc
is this possible?
Replace 3 with any value of n and $1 with the column you want to print:
$ for i in {1..3}; do awk '{print $1}' file; done
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
You could do it just with awk but you would have to store the whole file in memory:
$ awk '{a[NR]=$1}END{for(i=1;i<=3;i++)for(j=1;j<=NR;j++)print a[j]}' file
This is one of those occasions were using the shell constructs actually makes sense.
awk -v n=7 '{s = s $1 ORS} END{for (i=1;i<=n;i++) printf "%s", s}' file
Set n to whatever number you like.
Alternatively, with GNU awk:
awk -v n=7 -v RS='\0' -v ORS= 'END{for (i=1;i<=n;i++) print}' file
Here's a portable awk-based solution to do it using only built in variables with no looping at all :
{m,n,g}awk 'NF += __-(OFS = $-_)^_' RS='^$' FS='^$' ORS= __=3
1 <<<<
2
3
4
1 <<<<
2
3
4
1 <<<<
2
3
4
To repeat each row N times consecutively before the next row, then do
{m,n,g}awk 'NF += __-(OFS = "\n" $-_)^_' FS='^$' __=3
1 <<<<
1 <<<<
1 <<<<
2
2
2
3
3
3
4
4
4
If you wanna duplicate each row horizontally by N times, it's the same concept, slightly different :
{m,n,g}awk 'NF = __ *(OFS = $-_)^_' FS='^$' __=7
10101010101010
11111111111111
12121212121212
13131313131313
14141414141414
15151515151515
16161616161616
17171717171717
18181818181818
19191919191919
You can even make cute numeric pyramids by having the repeat count be based on NR :
{m,n,g}awk 'NF = NR * (OFS=$-_)^_' FS='^$'
10
1111
121212
13131313
1414141414
151515151515
16161616161616
1717171717171717
181818181818181818
19191919191919191919
2020202020202020202020
So if you ever need 2^25 aka 32^5 copies for ASCII string of 2^25 itself, then here's a pretty quick (0.45s) and easy way to get it done with no loops at all :
out9: 256MiB 0:00:00 [ 591MiB/s] [ 591MiB/s] [ <=> ]
( echo 33554432 | mawk2 'NF=(OFS=$-_)^_*$-_' FS='^$'; )
0.30s user 0.10s system 89% cpu 0.454 total
1 268435457
Even if you wanna repeat pretty sizable files, it's reasonably speedy fast : e.g. less than 13 secs for 9 x a 1.85 GB file :
{m,g}awk 'NF += __-(OFS = $-_)^_' RS='^$' FS='^$' ORS= __=9
out9: 16.6GiB 0:00:12 [1.30GiB/s] [1.30GiB/s] [ <=> ]
in0: 1.85GiB 0:00:00 [3.42GiB/s] [3.42GiB/s] [========>] 100%
5.48s user 4.44s system 77% cpu 12.809 total
112448475 17851902237 17851902237
If that's too slow for your needs, perhaps this way is a bit faster :
{m,g}awk '
BEGIN { FS = RS = "^$" }{__=sprintf("%*s",__, OFS = ORS = _)
gsub(".", "\\&", __) } sub(".+",__)' __=9
in0: 1.85GiB 0:00:00 [3.48GiB/s] [3.48GiB/s] [========>] 100%
out9: 16.6GiB 0:00:07 [2.12GiB/s] [2.12GiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}" | LC_ALL=C mawk2 -v __=9 ; )
0.72s user 4.34s system 64% cpu 7.855 total
112448475 17851902237 17851902237

rearrange columns using awk or cut command

I have large file with 1000 columns. I want to rearrange so that last column should be the 3rd column. FOr this i have used,
cut -f1-2,1000,3- file > out.txt
But this does not change the order.
Could anyone help using cut or awk?
Also, I want to rearrange columns 10 and 11 as shown below:
Example:
1 10 11 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
try this awk one-liner:
awk '{$3=$NF OFS $3;$NF=""}7' file
this is moving the last col to the 3rd col. if you have 1000, then it does it with 1000th col.
EDIT
if the file is tab-delimited, you could try:
awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7' file
EDIT2
add an example:
kent$ seq 20|paste -s -d'\t'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7'
1 2 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
EDIT3
You didn't give any input example. so assume you don't have empty columns in original file. (no continuous multi-tabs):
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$10 FS $11 FS $3;$10=$11="";gsub(/\t+/,"\t")}7'
1 2 10 11 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
After all we could print those fields in a loop.
I THINK what you want is:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; sub(OFS "[^" OFS "]*$","")}1' file
This might also work for you depending on your awk version:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; NF--}1' file
Without the part after the semi-colon you'll have trailing tabs in your output.
Since many people are searching for this and even the best awk solution is not really pretty and easy to use I wanted to post my solution (mycut) written in Python:
#!/usr/bin/env python3
import sys
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
#example usage: cat file | mycut 3 2 1
columns = [int(x) for x in sys.argv[1:]]
delimiter = "\t"
for line in sys.stdin:
parts = line.split(delimiter)
print("\t".join([parts[col] for col in columns]))
I think about adding the other features of cut like changing the delimiter and a feature to use a * to print the remaning columns. But then it will get an own page.
A shell wrapper function for awk' that uses simpler syntax:
# Usage: rearrange int_n [int_o int_p ... ] < file
rearrange ()
{
unset n;
n="{ print ";
while [ "$1" ]; do
n="$n\$$1\" \" ";
shift;
done;
n="$n }";
awk "$n" | grep '\w'
}
Examples...
echo foo bar baz | rearrange 2 3 1
bar baz foo
Using bash brace expansion, rearrange first and last 5 items in descending order:
echo {1..1000}a | tr '\n' ' ' | rearrange {1000..995} {5..1}
1000a 999a 998a 997a 996a 995a 5a 4a 3a 2a 1a
Sorted 3-letter shells in /bin:
ls -lLSr /bin/?sh | rearrange 5 9
150792 /bin/csh
154072 /bin/ash
771552 /bin/zsh
1554072 /bin/ksh