AWK split file every 50 ocurrences of a string - awk

I have a beginner question about awk.
I am using the line below to split a file into c files, using 'MATCH' as my delimiter.
awk 'BEGIN{flag=0} /MATCH/{flag++;next} {print $0 > (flag ".txt")}' file
My file is very long, but it has the form shown below:
MATCH
a
b
c
d
MATCH
a
b
I want to have the above awk line split my file every 50 'MATCH' ocurrences. The current command creates a new file for each 'MATCH' ocurrence. I am sure there is a simple way to achieve this, but I have not figured it out yet. I have tried using the line below with no luck.
awk 'BEGIN{flag=0} /MATCH/{flag++ == 50;next} {print $0 > (flag ".txt")}' file
I appreciate the help and guidance.

Untested, using any awk:
awk '
/MATCH/ && ( ( (++matchCnt) % 50 ) == 1 ) {
close(out)
out = (++outCnt) ".txt"
}
{ print > out }
' file

Assumptions:
the number of lines in a MATCH block are not known beforehand
the number of lines in a MATCH block could vary
the MATCH lines are to be copied to the output files
Sample input with 9 MATCH blocks:
$ cat file
MATCH
1.1
1.2
MATCH
2.1
2.2
MATCH
3.1
3.2
MATCH
4.1
4.2
MATCH
5.1
5.2
MATCH
6.1
6.2
MATCH
7.1
7.2
MATCH
8.1
8.2
MATCH
9.1
9.2
One awk idea:
awk -v blkcnt=3 ' # for OP case set blkcnt=50
BEGIN { outfile= ++fcnt ".txt" }
/MATCH/ { if (++matchcnt > blkcnt) {
close(outfile)
outfile= ++fcnt ".txt"
matchcnt=1
}
# next # uncomment if the "MATCH" lines are *NOT* to be copied to the output files
}
{ print $0 > outfile }
' file
For blkcnt=3 this generates:
$ head -40 {1..3}.txt
==> 1.txt <==
MATCH
1.1
1.2
MATCH
2.1
2.2
MATCH
3.1
3.2
==> 2.txt <==
MATCH
4.1
4.2
MATCH
5.1
5.2
MATCH
6.1
6.2
==> 3.txt <==
MATCH
7.1
7.2
MATCH
8.1
8.2
MATCH
9.1
9.2
For blkcnt=4 this generates:
$ head -40 {1..3}.txt
==> 1.txt <==
MATCH
1.1
1.2
MATCH
2.1
2.2
MATCH
3.1
3.2
MATCH
4.1
4.2
==> 2.txt <==
MATCH
5.1
5.2
MATCH
6.1
6.2
MATCH
7.1
7.2
MATCH
8.1
8.2
==> 3.txt <==
MATCH
9.1
9.2

If I've understood correctly, the first 50 blocks of a,b,c,d lines should be written to 1.txt, the next 50 to 2.txt and so on.
This can be achieved by building the filename from the integer value of (flag/50) and adding 1 to it (assuming you want the file series to being with 1 and not 0).
The BEGIN block can be removed as variables are set to 0 when first created if no value is given and they are used numerically.
Thus the following should achieve the desired output:
awk '/MATCH/{flag++;next} {print $0 >(int(flag/50)+1 ".txt")}' file

so while this isn't a complete solution, it does showcase how to capture each group of rows with each "MATCH", so once you count off every 50, then print them out in one shot, bearing in mind one needs to trim out the tail "MATCH" and save it for the next round
nice jot 53 | mawk 'NR % 6 != 1 || ($!NF = "MATCH")^_' |
mawk '{ printf(" :: input row(s) = %8u\n ::" \
" output row # = %8u\n " \
"-------------------\n %s%s " \
"----END-NEW-ROW----\n\n", NF^!!NF, NR, $!(NF = NF), ORS)
}' RS='(^)?MATCH\r?\n' ORS='MATCH\n' FS='\n' OFS='\f'
:: input row(s) = 1
:: output row # = 1
-------------------
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 2
-------------------
2
3
4
5
6
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 3
-------------------
8
9
10
11
12
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 4
-------------------
14
15
16
17
18
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 5
-------------------
20
21
22
23
24
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 6
-------------------
26
27
28
29
30
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 7
-------------------
32
33
34
35
36
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 8
-------------------
38
39
40
41
42
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 9
-------------------
44
45
46
47
48
MATCH
----END-NEW-ROW----
:: input row(s) = 5
:: output row # = 10
-------------------
50
51
52
53
MATCH
----END-NEW-ROW----

Related

Inplace remove last n lines of files without opening them more than once in gawk?

https://www.baeldung.com/linux/remove-last-n-lines-of-file
awk -v n=3 'NR==FNR{total=NR;next} FNR==total-n+1{exit} 1' input.txt input.txt
01 is my line number. Keep me please!
02 is my line number. Keep me please!
03 is my line number. Keep me please!
04 is my line number. Keep me please!
05 is my line number. Keep me please!
06 is my line number. Keep me please!
07 is my line number. Keep me please!
Here is a way to remove the last n lines. But it is not done inplace and the file is read twice, and it only deals with one file at a time.
How can I inplace remove the last n lines of many files without opening them more than once with one gawk command but without using any other external commands?
With your shown samples please try following awk code. Without using any external utilities as per OP's request in question. We could make use of END block here of awk programming.
awk -v n="3" '
{
total=FNR
lines[FNR]=$0
}
END{
till=total-n
for(i=1;i<=till;i++){
print lines[i]
}
}
' Input_file
single-pass awk solution that requires neither arrays nor gawk
— (unless your file is over 500 MB, then it might be slightly slower) :
rm -f file.txt
jot -c 30 51 > file.txt
gcat -n file.txt | rs -t -c$'\n' -C'#' 0 5 | column -s'#' -t
1 3 7 9 13 ? 19 E 25 K
2 4 8 : 14 # 20 F 26 L
3 5 9 ; 15 A 21 G 27 M
4 6 10 < 16 B 22 H 28 N
5 7 11 = 17 C 23 I 29 O
6 8 12 > 18 D 24 J 30 P
mawk -v __='file.txt' -v N='13' 'BEGIN {
OFS = FS = RS
RS = "^$"
getline <(__); close(__)
print $!(NF -= NF < (N+=_==$NF) ? NF : N) >(__) }'
gcat -n file.txt | rs -t -c$'\n' -C'#' 6 | column -s'#' -t ;
1 3 7 9 13 ?
2 4 8 : 14 #
3 5 9 ; 15 A
4 6 10 < 16 B
5 7 11 = 17 C
6 8 12 >
Speed is hardly a concern :
115K rows 198 MB file took 0.254 secs
rows = 115567. | UTF8 chars = 133793410. | bytes = 207390680.
( mawk2 -v __="${fn1}" -v N='13' ; )
0.04s user 0.20s system 94% cpu 0.254 total
rows = 115554. | UTF8 chars = 133779254. | bytes = 207370006.
5.98 million rows 988 MB file took 1.44 secs
rows = 5983333. | UTF8 chars = 969069988. | bytes = 1036334374.
( mawk2 -v __="${fn1}" -v N='13' ; )
0.33s user 1.07s system 97% cpu 1.435 total
rows = 5983320. | UTF8 chars = 969068062. | bytes = 1036332426.
Another way to do it, using GAWK, with option The BEGINFILE and ENDFILE Special Patterns:
{ lines[++numLines] = $0 }
BEGINFILE { fname=FILENAME}
ENDFILE { prt() }
function prt( lineNr,maxLines) {
close(fname)
printf "" > fname
maxLines = numLines - n
for ( lineNr=1; lineNr<=maxLines; lineNr++ ) {
print lines[lineNr] > fname
}
close(fname)
numLines = 0
}
I find that this is the most succinct solution to the problem.
$ gawk -i inplace -v n=3 -v ORS= -e '{ lines[FNR]=$0 RT }
ENDFILE {
for(i=1;i<=FNR-n;++i) {
print lines[i]
}
}' -- file{1..3}.txt

How to calculate anomaly using awk

A have a file:
file.txt
1 32
2 34
3 32
4 43
5 25
6 34
7 65
8 34
9 23
10 44
I would like to find anomaly on send column:
my below script printing anomalies considering row-2 to row-10 values. It is not considering row-1 values.
awk 'FNR==NR{
f=1;
if($1 >= 1 && $1 <= 10){
count++;
SUM+=$2;
};
next
}
FNR==1 && f==1{
AVG=SUM/count;
next
}
($1 >= 1 && $1 <= 10){
print $1, $2-AVG
}
' file.txt file.txt
My desire output:
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
I got a solution of it:
awk '{f=$1>=1 && $1<=10}f && NR==FNR{sum+=$2; c++; next}f{ print $1, $2-(sum/c) }' file.txt file.txt
I am still wondering why the first script is not giving correct answer.
Since this is just 2 columns file, this can be done in a single pass awk also:
awk '{map[$1] = $2; s += $2}
END {mean = s/NR; for (i in map) print i, map[i] - mean}' file
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
The first script in the OP is not giving the correct value, because you skip the first line in the second pass of your file. This is seen in the statement (FNR==1 && f==1) { AVG=sum/count; next }. Due to the next statement, you skip the computation of the deviation from the mean value for the first record.
This is an efficient computation of the deviation from the mean in a double pass:
awk '(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
If file contains values bigger than 10 or smaller than 1 in the first, column, but you only want to see this for values in the range of [0,10], then you can do:
awk '($1<1 || $1>10) {next}
(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
There are still other optimizations that can be done, but these only become beneficial when working with extremely large files (many millions of lines).

Awk - Conditionally print an element from a certain row, based on the condition of a different element in a different row

Say I have a lot of files with a consistent number of columns and rows, and a sample one looks like this:
1 2 3
4 5 6
7 8 9
I want to print column 3 of row 2, but only if column 3 of row 3 == 4 (in this case it is 9). I'm using this logic is a means to determine if the file is valid for my use-case, and extract the relevant field if it is.
My attempt, based on other answers to people asking how to isolate certain rows was this: awk 'BEGIN{FNR=3} $3=="4"{FNR=2;print $2}'
so you are looking for something like this?
awk 'FNR==2{ x = $3 }FNR==3 && $3=="4"{ print x }' file.txt
cat file.txt
1 2 3
4 5 6
7 8 4
Output:
6
cat file.txt
1 2 3
4 5 6
7 8 9
Output:
Nothing since column 3 of row 3 is 9
awk 'FNR==3 && $3==4{print p} {p=$3}' *
Here's another which doesn't care for the order in which the records appear. In the OP the problem was to print a value (v) from 2nd record based on the tested value (t) on the 3rd record. This solution allows for the test value to appear in an earlier record than the value to be printed:
$ awk '
FNR==2 { # record on which is the value to print
v=$3
f=1 # flag indicating the value v has been read
}
FNR==3 { # record of which is the value to test
t=$3
g=1 # test value read indicator
}
f && g { # once the value and test value are acquired and
if(t==4) # test the test
print v # output
exit # and exit
}' file
6
Record order reversed (FNR values changed in the code):
$ cat file2
1 2 3
7 8 4 # records
4 5 6 # reversed
$ awk 'FNR==3{v=$3;f=1}FNR==2{t=$3;g=1}f&&g{if(t==4)print v;exit}' file2
6
Flags f and g are different from v and t in case either should be empty ("").

print a line from every 5 elements of a column

I am looking for a way to select a column (e. g. eighth column) of a data file and write the first five numbers of that column in a row, the next five numbers in second row, and so on.
I have been testing with awk and printf without success.
The awk way to do this is to switch from using OFS and ORS to separate the output using the modulus function:
$ seq 1 20 | awk '{printf "%s", $1 (NR % 5 ? OFS : ORS)}'
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Change $1 to $8 for the eigth column for example and NR % 5 to NR % 10 for rows of 10 instead of 5. The seq command just generate a single column of digits from 1 to 20 used for demonstration.
I also find using xargs useful for this kind of thing:
$ seq 1 20 | awk '{print $1}' | xargs -n5
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
The awk isn't necessary for the example as seq only produces a single column however for your question change $1 to $8 to select only the eighth column from your input. With this approach you could also switch out awk with cut.
This will also produce the format requested
seq 1 20 | awk '{printf("%s ", $1); if (NR % 5 == 0) printf("\n")}'
where $1 indicates de column number which could be changed when passing an archive to the awk line.

rearrange columns using awk or cut command

I have large file with 1000 columns. I want to rearrange so that last column should be the 3rd column. FOr this i have used,
cut -f1-2,1000,3- file > out.txt
But this does not change the order.
Could anyone help using cut or awk?
Also, I want to rearrange columns 10 and 11 as shown below:
Example:
1 10 11 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
try this awk one-liner:
awk '{$3=$NF OFS $3;$NF=""}7' file
this is moving the last col to the 3rd col. if you have 1000, then it does it with 1000th col.
EDIT
if the file is tab-delimited, you could try:
awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7' file
EDIT2
add an example:
kent$ seq 20|paste -s -d'\t'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7'
1 2 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
EDIT3
You didn't give any input example. so assume you don't have empty columns in original file. (no continuous multi-tabs):
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$10 FS $11 FS $3;$10=$11="";gsub(/\t+/,"\t")}7'
1 2 10 11 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
After all we could print those fields in a loop.
I THINK what you want is:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; sub(OFS "[^" OFS "]*$","")}1' file
This might also work for you depending on your awk version:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; NF--}1' file
Without the part after the semi-colon you'll have trailing tabs in your output.
Since many people are searching for this and even the best awk solution is not really pretty and easy to use I wanted to post my solution (mycut) written in Python:
#!/usr/bin/env python3
import sys
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
#example usage: cat file | mycut 3 2 1
columns = [int(x) for x in sys.argv[1:]]
delimiter = "\t"
for line in sys.stdin:
parts = line.split(delimiter)
print("\t".join([parts[col] for col in columns]))
I think about adding the other features of cut like changing the delimiter and a feature to use a * to print the remaning columns. But then it will get an own page.
A shell wrapper function for awk' that uses simpler syntax:
# Usage: rearrange int_n [int_o int_p ... ] < file
rearrange ()
{
unset n;
n="{ print ";
while [ "$1" ]; do
n="$n\$$1\" \" ";
shift;
done;
n="$n }";
awk "$n" | grep '\w'
}
Examples...
echo foo bar baz | rearrange 2 3 1
bar baz foo
Using bash brace expansion, rearrange first and last 5 items in descending order:
echo {1..1000}a | tr '\n' ' ' | rearrange {1000..995} {5..1}
1000a 999a 998a 997a 996a 995a 5a 4a 3a 2a 1a
Sorted 3-letter shells in /bin:
ls -lLSr /bin/?sh | rearrange 5 9
150792 /bin/csh
154072 /bin/ash
771552 /bin/zsh
1554072 /bin/ksh