How to make awk only run after certain headings? - awk

I have a file with tables of values under certain headings. Like so:
Series A
a 01 20
b 21 43
c 44 59
d 60 72
Series B
a 01 25
b 26 48
c 49 70
d 71 90
My question is, how can I make awk only draw information from the specified series? So if awk is told to look at "Series B", column 2, it will output that column.

If the data is separated by a blank line then you can use something like this -
[jaypal:~/Temp] cat file
Series A
a 01 20
b 21 43
c 44 59
d 60 72
Series B
a 01 25
b 26 48
c 49 70
d 71 90
[jaypal:~/Temp] awk '/Series B/,/^$/ { if (NF==3) print $2 }' file
01
26
49
71
Or you can do something similar in sed:
[jaypal:~/Temp] sed -n '/Series B/,/^$/s/\(.*[^ ]\) \(.*[^ ]\) \(.*\)/\2/p' file
01
26
49
71

Assuming the tables are separated by empty line(s):
awk '$0 ~ p, !NF {
if (!($0 ~ p) && NF) print $c
}' p='Series B' c=2 infile
$0 ~ p, !NF - the pair of patterns separated by a comma specify a range of records.
The first one evaluates to true if the current record $0 matches the pattern indicated by
the value of the variable p. The second one is true when the value of NF (Number of Fields)
is 0 (empty or blank line).
if (!($0 ~ p) && NF) print $c - if the current record doesn't match the pattern p
(i.e. skip the header), and (&&) the current record contains at least one column (NF != 0)
print the desired column.

I prefer this way:
awk -v sect="$1" -v col="$2" '
/Series/ {header = $0; next}
NF > 1 {
set[header] = set[header]$col"|";
}
END {print set[sect];}
' $3 | tr '|' '\n'
I introduce associate array and temp seperator(not needed if you don't need sort on "Series", just use "\n" and omit tr), don't rely on blank line, use "|" as temp sep, use tr to transfer it back.
This is my common solution since in real life I can create custom "sorting column", then pipe to unix sort, and cut them back. Nawk(alas AIX) don't have sort and unix sort is far more superior then any hand-made sort in awk.

Related

How to print, with awk, duplicated fields in a column that have a specific amount of duplication

Data:
id
name
city
language
area_code
01
Juan
Cali
ES
44
01
José
Cali
ES
44
01
Pedro
Cali
ES
44
02
Albert
Edinburgh
19
02
Mark
En
19
03
Raisa
Hellsinki
FI
22
03
Lisa
Hellsinki
04
Gian
Roma
IT
33
05
Loris
Sicilia
05
Vera
Sicilia
31
The file containing this data is in next format:
01;Juan;Cali;ES;44
01;José;Cali;ES;44
01;Pedro;Cali;ES;44
02;Albert;Edinburgh;;19
02;Mark;;En;19
03;Raisa;Hellsinki;FI;22
03;Lisa;Hellsinki;;
04;Gian;Roma;IT;33
05;Loris;Sicilia;;
05;Vera;Sicilia;;31
In this data, rows with id = 02, 03, 05 have this very same field duplicated twice, so no matter what the rest of the data says, I need to be able to select only those rows that have the field id duplicated twice, so the expected result would be:
02;Albert;Edinburgh;;19
02;Mark;;En;19
03;Raisa;Hellsinki;FI;22
03;Lisa|Hellsinki;;
05;Loris;Sicilia;;
05;Vera;Sicilia;;31
So far I have only found the way to select rows duplicated any amount of times, which code is:
awk -F';' -v OFS=';' 'a[$1]++{print $0}' data.file
But I haven't been able to figure out the way to obtain only those lines with the id duplicated twice...
Update: like U2, I still haven't found what I'm looking for, but I have a new awk command that I think is closer:
awk -F';' -v OFS=';' '{a[$1]++; if (a[$1] == 2) {print $0}}' data.file
It correctly counts out the row with id 04, but includes rows with id 01 which is not exactly two times repeated but three...
In 2 passes:
$ awk -F';' 'NR==FNR{cnt[$1]++; next} cnt[$1]==2' file file
02;Albert;Edinburgh;;19
02;Mark;;En;19
03;Raisa;Hellsinki;FI;22
03;Lisa;Hellsinki;;
05;Loris;Sicilia;;
05;Vera;Sicilia;;31
or in 1 pass if your input is grouped by the first field as shown in your example (you can always sort it if not):
$ awk -F';' '
$1 != prev { if (cnt == 2) print buf; prev=$1; buf=$0; cnt=1; next }
{ buf=buf ORS $0; cnt++ }
END { if (cnt == 2) print buf }
' file
02;Albert;Edinburgh;;19
02;Mark;;En;19
03;Raisa;Hellsinki;FI;22
03;Lisa;Hellsinki;;
05;Loris;Sicilia;;
05;Vera;Sicilia;;31
By and large, it's quicker to sort post-filtered rows. Here's a single-pass awk-based solution with no pre-sorting or post-sorting needed -
— output would be based on original input row order,
so presorted input means presorted output
- downside of this approach is needing 2 arrays.
|
{m,g}awk -F';.+$' '{
sub("$", NR "=", ____[$((_____[NR] = $-_)^_)])
} END {
FS = OFS = ORS (___=_="")
$-_=_
for (__ in ____) {
if ((_=____[__]) ~"^[0-9]+=[0-9]+=$") {
$+_=_____[+_]
sub("^[0-9]+[=]",___,_)
$+_=_____[+_]
} }
print $((__=(_=FS)"+")*gsub((_)__,_)*sub("^"__,___)) }'
|
02;Albert;Edinburgh;;19
02;Mark;;En;19
03;Raisa;Hellsinki;FI;22
03;Lisa;Hellsinki;;
05;Loris;Sicilia;;
05;Vera;Sicilia;;31

how to keep newline(s) when selecting a given column with awk

Suppose I have a file like this (disclaimer: this is not fixed I can have more than 7 rows, and more than 4 columns)
R H A 23
S E A 45
T E A 34
U A 35
Y T A 35
O E A 353
J G B 23
I want the output to select second column if third column is A but keeping newline or whitespace character.
output should be:
HEE TE
I tried this:
awk '{if ($3=="A") print $2}' file | awk 'BEGIN{ORS = ""}{print $1}'
But this gives:
HEETE%
Which has a weird % and is missing the space.
You may use this gnu-awk solution using FIELDWIDTHS:
awk 'BEGIN{ FIELDWIDTHS = "1 1 1 1 1 1 *" } $5 == "A" {s = s $3}
END {print s}' file
HEE TE
awk splits each record using width values provided in this variable FIELDWIDTHS.
1 1 1 1 1 1 * means each of first 6 columns will have single character length and remaining text will be filled in 7th column. Since you have a space after each value so $2,$4,$6 will be filled with a single space and $1,$3,$5 will be filled with the provided values in input.
$5 == "A" {s = s $3}: Here we are checking if $5 is A and if that condition is true then we keep appending value of $3 in a variable s. In the END block we just print variable s.
Without using fixed width parsing, awk will treat A in 4th row as $2.
Or else if we let spaces part of column value then use:
awk '
BEGIN{ FIELDWIDTHS = "2 2 2 *" }
$3 == "A " {s = s substr($2,1,1)}
END {print s}
' file

How to print columns containing value

Let's say I have a data file containing the following:
1 2 3 4 5
67 88 12 32 22
9 99 34 59 86
17 0 78 0 77
11 0 0 0 43
I would like to have a code that searches through each column for the number 0. If the number 0 is found, the code will print out that entire column in a separate file.
With this data, the outputted file would look like so:
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0
It'd be great if the code didn't require me knowing the exact number of columns and/or row.
This will do what you want. It does not requiring knowing anything about how many rows or columns are present.
$ awk 'FNR==NR{for (i=1;i<=NF;i++)if ($i==o)a[i]=1;next} {tab="";for (i=1;i<=NF;i++)if (a[i]){printf "%s%s",tab,$i; tab="\t"};print ""}' file file
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0
How it works
Because the file name is specified twice on the command line, the awk script will read the file twice, the first time to look for zeros, the second time to print.
FNR==NR{for (i=1;i<=NF;i++)if ($i==o)a[i]=1;next}
One the first run through the file, a[i] is set to one for any column i that has a zero in it.
This code only applies to the first run through because of the condition FNR==NR. NR is the total number of records (lines) that we have read so far. FNR is the number of records (lines) that we have read so far from the current file. Thus, when FNR==NR, we are still reading the first file. The next at the end of the commands tells awk to skip the remaining commands and start over on the next line.
tab="";for (i=1;i<=NF;i++)if (a[i]){printf "%s%s",tab,$i; tab="\t"};print ""
When we are reading through the file for the second time, we print out each column i for which a[i] is non-zero. I chose tab-separated output but, by simply adjusting the printf statement, any format could be used.
sed '#n
# init and load line in buffer (1st line copied, other added)
s/.*/>& /;1!H;1h
# at end of file, load buffer in working area
$ {x
:cycle
# keep column if zero inside
/>[[:blank:]]*0[[:blank:]]/ s/>\(\([[:blank:]]*[0-9]\{1,\}\)[[:blank:]][[:graph:][:blank:]]*\)/\2>\1/g
# remove treated column
s/>[[:blank:]]*[0-9]\{1,\}\([[:blank:]]\{1,\}[[:graph:][:blank:]]*\)/>\1/g
# is there another colum to treat ?
/>[[:blank:]]*[0-9][[:graph:][:blank:]]/ b cycle
# print result after cleanup
s/>//gp
}' YourFile
Self commented sed
posix versioj so --posix on GNU sed
Here is an interesting way of doing it with GNU awk:
parse.awk
# Record number of columns (assuming all columns have the same number of fields)
NR == 1 { n = NF }
# First parse: Remember which columns contain `pat`
FNR == NR {
for(i=1; i<=NF; i++)
if($i == pat) {
h[i] = i
last = i>last ? i : last
}
next
}
# Before second parse: switch to reading one field at a time
ENDFILE {
RS="[ \t\n]+"
}
# Second parse: print field if current-record-number modulo
# number-of-columns is in the `h` hash
{ m = FNR % n }
m in h {
ORS = (m == last) ? "\n" : OFS # print new-line after last column
print $1
}
Run it like this for example:
awk -f parse.awk pat=0 infile infile
Output:
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0
Or with OFS='\t':
awk -f parse.awk pat=0 OFS='\t' infile infile
Output:
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0

move certain columns to end using awk

I have large tab delimited file with 1000 columns. I want to rearrange so that certain columns have to be moved to the end.
Could anyone help using awk
Example input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Move columns 5,6,7,8 to the end.
Output:
1 2 3 4 9 10 11 12 13 14 15 16 17 18 19 20 5 6 7 8
This prints columns 1 to a, then b to the last, and then columns a+1 to b-1:
$ awk -v a=4 -v b=9 '{for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i};for (i=a+1;i<b;i++) {printf "%s\t",$i};print""}' file
1 2 3 4 9 10 11 12 13 14 15 16
17 18 19 20 5 6 7 8
The columns are moved in this way for every line in the input file, however many lines there are.
How it works
-v a=4 -v b=9
This defines the variables a and b which determine the limits on which columns will be moved.
for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i}
This prints all columns except the ones from a+1 to b-1.
In this loop, i is incremented by one except when i==a in which case it is incremented by b-a so as to skip over the columns to be moved. This is done with awk's ternary statement:
i += i==a ? b-a : 1
+= simply means "add to." i==a ? b-a : 1 is the ternary statement. The value that it returns depends on whether i==a is true or false. If it is true, the value before the colon is returned. If it is false, the value after the colon is returned.
for (i=a+1;i<b;i++) {printf "%s\t",$i}
This prints columns a+1 to b-1.
print""
This prints a newline character to end the line.
Alternative solution that avoids printf
This approach assembles the output into the variable out and then prints with a plain print command, avoiding printf and the need for percent signs:
awk -v a=4 -v b=9 '{out="";for (i=1;i<=NF;i+=i==a?b-a:1) out=out $i"\t";for (i=a+1;i<b;i++) out=out $i "\t";print out}' file
One way to rearrange 2 columns ($5 become $20 and $20 become $5) the rest stay unchanged :
$ awk '{x=$5; $5=$20; $20=x; print}' file.txt
for 4 columns :
$ awk '{
x=$5; $5=$20; $9=x;
y=$9; $9=$10; $10=y;
print
}' file.txt
My approach:
awk 'BEGIN{ f[5];f[6];f[7];f[8] } \
{ for(i=1;i<=NF;i++) if(!(i in f)) printf "%s\t", $i; \
for(c in f) printf "%s\t", $c; printf "\n"} ' file
It's splitted in 3 parts:
The BEGIN{} part determines which field should be moved to the end. The indexes of the array f are moved. In the example it's 5, 6, 7 and 8.
Cycle trough every field (doesn't matter if there are 1000 fields or more) and check if they are in the array. If not print them.
Now we need the skipped fields. Cycle trough the f array and print those values.
Another way in awk
Switch last A-B with last N fields
awk -vA=4 -vB=8 '{x=B-A;for(i=A;i<=B;i++){y=$i;$i=$(t=(NF-x--));$t=y}}1' file
Put N rows from end into positon A
awk -vA=3 -vB=8 '{split($0,a," ");x=A++;while(x++<B)$x=a[NF-(B-x)];while(B++<NF)$B=a[A++]}1' file

How to get random lines with a given distribution with awk?

I have two tabular files that I want to merge, but prior to that I want to reduce the second one.
The first file let's say File1 is tabular and is like this
A 67 98 56
A 22 29 62
A 11 99 28
B 37 88 14
B 33 99 65
We have 3 lines with A and two with B
File2 contains 3000 lines between A and B, I want to randomly select lines from File2 but exactly the same number of A and B than File1, which means just 3 random lines with A and two with B
Any one have an idea on how to do this with awk ?
Thanks
#!/bin/bash
read -r acount bcount <<< $(csplit file2 '/^B /')
awk -v "acount=$acount" -v "bcount=$bcount" '
NR == FNR {
arr[$1]++;
next
}
! setup {
setup = 1
while (arandcount < acount) {
line = int(rand() * acount) + 1
if (! alines[line]) {
alines[line] = 1
arandcount++
}
}
while (brandcount < bcount) {
line = int(rand() * bcount) + 1
if (! blines[line]) {
blines[line] = 1
brandcount++
}
}
}
FILENAME == "xx00" && FNR in alines {
print
}
FILENAME == "xx01" && FNR in blines {
print
}' file1 xx00 xx01
Where "xx00" and "xx01" are the names of the files created by csplit.
The csplit command splits the input file on the regex and outputs the line counts of each output file. The read command puts those counts into the variables. The variables are passed into the AWK program.
The first block reads all the lines in file1 and counts each "type" (A or B).
The second block picks the line numbers to select by choosing a random number between 1 and the count of lines of the "type". This block only gets executed once because of the flag.
The last two blocks check each line's record number to see if it's in the array of picked line numbers and prints it if it is.
This might work for you:
grep '^A' file2 | sort -R | head -$(grep -c '^A' file1) >file3
grep '^B' file2 | sort -R | head -$(grep -c '^B' file1) >>file3
N.B. This assumes file1 is sorted.