How to print columns containing value - awk

Let's say I have a data file containing the following:
1 2 3 4 5
67 88 12 32 22
9 99 34 59 86
17 0 78 0 77
11 0 0 0 43
I would like to have a code that searches through each column for the number 0. If the number 0 is found, the code will print out that entire column in a separate file.
With this data, the outputted file would look like so:
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0
It'd be great if the code didn't require me knowing the exact number of columns and/or row.

This will do what you want. It does not requiring knowing anything about how many rows or columns are present.
$ awk 'FNR==NR{for (i=1;i<=NF;i++)if ($i==o)a[i]=1;next} {tab="";for (i=1;i<=NF;i++)if (a[i]){printf "%s%s",tab,$i; tab="\t"};print ""}' file file
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0
How it works
Because the file name is specified twice on the command line, the awk script will read the file twice, the first time to look for zeros, the second time to print.
FNR==NR{for (i=1;i<=NF;i++)if ($i==o)a[i]=1;next}
One the first run through the file, a[i] is set to one for any column i that has a zero in it.
This code only applies to the first run through because of the condition FNR==NR. NR is the total number of records (lines) that we have read so far. FNR is the number of records (lines) that we have read so far from the current file. Thus, when FNR==NR, we are still reading the first file. The next at the end of the commands tells awk to skip the remaining commands and start over on the next line.
tab="";for (i=1;i<=NF;i++)if (a[i]){printf "%s%s",tab,$i; tab="\t"};print ""
When we are reading through the file for the second time, we print out each column i for which a[i] is non-zero. I chose tab-separated output but, by simply adjusting the printf statement, any format could be used.

sed '#n
# init and load line in buffer (1st line copied, other added)
s/.*/>& /;1!H;1h
# at end of file, load buffer in working area
$ {x
:cycle
# keep column if zero inside
/>[[:blank:]]*0[[:blank:]]/ s/>\(\([[:blank:]]*[0-9]\{1,\}\)[[:blank:]][[:graph:][:blank:]]*\)/\2>\1/g
# remove treated column
s/>[[:blank:]]*[0-9]\{1,\}\([[:blank:]]\{1,\}[[:graph:][:blank:]]*\)/>\1/g
# is there another colum to treat ?
/>[[:blank:]]*[0-9][[:graph:][:blank:]]/ b cycle
# print result after cleanup
s/>//gp
}' YourFile
Self commented sed
posix versioj so --posix on GNU sed

Here is an interesting way of doing it with GNU awk:
parse.awk
# Record number of columns (assuming all columns have the same number of fields)
NR == 1 { n = NF }
# First parse: Remember which columns contain `pat`
FNR == NR {
for(i=1; i<=NF; i++)
if($i == pat) {
h[i] = i
last = i>last ? i : last
}
next
}
# Before second parse: switch to reading one field at a time
ENDFILE {
RS="[ \t\n]+"
}
# Second parse: print field if current-record-number modulo
# number-of-columns is in the `h` hash
{ m = FNR % n }
m in h {
ORS = (m == last) ? "\n" : OFS # print new-line after last column
print $1
}
Run it like this for example:
awk -f parse.awk pat=0 infile infile
Output:
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0
Or with OFS='\t':
awk -f parse.awk pat=0 OFS='\t' infile infile
Output:
2 3 4
88 12 32
99 34 59
0 78 0
0 0 0

Related

Delete every occurrence in a file

Let's say I have a file with this format content:
6 8
6 9
12 20
6
8
9
12
20
35
I want to delete all the lines if the number (from 1st or 2nd column) is found in next lines whether it is in the first or 2nd column inluding the line where the initial number is found.
So I should have this as a result:
35
I've tried using
awk '{for(i=1;i<=NF;i++){if($i in a){next};a[$i]}} 1'
Is there a simple way to do this ?
This line should help:
awk 'NR==FNR{a[$0]++;next}{for(i=1;i<=NF;i++)if(a[$i]>1)next}7'
RS=" |\n" file RS="\n" file
With this input file:
100 200 300
6 8
6 9
12 20
6
8
9
12
20
35
The one-liner above is gonna output:
100 200 300
35
You want to find numbers that only occur once. Let me add "42" to your file on the same line as "35"
$ cat file
6 8
6 9
12 20
6
8
9
12
20
35 42
To find all the number that only appear once:
$ tr -s "[:blank:]" "\n" < file | sort -n | uniq -u
35
42
But that doesn't show the line in the file where they appear, so
$ grep -n -Fw -f <(tr -s "[:blank:]" "\n" < file | sort -n | uniq -u) file
9:35 42
I added the -n option just for effect, to show that we have line 9 from the file.
The code you originally wrote works perfectly under the condition: print the current line if and only if any of the previous lines do not have a number listed in the current line. However, you do not request previous but next.
The simples solution for you would be to tac the input and tac it back.
$ tac <file> | awk '{for(i=1;i<=NF;i++){if($i in a) next;a[$i]}}1' | tac
If you want to use a single awk then you have to store the lines first and process them afterward, it is also a bit more cumbersome because :
you need to count how many times a number appeared
and you need a double pass algorithm (by running the file twice or storing it in memory)
This gives you the following :
double pass:
$ awk '(NR==FNR){ for(i=1;i<=NF;i++) a[$i]++; next }'
{ b=0; for(i=1;i<=NF;i++) b+=--a[$i] }
!b; <file> <file>
memory:
$ awk '{ m[NR]=$0; for(i=1;i<=NF;i++) a[$i]++; next }
END { for (n=1;n<=NR;++n) {
b=0; $0=m[n];
for(i=1;i<=NF;i++) b+=--a[$i]
if (!b) { print }
}
}' <file>
This outputs the expected :
6
8
9
12
20
35
If you want to remove all lines which contain a number which appears more then ones in the file, then you simply do :
$ awk '(NR==FNR) { for(i=1;i<=NF;i++) a[$i]++; next }
{ b=1; for(i=1;i<=NF;i++) b = b && (a[$i]==1) }
b' <file> <file>
note: from your input you might also be suffering from the windows \r problem, so you might want to do gsub("\r","") before everything.
Another in awk, processes the data once:
awk '
{
for(i=1;i<=NF;i++) { # process numbers in record
r[$i]=r[$i] NR OFS # add NR on a record list array
if(p=split(r[$i],t)>1) { # if there are more than one record in r
for(j in t) # delete records on the list
delete a[t[j]]
f=1 # flag up
}
}
if(f=="") # if flag is down
a[NR]=$0 # store current record
f="" # reset flag
}
END {
for(i=1;i<=NR;i++) # in the end
if(i in a) # print all records stored
print a[i]
}' file
35
A revised version:
awk '
{
for(i=1;i<=NF;i++) {
if($i in r) { # just store the NR of last duplicate
delete a[r[$i]]
f=1
}
r[$i]=NR
}
if(f=="")
a[NR]=$0
f=""
}
END {
for(i=1;i<=NR;i++)
if(i in a)
print a[i]
}' file
35
You could try something like this:
cat yourFile | tr ' ' '\n' | sort | uniq -u

Adding numbers forth column of first file to second file using awk.

Hoping somebody can help me with this task. I have two infiles,
File1
# name length av.qual #-reads mx.cov. av.cov GC% CnIUPAC CnFunny CnN CnX CnGap CnNoCov
10-1_rep_c1 1406 80 8017 4637 1641.26 31.98 1 0 4 0 7 0
10-1_rep_c2 832 80 1641 1462 557.34 32.13 1 0 0 0 5 0
10-1_rep_c3 1284 83 4674 2338 1040.80 24.75 7 0 0 0 8 0
10-1_rep_c4 750 83 2335 2017 886.31 24.73 2 0 0 0 3 0
10-1_rep_c5 1180 78 2326 1486 572.51 19.76 1 0 0 0 7 0
File2
>10-1_rep_c1
ttttttttttttttacaataaaatgcrccattattcctttcgtactaaacaatgccttat
ggccaccagatagaaaccaatctgactcacgtcgattttaactcaaatcatgtaaaattc
>10-1_rep_c2
aacagcagaattaatattgttcacaggtttttataaaacgacctattaatgaatttccat
cccctaaaaatggtcggcttacttgatgtaaccaccccctctagttaataataattgtat
>10-1_rep_c3
aattataaaaagaatttttaaagcataaattattagtaattttaagagaaattaaaggta
ttataaaagagtaatagtactgacaaggaaaaacttttatataaaaaaaagaaaatttaa
The outfile I would like to have is,
>10-1_rep_c1_8017
ttttttttttttttacaataaaatgcrccattattcctttcgtactaaacaatgccttat
ggccaccagatagaaaccaatctgactcacgtcgattttaactcaaatcatgtaaaattc
>10-1_rep_c2_1641
aacagcagaattaatattgttcacaggtttttataaaacgacctattaatgaatttccat
cccctaaaaatggtcggcttacttgatgtaaccaccccctctagttaataataattgtat
>10-1_rep_c3_4674
aattataaaaagaatttttaaagcataaattattagtaattttaagagaaattaaaggta
ttataaaagagtaatagtactgacaaggaaaaacttttatataaaaaaaagaaaatttaa
So the fourth column of the first file appended onto the second file header of each DNA sequence.
This is an alternative using awk:
awk 'FNR==NR{a[">"$1]="_"$4;next}{print $0a[$0]}' File1 File2
try following and let me know if this helps you.
awk 'FNR==NR{a[$1]=$4;next} ($2 in a){print $2"_"a[$2];next} 1' file1 FS=">" file2
Explanation: So checking here first condition FNR==NR(which will be only TRUE when first file named file1 is being read, because FNR and NR both represent the number of lines in a Input_file only difference between them is NR's value will be keep on increasing till all files being read and FNR's value will be RESET each time when a new Input_file start to get read.), so in first file reading time creating an array named a whose index is $1 and value is $4(as per your ask), now mentioning next keyword which will make sure cursor shouldn't go further and it will skip all further statements then.
Now checking condition where checking if $2 of file2(whose field separator I have set as > to remove it from mix, NOTE: we could set field separators differently for different Input_files).
So if it is present then print 2nd field of Input_file file2 and "_" then array a's value whose index is $2 of file2, then mentioning next to skip further statements. Now mentioning 1 will be printing the lines(apart from whose $2 comes in array a as index, so awk works on method like condition then action. If any condition is TRUE then some action should happen. Here condition is TRUE by mentioning 1 and action is Not defined so by default print will happen, print of current line of file2).
Then mentioning the Input_file1 name as file1. After that mentioning the FS(field separator's value) to ">"(explained above too). After that mentioning the second Input_file as file2 too then.

move certain columns to end using awk

I have large tab delimited file with 1000 columns. I want to rearrange so that certain columns have to be moved to the end.
Could anyone help using awk
Example input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Move columns 5,6,7,8 to the end.
Output:
1 2 3 4 9 10 11 12 13 14 15 16 17 18 19 20 5 6 7 8
This prints columns 1 to a, then b to the last, and then columns a+1 to b-1:
$ awk -v a=4 -v b=9 '{for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i};for (i=a+1;i<b;i++) {printf "%s\t",$i};print""}' file
1 2 3 4 9 10 11 12 13 14 15 16
17 18 19 20 5 6 7 8
The columns are moved in this way for every line in the input file, however many lines there are.
How it works
-v a=4 -v b=9
This defines the variables a and b which determine the limits on which columns will be moved.
for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i}
This prints all columns except the ones from a+1 to b-1.
In this loop, i is incremented by one except when i==a in which case it is incremented by b-a so as to skip over the columns to be moved. This is done with awk's ternary statement:
i += i==a ? b-a : 1
+= simply means "add to." i==a ? b-a : 1 is the ternary statement. The value that it returns depends on whether i==a is true or false. If it is true, the value before the colon is returned. If it is false, the value after the colon is returned.
for (i=a+1;i<b;i++) {printf "%s\t",$i}
This prints columns a+1 to b-1.
print""
This prints a newline character to end the line.
Alternative solution that avoids printf
This approach assembles the output into the variable out and then prints with a plain print command, avoiding printf and the need for percent signs:
awk -v a=4 -v b=9 '{out="";for (i=1;i<=NF;i+=i==a?b-a:1) out=out $i"\t";for (i=a+1;i<b;i++) out=out $i "\t";print out}' file
One way to rearrange 2 columns ($5 become $20 and $20 become $5) the rest stay unchanged :
$ awk '{x=$5; $5=$20; $20=x; print}' file.txt
for 4 columns :
$ awk '{
x=$5; $5=$20; $9=x;
y=$9; $9=$10; $10=y;
print
}' file.txt
My approach:
awk 'BEGIN{ f[5];f[6];f[7];f[8] } \
{ for(i=1;i<=NF;i++) if(!(i in f)) printf "%s\t", $i; \
for(c in f) printf "%s\t", $c; printf "\n"} ' file
It's splitted in 3 parts:
The BEGIN{} part determines which field should be moved to the end. The indexes of the array f are moved. In the example it's 5, 6, 7 and 8.
Cycle trough every field (doesn't matter if there are 1000 fields or more) and check if they are in the array. If not print them.
Now we need the skipped fields. Cycle trough the f array and print those values.
Another way in awk
Switch last A-B with last N fields
awk -vA=4 -vB=8 '{x=B-A;for(i=A;i<=B;i++){y=$i;$i=$(t=(NF-x--));$t=y}}1' file
Put N rows from end into positon A
awk -vA=3 -vB=8 '{split($0,a," ");x=A++;while(x++<B)$x=a[NF-(B-x)];while(B++<NF)$B=a[A++]}1' file

How to get random lines with a given distribution with awk?

I have two tabular files that I want to merge, but prior to that I want to reduce the second one.
The first file let's say File1 is tabular and is like this
A 67 98 56
A 22 29 62
A 11 99 28
B 37 88 14
B 33 99 65
We have 3 lines with A and two with B
File2 contains 3000 lines between A and B, I want to randomly select lines from File2 but exactly the same number of A and B than File1, which means just 3 random lines with A and two with B
Any one have an idea on how to do this with awk ?
Thanks
#!/bin/bash
read -r acount bcount <<< $(csplit file2 '/^B /')
awk -v "acount=$acount" -v "bcount=$bcount" '
NR == FNR {
arr[$1]++;
next
}
! setup {
setup = 1
while (arandcount < acount) {
line = int(rand() * acount) + 1
if (! alines[line]) {
alines[line] = 1
arandcount++
}
}
while (brandcount < bcount) {
line = int(rand() * bcount) + 1
if (! blines[line]) {
blines[line] = 1
brandcount++
}
}
}
FILENAME == "xx00" && FNR in alines {
print
}
FILENAME == "xx01" && FNR in blines {
print
}' file1 xx00 xx01
Where "xx00" and "xx01" are the names of the files created by csplit.
The csplit command splits the input file on the regex and outputs the line counts of each output file. The read command puts those counts into the variables. The variables are passed into the AWK program.
The first block reads all the lines in file1 and counts each "type" (A or B).
The second block picks the line numbers to select by choosing a random number between 1 and the count of lines of the "type". This block only gets executed once because of the flag.
The last two blocks check each line's record number to see if it's in the array of picked line numbers and prints it if it is.
This might work for you:
grep '^A' file2 | sort -R | head -$(grep -c '^A' file1) >file3
grep '^B' file2 | sort -R | head -$(grep -c '^B' file1) >>file3
N.B. This assumes file1 is sorted.

How to make awk only run after certain headings?

I have a file with tables of values under certain headings. Like so:
Series A
a 01 20
b 21 43
c 44 59
d 60 72
Series B
a 01 25
b 26 48
c 49 70
d 71 90
My question is, how can I make awk only draw information from the specified series? So if awk is told to look at "Series B", column 2, it will output that column.
If the data is separated by a blank line then you can use something like this -
[jaypal:~/Temp] cat file
Series A
a 01 20
b 21 43
c 44 59
d 60 72
Series B
a 01 25
b 26 48
c 49 70
d 71 90
[jaypal:~/Temp] awk '/Series B/,/^$/ { if (NF==3) print $2 }' file
01
26
49
71
Or you can do something similar in sed:
[jaypal:~/Temp] sed -n '/Series B/,/^$/s/\(.*[^ ]\) \(.*[^ ]\) \(.*\)/\2/p' file
01
26
49
71
Assuming the tables are separated by empty line(s):
awk '$0 ~ p, !NF {
if (!($0 ~ p) && NF) print $c
}' p='Series B' c=2 infile
$0 ~ p, !NF - the pair of patterns separated by a comma specify a range of records.
The first one evaluates to true if the current record $0 matches the pattern indicated by
the value of the variable p. The second one is true when the value of NF (Number of Fields)
is 0 (empty or blank line).
if (!($0 ~ p) && NF) print $c - if the current record doesn't match the pattern p
(i.e. skip the header), and (&&) the current record contains at least one column (NF != 0)
print the desired column.
I prefer this way:
awk -v sect="$1" -v col="$2" '
/Series/ {header = $0; next}
NF > 1 {
set[header] = set[header]$col"|";
}
END {print set[sect];}
' $3 | tr '|' '\n'
I introduce associate array and temp seperator(not needed if you don't need sort on "Series", just use "\n" and omit tr), don't rely on blank line, use "|" as temp sep, use tr to transfer it back.
This is my common solution since in real life I can create custom "sorting column", then pipe to unix sort, and cut them back. Nawk(alas AIX) don't have sort and unix sort is far more superior then any hand-made sort in awk.