Printing field by column with variable line field - awk

I have a command that returns an output similar to;
*************
* something *
*************
| Header | Title |
Column1|Column2 | Column3 |Column4| Column5 |Column6|Column7| Column8 |
--------------------------------------------------------------------------------
val1 val2 val3 x y i j 1(a) 2 1(a) 2 val4
val5 val6 val7 w x y z i j k 2(b) 2 1(b) 1 val8
..
..
Total lines: xx
I want to just print column6 for example, but because the output is not fixed variable field by space, awk '{print $x}' won't work for me. I need a way to print output by defined column (eg. Column6 or Column8). Maybe printing the column6 field from the right which is field $5 from the right? Is there such a method to print from right rather than the default of all the command which count field from the left?
Any help would be appreciated.

Use NF for this
awk '{print $(NF-5)}'
This will print the 6th to last column for example

I have been solving similar problem. Assuming Column6 is all the time there, you can use following colon to find an index
echo "Column1|Column2|Column6|Column8" | sed 's/Column6.*//;s/[^|]\+//g' | wc -c
Then you can simply construct the awk query
X=$(echo ...)
SCRIPT="{ print \$${X}; }"
echo "Column1 |Column2 |Column6 |Column8" | awk "${SCRIPT}"
|Column6

Rewrite In Gnu awk:
$ cat program.awk
BEGIN { FS="|" }
$0 ~ c { # process record with header
split($0,a,"|") # split to get header indexes
for(i in a) { # loop all column names
gsub(/^ *| *$/,"",a[i]) # trim space off
if(a[i]==c) ci=i # ci is the one
}
while(i=index($0,FS)) { # form FIXEDWIDTHS to separate fields
FIELDWIDTHS = FIELDWIDTHS i " "
$0=substr($0,i+1)
}
}
ci && $ci !~ /^-+$/ && $0=$ci # implicit printing
Run it:
$ awk -v c="Column6" -f program.awk file
1(a) 2
2(b) 2
If you want to edit the outputed column, the last row in program.awk is the place to do it. For example, let's imagine for a second that you'd like to loose the parenthesized part of Column6, you could for example create an action part with {sub(/\(.*\)/,""); print} for it.

Related

Inserting text at specific position in file names which match a pattern

I am editing a large text file called "test.txt" on a Mac. Most lines start with #, but some lines are a tab separated list of fields:
val1 val2 val3 val4 val5 val6 val7 val8 val9
What I would like to do is find specific lines where val2 = foo and val3 = bar (or just grep for the string foo \t bar, and then on these lines only, replace whatever val9 is with the string val9=val9. So if val9 is 'g1.t1', I would replace it with 'g1.t1=g1.t1'.
I was able to come up with the following command:
fgrep -l -w 'foo bar' test.txt | xargs sed -i "" 's/\([^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\)\t\([^\t]*\)/\1\t\2=\2/'
to find these lines, and making these modifications, but this just prints out these modified lines.
I want to write the entire file back out to a new file called "test_edited.txt", with only these changes made. I feel like the solution I've come up, by relying on piping the output of fgrep to sed, doesn't allow for this. But maybe I'm missing something?
Any suggestions welcome!
Thanks!
awk is more suitable for this job than a grep + xargs + sed` with a very clumsy looking regular expression:
awk 'BEGIN{FS=OFS="\t"}
$2 == "foo" && $3 == "bar" {$9 = $9 "=" $9} 1' file
# if you want to save changes back to original file use:
awk 'BEGIN{FS=OFS="\t"}
$2 == "foo" && $3 == "bar" {$9 = $9 "=" $9} 1' file > _tmp &&
mv _tmp file

Counting the number of unique values based on two columns in bash

I have a tab-separated file looking like this:
A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234
Output:
A 3
B 2
C 1
Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.
awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci
I'd really appreciate your help! Thank you in advance.
With complete awk solution could you please try following.
awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Using GNU awk:
$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file
Output:
A 3
B 2
C 1
Explained:
$ gawk -F\\t '{ # using GNU awk and tab as delimiter
a[$1][$2] # hash to 2D array
}
END {
for(i in a) # for all values in first field
print i,length(a[i]) # output value and the size of related array
}' file
$ sort -u file | cut -f1 | uniq -c
3 A
2 B
1 C
Another way, using the handy GNU datamash utility:
$ datamash -g1 countunique 2 < input.txt
A 3
B 2
C 1
Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.
You could try this:
cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'
It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)

awk/sed remove duplicates and merge permuted columns

I have the following file:
ABC MNH 1
UHR LOI 2
QWE LOI 3
MNH ABC 4
PUQ LOI 5
MNH ABC 6
QWE LOI 7
LOI UHR 8
I want to remove all duplicates (based on the the first two columns - e.g., row 6 is a duplicate of row 4). Also I want to merge entries where column 1 and 2 are permuted (e.g., row 1 and 4). This means that this list should result in:
ABC MNH 1 4
UHR LOI 2 8
QWE LOI 3
PUQ LOI 5
However, this file is huge. About 2-3 TB. Can this be done with awk/sed?
I don't understand why what you posted is your expected output so you may have to massage it but IMHO this is right the way to approach the problem so that only "sort" is handling storing the multi-TB input internally (and sort is designed to do that with paging etc.) while the awk scripts are just processing one line at a time and keeping very little in memory:
$ cat tst.sh
#!/bin/env bash
awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -k1,2 |
awk '
{ curr = $1 OFS $2 }
prev != curr {
if ( NR>1 ) {
print rec
}
rec = $0
sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
prev = curr
next
}
{ rec = rec OFS $NF }
END { print rec }
'
$ ./tst.sh file
ABC MNH 1 4 6
PUQ LOI 5
QWE LOI 3 7
LOI UHR 8 2
An alternative implementation after discussing with #kvantour in the comments below (requires GNU sort for -s stable sort):
$ cat tst.sh
#!/bin/env bash
awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -s -k1,2 |
awk '
{ curr = $1 OFS $2 }
prev != curr {
if ( NR>1 ) {
print rec
}
rec = $0
sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
sub(/[[:space:]]+[^[:space:]]+$/,"",rec)
delete seen
prev = curr
}
!seen[$3,$4]++ { rec = rec OFS $NF }
END { print rec }
'
$ ./tst.sh file
ABC MNH 1 4
PUQ LOI 5
QWE LOI 3
UHR LOI 2 8
The always helpful GNU datmash to the rescue!
$ sort -k1,2 -u input.txt |
awk -v OFS="\t" '$2 < $1 { tmp = $1; $1 = $2; $2 = tmp } { print $1, $2, $3 }' |
sort -k1,2 |
datamash groupby 1,2 collapse 3 |
tr ',' ' '
ABC MNH 1 4
LOI PUQ 5
LOI QWE 3
LOI UHR 2 8
Broken down, this:
Sorts the input file based on the first two columns and removes duplicates.
If the second column is less than the first column, swaps the two (So MNH ABC 6 becomes ABC MNH 6), and outputs tab-separated columns (Which is what datamash works with by default).
Sorts that so all the transformed rows are in order (But this time keeping duplicates).
Uses datamash to produce a single line for all the duplicate first two columns, with a comma-separated list of the values of the third columns as the third column of the output (Like ABC MNH 1,4)
Turns those commas into spaces.
Most memory-efficient solutions will require the data to be sorted, and while the sort program is quite good at doing that, it'll still use a bunch of temporary files so you'll need 2-3 or so terabytes of free disk space.
If you're going to be doing a lot of stuff with the same data, it's probably worth sorting it once and reusing that file instead of sorting it every time as the first step of a pipeline:
$ sort -k1,2 -u input.txt > unique_sorted.txt
$ awk ... unique_sorted.txt | ...
If there's enough duplicates and enough RAM that it's feasible to hold the results in memory, it can be done in one pass through the input file removing duplicates as it goes and then iterating through all the remaining pairs of values:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my %keys;
while (<>) {
chomp;
my ($col1, $col2, $col3) = split ' ';
$keys{$col1}{$col2} = $col3 unless exists $keys{$col1}{$col2};
}
$, = " ";
while (my ($col1, $sub) = each %keys) {
while (my ($col2, $col3) = each %$sub) {
next unless defined $col3;
if ($col1 lt $col2 && exists $keys{$col2}{$col1}) {
$col3 .= " $keys{$col2}{$col1}";
$keys{$col2}{$col1} = undef;
} elsif ($col2 lt $col1 && exists $keys{$col2}{$col1}) {
next;
}
say $col1, $col2, $col3;
}
}
This produces output in arbitrary unsorted order for efficiency's sake.
And an approach using sqlite (Also requires lots of extra free disk space, and that the columns are separated by tabs, not arbitrary whitespace):
#!/bin/sh
input="$1"
sqlite3 -batch -noheader -list temp.db 2>/dev/null <<EOF
.separator \t
PRAGMA page_size = 8096; -- Make sure the database can grow big enough
CREATE TABLE data(col1, col2, col3, PRIMARY KEY(col1, col2)) WITHOUT ROWID;
.import "$input" data
SELECT col1, col2, group_concat(col3, ' ')
FROM (
SELECT col1, col2, col3 FROM data WHERE col1 < col2
UNION ALL
SELECT col2, col1, col3 FROM data WHERE col2 < col1
)
GROUP BY col1, col2
ORDER BY col1, col2;
EOF
rm -f temp.db
If your first two columns will only have 3 characters maximum you will have 26^6 possible combinations for the first two columns. This is very easy to handle with awk.
{ key1=$1$2; key2=$2$1 }
(key1 in a) { next } # duplicate :> skip
(key2 in a) { print $2,$1,a[key2],$3 } # permutation :> print
{ a[key1]=$3 } # store value
This however will only print the permutations, and as requested, maximum 2 elements. As a consequence, the array a will have both key1 and the permuted key key2 in the array in case a permutation is found, otherwise it will only have key1.
This can be cleaned up with a second array keeping track if a permutation is already printed. Call it b. This way you can eliminate 2 elements from a while keeping track of one element in b:
{ key1=$1$2; key2=$2$1 }
(key1 in b) || (key2 in b) { next } # permutation printed, is duplicate
(key1 in a) { next } # only duplicate, no permutation found
(key2 in a) { # permutation found
print $2,$1,a[key2],$3 # - print
delete a[key1] # - delete keys from a
delete a[key2]
b[key1] # - store key in b
next # - skip the rest
}
{ a[key1]=$3 }
END { for (k in a) { print substr(1,3,k),substr(4,3,k),a[k] } }

Grep specific columns, count and write output file

I'm trying to summarize my data and count for specific items
These are human sequencing data and hence very large.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL PRIMARY
1 12867 . C A 5 q40;bldp;blq SS=1;VT=SNP; GT:DP:AD:BQ:MQ:SB:FA:SS:SSC:MQA 1/0:8:7,1:36,39:0:0.0,0.0:0.125:0:5:14.9,16.0 1/0:2:2,0:33,0:0:0.0,0:0.0:1:5:16.0,0
To simplify, the data looks something like this
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11
x x x x x x x SS=1 x 1/0:8:7,1:36,39:0:0.0,0.0:0.125:0:5:14.9,16.0 1/0:2:2,0:33,0:0:0.0,0:0.0:1:5:16.0,0
x x x x x x x SS=2 x 1/0:8:7,1:36,39:0:0.0,0.0:0.125:0:5:14.9,16.0 1/0:2:2,0:33,0:0:0.0,0:0.0:1:5:16.0,0
First , I need to count the number how many different SS in column8. There are 5 different types of SS i.e. SS=1 ..... SS=5.
This could be done by grep command and
I tried
grep SS=1 file1.vcf | wc -l
grep SS=2 file1.vcf | wc -l
Then I want to count how many "0", "1" , "2" in column 10 and 11 at the position after the 7th colon (:)
This is the part that I'm not sure how to do. I was thinking about using awk but i'm not sure how to specify to look for at specific position (after the 7th colon (:)
awk -F ':' '$11==1' #this does command only specifies column but not at specific position.
I have 246 files that I want to do exactly the same. How can I apply to all my files and write the count in txt file? I only know how to do it one by one and probably I can cat the count files at the end.
for f in *.vcf; do grep SS=1 "$f" | wc -l > ${f}SS1.txt; done
To count how many different values you have in column 8 you can use the typical approach:
$ awk -F"\t" 'NR>1{a[$8]++} END{for (i in a) print i,a[i]}' file
SS=1 1
SS=2 1
To count how many different values you have in the 8th position of a :-separated string from the 10th and 11th fields, you can use split() to slice the string in blocks. And then, use the same approach as above.
$ awk -F"\t" 'NR>1{split($10,a,":"); split($11,b,":"); count10[a[8]]++; count11[b[8]]++} END {for (i in count10) print i, count10[i]; for (i in count11) print i, count11[i]}' a
0 2
1 2
You can put all together to get something like:
$ awk -F"\t" 'NR>1{count8[$8]++; split($10,a,":"); split($11,b,":"); count10[a[8]]++; count11[b[8]]++} END {for (i in count8) print i, count8[i]; for (i in count10) print i, count10[i]; for (i in count11) print i, count11[i]}' file
SS=1 1
SS=2 1
0 2
1 2
If you want to do this for many files, you can either use the loop or -better- work with FILENAME and ENDFILE to flush the stored information. Try it out and let us know if you face any problem there.

awk to Count Sum and Unique improve command

Would like to print based on 2nd column ,count of line items, sum of 3rd column and unique values of first column.Having around 100 InputTest files and not sorted ..
Am using below 3 commands to achieve the desired output , would like to know the simplest way ...
InputTest*.txt
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
ghi,zz,10,sss
Step#1:
cat InputTest*.txt | awk -F, '{key=$2;++a[key];b[key]=b[key]+$3} END {for(i in a) print i","a[i]","b[i]}'
Op#1
xx,4,40
yy,4,60
zz,1,10
Step#2
awk -F ',' '{print $1,$2}' InputTest*.txt | sort | uniq >Op_UniqTest2.txt
Op#2
abc xx
abc yy
def xx
def yy
ghi zz
Step#3
awk '{print $2}' Op_UniqTest2.txt | sort | uniq -c
Op#3
2 xx
2 yy
1 zz
Desired Output:
xx,4,40,2
yy,4,60,2
zz,1,10,1
Looking for suggestions !!!
BEGIN { FS = OFS = "," }
{ ++lines[$2]; if (!seen[$2,$1]++) ++diff[$2]; count[$2]+=$3 }
END { for(i in lines) print i, lines[i], count[i], diff[i] }
lines tracks the number of occurrences of each value in column 2
seen records unique combinations of the second and first column, incrementing diff[$2] whenever a unique combination is found. The ++ after seen[$2,$1] means that the condition will only be true the first time the combination is found, as the value of seen[$2,$1] will be increased to 1 and !seen[$2,$1] will be false.
count keeps a total of the third column
$ awk -f avn.awk file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Using awk:
$ awk '
BEGIN { FS = OFS = "," }
{ keys[$2]++; sum[$2]+=$3 } !seen[$1,$2]++ { count[$2]++ }
END { for(key in keys) print key, keys[key], sum[key], count[key] }
' file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Set the input and output field separator to , in BEGIN block. We use arrays keys to identify and count keys. sum array keeps the sum for each keys. count allows us to keep track of unique column1 for each of column2 values.