Delete row if same value in all columns - awk

I have a space-delimited large file with thousands of rows and columns. I would like to remove all lines which have the same value across all columns but the first.
Input:
CHROM 108 139 159 265 350 351
SNP1 -1 -1 -1 -1 -1 -1
SNP2 2 2 2 2 2 2
SNP3 0 0 0 -1 -1 -1
SNP4 1 1 1 1 1 1
SNP5 0 0 0 0 0 0
Desired
CHROM 108 139 159 265 350 351
SNP3 0 0 0 -1 -1 -1
There is a similar question asked for the Panda Framework (Delete duplicate rows with the same value in all columns in pandas) and I found a somewhat partial solution that removes lines containing only zero
awk 'NR > 1{s=0; for (i=3;i<=NF;i++) s+=$i; if (s!=0)print}' input > outfile
but I want to do this for the numbers -1, 0, 1 and 2 in one go with header and 1st column as the identifier.
Any help will be highly appreciated.

I believe you can do something like this:
awk '{s=$0; gsub(FS $2,FS)} (NF > 1) {print s}' file
Which outputs:
CHROM 108 139 159 265 350 351
SNP3 0 0 0 -1 -1 -1
How does this work?
{s=$0; gsub(FS $2,FS)}: This action contains 2 parts:
Store the current line in variable s
Substitute in the current line $0 all values of the second field including its starting field separator FS (FS $2) with a field separator FS. This has as a side effect the $0 is redefined and all field variables and the total number of field NF are redefined. The field separator FS is needed to avoid matching xx if $2=x
(NF > 1) {print s}: If you have more then 1 field left, print the line, it means you have various numbers.

You can try this:
awk 'NR==1;NR>1{for(i=2;i<NF;i++)if($(i+1)!=$i) {print;next}}' file
It print the header line.
It loops over the fields until the a difference with the next one is found, then prints it, and go to the next one.

Could you please try following.
awk '{val=$2;count=1;for(i=3;i<=NF;i++){if(val==$i){count++}};if(count!=(NF-1)){print}}' Input_file

Portable Perl solution:
$ cat all_row
CHROM 108 139 159 265 350 351
SNP1 -1 -1 -1 -1 -1 -1
SNP2 2 2 2 2 2 2
SNP3 0 0 0 -1 -1 -1
SNP4 1 1 1 1 1 1
SNP5 0 0 0 0 0 0
$ perl -F"\s+" -ane ' { print "$_" if #F[1 .. $#F-1] != $F[1] } ' all_row
CHROM 108 139 159 265 350 351
SNP3 0 0 0 -1 -1 -1
$
if the ask is like don't delete if same value in all columns, then
$ perl -F"\s+" -ane ' { print "$_" if #F[1 .. $#F-1] == $F[1] } ' all_row
SNP1 -1 -1 -1 -1 -1 -1
SNP2 2 2 2 2 2 2
SNP4 1 1 1 1 1 1
SNP5 0 0 0 0 0 0

Related

Count occurrences of only positive number

I am trying to count occurrence of positive one (1) but I also have negative one (-1) in lines that's why its giving me cumulative count.
For example:
Script:
awk -F'|' 'BEGIN{print "count", "lineNum"}{print gsub(/1/,"") "\t" NR}' input_file
1 1 1 -1 -1 -1 0 0
-1 0 0 -1 -1 -1 0 0
1 1 0 -1 -1 -1 0 0
0 1 1 -1 -1 -1 0 0
Counts:
6
4
5
5
I am able to find count for only negative 1 (-1) using this command:
awk -F'|' 'BEGIN{print "count", "lineNum"}{print gsub(/\-1/,"") "\t" NR}' input_file
count for negative one (-1)
3
4
3
3
But unable to find desired count of only positive ones (1)
Desired count:
3
0
2
2
Any help will be highly appreciated.
$ awk '{print gsub(/(^|[^-])1/,"")}' file
3
0
2
2
With GNU awk, you can use word break assertions to definitively find -1 vs -11 (if those entries are possible.) Then use gsub to get the count of the positive 1 remaining in the line:
echo "1 1 1 -1 -1 -1 0 0
-1 0 0 -1 -1 -1 0 0
1 1 0 -1 -1 -1 0 0
0 1 1 -1 -1 -1 0 0" >file
$ gawk '{gsub(/-1\>/,""); print gsub(/\<1\>/,"1")}' file
3
0
2
2
With POSIX awk, you can just loop the fields and check the values. Count them if it is what you seek:
$ awk '{cnt=0; for (i=1;i<=NF;i++) if ($i+0==1) cnt++; print cnt}' file
3
0
2
2

How to awk interval expression and grouping together?

I am trying to get the measure of cpu time spent on user tasks, system tasks, interrupt handling, io wait etc by parsing the the below output of /proc/stat.
My intent is to retrieve the numerical value in the first line{the one that starts with "cpu " into seperate array elements indexed from 1 through N
[kcube#myPc ~]$ cat /proc/stat
cpu 70508209 48325 12341967 18807644987 671141 0 11736 0 0 0
cpu0 4350458 319 868828 1175271469 23047 0 2397 0 0 0
cpu1 3944197 277 857728 1175822236 16462 0 1025 0 0 0
cpu2 3919468 538 924717 1175628294 136617 0 2270 0 0 0
cpu3 3763268 441 855219 1175968114 43631 0 733 0 0 0
cpu4 3551196 147 856029 1176198902 18392 0 851 0 0 0
cpu5 5394823 1806 997806 1174089493 120122 0 2056 0 0 0
cpu6 3425023 656 839042 1176324091 58718 0 3 0 0 0
cpu7 3167959 189 811389 1176654383 19218 0 2 0 0 0
cpu8 4454976 5046 625657 1175714502 10447 0 26 0 0 0
cpu9 5049813 5365 655732 1175082394 10511 0 30 0 0 0
cpu10 4746872 4727 630042 1175408141 10959 0 28 0 0 0
cpu11 5367186 4684 659408 1174759103 9992 0 23 0 0 0
cpu12 4744405 5940 704282 1175177246 149934 0 714 0 0 0
cpu13 4689816 5954 650193 1175439255 13494 0 5 0 0 0
cpu14 4872185 5479 699429 1175126266 16945 0 898 0 0 0
cpu15 5066558 6748 706459 1174981089 12643 0 669 0 0 0
I have below awk script.
[kcube#myPc ~]$ cat test.awk
{
if ( match($0,/cpu\s(\s[[:digit:]]+){10}$/, ary) ) {
print ary[0]
print ary[1]
}
}
This always gives me the last numeric value in the first line into ary[1].
What I am looking for is to have like :
ary[1] = 70508209
ary[2] = 48325
.
.
so on
I never used interval expression and grouping together. I tried searching for answers but couldn't find one. Can someone help me out?
I'm using GNU Awk 4.0.2
$ cat tst.awk
match($0,/^cpu\s(\s[[:digit:]]+){10}$/,ary) {
print "bad match:", ary[1]
print "bad match:", ary[2]
}
match($0,/^cpu\s+([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)\s([[:digit:]]+)$/,ary) {
print "good match:", ary[1]
print "good match:", ary[2]
}
/^cpu\s/ && split($0,tmp,/[[:digit:]]+/,ary) {
print "good split:", ary[1]
print "good split:", ary[2]
}
$ awk -f tst.awk file
bad match: 0
bad match:
good match: 70508209
good match: 48325
good split: 70508209
good split: 48325
An interval expression defines how many repetitions of the previous expression must exist for the regexp to match, that is all. It is not part of populating capture groups - that is entirely up to use of round brackets enclosing regexp segments. To do what you want you need to either define explicit capture groups for each number, or use split() or similar to create the array based on a regexp that describes each entity you want to be captured.
All of the above uses GNU awk - for the 3rd arg to match() and the 4th arg to split(). Note that you can also just do this with GNU awk for FPAT:
$ awk -v FPAT='[0-9]+' '/^cpu /{for (i=1; i<=NF; i++) print i, $i}' file
1 70508209
2 48325
3 12341967
4 18807644987
5 671141
6 0
7 11736
8 0
9 0
10 0

awk: take entries from file and add values in between

I have the following file:
2 some
5 some
8 some
10 thing
15 thing
19 thing
Now I want to end up with entries, where for "some" 2,5,8 correspond to rows where there is a 1, everything else is 0. It doesn't matter how many rows there are. This means for "some":
0
1
0
0
1
0
0
1
0
0
and for "thing"
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
Is this possible in a quick way with awk? I mean with something like:
awk '{for(i=1;i<=10;i++) entries[$i]=0 for(f=0;<=NF;f++) entries[$f]=1' testfile.txt
another awk, output terminates with the last index
awk -v key='thing' '$2==key{while(++c<$1) print 0; print 1}' file
to add some extra 0's after the last 1; add END{while(i++<3) print 0}
Something like this seems to work in order to produce "some" data:
$ cat file1
2 some
5 some
8 some
10 thing
15 thing
19 thing
$ awk 'max<$1 && $2=="some"{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
0
1
0
0
1
0
0
1
Similarly , this one works for "thing" data
$ awk 'max<$1 && $2=="thing"{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
Alternativelly, as mentioned by glennjackman in comments we could use an external variable to select between some or thing:
$ awk -v word="some" 'max<$1 && $2==word{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
# for thing just apply awk -v word="thing"
You can achieve better parameterizing using an awk variable like this:
$ w="some" #selectable / set by shell , by script , etc
$ awk -v word="$w" 'max<$1 && $2==word{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
perl:
perl -lanE '
push #{$idx{$F[1]}}, $F[0] - 1; # subtract 1 because we are working with
# (zero-based) array indices
$max = $F[0]; # I assume the input is sorted by column 1
} END {
$, = "\n";
for $word (keys %idx) {
# create a $max-sized array filled with zeroes
#a = (0) x $max;
# then, populate the entries which should be 1
#a[ #{$idx{$word}} ] = (1) x #{$idx{$word}};
say $word, #a;
}
' file | pr -2T -s | nl -v0
0 thing some
1 0 0
2 0 1
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 1 0
11 0 0
12 0 0
13 0 0
14 0 0
15 1 0
16 0 0
17 0 0
18 0 0
19 1 0

Eliminate lines based on values in multiple columns

I'm trying to remove rows from a big table but conditioned that one column has one value and another column has other values.
So far I've been trying this but I guess I'm not combining the awk properly..
awk '$11 !="1"'| awk '$20==2 || $20==3' infile.txt >out.txt
The code is probably too simple but should work anyways..or not?
Thanks
edit:
This is what the table looks like
5306083 TGATCAATCTCATAAC[A/C]AAAAAAAAA consensus_24 211 1 species 0 0 0 T 0 7 T recommended 0.708 F 0 -100 T recommended
5193751 AGTAGCTTGCGCGGA[C/T]GGGGGGGGG consensus_32 227 1 species 0 0 0 T 1 1 T not_recommended 0.75 F 0 -100 T not_recommended
5193254 TAAAAAAAAAAAAAA[G/T]ATTCATCC consensus_26 192 1 species 0 0 0 T 1 0 T not_recommended 0.726 F 0 -100 T neutral
So if I filter based in that $11=1 and $20 needs to be "neutral" or "not_recommended" I would get this
5306083 TGATCAATCTCATAAC[A/C]AAAAAAAAA consensus_24 211 1 species 0 0 0 T 0 7 T recommended 0.708 F 0 -100 T recommended
awk '$11!=1 && ($20==2 || $20==3)' infile.txt > out.txt
should do.
UPDATE: based on the input given, you should get two lines in the output for this condition
$ awk '$11==1 && ($20=="not_recommended" || $20=="neutral")' file
5193751 AGTAGCTTGCGCGGA[C/T]GGGGGGGGG consensus_32 227 1 species 0 0 0 T 1 1 T not_recommended 0.75 F 0 -100 T not_recommended
5193254 TAAAAAAAAAAAAAA[G/T]ATTCATCC consensus_26 192 1 species 0 0 0 T 1 0 T not_recommended 0.726 F 0 -100 T neutral
But I guess, what you mean is you want the negation of this which is different from your original post
$ awk '$11!=1 || ($20!="not_recommended" && $20!="neutral")' file
5306083 TGATCAATCTCATAAC[A/C]AAAAAAAAA consensus_24 211 1 species 0 0 0 T 0 7 T recommended 0.708 F 0 -100 T recommended

delete many empty spaces between columns and make only one-white-space between columns

I have a file with more than 2500 columns. Each column is separated with tab or several white space.
The data format in the file is as below:
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
I want to delete the tab or many empty white-spaces between the columns and make only one white-space between the columns as below.
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
How I delete the empty spaces ?
This should do:
awk '{$1=$1}1' file
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
By setting $1=$1 it cleans up all the spaces and tabs. 1 is to print it.
With sed:
sed 's/[[:space:]]\+/ /g' filename
Alternatively with tr:
tr -s '[:blank:]' ' ' filename