I am trying to count occurrence of positive one (1) but I also have negative one (-1) in lines that's why its giving me cumulative count.
For example:
Script:
awk -F'|' 'BEGIN{print "count", "lineNum"}{print gsub(/1/,"") "\t" NR}' input_file
1 1 1 -1 -1 -1 0 0
-1 0 0 -1 -1 -1 0 0
1 1 0 -1 -1 -1 0 0
0 1 1 -1 -1 -1 0 0
Counts:
6
4
5
5
I am able to find count for only negative 1 (-1) using this command:
awk -F'|' 'BEGIN{print "count", "lineNum"}{print gsub(/\-1/,"") "\t" NR}' input_file
count for negative one (-1)
3
4
3
3
But unable to find desired count of only positive ones (1)
Desired count:
3
0
2
2
Any help will be highly appreciated.
$ awk '{print gsub(/(^|[^-])1/,"")}' file
3
0
2
2
With GNU awk, you can use word break assertions to definitively find -1 vs -11 (if those entries are possible.) Then use gsub to get the count of the positive 1 remaining in the line:
echo "1 1 1 -1 -1 -1 0 0
-1 0 0 -1 -1 -1 0 0
1 1 0 -1 -1 -1 0 0
0 1 1 -1 -1 -1 0 0" >file
$ gawk '{gsub(/-1\>/,""); print gsub(/\<1\>/,"1")}' file
3
0
2
2
With POSIX awk, you can just loop the fields and check the values. Count them if it is what you seek:
$ awk '{cnt=0; for (i=1;i<=NF;i++) if ($i+0==1) cnt++; print cnt}' file
3
0
2
2
Related
I have a dataframe like this
0 1 0 1 0 0....
1 1 1 1 0 0
0 0 1 1 0 1
.
.
.
And I want to multiply each of them with a geometric sequence
1, 10, 100, 1000, 10000 ... 10^(n-1)
so the result will be
0 10 0 1000 0 0....
1 10 100 1000 0 0
0 0 100 1000 0 100000
.
.
.
I have tried with
awk '{n=0 ; x=0 ; for (i = 1; i <= NF; i++) if ($i == 1) {n=10**i ; x = x+n } print x }' test.txt
But the results were not what I expected
With GNU awk:
awk '{for (i=1; i<=NF; i++){if($i==1){n=10**(i-1); $i=$i*n}} print}' test.txt
Output:
0 10 0 1000 0 0
1 10 100 1000 0 0
0 0 100 1000 0 100000
Note: In this answer, we always assume single digits per column
There are a couple of things you have to take into account:
If you have a sequence given by:
a b c d e
Then the final number will be edcba
awk is not aware of integers, but knows only floating point numbers, so there is a maximum number it can reach, from an integer perspective, and that is 253 (See biggest integer that can be stored in a double). This means that multiplication is not the way forward. If we don't use awk, this is still valid for integer arithmetic as the maximum value is 264-1 in unsigned version.
Having that said, it is better to just write the number with n places and use 0 as a delimiter. Example, if you want to compute 3 × 104, you can do;
awk 'BEGIN{printf "%0.*d",4+1,3}' | rev
Here we make use of rev to reverse the strings (00003 → 30000)
Solution 1: In the OP, the code alludes to the fact that the final sum is requested (a b c d e → edcba). So we can just do the following:
sed 's/ //g' file | rev
awk -v OFS="" '{$1=$1}1' file | rev
If you want to get rid of the possible starting zero's you can do:
sed 's/ //g;s/^0*//; file | rev
Solution 2: If the OP only wants the multiplied columns as output, we can do:
awk '{for(i=NF;i>0;--i) printf("%0.*d"(i==1?ORS:OFS),i,$i)}' file | rev
Solution 3: If the OP only wants the multiplied columns as output and the sum:
awk '{ s=$0;gsub(/ /,"",s); printf s OFS }
{ for(i=NF;i>0;--i) printf("%0.*d"(i==1?ORS:OFS),i,$i)} }
' | rev
What you wrote is absolutely not what you want. Your awk program parses each line of the input and computes only one number per line which happens to be 10 times the integer you would see if you were writing the 0's and 1's in reverse order. So, for a line like:
1 0 0 1 0 1
your awk program computes 10+0+0+10000+0+1000000=1010010. As you can see, this is the same as 10 times 101001 (100101 reversed).
To do what you want you can loop over all fields and modify them on the fly by multiplying them by the corresponding power of 10, as shown in the an other answer.
Note: another awk solution, a bit more compact, but strictly equivalent for your inputs, could be:
awk '{for(i=1;i<=NF;i++) $i*=10**(i-1)} {print}' test.txt
The first block loops over the fields and modifies them on the fly by multiplying them by the corresponding power of 10. The second block prints the modified line.
As noted in an other answer there is a potential overflow issue with the pure arithmetic approach. If you have lines with many fields you could hit the maximum of integer representation in floating point format. It could be that the strange 1024 values in the output you show are due to this.
If there is a risk of overflow, as suggested in the other answer, you could use another approach where the trailing zeroes are added not by multiplying by a power of 10, but by concatenating value 0 represented on the desired number of digits, something that printf and sprintf can do:
$ awk 'BEGIN {printf("%s%0.*d\n",1,4,0)}' /dev/null
10000
So, a GNU awk solution based on this could be:
awk '{for(i=1;i<=NF;i++) $i = $i ? sprintf("%s%0.*d",$i,i-1,0) : $i} 1' test.txt
how about not doing any math at all :
{m,n,g}awk '{ for(_+=_^=___=+(__="");_<=NF;_++) {
$_=$_ ( \
__=__""___) } } gsub((_=" "(___))"+",_)^+_'
=
1 0 0 0 10000 0 0 0 0 1000000000 10000000000
1 0 0 0 10000 0 0 10000000 0 0 10000000000
1 0 0 0 10000 100000 0 0 0 0 10000000000
1 0 0 1000 0 0 1000000 0 100000000 1000000000
1 0 0 1000 10000 0 0 0 0 1000000000 10000000000
1 0 100 0 0 0 1000000 10000000 0 0 10000000000
1 0 100 0 0 100000 1000000 10000000 100000000 1000000000
1 0 100 0 10000 0 1000000 0 100000000
1 0 100 1000 0 100000 0 0 0 0 10000000000
1 0 100 1000 10000 0 0 10000000
1 10 0 0 0 0 1000000 10000000 0 1000000000
1 10 0 1000 0 100000 0 0 100000000
1 10 0 1000 0 100000 0 0 100000000 1000000000 10000000000
1 10 0 1000 0 100000 0 10000000 100000000 1000000000
1 10 100 1000 10000 100000 0 0 0 1000000000
I have a space-delimited large file with thousands of rows and columns. I would like to remove all lines which have the same value across all columns but the first.
Input:
CHROM 108 139 159 265 350 351
SNP1 -1 -1 -1 -1 -1 -1
SNP2 2 2 2 2 2 2
SNP3 0 0 0 -1 -1 -1
SNP4 1 1 1 1 1 1
SNP5 0 0 0 0 0 0
Desired
CHROM 108 139 159 265 350 351
SNP3 0 0 0 -1 -1 -1
There is a similar question asked for the Panda Framework (Delete duplicate rows with the same value in all columns in pandas) and I found a somewhat partial solution that removes lines containing only zero
awk 'NR > 1{s=0; for (i=3;i<=NF;i++) s+=$i; if (s!=0)print}' input > outfile
but I want to do this for the numbers -1, 0, 1 and 2 in one go with header and 1st column as the identifier.
Any help will be highly appreciated.
I believe you can do something like this:
awk '{s=$0; gsub(FS $2,FS)} (NF > 1) {print s}' file
Which outputs:
CHROM 108 139 159 265 350 351
SNP3 0 0 0 -1 -1 -1
How does this work?
{s=$0; gsub(FS $2,FS)}: This action contains 2 parts:
Store the current line in variable s
Substitute in the current line $0 all values of the second field including its starting field separator FS (FS $2) with a field separator FS. This has as a side effect the $0 is redefined and all field variables and the total number of field NF are redefined. The field separator FS is needed to avoid matching xx if $2=x
(NF > 1) {print s}: If you have more then 1 field left, print the line, it means you have various numbers.
You can try this:
awk 'NR==1;NR>1{for(i=2;i<NF;i++)if($(i+1)!=$i) {print;next}}' file
It print the header line.
It loops over the fields until the a difference with the next one is found, then prints it, and go to the next one.
Could you please try following.
awk '{val=$2;count=1;for(i=3;i<=NF;i++){if(val==$i){count++}};if(count!=(NF-1)){print}}' Input_file
Portable Perl solution:
$ cat all_row
CHROM 108 139 159 265 350 351
SNP1 -1 -1 -1 -1 -1 -1
SNP2 2 2 2 2 2 2
SNP3 0 0 0 -1 -1 -1
SNP4 1 1 1 1 1 1
SNP5 0 0 0 0 0 0
$ perl -F"\s+" -ane ' { print "$_" if #F[1 .. $#F-1] != $F[1] } ' all_row
CHROM 108 139 159 265 350 351
SNP3 0 0 0 -1 -1 -1
$
if the ask is like don't delete if same value in all columns, then
$ perl -F"\s+" -ane ' { print "$_" if #F[1 .. $#F-1] == $F[1] } ' all_row
SNP1 -1 -1 -1 -1 -1 -1
SNP2 2 2 2 2 2 2
SNP4 1 1 1 1 1 1
SNP5 0 0 0 0 0 0
I have the following file:
2 some
5 some
8 some
10 thing
15 thing
19 thing
Now I want to end up with entries, where for "some" 2,5,8 correspond to rows where there is a 1, everything else is 0. It doesn't matter how many rows there are. This means for "some":
0
1
0
0
1
0
0
1
0
0
and for "thing"
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
Is this possible in a quick way with awk? I mean with something like:
awk '{for(i=1;i<=10;i++) entries[$i]=0 for(f=0;<=NF;f++) entries[$f]=1' testfile.txt
another awk, output terminates with the last index
awk -v key='thing' '$2==key{while(++c<$1) print 0; print 1}' file
to add some extra 0's after the last 1; add END{while(i++<3) print 0}
Something like this seems to work in order to produce "some" data:
$ cat file1
2 some
5 some
8 some
10 thing
15 thing
19 thing
$ awk 'max<$1 && $2=="some"{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
0
1
0
0
1
0
0
1
Similarly , this one works for "thing" data
$ awk 'max<$1 && $2=="thing"{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
Alternativelly, as mentioned by glennjackman in comments we could use an external variable to select between some or thing:
$ awk -v word="some" 'max<$1 && $2==word{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
# for thing just apply awk -v word="thing"
You can achieve better parameterizing using an awk variable like this:
$ w="some" #selectable / set by shell , by script , etc
$ awk -v word="$w" 'max<$1 && $2==word{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
perl:
perl -lanE '
push #{$idx{$F[1]}}, $F[0] - 1; # subtract 1 because we are working with
# (zero-based) array indices
$max = $F[0]; # I assume the input is sorted by column 1
} END {
$, = "\n";
for $word (keys %idx) {
# create a $max-sized array filled with zeroes
#a = (0) x $max;
# then, populate the entries which should be 1
#a[ #{$idx{$word}} ] = (1) x #{$idx{$word}};
say $word, #a;
}
' file | pr -2T -s | nl -v0
0 thing some
1 0 0
2 0 1
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 1 0
11 0 0
12 0 0
13 0 0
14 0 0
15 1 0
16 0 0
17 0 0
18 0 0
19 1 0
I have a file with more than 2500 columns. Each column is separated with tab or several white space.
The data format in the file is as below:
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
I want to delete the tab or many empty white-spaces between the columns and make only one white-space between the columns as below.
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
How I delete the empty spaces ?
This should do:
awk '{$1=$1}1' file
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
By setting $1=$1 it cleans up all the spaces and tabs. 1 is to print it.
With sed:
sed 's/[[:space:]]\+/ /g' filename
Alternatively with tr:
tr -s '[:blank:]' ' ' filename
I want to calculate the sum and ratio values from data below. (The actual data contains more than 200,000 columns and 45000 rows (lines)).
For clarity purpose I have given only simple data format.
#Frame BMR_42#O22 BMR_49#O13 BMR_59#O13 BMR_23#O26 BMR_10#O13 BMR_61#O26 BMR_23#O25
1 1 1 0 1 1 1 1
2 0 1 0 0 1 1 0
3 1 1 1 0 0 1 1
4 1 1 0 0 1 0 1
5 0 0 0 0 0 0 0
6 1 0 1 1 0 1 0
7 1 1 1 1 0 0 0
8 1 1 1 0 0 0 0
9 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0
The columns need to be selected with certain criteria.
The column data which I consider is columns with "#O13" only. Below I have given the selected columns from above example.
BMR_49#O13 BMR_59#O13 BMR_10#O13
1 0 1
1 0 1
1 1 0
1 0 1
0 0 0
0 1 0
1 1 0
1 1 0
1 1 1
0 0 0
From the selected column, I want to calculate:
1) the sum of all the "1"s. In this example we get value 16.
2) the number of total rows containing occurrence of "1" (at least once). From above example there are 8 rows which contain at least one occurrence of "1".
lastly,
3) the ratio of total of all "1"s with total lines with occurrence of "1"s.
That is :: (total of all "1"s)/(total rows with the occurance of "1").
Example 16/8
As a start, I tried with this command to select only the columns with "#O13"
awk '{for (i=1;i<=NF;i++) if (i~/#O13/); print ""}' $file2
Although this run but doesn't show up the values.
This should do:
awk 'NR==1{for (i=1;i<=NF;i++) if ($i~/#O13/) a[i];next} {f=0;for (i in a) if ($i) {s++;f++};if (f) r++} END {print "number of 1="s"\nrows with 1="r"\nratio="s/r}' file
number of 1=16
rows with 1=8
ratio=2
Some more readable:
awk '
NR==1{
for (i=1;i<=NF;i++)
if ($i~/#O13/)
a[i]
next
}
{
f=0
for (i in a)
if ($i=="1") {
s++
f++
}
if (f) r++
}
END {
print "number of 1="s \
"\nrows with 1="r \
"\nratio="s/r
}
' file