Sort associative array with AWK - awk

Here's my array (gawk script) :
myArray["peter"] = 32
myArray["bob"] = 5
myArray["john"] = 463
myArray["jack"] = 11
After sort, I need the following result :
bob 5
jack 11
peter 32
john 463
When i use "asort", indices are lost. How to sort by array value without losing indices ? (I need ordered indices based on their values)
(I need to obtain this result with awk/gawk only, not shell script, perl, etc)
If my post isn't clear enough, here is an other post explaining the same issue : http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Q_26626841.html )
Thanks in advance
Update :
Thanks to you both, but i need to sort by values, not indices (i want ordered indices according to their values).
In other terms, i need this result :
bob 5
jack 11
peter 32
john 463
not :
bob 5
jack 11
john 463
peter 32
(I agree, my example is confusing, the chosen values are pretty bad)
From the code of Catcall, I wrote a quick implementation that works, but it's rather ugly (I concatenate keys & values before sort and split during comparison). Here's what it looks like :
function qsort(A, left, right, i, last) {
if (left >= right)
return
swap(A, left, left+int((right-left+1)*rand()))
last = left
for (i = left+1; i <= right; i++)
if (getPart(A[i], "value") < getPart(A[left], "value"))
swap(A, ++last, i)
swap(A, left, last)
qsort(A, left, last-1)
qsort(A, last+1, right)
}
function swap(A, i, j, t) {
t = A[i]; A[i] = A[j]; A[j] = t
}
function getPart(str, part) {
if (part == "key")
return substr(str, 1, index(str, "#")-1)
if (part == "value")
return substr(str, index(str, "#")+1, length(str))+0
return
}
BEGIN { }
{ }
END {
myArray["peter"] = 32
myArray["bob"] = 5
myArray["john"] = 463
myArray["jack"] = 11
for (key in myArray)
sortvalues[j++] = key "#" myArray[key]
qsort(sortvalues, 0, length(myArray));
for (i = 1; i <= length(myArray); i++)
print getPart(sortvalues[i], "key"), getPart(sortvalues[i], "value")
}
Of course I'm interested if you have something more clean...
Thanks for your time

Edit:
Sort by values
Oh! To sort the values, it's a bit of a kludge, but you can create a temporary array using a concatenation of the values and the indices of the original array as indices in the new array. Then you can asorti() the temporary array and split the concatenated values back into indices and values. If you can't follow that convoluted description, the code is much easier to understand. It's also very short.
# right justify the integers into space-padded strings and cat the index
# to create the new index
for (i in myArray) tmpidx[sprintf("%12s", myArray[i]),i] = i
num = asorti(tmpidx)
j = 0
for (i=1; i<=num; i++) {
split(tmpidx[i], tmp, SUBSEP)
indices[++j] = tmp[2] # tmp[2] is the name
}
for (i=1; i<=num; i++) print indices[i], myArray[indices[i]]
Edit 2:
If you have GAWK 4, you can traverse the array by order of values without performing an explicit sort:
#!/usr/bin/awk -f
BEGIN {
myArray["peter"] = 32
myArray["bob"] = 5
myArray["john"] = 463
myArray["jack"] = 11
PROCINFO["sorted_in"] = "#val_num_asc"
for (i in myArray) {
{print i, myArray[i]}}
}
}
There are settings for traversing by index or value, ascending or descending and other options. You can also specify a custom function.
Previous answer:
Sort by indices
If you have an AWK, such as gawk 3.1.2 or greater, which supports asorti():
#!/usr/bin/awk -f
BEGIN {
myArray["peter"] = 32
myArray["bob"] = 5
myArray["john"] = 463
myArray["jack"] = 11
num = asorti(myArray, indices)
for (i=1; i<=num; i++) print indices[i], myArray[indices[i]]
}
If you don't have asorti():
#!/usr/bin/awk -f
BEGIN {
myArray["peter"] = 32
myArray["bob"] = 5
myArray["john"] = 463
myArray["jack"] = 11
for (i in myArray) indices[++j] = i
num = asort(indices)
for (i=1; i<=num; i++) print i, indices[i], myArray[indices[i]]
}

Use the Unix sort command with the pipe, keeps Awk code simple and follow Unix philosophy
Create a input file with values seperated by comma
peter,32
jack,11
john,463
bob,5
Create a sort.awk file with the code
BEGIN { FS=","; }
{
myArray[$1]=$2;
}
END {
for (name in myArray)
printf ("%s,%d\n", name, myArray[name]) | "sort -t, -k2 -n"
}
Run the program, should give you the output
$ awk -f sort.awk data
bob,5
jack,11
peter,32
john,463

PROCINFO["sorted_in"] = "#val_num_desc";
Before iterating an array, use the above statement. But, it works in awk version 4.0.1. It does not work in awk version 3.1.7.
I am not sure in which intermediate version, it got introduced.

And the simple answer...
function sort_by_myArray(i1, v1, i2, v2) {
return myArray[i2] < myArray[i1];
}
BEGIN {
myArray["peter"] = 32;
myArray["bob"] = 5;
myArray["john"] = 463;
myArray["jack"] = 11;
len = length(myArray);
asorti(myArray, k, "sort_by_myArray");
# Print result.
for(n = 1; n <= len; ++n) {
print k[n], myArray[k[n]]
}
}

Use asorti:
#!/usr/bin/env -S gawk -f
{
score[$1] = $0;
array[sprintf("%3s",$2) $1] = $1;
}
END {
asorti(array, b)
for(i in b)
{
name = array[b[i]]
print score[name]
}
}

The following function works in Gawk 3.1.7 and doesn't resort to any of the workarounds described above (no external sort command, no array subscripts, etc.) It's just a basic implementation of the insertion sort algorithm adapted for associative arrays.
You pass it an associative array to sort on the values and an empty array to populate with the corresponding keys.
myArray["peter"] = 32;
myArray["bob"] = 5;
myArray["john"] = 463;
myArray["jack"] = 11;
len = resort( myArray, result );
for( i = 1; i <= len; i++ ) {
key = result[ i ];
print i ": " key " = " myArray[ key ];
}
Here is the implementation:
function resort( data, list,
key, len, i, j, v )
{
len = 0;
for( key in data ) {
list[ ++len ] = key;
}
# insertion sort algorithm adapted for
# one-based associative arrays in gawk
for( i = 2; i <= len; i++ )
{
v = list[ i ];
for( j = i - 1; j >= 1; j-- )
{
if( data[ list[ j ] ] <= data[ v ] )
break;
list[ j + 1 ] = list[ j ];
}
list[ j + 1 ] = v;
}
return len;
}
You can "reverse sort" by simply iterating the resulting array in descending order.

Related

Faster algorithm/language for string replacement [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
EDIT:
Method 3 provided below is way faster by testing, reduce the estimated runtime from 2-3 days to < 1 day.
I had a sample file with a long string >50M like this.
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtgtgGCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatataGCCCTAGGtcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCtggtgtgtggggttagggAtagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGgtgtCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
For everything substring with length k = 50 (which means there are
length(file)-k+1 substring)
if the A||T||a||t (Upper & Lower case) is >40%,
replace every character in that substring with N or n (preserving
case).
Sample output:
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnggttaNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGNNnnnnnnnnnnnnnnnnnnNnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGnnnnNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnngNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
I was using AWK in command line for ease, but it just runs extremely slow with string replacement... and consume only <5% CPU somehow
Code: https://repl.it/#hey0wing/DutifulHandsomeMisrac-2
# Method 1
cat chr22.fa | head -n1 > chr22.masked.fa
cat chr22.fa | tail -n+2 | awk -v k=100 -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
while (i <= length($0)-k+1) {
x = substr($0, i, k)
if (i == 1) {
rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
} else {
prevx = substr($0,i-1,1)
if (prevx == "A" || prevx == "a" || prevx == "T" || prevx == "t")
rate -= 1
nextx = substr(x,k,1)
if (nextx == "A" || nextx == "a" || nextx == "T" || nextx == "t")
rate += 1
}
if (rate>r*k/100) {
h++
highGC[i] = i
}
printf("index-r:%f%% high-AT:%d \r",i/(length($0)-k+1)*100,h)
i += 1
}
printf("index-r:%f%% high-AT:%d\n\n",i/(length($0)-k+1)*100,h)
for (j in highGC) {
y = highGC[j]
SUB++
printf("sub-r:%f%% \r",SUB/h*100)
x = substr($0, y, k)
gsub (/[AGCT]/,"N",x)
gsub (/[agct]/,"n",x)
$0 = substr($0,1,y-1) x substr($0,y+k)
}
printf("sub-r:%f%%\nsubstituted:%d\n\n",SUB/h*100,SUB)
printf("%s",$0) >> "chr22.masked.fa"
}'
# Method 2
cat chr22.fa | head -n1 > chr22.masked2.fa
cat chr22.fa | tail -n+2 | awk -v k="100" -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
h = 0;
while (i<=length($0)-k+1) {
x = substr($0, i, k)
rate = gsub(/[ATX]/,"X",x) + gsub(/[atx]/,"x",x)
if (rate>r/k*100) {
h++
gsub (/[GC]/,"N",x)
gsub (/[gc]/,"n",x)
$0 = substr($0,1,i-1) x substr($0,i+k)
}
printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
i += 1
}
gsub (/X/,"N",$0)
gsub (/x/,"n",$0)
printf("index-r:%f%% sub-r:%f%% \n",i/(length($0)-k+1)*100,h/544*100)
printf("%s",$0) >> "chr22.masked2.fa"
}'
# Method 3
cat chr22.fa | head -n1 > chr22.masked3.fa
cat chr22.fa | tail -n+2 | awk -v k="100" -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
h = 0;
while (i <= length($0)-k+1) {
x = substr($0, i, k)
rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
if (rate>r/k*100) {
h++
gsub(/[ACGT]/,"N",x)
gsub(/[acgt]/,"n",x)
if (i == 1) {
s = x
} else {
s = substr(s,1,length(s)-k+1) x
}
} else {
if (i == 1) {
s = x
} else {
s = s substr(x,k,1)
}
}
printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
i += 1
}
printf("index-r:%f%% sub-r:%f%% \n\n",i/(length($0)-k+1)*100,h/544*100)
printf("%s",s) >> "chr22.masked3.fa"
}'
The estimated runtime is around 2-3 days ...
Are there any faster algorithm for this problem? If no, are there any language can perform string replacement faster?
More info:
the AWK command consume ~30% CPU at WSL & GitBash, but only ~5% on windows cmd with an OpenSSH client, where the progress rate is similar
Okay, there's an O(n) solution that involves a sliding window on to your data set. The following algorithm should suffice:
set window to ""
while true:
if window is "":
read k characters into window, exit while if less available
set atCount to number of characters in window matching "AaTt".
if atCount > 40% of k:
for each char in window:
if char uppercase:
output "N"
else:
output "n"
window = ""
else:
if first character of window matches "AaTt":
decrease atCount
remove first character of window
read next character into end of window, exit while if none available
if last character of window matches "AaTt":
increase atCount
What this does is to run a sliding window through your data, at each point testing if the proportion of AaTt characters in that window is more than 40%.
If so, it outputs the desired Nn characters and reloads the next k-sized window.
If it's not over 40%, it removes the first character in the windows and adds the next one to the end, adjusting the count of AaTt characters correctly.
If, at any point, there aren't enough characters left to satisfy a check (k when loading a full window, or 1 when sliding), it exits the loop.
Try some perl:
perl -slpe '
my $len = length;
for (my $i = 0; $i < $len; $i += $k) {
my $substring = substr($_, $i, $k);
my $count = $substring =~ tr/aAtT/aAtT/;
if ($count >= $k * $threshold) {
$substring =~ s/[[:lower:]]/n/g;
$substring =~ s/[[:upper:]]/N/g;
substr($_, $i, $k) = $substring;
}
}
' -- -k=50 -threshold=0.4 file

Passing arrays in awk function

I want to write a function which accepts two arguments one is a constant values and another is an array. The function finds the index of the element in the arrays and returns it.I want to call this function with multiple arrays just as below what I have tried.
BEGIN{
a[1]=2;
a[2]=4;
a[3]=3;
b[1]=4;
b[2]=2;
b[3]=6;
c[1]=5;
c[2]=1;
c[3]=6;
arr[1]=a;
arr[2]=b;
arr[3]=c
}
function pos(val,ar[]) {
for (m=1;m<=length(ar);m++) { if (val == ar[m] )
return m;
else continue }
}
{for( k=1;k<=NF;k++) { for(l=1;l<=length(arr);l++) { print "pos=" pos($i,arr[l])} } }
but I am getting errors :
fatal: attempt to use array `a' in a scalar context
Looking at the code can anyone tell me how can I achieve what I am trying to achieve using awk. The challenge I have here is to assign and array as an element to another array as in arr[1]=a and passing the the array as a parameter by referencing it with its index as in pos($i,arr[l] . I dont know how to make these statements syntactically and functionally correct in awk .
the input is :
2 4 6
3 5 6
1 2 5
and in the out put the code should return the position of the value read from the file if it is present in any of the arrays defined
output:
1 1 3
6
2 1
in first line of output indexed of corresponding elements in the array a b and c have been returned respectively . 1 is index of 2 in a , 1 is index of 4 in b and 3 is index of 6 in c and so on for the upcoming lines in the input file.
I truly don't understand what it is you're trying to do (especially why an input of 2 produces the index from a but not the index from b while an input of 4 does the reverse) but to create a multi-dimensional array arr[][] from a[], b[], and c[] with GNU awk (the only awk that supports true multi-dimensional arrays) would be:
for (i in a) arr[1][i] = a[i]
for (i in b) arr[2][i] = b[i]
for (i in c) arr[3][i] = c[i]
not just arr[1] = a, etc. Note that you're storing a copy of the contents of a[] in arr[1][], not a reference to a[], so if a[] changes then arr[1][] won't. What you might want to do instead (again GNU awk only) is store the sub-array names in arr[] and then access them through the builtin variable SYMTAB (see the man page), e.g.:
$ cat tst.awk
BEGIN{
split("2 4 3",a)
split("4 2 6",b)
split("5 1 6",c)
arr[1] = "a"
arr[2] = "b"
arr[3] = "c"
prtArr(arr)
}
function prtArr(arr, i,subArrName) {
for (i=1; i in arr; i++) {
subArrName = arr[i]
printf "arr[%d] -> %s[] =\n", i, subArrName
prtSubArr(SYMTAB[subArrName])
}
}
function prtSubArr(subArr, j) {
for (j=1; j in subArr; j++) {
print "\t" subArr[j]
}
}
.
$ awk -f tst.awk
arr[1] -> a[] =
2
4
3
arr[2] -> b[] =
4
2
6
arr[3] -> c[] =
5
1
6
Now arr[] is no longer a multi-dimensional array, it's just an array of array name strings, and the contents of a[] are only stored in 1 place (in a[]) and just referenced from SYMTAB[] indexed by the contents of arr[N] rather than copied into arr[N][].

awk count selective combinations only:

Would like to read and count the field value == "TRUE" only from 3rd field to 5th field.
Input.txt
Locationx,Desc,A,B,C,Locationy
ab123,Name1,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,FALSE,TRUE,ab1234
ab123,Name2,FALSE,FALSE,TRUE,ab1234
ab123,Name1,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,TRUE,TRUE,ab1234
ab123,Name3,FALSE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,ab1234
ab123,Name1,TRUE,TRUE,FALSE,ab1234
While reading the headers from 3rd field to 5th field , i,e A, B, C want to generate unique combinations and permutations like A,B,C,AB,AC,AB,ABC only.
Note: AA, BB, CC, BA etc excluded
If the "TRUE" is considered for "AB" combination count then it should not be considered for "A" conut & "B" count again to avoid duplicate ..
Example#1
Locationx,Desc,A,B,C,Locationy
ab123,Name1,TRUE,TRUE,TRUE,ab1234
Op#1
Desc,A,B,C,AB,AC,BC,ABC
Name1,,,,,,,1
Example#2
Locationx,Desc,A,B,C,Locationy
ab123,Name1,TRUE,TRUE,FALSE,ab1234
Op#2
Desc,A,B,C,AB,AC,BC,ABC
Name1,,,,1,,,
Example#3
Locationx,Desc,A,B,C,Locationy
ab123,Name1,FALSE,TRUE,FALSE,ab1234
Op#3
Desc,A,B,C,AB,AC,BC,ABC
Name1,,1,,,,,
Desired Output:
Desc,A,B,C,AB,AC,BC,ABC
Name1,,,,1,,,2
Name2,,,1,,1,,1
Name3,1,,,2,,,
Actual File is like below :
Input.txt
Locationx,Desc,INCOMING,OUTGOING,SMS,RECHARGE,DEBIT,DATA,Locationy
ab123,Name1,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,ab1234
ab123,Name2,TRUE,TRUE,FALSE,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,ab1234
ab123,Name1,TRUE,TRUE,TRUE,TRUE,FALSE,TRUE,ab1234
ab123,Name2,TRUE,TRUE,TRUE,TRUE,FALSE,TRUE,ab1234
ab123,Name3,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,TRUE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,TRUE,FALSE,FALSE,ab1234
ab123,Name1,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,ab1234
Have tried lot , nothing is materialised , any suggestions please !!!
Edit: Desired Output from Actual Input:
Desc,INCOMING-OUTGOING-SMS-RECHARGE-DEBIT-DATA,OUTGOING-SMS-RECHARGE-DEBIT-DATA,INCOMING-SMS-RECHARGE-DEBIT-DATA,INCOMING-OUTGOING-RECHARGE-DEBIT-DATA,INCOMING-OUTGOING-SMS-RECHARGE-DATA,INCOMING-OUTGOING-SMS-RECHARGE-DEBIT,SMS-RECHARGE-DEBIT-DATA,OUTGOING-RECHARGE-DEBIT-DATA,OUTGOING-SMS-RECHARGE-DATA,OUTGOING-SMS-RECHARGE-DEBIT,INCOMING-RECHARGE-DEBIT-DATA,INCOMING-SMS-DEBIT-DATA,INCOMING-SMS-RECHARGE-DATA,INCOMING-SMS-RECHARGE-DEBIT,INCOMING-OUTGOING-DEBIT-DATA,INCOMING-OUTGOING-RECHARGE-DATA,INCOMING-OUTGOING-RECHARGE-DEBIT,INCOMING-OUTGOING-SMS-DATA,INCOMING-OUTGOING-SMS-DEBIT,INCOMING-OUTGOING-SMS-RECHARGE,RECHARGE-DEBIT-DATA,SMS-DEBIT-DATA,SMS-RECHARGE-DATA,SMS-RECHARGE-DEBIT,OUTGOING-RECHARGE-DATA,OUTGOING-RECHARGE-DEBIT,OUTGOING-SMS-DATA,OUTGOING-SMS-DEBIT,OUTGOING-SMS-RECHARGE,INCOMING-DEBIT-DATA,INCOMING-RECHARGE-DATA,INCOMING-RECHARGE-DEBIT,INCOMING-SMS-DATA,INCOMING-SMS-DEBIT,INCOMING-SMS-RECHARGE,INCOMING-OUTGOING-DATA,INCOMING-OUTGOING-DEBIT,INCOMING-OUTGOING-RECHARGE,INCOMING-OUTGOING-SMS,DEBIT-DATA,RECHARGE-DATA,RECHARGE-DEBIT,SMS-DATA,SMS-DEBIT,SMS-RECHARGE,OUTGOING-DATA,OUTGOING-DEBIT,OUTGOING-RECHARGE,OUTGOING-SMS,INCOMING-DATA,INCOMING-DEBIT,INCOMING-RECHARGE,INCOMING-SMS,INCOMING-OUTGOING,DATA,DEBIT,RECHARGE,SMS,OUTGOING,INCOMING
Name1,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,,,1,,,,,,,,,,,,,,,,,,,,,
Name2,,,,1,1,,,,,,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Name3,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,,,,,,,,,,,,,,,,,,,1,,,
Don't have Perl and Python access !!!
I have written a perl script that does this for you. As you can see from the size and comments, it is really simple to get this done.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Algorithm::Combinatorics qw(combinations);
## change the file to the path where your file exists
open my $fh, '<', 'file';
my (%data, #new_labels);
## capture the header line in an array
my #header = split /,/, <$fh>;
## backup the header
my #fields = #header;
## remove first, second and last columns
#header = splice #header, 2, -1;
## generate unique combinations
for my $iter (1 .. +#header) {
my $combination = combinations(\#header, $iter);
while (my $pair = $combination->next) {
push #new_labels, "#$pair";
}
}
## iterate through rest of the file
while(my $line = <$fh>) {
my #line = split /,/, $line;
## identify combined labels that are true
my #is_true = map { $fields[$_] } grep { $line[$_] eq "TRUE" } 0 .. $#line;
## increment counter in hash map keyed at description and then new labels
++$data{$line[1]}{$_} for map { s/ /-/g; $_ } "#is_true";
}
## print the new header
print join ( ",", "Desc", map {s/ /-/g; $_} reverse #new_labels ) . "\n";
## print the description and counter values
for my $desc (sort keys %data){
print join ( ",", $desc, ( map { $data{$desc}{$_} //= "" } reverse #new_labels ) ) . "\n";
}
Output:
Desc,INCOMING-OUTGOING-SMS-RECHARGE-DEBIT-DATA,OUTGOING-SMS-RECHARGE-DEBIT-DATA,INCOMING-SMS-RECHARGE-DEBIT-DATA,INCOMING-OUTGOING-RECHARGE-DEBIT-DATA,INCOMING-OUTGOING-SMS-DEBIT-DATA,INCOMING-OUTGOING-SMS-RECHARGE-DATA,INCOMING-OUTGOING-SMS-RECHARGE-DEBIT,SMS-RECHARGE-DEBIT-DATA,OUTGOING-RECHARGE-DEBIT-DATA,OUTGOING-SMS-DEBIT-DATA,OUTGOING-SMS-RECHARGE-DATA,OUTGOING-SMS-RECHARGE-DEBIT,INCOMING-RECHARGE-DEBIT-DATA,INCOMING-SMS-DEBIT-DATA,INCOMING-SMS-RECHARGE-DATA,INCOMING-SMS-RECHARGE-DEBIT,INCOMING-OUTGOING-DEBIT-DATA,INCOMING-OUTGOING-RECHARGE-DATA,INCOMING-OUTGOING-RECHARGE-DEBIT,INCOMING-OUTGOING-SMS-DATA,INCOMING-OUTGOING-SMS-DEBIT,INCOMING-OUTGOING-SMS-RECHARGE,RECHARGE-DEBIT-DATA,SMS-DEBIT-DATA,SMS-RECHARGE-DATA,SMS-RECHARGE-DEBIT,OUTGOING-DEBIT-DATA,OUTGOING-RECHARGE-DATA,OUTGOING-RECHARGE-DEBIT,OUTGOING-SMS-DATA,OUTGOING-SMS-DEBIT,OUTGOING-SMS-RECHARGE,INCOMING-DEBIT-DATA,INCOMING-RECHARGE-DATA,INCOMING-RECHARGE-DEBIT,INCOMING-SMS-DATA,INCOMING-SMS-DEBIT,INCOMING-SMS-RECHARGE,INCOMING-OUTGOING-DATA,INCOMING-OUTGOING-DEBIT,INCOMING-OUTGOING-RECHARGE,INCOMING-OUTGOING-SMS,DEBIT-DATA,RECHARGE-DATA,RECHARGE-DEBIT,SMS-DATA,SMS-DEBIT,SMS-RECHARGE,OUTGOING-DATA,OUTGOING-DEBIT,OUTGOING-RECHARGE,OUTGOING-SMS,INCOMING-DATA,INCOMING-DEBIT,INCOMING-RECHARGE,INCOMING-SMS,INCOMING-OUTGOING,DATA,DEBIT,RECHARGE,SMS,OUTGOING,INCOMING
Name1,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,,,1,,,,,,,,,,,,,,,,,,,,,
Name2,,,,1,,1,,,,,,,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Name3,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,,,,,,,,,,,,,,,,,,,1,,,
Note: Please revisit your expected output. It has few mistakes in it as you can see from the output generated from the script above.
Here is an attempt at solving this using awk:
Content of script.awk
BEGIN { FS = OFS = "," }
function combinations(flds, itr, i, pre) {
for (i=++cnt; i<=numRecs; i++) {
++n
sep = ""
for (pre=1; pre<=itr; pre++) {
newRecs[n] = newRecs[n] sep (sprintf ("%s", flds[pre]));
sep = "-"
}
newRecs[n] = newRecs[n] sep (sprintf ("%s", flds[i])) ;
}
}
NR==1 {
for (fld=3; fld<NF; fld++) {
recs[++numRecs] = $fld
}
for (iter=0; iter<numRecs; iter++) {
combinations(recs, iter)
}
next
}
!seen[$2]++ { desc[++d] = $2 }
{
y = 0;
var = sep = ""
for (idx=3; idx<NF; idx++) {
if ($idx == "TRUE") {
is_true[++y] = recs[idx-2]
}
}
for (z=1; z<=y; z++) {
var = var sep sprintf ("%s", is_true[z])
sep = "-"
}
data[$2,var]++;
}
END{
printf "%s," , "Desc"
for (k=1; k<=n; k++) {
printf "%s%s", newRecs[k],(k==n?RS:FS)
}
for (name=1; name<=d; name++) {
printf "%s,", desc[name];
for (nR=1; nR<=n; nR++) {
printf "%s%s", (data[desc[name],newRecs[nR]]?data[desc[name],newRecs[nR]]:""), (nR==n?RS:FS)
}
}
}
Sample file
Locationx,Desc,A,B,C,Locationy
ab123,Name1,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,FALSE,TRUE,ab1234
ab123,Name2,FALSE,FALSE,TRUE,ab1234
ab123,Name1,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,TRUE,TRUE,ab1234
ab123,Name3,FALSE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,ab1234
ab123,Name1,TRUE,TRUE,FALSE,ab1234
Execution:
$ awk -f script.awk file
Desc,A,B,C,A-B,A-C,A-B-C
Name1,,,,1,,2
Name2,,,1,,1,1
Name3,1,,,2,,
Now, there is pretty evident bug in the combination function. It does not recurse to print all combinations. For eg: for A B C D it will print
A
B
C
AB
AC
ABC
but not BC

calculating the mean of columns in text files

I have two folders named f1 and f2. These folders contain 300 text files with 2 columns.The content of files are shown below.I would like to calculate the mean of second column.file names are same in both folders.
file1 in f1 folder
54 6
55 10
57 5
file2 in f1 folder
24 8
28 12
file1 in f2 folder
34 3
22 8
file2 in f2 folder
24 8
28 13
output
folder1 folder2
file1 21/3= 7 11/2=5.5
file2 20/2=10 21/2=10.5
-- -- --
-- -- --
file300 -- --
total mean of folder1 = sum of the means/3oo
total mean of folder2 = sum of the means/3oo
I'd do it with two awk scripts. (Originally, I had a sort phase in the middle, but that isn't actually necessary. However, I think that two scripts is probably easier than trying to combine them into one. If someone else does it 'all in one' and it is comprehensible, then choose their solution instead.)
Sample run and output
This is based on the 4 files shown in the question. The names of the files are listed on the command line, but the order doesn't matter. The code assumes that there is only one slash in the file names, and no spaces and the like in the file names.
$ awk -f summary1.awk f?/* | awk -f summary2.awk
file1 21/3 = 7.000 11/2 = 5.500
file2 20/2 = 10.000 21/2 = 10.500
total mean of f1 = 17/2 = 8.500
total mean of f2 = 16/2 = 8.000
summary1.awk
function print_data(file, sum, count) {
sub("/", " ", file);
print file, sum, count;
}
oldfile != FILENAME { if (count > 0) { print_data(oldfile, sum, count); }
count = 0; sum = 0; oldfile = FILENAME
}
{ count++; sum += $2 }
END { print_data(oldfile, sum, count) }
This processes each file in turn, summing the values in column 2 and counting the number of lines. It prints out the folder name, the file name, the sum and the count.
summary2.awk
{
sum[$2,$1] = $3
cnt[$2,$1] = $4
if (file[$2]++ == 0) file_list[n1++] = $2
if (fold[$1]++ == 0) fold_list[n2++] = $1
}
END { for (i = 0; i < n1; i++)
{
printf("%-20s", file_list[i])
name = file_list[i]
for (j = 0; j < n2; j++)
{
folder = fold_list[j]
s = sum[name,folder]
n = cnt[name,folder]
a = (s + 0.0) / n
printf(" %6d/%-3d = %10.3f", s, n, a)
gsum[folder] += a
}
printf("\n")
}
for (i = 0; i < n2; i++)
{
folder = fold_list[i]
s = gsum[folder]
n = n1;
a = (s + 0.0) / n
printf("total mean of %-6s = %6d/%-3d = %10.3f\n", folder, s, n, a)
}
}
The file associative array tracks references to file names. The file_list array keeps the file names in the order that they're read. Similarly, the fold associative array tracks the folder names, and the fold_list array keeps track of the folder names in the order that they appear. If you do something weird enough with the order that you supply the names to the first command, you may need to insert a sort command between the two awk commands, such as sort -k2,2 -k1,1.
The sum associative array contains the sum for a given file name and folder name. The cnt associative array contains the count for a given file name and folder name.
The END section of the report has two main loops (though the first loop contains a nested loop). The first main loop processes the files in the order presented, generating one line containing one entry for each folder. It also accumulates the averages for the folder name.
The second main loop generates the 'total mean` data for each folder. I'm not sure whether the statistics makes sense (shouldn't the overall mean for folder1 be the sum of the values in folder1 divided by the number of entries, or 41/5 = 8.2 rather than 17/2 or 8.5?), but the calculation does what I think the question asks for (sum of means / number of files, written as 300 in the question).
With some help from grep:
grep '[0-9]' folder[12]/* | awk '
{
split($0,b,":");
f=b[1]; split(f,c,"/"); d=c[1]; f=c[2];
s[f][d]+=$2; n[f][d]++; nn[d]++;}
END{
for (f in s) {
printf("%-10s", f);
for (d in s[f]) {
a=s[f][d] / n[f][d];
printf(" %6.2f ", a);
p[d] += a;
}
printf("\n");
}
for (d in p) {
printf("total mean %-8s = %8.2f\n", d, p[d]/nn[d]);
}
}'

How to get total column from awk

I am testing awk and got this thought .So we know that
raja#badfox:~/Desktop/trails$ cat num.txt
1 2 3 4
1 2 3 4
4 1 2 31
raja#badfox:~/Desktop/trails$ awk '{ if ($1 == '4') print $0}' num.txt
4 1 2 31
raja#badfox:~/Desktop/trails$
so the command going to check for 4 at 1st column in filename num.txt .
So now i want output as there is a 4 at column 4 also and for example if i have 100 column of information and i want get the output as how many columns i have with the term i am searching .
I mean from the above example i want column 4 and column 1 as the output and i am searching for 4 .
If you are trying to find the rows which contain your search item (in this case, the value 4), and you want a count of how many such values appear in the row (as well as the row's data), then you need something like:
awk '{ count=0
for (i = 1; i <= NF; i++) if ($i == 4) count++
if (count) print $i ": " $0
}'
That doesn't quite fit onto one SO line.
If you merely want to identify which columns contain the search value, then you can use:
awk '{ for (i = 1; i <= NF; i++) if ($i == 4) column[i] = 1 }
END { for (i in column) print i }'
This sets the (associative) array element column[i] to 1 for each column that contains the search value, 4. The loop at the end prints the column numbers that contain 4 in an indeterminate (not sorted) order. GNU awk includes sort functions (asort, asorti); POSIX awk does not. If sorted order is crucial, then consider:
awk 'BEGIN { max = 0 }
{ for (i = 1; i <= NF; i++) if ($i == 4) { column[i] = 1; if (i > max) max = i } }
END { for (i = 1; i <= max; i++) if (column[i] == 1) print i }'
You are looking for the NF variable. It's the number of fields in the line.
Here's an example of how to use it:
{
if (NF == 8) {
print $3, $8;
} else if (NF == 9) {
print $3, $9;
}
}
Or inside a loop:
# This will print the line if any field has the value 4
for (i=1; i<=NF; i++) {
if ($i == 4)
print $0
}