awk count selective combinations only: - awk
Would like to read and count the field value == "TRUE" only from 3rd field to 5th field.
Input.txt
Locationx,Desc,A,B,C,Locationy
ab123,Name1,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,FALSE,TRUE,ab1234
ab123,Name2,FALSE,FALSE,TRUE,ab1234
ab123,Name1,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,TRUE,TRUE,ab1234
ab123,Name3,FALSE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,ab1234
ab123,Name1,TRUE,TRUE,FALSE,ab1234
While reading the headers from 3rd field to 5th field , i,e A, B, C want to generate unique combinations and permutations like A,B,C,AB,AC,AB,ABC only.
Note: AA, BB, CC, BA etc excluded
If the "TRUE" is considered for "AB" combination count then it should not be considered for "A" conut & "B" count again to avoid duplicate ..
Example#1
Locationx,Desc,A,B,C,Locationy
ab123,Name1,TRUE,TRUE,TRUE,ab1234
Op#1
Desc,A,B,C,AB,AC,BC,ABC
Name1,,,,,,,1
Example#2
Locationx,Desc,A,B,C,Locationy
ab123,Name1,TRUE,TRUE,FALSE,ab1234
Op#2
Desc,A,B,C,AB,AC,BC,ABC
Name1,,,,1,,,
Example#3
Locationx,Desc,A,B,C,Locationy
ab123,Name1,FALSE,TRUE,FALSE,ab1234
Op#3
Desc,A,B,C,AB,AC,BC,ABC
Name1,,1,,,,,
Desired Output:
Desc,A,B,C,AB,AC,BC,ABC
Name1,,,,1,,,2
Name2,,,1,,1,,1
Name3,1,,,2,,,
Actual File is like below :
Input.txt
Locationx,Desc,INCOMING,OUTGOING,SMS,RECHARGE,DEBIT,DATA,Locationy
ab123,Name1,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,ab1234
ab123,Name2,TRUE,TRUE,FALSE,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,ab1234
ab123,Name1,TRUE,TRUE,TRUE,TRUE,FALSE,TRUE,ab1234
ab123,Name2,TRUE,TRUE,TRUE,TRUE,FALSE,TRUE,ab1234
ab123,Name3,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,TRUE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,TRUE,FALSE,FALSE,ab1234
ab123,Name1,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,ab1234
Have tried lot , nothing is materialised , any suggestions please !!!
Edit: Desired Output from Actual Input:
Desc,INCOMING-OUTGOING-SMS-RECHARGE-DEBIT-DATA,OUTGOING-SMS-RECHARGE-DEBIT-DATA,INCOMING-SMS-RECHARGE-DEBIT-DATA,INCOMING-OUTGOING-RECHARGE-DEBIT-DATA,INCOMING-OUTGOING-SMS-RECHARGE-DATA,INCOMING-OUTGOING-SMS-RECHARGE-DEBIT,SMS-RECHARGE-DEBIT-DATA,OUTGOING-RECHARGE-DEBIT-DATA,OUTGOING-SMS-RECHARGE-DATA,OUTGOING-SMS-RECHARGE-DEBIT,INCOMING-RECHARGE-DEBIT-DATA,INCOMING-SMS-DEBIT-DATA,INCOMING-SMS-RECHARGE-DATA,INCOMING-SMS-RECHARGE-DEBIT,INCOMING-OUTGOING-DEBIT-DATA,INCOMING-OUTGOING-RECHARGE-DATA,INCOMING-OUTGOING-RECHARGE-DEBIT,INCOMING-OUTGOING-SMS-DATA,INCOMING-OUTGOING-SMS-DEBIT,INCOMING-OUTGOING-SMS-RECHARGE,RECHARGE-DEBIT-DATA,SMS-DEBIT-DATA,SMS-RECHARGE-DATA,SMS-RECHARGE-DEBIT,OUTGOING-RECHARGE-DATA,OUTGOING-RECHARGE-DEBIT,OUTGOING-SMS-DATA,OUTGOING-SMS-DEBIT,OUTGOING-SMS-RECHARGE,INCOMING-DEBIT-DATA,INCOMING-RECHARGE-DATA,INCOMING-RECHARGE-DEBIT,INCOMING-SMS-DATA,INCOMING-SMS-DEBIT,INCOMING-SMS-RECHARGE,INCOMING-OUTGOING-DATA,INCOMING-OUTGOING-DEBIT,INCOMING-OUTGOING-RECHARGE,INCOMING-OUTGOING-SMS,DEBIT-DATA,RECHARGE-DATA,RECHARGE-DEBIT,SMS-DATA,SMS-DEBIT,SMS-RECHARGE,OUTGOING-DATA,OUTGOING-DEBIT,OUTGOING-RECHARGE,OUTGOING-SMS,INCOMING-DATA,INCOMING-DEBIT,INCOMING-RECHARGE,INCOMING-SMS,INCOMING-OUTGOING,DATA,DEBIT,RECHARGE,SMS,OUTGOING,INCOMING
Name1,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,,,1,,,,,,,,,,,,,,,,,,,,,
Name2,,,,1,1,,,,,,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Name3,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,,,,,,,,,,,,,,,,,,,1,,,
Don't have Perl and Python access !!!
I have written a perl script that does this for you. As you can see from the size and comments, it is really simple to get this done.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Algorithm::Combinatorics qw(combinations);
## change the file to the path where your file exists
open my $fh, '<', 'file';
my (%data, #new_labels);
## capture the header line in an array
my #header = split /,/, <$fh>;
## backup the header
my #fields = #header;
## remove first, second and last columns
#header = splice #header, 2, -1;
## generate unique combinations
for my $iter (1 .. +#header) {
my $combination = combinations(\#header, $iter);
while (my $pair = $combination->next) {
push #new_labels, "#$pair";
}
}
## iterate through rest of the file
while(my $line = <$fh>) {
my #line = split /,/, $line;
## identify combined labels that are true
my #is_true = map { $fields[$_] } grep { $line[$_] eq "TRUE" } 0 .. $#line;
## increment counter in hash map keyed at description and then new labels
++$data{$line[1]}{$_} for map { s/ /-/g; $_ } "#is_true";
}
## print the new header
print join ( ",", "Desc", map {s/ /-/g; $_} reverse #new_labels ) . "\n";
## print the description and counter values
for my $desc (sort keys %data){
print join ( ",", $desc, ( map { $data{$desc}{$_} //= "" } reverse #new_labels ) ) . "\n";
}
Output:
Desc,INCOMING-OUTGOING-SMS-RECHARGE-DEBIT-DATA,OUTGOING-SMS-RECHARGE-DEBIT-DATA,INCOMING-SMS-RECHARGE-DEBIT-DATA,INCOMING-OUTGOING-RECHARGE-DEBIT-DATA,INCOMING-OUTGOING-SMS-DEBIT-DATA,INCOMING-OUTGOING-SMS-RECHARGE-DATA,INCOMING-OUTGOING-SMS-RECHARGE-DEBIT,SMS-RECHARGE-DEBIT-DATA,OUTGOING-RECHARGE-DEBIT-DATA,OUTGOING-SMS-DEBIT-DATA,OUTGOING-SMS-RECHARGE-DATA,OUTGOING-SMS-RECHARGE-DEBIT,INCOMING-RECHARGE-DEBIT-DATA,INCOMING-SMS-DEBIT-DATA,INCOMING-SMS-RECHARGE-DATA,INCOMING-SMS-RECHARGE-DEBIT,INCOMING-OUTGOING-DEBIT-DATA,INCOMING-OUTGOING-RECHARGE-DATA,INCOMING-OUTGOING-RECHARGE-DEBIT,INCOMING-OUTGOING-SMS-DATA,INCOMING-OUTGOING-SMS-DEBIT,INCOMING-OUTGOING-SMS-RECHARGE,RECHARGE-DEBIT-DATA,SMS-DEBIT-DATA,SMS-RECHARGE-DATA,SMS-RECHARGE-DEBIT,OUTGOING-DEBIT-DATA,OUTGOING-RECHARGE-DATA,OUTGOING-RECHARGE-DEBIT,OUTGOING-SMS-DATA,OUTGOING-SMS-DEBIT,OUTGOING-SMS-RECHARGE,INCOMING-DEBIT-DATA,INCOMING-RECHARGE-DATA,INCOMING-RECHARGE-DEBIT,INCOMING-SMS-DATA,INCOMING-SMS-DEBIT,INCOMING-SMS-RECHARGE,INCOMING-OUTGOING-DATA,INCOMING-OUTGOING-DEBIT,INCOMING-OUTGOING-RECHARGE,INCOMING-OUTGOING-SMS,DEBIT-DATA,RECHARGE-DATA,RECHARGE-DEBIT,SMS-DATA,SMS-DEBIT,SMS-RECHARGE,OUTGOING-DATA,OUTGOING-DEBIT,OUTGOING-RECHARGE,OUTGOING-SMS,INCOMING-DATA,INCOMING-DEBIT,INCOMING-RECHARGE,INCOMING-SMS,INCOMING-OUTGOING,DATA,DEBIT,RECHARGE,SMS,OUTGOING,INCOMING
Name1,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,,,1,,,,,,,,,,,,,,,,,,,,,
Name2,,,,1,,1,,,,,,,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Name3,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,,,,,,,,,,,,,,,,,,,1,,,
Note: Please revisit your expected output. It has few mistakes in it as you can see from the output generated from the script above.
Here is an attempt at solving this using awk:
Content of script.awk
BEGIN { FS = OFS = "," }
function combinations(flds, itr, i, pre) {
for (i=++cnt; i<=numRecs; i++) {
++n
sep = ""
for (pre=1; pre<=itr; pre++) {
newRecs[n] = newRecs[n] sep (sprintf ("%s", flds[pre]));
sep = "-"
}
newRecs[n] = newRecs[n] sep (sprintf ("%s", flds[i])) ;
}
}
NR==1 {
for (fld=3; fld<NF; fld++) {
recs[++numRecs] = $fld
}
for (iter=0; iter<numRecs; iter++) {
combinations(recs, iter)
}
next
}
!seen[$2]++ { desc[++d] = $2 }
{
y = 0;
var = sep = ""
for (idx=3; idx<NF; idx++) {
if ($idx == "TRUE") {
is_true[++y] = recs[idx-2]
}
}
for (z=1; z<=y; z++) {
var = var sep sprintf ("%s", is_true[z])
sep = "-"
}
data[$2,var]++;
}
END{
printf "%s," , "Desc"
for (k=1; k<=n; k++) {
printf "%s%s", newRecs[k],(k==n?RS:FS)
}
for (name=1; name<=d; name++) {
printf "%s,", desc[name];
for (nR=1; nR<=n; nR++) {
printf "%s%s", (data[desc[name],newRecs[nR]]?data[desc[name],newRecs[nR]]:""), (nR==n?RS:FS)
}
}
}
Sample file
Locationx,Desc,A,B,C,Locationy
ab123,Name1,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,FALSE,TRUE,ab1234
ab123,Name2,FALSE,FALSE,TRUE,ab1234
ab123,Name1,TRUE,TRUE,TRUE,ab1234
ab123,Name2,TRUE,TRUE,TRUE,ab1234
ab123,Name3,FALSE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,FALSE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,ab1234
ab123,Name3,TRUE,TRUE,FALSE,ab1234
ab123,Name1,TRUE,TRUE,FALSE,ab1234
Execution:
$ awk -f script.awk file
Desc,A,B,C,A-B,A-C,A-B-C
Name1,,,,1,,2
Name2,,,1,,1,1
Name3,1,,,2,,
Now, there is pretty evident bug in the combination function. It does not recurse to print all combinations. For eg: for A B C D it will print
A
B
C
AB
AC
ABC
but not BC
Related
Swapping / rearranging of columns and its values based on inputs using Unix scripts
Team, I have an requirement of changing /ordering the column of csv files based on inputs . example : Datafile (source File) will be always with standard column and its values example : PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION MK3,Biberach,15200100_3,Biologics Downstream MK3,Biberach,15200100_4,Sciona Upstream MK3,Biberach,15200100_5,Drag envois MK3,Biberach,15200100_8,flatsylio MK3,Biberach,15200100_1,bioCovis these columns (PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION) will be standard for source files and what i am looking for solution to format this and generate new file with the columns which we preferred . Note : Source / Data file will be always comma delimited Example : if I pass PRODUCTCODE,BATCHID as input then i would like to have only those column and its data extracted from source file and generate new file . Something like script_name <output_column> <Source_File_name> <target_file_name> target file example : PRODUCTCODE,BATCHID MK3,15200100_3 MK3,15200100_4 MK3,15200100_5 MK3,15200100_8 MK3,15200100_1 if i pass output_column as "LV1P_DESCRIPTION,PRODUCTCODE" then out file should be like below LV1P_DESCRIPTION,PRODUCTCODE Biologics Downstream,MK3 Sciona Upstream,MK3 Drag envios,MK3 flatsylio,MK3 bioCovis,MK3 It would be great if any one can help on this. I have tried using some awk scripts (got it from some site) but it was not working as expected , since i don't have unix knowledge finding difficulties to modify this . awk code: BEGIN { FS = "," } NR==1 { split(c, ca, ",") for (i = 1 ; i <= length(ca) ; i++) { gsub(/ /, "", ca[i]) cm[ca[i]] = 1 } for (i = 1 ; i <= NF ; i++) { if (cm[$i] == 1) { cc[i] = 1 } } if (length(cc) == 0) { exit 1 } } { ci = "" for (i = 1 ; i <= NF ; i++) { if (cc[i] == 1) { if (ci == "") { ci = $i } else { ci = ci "," $i } } } print ci } the above code is saves as Remove.awk and this will be called by another scripts as below var1="BATCHID,LV2P_DESCRIPTION" ## this is input fields values used for testing awk -f Remove.awk -v c="${var1}" RESULT.csv > test.csv
The following GNU awk solution should meet your objectives: awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" 'BEGIN { split(flds,map,",") } NR==1 { for (i=1;i<=NF;i++) { map1[$i]=i } } { printf "%s",$map1[map[1]];for(i=2;i<=length(map);i++) { printf ",%s",$map1[map[i]] } printf "\n" }' file Explanation: awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" ' # Pass the fields to print as a variable field BEGIN { split(flds,map,",") # Split fld into an array map using , as the delimiter } NR==1 { for (i=1;i<=NF;i++) { map1[$i]=i # Loop through the header and create and array map1 with the column header as the index and the column number the value } } { printf "%s",$map1[map[1]]; # Print the first field specified (index of map) for(i=2;i<=length(map);i++) { printf ",%s",$map1[map[i]] # Loop through the other field numbers specified, printing the contents } printf "\n" }' file
MAWK: Store match() in variable
I try to use MAWK where the match() built-in function doesn't have a third value for variable: match($1, /9f7fde/) { substr($1, RSTART, RLENGTH); } See doc. How can I store this output into a variable named var when later I want to construct my output like this? EDIT2 - Complete example: Input file structure: <iframe src="https://vimeo.com/191081157" frameborder="0" height="481" width="608" scrolling="no"></iframe>|Random title|Uploader|fun|tag1,tag2,tag3 <iframe src="https://vimeo.com/212192268" frameborder="0" height="481" width="608" scrolling="no"></iframe>|Random title|Uploader|fun|tag1,tag2,tag3 parser.awk: { Embed = $1; Title = $2; User = $3; Categories = $4; Tags = $5; } BEGIN { FS="|"; } # Regexp without pattern matching for testing purposes match(Embed, /191081157/) { Id = substr(Embed, RSTART, RLENGTH); } { print Id"\t"Title"\t"User"\t"Categories"\t"Tags; } Expected output: 191081157|Random title|Uploader|fun|tag1,tag2,tag3 I want to call the Id variable outside the match() function. MAWK version: mawk 1.3.4 20160930 Copyright 2008-2015,2016, Thomas E. Dickey Copyright 1991-1996,2014, Michael D. Brennan random-funcs: srandom/random regex-funcs: internal compiled limits: sprintf buffer 8192 maximum-integer 2147483647
The obvious answer would seem to be match($1, /9f7fde/) { var = "9f7fde"; } But more general would be: match($1, /9f7fde/) { var = substr($1, RSTART, RLENGTH); }
UPDATE : The solution above mine could be simplified to : from match($1, /9f7fde/) { var = substr($1, RSTART, RLENGTH) } to { __=substr($!_,match($!_,"9f7fde"),RLENGTH) } A failed match would have RLENGTH auto set to -1, so nothing gets substring'ed out. But even that is too verbose : since the matching criteria is a constant string, then simply mawk '$(_~_)~_{__=_}' \_='9f7fde' ============================================ let's say this line .....vimeo.com/191081157" frameborder="0" height="481" width="608" scrolling="no">Random title|Uploader|fun|tag1,tag2,tag3 {mawk/mawk2/gawk} 'BEGIN { OFS = ""; FS = "(^.+vimeo[\056]com[\057]|[\042] frameborder.+[\057]iframe[>])" ; } (NF < 4) || ($2 !~ /191081157/) { next } ( $1 = $1 )' \056 is the dot ( . ) \057 is forward slash ( / ) and \042 is double straight quote ( " ) if it can't even match at all, move onto next row. otherwise, use the power of the field separator to gobble away all the unneeded parts of the line. The $1 = $1 will collect the prefix and the rest of the HTML tags you don't need. The assignment operation of $1 = $1 will also return true, providing the input for boolean evaluation for it to print. This way, you don't need either match( ) or substr( ) at all.
awk | Rearrange fields of CSV file on the basis of column value
I need you help in writing awk for the below problem. I have one source file and required output of it. Source File a:5,b:1,c:2,session:4,e:8 b:3,a:11,c:5,e:9,session:3,c:3 Output File session:4,a=5,b=1,c=2 session:3,a=11,b=3,c=5|3 Notes: Fields are not organised in source file In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol. Please help.
Will work in any modern awk: $ cat file a:5,b:1,c:2,session:4,e:8 a:5,c:2,session:4,e:8 b:3,a:11,c:5,e:9,session:3,c:3 $ cat tst.awk BEGIN{ FS="[,:]"; split("session,a,b,c",order) } { split("",val) # or delete(val) in gawk for (i=1;i<NF;i+=2) { val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1) } for (i=1;i in order;i++) { name = order[i] printf "%s%s", (i==1 ? name ":" : "," name "="), val[name] } print "" } $ awk -f tst.awk file session:4,a=5,b=1,c=2 session:4,a=5,b=,c=2 session:3,a=11,b=3,c=5|3 If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output. Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with: awk ' BEGIN { FS = "[,:]" OFS = "," } { for ( i = 1; i <= NF; i+= 2 ) { if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue } hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1) } asorti( hash, hash_orig ) for ( i = 1; i <= length(hash); i++ ) { printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ] } printf "\n" delete hash delete hash_orig } ' infile that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields: session:4,a:5,b:1,c:2,e:8 session:3,a:11,b:3,c:5|3,e:9
awk system not setting variables properly
I am having a issue in having the output of the grep (used in system() in nawk ) assigned to a variable . nawk '{ CITIZEN_COUNTRY_NAME = "INDIA" CITIZENSHIP_CODE=system("grep "CITIZEN_COUNTRY_NAME " /tmp/OFAC/country_codes.config | cut -d # -f1") }'/tmp/***** The value IND is displayed in the console but when i give a printf the value of citizenshipcode is 0 - Can you pls help me here printf("Country Tags|%s|%s\n", CITIZEN_COUNTRY_NAME ,CITIZENSHIP_CODE) Contents of country_codes.config file IND#INDIA IND#INDIB CAN#CANADA
system returns the exit value of the called command, but the output of the command is not returned to awk (or nawk). To get the output, you want to use getline directly. For example, you might re-write your script: awk ' { file = "/tmp/OFAC/country_codes.config"; CITIZEN_COUNTRY_NAME = "INDIA"; FS = "#"; while( getline < file ) { if( $0 ~ CITIZEN_COUNTRY_NAME ) { CITIZENSHIP_CODE = $1; } } close( file ); }'
Pre-load the config file with awk: nawk ' NR == FNR { split($0, x, "#") country_code[x[2]] = x[1] next } { CITIZEN_COUNTRY_NAME = "INDIA" if (CITIZEN_COUNTRY_NAME in country_code) { value = country_code[CITIZEN_COUNTRY_NAME] } else { value = "null" } print "found " value " for country name " CITIZEN_COUNTRY_NAME } ' country_codes.config filename
How can I embed arguments in an awk script?
This is the evolution of these two questions, here, and here. For mine own learning, I'm trying to accomplish two (more) things with the code below: Instead of invoking my script with # myscript -F "," file_to_process, how can I fold in the '-F ","' part into the script itself? How can I initialize a variable, so that I only assign a value once (ignoring subsequent matches? You can see from the script that I parse seconds and micro seconds in each rule, I'd like to keep the first assignment of sec around so I could subtract it from subsequent matches in the printf() statement. #!/usr/bin/awk -f /DIAG:/ { lbl = $3; sec = $5; usec = $6; /Test-S/ { stgt = $7; s1 = $30; s2 = $31; } /Test-A/ { atgt = $7; a = $8; } /Test-H/ { htgt = $7; h = $8; } /Test-C/ { ctgt = $7; c = $8; } } /WARN:/ { sec = $4; usec = $5; m1 = $2; m2 = $3 } { printf("%16s,%17d.%06d,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%5d,%5d\n", lbl, sec, usec, stgt, s1, s2, atgt, a, htgt, h, ctgt, c, m1, m2) }
Use a BEGIN clause: BEGIN { FS = "," var1 = "text" var2 = 3 etc. } This is run before the line-processing statements. FS is the field separator. If you want to parse a value and keep it, it depends on whether you want only the first one or you want each previous one. To keep the first one: FNR==1 { keep=$1 } To keep the previous one: BEGIN { prevone = "initial value" } /regex/ { do stuff with $1 do stuff with prevone prevone = $1 }