scripting in awk - awk

I have a text file with contents as below:
1,A,100
2,A,200
3,B,150
4,B,100
5,B,250
i need the output as :
A,300
B,500
the logic here is sum of all the 3rd fields whose 2nd field is A and in the same way for B
how could we do it using awk?

You can do it using a hash as:
awk -F"," '{cnt[$2]+=$3}END{for (x in cnt){printf "%s,%d\n",x,cnt[x]}}' file

Well, I'm not up for writing and debugging the code for you. However, the elements you need are:
You can use FS="," to change the field separator to a comma.
The fields you care about are obviously the second ($2) and third ($3) fields.
You can create your own variables to accumulate the values into.
I'd suggest an associative array variable, indexed by field two.

$ awk -F"," '{_[$2]+=$3}END{for(i in _)print i,_[i]}' OFS="," file
A,300
B,500

Related

awk - store first occurrence based on cell

I have a file (around 10k entries) with following format:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
A;B;C;n/a;n/a
D;E;F;56.011;13.099
D;E;F;56.01;13.01
D;E;F;n/a;n/a
I;B;C;n/a;n/a
containing duplicates, some without, others with mildly contradicting LAT;LONG coordinates.
I only want to store first unique value of [$1;$2;$3;$4;$5] as output, so desired output should look like:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
I'd assume that I want to create an array, but I struggle with proper formating of it... so any help appreciated !
I'm glad you have it working, but personally, I would suggest something a little more along the lines of:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
Example Use/Output
With your data in file, you could then do:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
You can shorten it to about your example (which simply checks if the unique index of the first three fields combined has been set yet and relies on the default print operation to output the first records having the unique combination):
$ awk -F";" '!seen[$1,$2,$3]++' file
However, using the joined fields $1,$2,$3 as the index is about the only way you can ensure uniqueness.
If you say your works, then it is certainly shorter. Let me know if you have further questions.
Found it by stopping to look for creating arrays
created a new $1 being $1,$2,$3, but the other solutions is indeed more elegant, here is the command I came up with after merging the fields in the file (and setting them as new $1), which I then didn't have to do
awk -F';' '!seen[($1)]++' file1.csv > file2.csv

Delete repetitions in text file by using awk

I have a fragment of text file (this text file is huge):
114303 SOL1443
114311 SOL679
114316 SOL679
114432 SOL1156
114561 SOL122
114574 SOL2000
114952 SOL3018
115597 SOL609
115864 SOL2385
115993 SOL3448
SOL2 61571
SOL3 87990
SOL4 96242
SOL5 6329
SOL5 16550
SOL9 84894
SOL9 84911
SOL12 91985
SOL15 85816
I need to write script which will delete lines which have duplicate SOLnumber. It doesnt matter if SOL is in the first or in the second column
For example in text I have
115993 SOL269
SOL269 84911
12373 SOL269
So my script will delete second and third line
SOL269 84911
12373 SOL269
I know that in awk I can use
awk '!seen[$0]++' data.txt
to delete duplicate lines, but it deletes lines which have the same words in every column.
Please help me!
You need to extract the value of SOL and group the contents of the file based on it. The below command uses the regex match() function to match in the current line containing the pattern SOL followed by digit and store the captured group in variable sol.
Now with the value in the variable, use the logic !unique[sol]++ to list only the lines containing the pattern once.
awk 'match($0, /SOL[[:digit:]]+/){ sol = substr($0, RSTART, RLENGTH); } !unique[sol]++'
Not saying perl is any better than the above, but you can do
perl -ne '/(SOL\d+)/; print unless $unique{$1}++' file
As your SOL field is not always at the same place, you first have to find it.
awk '{
end=substr($0, index("SOL", $0))
sol=substr(end, 0, index(" ", end))
}
!seen[sol]++
' data.txt
You can do this, same idea than your awk command (just do some preprocessing to select the column to use in seen array:
awk '{if($1 ~ /^SOL/){sol_kw=$1}else{sol_kw=$2}}!seen[sol_kw]++' <file>

Using awk, conditionally print default value based on key,value not existing in second file

This is a variant on
Using awk how do I combine data in two files and substitute values from the second file to the first file?
The reason this question is not the same as the one linked above is that this one handles a couple of additional cases.
data.txt contains some data:
A;1
B;2
C;3
A;4
keys.txt contains "key,value" pairs, however, the key,value pair for C is missing:
A;60
B;50
D;30
Desired output
A;1;60
B;2;50
C;3;1
A;4;60
For all "key,value" pairs missing, append default value of 1 to those rows. Should also be able to handle "key,value" pairs in keys.txt which doesn't have a corresponding key in data.txt (such as D;30 in this example).
awk to the rescue!
compared to the related answer
remove in filter and replace last field with a conditional.
$ awk 'BEGIN {FS=OFS=";"}
NR==FNR {a[$1]=$2; next}
{print $0,($1 in a)?a[$1]:1}' file2 file1
A;1;60
B;2;50
C;3;1
A;4;60

Output field separators in awk after substitution in fields

Is it always the case, after modifying a specific field in awk, that information on the output field separator is lost? What happens if there are multiple field separators and I want them to be recovered?
For example, suppose I have a simple file example that contains:
a:e:i:o:u
If I just run an awk script, which takes account of the input field separator, that prints each line in my file, such as running
awk -F: '{print $0}' example
I will see the original line. If however I modify one of the fields directly, e.g. with
awk -F: '{$2=$2"!"; print $0}' example
I do not get back a modified version of the original line, rather I see the fields separated by the default whitespace separator, i.e:
a e! i o u
I can get back a modified version of the original by specifying OFS, e.g.:
awk -F: 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example
In the case, however, where there are multiple potential field separators but in the case of multiple separators is there a simple way of restoring the original separators?
For example, if example had both : and ; as separators, I could use -F":|;" to process the file but OFS would no be sufficient to restore the original separators in their relative positions.
More explicitly, if we switched to example2 containing
a:e;i:o;u
we could use
awk -F":|;" 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example2
(or -F"[:;]") to get
a:e!:i:o:u
but we've lost the distinction between : and ; which would have been maintained if we could recover
a:e!;i:o;u
You need to use GNU awk for the 4th arg to split() which saves the separators, like RT does for RS:
$ awk -F'[:;]' '{split($0,f,FS,s); $2=$2"!"; r=s[0]; for (i=1;i<=NF;i++) r=r $i s[i]; $0=r} 1' file
a:e!;i:o;u
There is no automatically populated array of FS matching strings because of how expensive it'd be in time and memory to store the string that matches FS every time you split a record into fields. Instead the GNU awk folks provided a 4th arg to split() so you can do it yourself if/when you want it. That is the result of a long conversation a few years ago in the comp.lang.awk newsgroup between experienced awk users and gawk providers before all agreeing that this was the best approach.
See split() at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions.

awk: calculating sum from values in single field with multiple delimiters

Related to another post I had...
parsing a sql string for integer values with multiple delimiters,
In which I say I can easily accomplish the same with UNIX tools (ahem). I found it a bit more messy than expected. I'm looking for an awk solution. Any suggestions on the following?
Here is my original post, paraphrased:
#
I want to use awk to parse data sourced from a flat file that is pipe delimited. One of the fields is sub-formatted as follows. My end state is to sum the integers within the field, but my question here is to see of ways to use awk to sum the numeric values in the field. The pattern of the sub-formatting will always be where the desired integers will be preceded by a tilde (~) and followed by an asterisk (*), except for the last one in field. The number of sub fields may vary too (my example has 5, but there could more or less). The 4 char TAG name is of no importance.
So here is a sample:
|GADS~55.0*BILK~0.0*BOBB~81.0*HETT~32.0*IGGR~51.0|
From this example, all I would want for processing is the final number of 219. Again, I can work on the sum part as a further step; just interested in getting the numbers.
#
My solution currently entails two awk statements. First using gsub to replace the '~' with a '*' delimiter in my target field, 77:
awk -F'|' 'BEGIN {OFS="|"} { gsub("~", "*", $77) ; print }' file_1 > file_2
My second awk statement is to calculate the numeric sums on the target field, 77, which is the last field, and replace it with the calculated value. It is built on the assumption that there will be no other asterisks (*) anywhere else in the file. I'm okay with that. It is working for most examples, but not others, and my gut tells me this isn't that robust of an answer. Any ideas? The suggestions on my other post for SQL were great, but I couldn't implement them for unrelated silly reasons.
awk -F'*' '{if (NF>=2) {s=0; for (i=1; i<=NF; i++) s=s+$i; print substr($1, 1, length($1)-4) s;} else print}' file_2 > file_3
To get the sum (219) from your example, you can use this:
awk -F'[^0-9.]+' '{for(i=1;i<=NF;i++)s+=$i;print s}' file
or the following for 219.00 :
awk -F'[^0-9.]+' '{for(i=1;i<=NF;i++)s+=$i;printf "%.2f\n", s}' file