awk to split and run a calculation in command

awk to split and run a calculation in command - awk

This is for my own learning, but lets say I have the below input file that before I run an awk command needs to split $5 before the -. Basically, I am summing all matching $5 strings by using $3-$2, outputting the lines and the total, but without a split they are all different. I can split the file before but I am curious if I can do everything in one awk. The commandd works on a file if it is split before the awk is run. Thank you :).
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
awk
awk '{split($5,a,"-"); a[1]} {c1[$a1]++; c2[$a1]+=($3-$2)}
END{for (e in c1) print e, c1[e], c2[e]}' input > out
** current output** (without the split)
AGRN-6 220
AGRN-7 281
desired output
AGRN 2 501

The only problem I see with your script is the references to c1[$a1] and c2[$a1]. Remember that the dollar sign is NOT an indicator of a string, you should think of it more of a selector or an array whose index are the positions of fields on the line.
So that means that $a1 is not the value of the varliable a1, but rather the value in the field in the a1 variable. To demonstrate:
$ echo "one two three" | awk '{ n=2; print $n }'
Simply remove the extra dollar signs and you should be good to go.
Incidentally, I don't get the same output as you when I run the incorrect script. Instead, I get an error:
awk: illegal field $(), name "a1"
input record number 1, file inp1
source line number 1
I'm using BSD awk. I don't get the error when I run your script with GNU awk (gawk). If you'll be doing a lot of awk programming, I suggest you pick up another awk or two just to see how different implementations parse your code, when things don't run as expected.

Related

I need to sum all the values in a column across multiple files

I have a directory with multiple csv text files, each with a single line in the format:
field1,field2,field3,560
I need to output the sum of the fourth field across all files in a directory (can be hundreds or thousands of files). So for an example of:
file1.txt
field1,field2,field3,560
file2.txt
field1,field2,field3,415
file3.txt
field1,field2,field3,672
The output would simply be:
1647
I've been trying a few different things, with the most promising being an awk command that I found here in response to another user's question. It doesn't quite do what I need it to do, and I am an awk newb so I'm unsure how to modify it to work for my purpose:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt
This correctly outputs 975.
However if I try pass it a 3rd file, rather than add field 4 from all 3 files, it adds file1 to file2, then file1 to file3:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt file3.txt
975
1232
Can anyone show me how I can modify this awk statement to accept more than two files or, ideally because there are thousands of files to sum up, an * to output the sum of the fourth field of all files in the directory?
Thank you for your time and assistance.

A couple issues with the current code:
NR==FNR is used to indicate special processing for the 1st file; in this case there is no processing that is 'special' for just the 1st file (ie, all files are to be processed the same)
an array (eg, a[NR]) is used to maintain a set of values; in this case you only have one global value to maintain so there is no need for an array
Since you're only looking for one global sum, a bit more simpler code should suffice:
$ awk -F',' '{sum+=$4} END {print sum+0}' file{1..3}.txt
1647
NOTES:
in the (unlikely?) case all files are empty, sum will be undefined so print sum will display a blank link; sum+0 insures we print 0 if sum remains undefined (ie, all files are empty)
for a variable number of files file{1..3}.txt can be replaced with whatever pattern will match on the desired set of files, eg, file*.txt, *.txt, etc

Here we go (no need to test NR==FNR in a concatenation):
$ cat file{1,2,3}.txt | awk -F, '{count+=$4}END{print count}'
1647
Or same-same 🇹🇭 (without wasting some pipe(s)):
$ awk -F, '{count+=$4}END{print count}' file{1,2,3}.txt
1647

$ perl -MList::Util=sum0 -F, -lane'push #a,$F[3];END{print sum0 #a}' file{1..3}.txt
1647
$ perl -F, -lane'push #a,$F[3];END{foreach(#a){ $sum +=$_ };print "$sum"}' file{1..3}.txt
1647

$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

Generating 10 random numbers in a range in an awk script

So I'm trying to write an awk script that generates passwords given random names inputted from a .csv file. I'm aiming to do first 3 letters of last name, number of characters in the fourth field, then a random number between 1-200 after a space. So far I've got the letters and num of characters fine, but am having a hard time getting the syntax in my for loop to work for the random numbers. Here is an example of the input:
Danette,Suche,Female,"Kingfisher, malachite"
Corny,Chitty,Male,"Seal, southern elephant"
And desired output:
Suc21 80
Chi23 101
For 10 rows total. My code looks like this:
BEGIN{
FS=",";OFS=","
}
{print substr($2,0,3)length($4)
for(i=0;i<10;i++){
echo $(( $RANDOM % 200 ))
}}
Then I've been running it like
awk -F"," -f script.awk file.csv
But it only shows the 3 characters and length of fourth field, no random numbers. If anyone's able to point out where I'm screwing up it would be much appreciated , thanks guys

You can use rand() to generate a random number between 0 and 1:
awk -F, '{print substr($2,0,3)length($4),int(rand()*200)+1}' file.csv

BEGIN{
FS=",";OFS=","
}
{print substr($2,0,3)length($4)
for(i=0;i<10;i++){
echo $(( $RANDOM % 200 ))
}}
There is not echo function defined in GNU AWK, if you wish to use shell command you might use system function, however keep in mind that it does return status code and does print what said command output, without ability to alter it, so you need to design command so you get desired output from it.
Let file.txt content be
A
B
C
then
awk '{printf "%s ",$0;system("echo ${RANDOM}%200 | bc")}' file.txt
might give output
A 95
B 139
C 1
Explanation: firstly I use printf so no newline is appended automatically, I output whole line followed by space, then I execute command which does output random value in range
echo ${RANDOM}%200 | bc
it does simply ram RANDOM followed by %200 into calculator, which does output result of such action.
If you are not dead set on using RANDOM variable, then rand function, might be use without hassle.
(tested with gawk 4.2.1 and bc 1.07.1)

How to compare digits after find with awk, egrep

i have some file.txt where is a lot of information. Input in file looks like:
<ss>283838<ss>
.
.
<ss>111 from 4444<ss>
.
<ss>255<ss>
The numbers can have any number of digits.
I need to find and compare these 2 numbers
If they equal print name of file and print that they are equal if not, reverse meneaning. Only one string in file have digits with word "from" between
I tried to do like
Awk '/[0-9]+ from./ {print $0} file.txt | egrep -o '[0-9]+'
With this command i get those two digits, but i im stacked now, and do not know how to compare them

With your shown samples, could you please try following. Simple explanation would be: getting respective values of digits by regex and then comparing them to check 3 cases either they are greater, lesser or equal to each other, will add detailed explanation in sometime.
awk '
match($0,/<[a-zA-Z]+[0-9]+/){
val1=substr($0,RSTART,RLENGTH)
gsub(/[^0-9]*/,"",val1)
match($0,/[0-9]+[a-zA-Z]+>/)
val2=substr($0,RSTART,RLENGTH)
gsub(/[^0-9]*/,"",val2)
if(val1>val2){
print "val1("val1 ")is Greater than val2("val2")"
}
if(val2>val1){
print "val2("val2 ")is Greater than val1("val1")"
}
if(val1==val2){
print "val1("val1 ")is equals to val2("val2")"
}
}' Input_file
For your current shown sample output will be as follows:
val2(333)is Greater than val1(222)

How to replace an entire column without losing the formatting in awk

Editor's note:
This question has a troubled edit history in that a well-meaning, but misguided edit (which introduced unrelated, "pretty" formatting relying on spaces and | chars. to separate columns) temporarily confused the issue (since reverted).
The OP's premise is that the input is tab-delimited, even though that's not directly reflected in the sample input displayed here.
I have a input file having 6 columns and they are tab-separated. I want to replace all values in column 5 with value '81115', while keeping the formatting intact.
Input File :
203 ADD 24 IAC 81216 IT
204 ATT 24 IAC 81216 IT
Desired Output File :
203 ADD 24 IAC 81115 IT
204 ATT 24 IAC 81115 IT
My solution #1
I am using the following command:
awk '{$5 = v} 1' v="81115" file > file.NEW
With the above command, column 5 is getting replaced, but the columns are no longer tab-separated.
Output File :
203 ADD 24 IAC 81115 IT
204 ATT 24 IAC 81115 IT
My solution #2
To maintain the formatting I have tried using the following commands:
awk -v replace="81115" -F '\t' -v OFS='\t' {$5=replace}1' file > file.NEW
OR
awk -F"\t" -v OFS="\t" '{$5=81115}1' file > file.NEW
OR
awk -F '\t' '{$5="81115";}1' OFS='\t' file > file.NEW
All of the above commands are keeping the formatting intact, but are adding a new column with value 81115 at the end; i.e., column 7 is getting appended.
Output File:
203 ADD 24 IAC 81216 IT 81115
204 ATT 24 IAC 81216 IT 81115
Can anyone suggest an alternate solution or changes to above commands?

For in-column update retaining the format, you need to use the split function. Note that the split function with forth argument is supported only by GNU awk.
Try this:
awk '{split($0, a, FS, seps) # split based on FS
a[5]="81115"; # Update the 5th column
for (i=1;i<=NF;i++) # print the data back
printf("%s%s", a[i], seps[i]) # keeping the separators
print ""}' # print a new line
One-liner:
awk '{split($0, a, FS, seps); a[5]="81115"; for (i=1;i<=NF;i++) printf("%s%s", a[i], seps[i]); print ""}' /tmp/data
Credit goes to https://stackoverflow.com/a/39326264/2032943

The simplest solution based on given sample input is a simple search and replace using sed, which assumes 5th column has only same value of 81216 and that value doesn't occur anywhere in 1-4 column
$ sed 's/81216/81115/' file
203 ADD 24 IAC 81115 IT
204 ATT 24 IAC 81115 IT
If any value in 5th column has to be replaced,
sed -E 's/^((\S+\s+){4})\S+/\181115/' file
If \s and \S are not recognized, use
sed -E 's/^(([^[:space:]]+[[:space:]]+){4})[^[:space:]]+/\181115/' file
Similar solution can be used with GNU awk which has the gensub function
awk '{$0 = gensub(/^((\S+\s+){4})\S+/, "\\181115", "1", $0)}1' file
Or with variable,
awk -v replace='81115' '{$0 = gensub(/^((\S+\s+){4})\S+/, "\\1"replace, "1", $0)}1' file
All the above solutions preserve input file space formatting

Note:
- If you must preserve the exact separator strings from the input and you have GNU awk, see #Sundeep's helpful answer, or, for a solution covering all fields, see Jay Rajput's helpful answer.
- This answers tries to diagnose the OP's problem and a contains a solution that converts the input to consistently tab-delimited output.
Your 1st attempt doesn't preserve tabs in the output, because, in the absence of setting OFS, the output-field separator, Awk separates output fields by a space each.
(By assigning to a field, as you do with $5 = ..., the input line is implicitly rebuilt, using the value of OFS (a space by default) as the separator to piece the (modified) fields back together to form the output line.)
Your other attempts all look reasonable, which suggests that your input file may not be structured the way you think it is.
Use cat -et to verify that all columns in your input file are indeed separated by a tab each: ^I represents a tab in the output of cat -et.
If your input file contains a mix of tab- and space(s)-separated columns and/or if some fields have multiple tabs between them, you need to rely on awk's default parsing to split your input into fields as expected, namely by any run of nonempty whitespace.
You then use tab as the separator only on output, by setting OFS only:
awk -v replace='81115' -v OFS='\t' '{$5=replace}1' file
Note the absence of the -F option so as to rely on Awk's default field-splitting behavior.
While this will not necessarily keep the exact input formatting, you will get consistently tab-separated output.

awk to output matches to separate files

I am trying to combine the text in $2 that is the same and output them to separate files with the match being the name of the new file. Since the actual files are quite large I open each file, then close to save on speed and memory, my attempt is below. Thank you :).
awk '{printf "%s\n", $2==$2".txt"; close($2".txt")}' input.txt **'{ print $2 > "$2.txt" }'**
input.txt
chr19:41848059-41848167 TGFB1:exon.2;TGFB1:exon.3;TGFB1:exon.4 284.611 108 bases
chr15:89850833-89850913 FANCI:exon.20;FANCI:exon.27;FANCI:exon.32;FANCI:exon.33;FANCI:exon.34 402.012 80 bases
chr15:31210356-31210508 FANC1:exon.6;FANC1:exon.7 340.914 152 bases
chr19:41850636-41850784 TGFB1:exon.1;TGFB1:exon.2;TGFB1:exon.3 621.527 148 bases
Desired output for TGFB1.txt
chr19:41848059-41848167 TGFB1:exon.2;TGFB1:exon.3;TGFB1:exon.4 284.611 108 bases
chr19:41850636-41850784 TGFB1:exon.1;TGFB1:exon.2;TGFB1:exon.3 621.527 148 bases
Desired output for FANC1.txt
chr15:89850833-89850913 FANCI:exon.20;FANCI:exon.27;FANCI:exon.32;FANCI:exon.33;FANCI:exon.34 402.012 80 bases
chr15:31210356-31210508 FANC1:exon.6;FANC1:exon.7 340.914 152 bases
EDIT:
awk -F '[ :]' '{f = $3 ".txt"; close($3 ".txt")} print > f}' BMF_unix_loop_genes_IonXpress_008_150902_loop_genes_average_IonXpress_008_150902.bed > /home/cmccabe/Desktop/panels/BMF **/"$f".txt;**
bash: /home/cmccabe/Desktop/panels/BMF: Is a directory

You can just redefine your field separator to include a colon and then the file name would be in $3
awk -F '[ :]' '{f = $3 ".txt"; print > f}' input.txt
I've encountered problems with some awks where constructing the filename to the right of the redirection is problematic, which is why I'm using a variable. However the Friday afternoon beer cart has been around and I can't recall specific details :/
I wouldn't bother closing the files unless you're expecting hundreds or thousands of new files to be produced.

You need to split the second field to the desired field name. This should do
$ awk 'BEGIN{close(p)} {split($2,f,":"); p=f[1]".txt"; print $0 > p }' file
Note that it won't produce your output exactly since you have a typo in one of the fields
$ ls *.txt
FANC1.txt FANCI.txt TGFB1.txt

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk to split and run a calculation in command - awk

Related

I need to sum all the values in a column across multiple files

Generating 10 random numbers in a range in an awk script

How to compare digits after find with awk, egrep

How to replace an entire column without losing the formatting in awk

awk to output matches to separate files

Categories

Resources