awk or sed to delete a pattern with string and number - awk

I have a file with the following content:
string1_204
string2_408
string35_592
I need to get rid of string1_,string2_,string35_ and so on and add 204,408,592 to get a value.
so the output should be 1204.
I can take out string1_ and string 2_ but for string35_592 I have 5_592.
I cant seem to get the command right to do what I want to do. Please any help is appreciated:)

With awk:
awk -F_ '{s+=$2}END{print s}' your.txt
Output:
1204
Explanation:
-F_ sets the field separator to _ what makes it easy to access
the numbers later on
{
# runs on every line of the input file
# adds the value of the second field - the number - to s.
# awk auto initializes s with 0 on it's first usage
s+=$2
}
END {
# runs after all input has been processed
# prints the sum
print s
}

In case you are interested in a coreutils/bc alternative:
<infile cut -d_ -f2 | paste -sd+ - | bc
Output:
1024
Explanation:
cut splits each line at underscore characters (-d_) and outputs only the second field (-f2). The column of numbers is passed on to paste which joins them on a line (-s) delimited by plus characters (-d+). This is passed on to bc which calculates and outputs the sum.

Related

Generating 10 random numbers in a range in an awk script

So I'm trying to write an awk script that generates passwords given random names inputted from a .csv file. I'm aiming to do first 3 letters of last name, number of characters in the fourth field, then a random number between 1-200 after a space. So far I've got the letters and num of characters fine, but am having a hard time getting the syntax in my for loop to work for the random numbers. Here is an example of the input:
Danette,Suche,Female,"Kingfisher, malachite"
Corny,Chitty,Male,"Seal, southern elephant"
And desired output:
Suc21 80
Chi23 101
For 10 rows total. My code looks like this:
BEGIN{
FS=",";OFS=","
}
{print substr($2,0,3)length($4)
for(i=0;i<10;i++){
echo $(( $RANDOM % 200 ))
}}
Then I've been running it like
awk -F"," -f script.awk file.csv
But it only shows the 3 characters and length of fourth field, no random numbers. If anyone's able to point out where I'm screwing up it would be much appreciated , thanks guys
You can use rand() to generate a random number between 0 and 1:
awk -F, '{print substr($2,0,3)length($4),int(rand()*200)+1}' file.csv
BEGIN{
FS=",";OFS=","
}
{print substr($2,0,3)length($4)
for(i=0;i<10;i++){
echo $(( $RANDOM % 200 ))
}}
There is not echo function defined in GNU AWK, if you wish to use shell command you might use system function, however keep in mind that it does return status code and does print what said command output, without ability to alter it, so you need to design command so you get desired output from it.
Let file.txt content be
A
B
C
then
awk '{printf "%s ",$0;system("echo ${RANDOM}%200 | bc")}' file.txt
might give output
A 95
B 139
C 1
Explanation: firstly I use printf so no newline is appended automatically, I output whole line followed by space, then I execute command which does output random value in range
echo ${RANDOM}%200 | bc
it does simply ram RANDOM followed by %200 into calculator, which does output result of such action.
If you are not dead set on using RANDOM variable, then rand function, might be use without hassle.
(tested with gawk 4.2.1 and bc 1.07.1)

How to use awk to count the occurence of a word beginning with something?

I have a file that looks like this:
**FID IID**
1 RQ50131-0
2 469314
3 469704
4 469712
5 RQ50135-2
6 469720
7 470145
I want to use awk to count the occurences of IDs beginning with 'RQ' in column 2.
So for the little snapshot, it should be 2. After the RQ, the numbers differ so I want a count with anything that begins with RQ.
I am using this code
awk -F '\t' '{if(match("^RQ$",$2))print}'|wc -l ID.txt > RQ.txt
But I don't get an output.
Tabs are used as field delimiters by default (same as spaces), so you can omit -F '\t'.
You can use
awk '$2 ~ /^RQ/{cnt++} END{print cnt}' ID.txt > RQ.txt
Once Field 2 starts with RQ, increment cnt and once the file is processed print cnt.
See the online demo.
You did
{if(match("^RQ$",$2))print}
but compulsory arguments to match function are string, regexp. Also do not use $ if you are interesting in finding strings starting with as $ denotes end. After fixing that issues code would be
{if(match($2,"^RQ"))print}
Disclaimer: this answer does describe solely fixing problems with your current code, it does not contain any ways to ameliorate your code.
Also apart from the reversed parameters for match, the file ID.txt should come right after the closing single quote.
As you want to print the whole line, you can omit the if statement and the print statement because match returns the index at which that substring begins, or 0 if there is no match.
awk 'match($2,"^RQ")' ID.txt | wc -l > RQ.txt

Combine grep output when piping

I use the following command sipcalc to display information about an IP:
sipcalc 192.16.12.1/16 | grep -E 'Network address|Network mask \(bits\)'
The output is:
Network address - 192.16.0.0
Network mask (bits) - 16
Is there a way to combine the above output (only the right part), so the output would be:
192.16.0.0/16
I have my own way to do this by separating grep call and then concatenate the result, but I don't think it is a good solution. Can grep or any other commands that can be used to pipe the output like awk in order to obtain the output above?
grep is not really an ideal tool for doing operations beyond just searching for your expected text. Use awk alone!
awk '/Network address/{ ip = $NF } /Network mask \(bits\)/{ print ip "/" $NF}'
Awk processes records in /pattern/ { action } syntax. So when the first pattern in matched, extract the last field delimited by space $NF i.e. a special variable Awk uses to store the value of last column when delimited by space ( See 7.5.1 Built-in Variables That Control awk)
When the second pattern is matched in a similar way, join that last field with the value stored in ip variable. The + just concatenates the individual strings to produce the desired result.

How to compare digits after find with awk, egrep

i have some file.txt where is a lot of information. Input in file looks like:
<ss>283838<ss>
.
.
<ss>111 from 4444<ss>
.
<ss>255<ss>
The numbers can have any number of digits.
I need to find and compare these 2 numbers
If they equal print name of file and print that they are equal if not, reverse meneaning. Only one string in file have digits with word "from" between
I tried to do like
Awk '/[0-9]+ from./ {print $0} file.txt | egrep -o '[0-9]+'
With this command i get those two digits, but i im stacked now, and do not know how to compare them
With your shown samples, could you please try following. Simple explanation would be: getting respective values of digits by regex and then comparing them to check 3 cases either they are greater, lesser or equal to each other, will add detailed explanation in sometime.
awk '
match($0,/<[a-zA-Z]+[0-9]+/){
val1=substr($0,RSTART,RLENGTH)
gsub(/[^0-9]*/,"",val1)
match($0,/[0-9]+[a-zA-Z]+>/)
val2=substr($0,RSTART,RLENGTH)
gsub(/[^0-9]*/,"",val2)
if(val1>val2){
print "val1("val1 ")is Greater than val2("val2")"
}
if(val2>val1){
print "val2("val2 ")is Greater than val1("val1")"
}
if(val1==val2){
print "val1("val1 ")is equals to val2("val2")"
}
}' Input_file
For your current shown sample output will be as follows:
val2(333)is Greater than val1(222)

Using awk to print index of a pattern in a file

I've been sitting on this one for quite a while:
I would like to search for a pattern in a sample.file using awk and print the index:
>sample
ATGCGAAAAGATGAACGA
GTGACAGACAGACAGACA
GATAAACTGACGATAAAA
...
Let's say I want to find the index of the following pattern: "AAAA" (occurs twice), so the result should be 6 and 51.
EDIT:
I was able to use the following script:
cat ./sample.fasta |\
awk '{
s=$0
o=0
m="AAAA"
l=length(m)
i=index(s,m)
while (i>0) {
o+=i
print o
s=substr(s,i+l)
o+=l-1
i=index(s,m)
}
}'
However, it restarts the index on every new line, so the result is 6 and 15. I can always concatenate all lines into one single line, but maybe there's a more elegant way.
Thanks in advance
awk reads files line-by-line so it would never be a problem to find "all" indices in a multi-line file. Your problem is that you're trying to use a BEGIN block which, as its name suggests, only runs at the beginning of the program. As well, the index() function takes two arguments.
For your sample data, this should work:
awk '/AAAA/{print index($0,"AAAA")+l} NR>1{l+=length}' sample.file
The first block of code only runs when AAAA is matched, the second runs for every line after the first, incrementing the counter with the length of the line.
For the case where you have multiple matches per line, this should work:
awk -v pat=AAAA 'BEGIN{for(n=0;n<length(pat);n++) rep=rep"x"} NR>1{while(i=index($0,pat)){print i+l; sub(pat,rep);} l+=length}' sample.file
The pattern is passed as a variable; when the program starts a replacement text is generated based on the length of the pattern. Then each line after the first is looped over, getting the index of the pattern and replacing it so the next iteration returns the next instance.
It's worth mentioning that both these methods will match AAAAAA.
AWK indexes of course:
awk '{ l=index($0, "AAAA"); if (l) print l+i; i+=length(); }' dna.txt
6
51
if you're fine with zero based indices, this may be simpler.
$ sed 1d file | tr -d '\n' | grep -ob AAAA
5:AAAA
50:AAAA
assumes you have the header row as posted, if not remove sed command. Note that this assumes single byte chars as shown. For extended charsets it won't be the char position but byte-offset.