Addition of particular numbers in a file using awk or grep - awk

I am looking for something like this:
FILE NAME : fruites.txt
Apple a day keeps doctor away
but people dont like it............... 23 peoples found.
Banana_A.1 keeps u fit
and its very tasty.................... 12 peoples found.
Banana_B.2 juices is very good to taste
and most people like them
as well as consumed the most.......... 15 peoples found.
Anar is difficult to eat
as well as its very costly............ 35 peoples found.
grapes are easy to eat
and people like it the most........... 10 peoples found.
fruites are very healthy and improves vitamins.
Apple : The apple tree is a deciduous tree in the rose family best known for its sweet, pomaceous
fruit, the apple.
Banana_A.1: A banana is an edible fruit, botanically a berry, produced by several kinds of large
herbaceous flowering plants in the genus Musa.
Banana_B.2: A banana is an fruit, botanically a kerry, produced by several kinds of large
herbaceous flowering plants in the genus Musa.
Anar : The pomegranate, botanical name Punica granatum, is a fruit-bearing deciduous shrub or
small tree growing between 5 and 8 m tall.
I want the addition of all peoples found except banana
ANS : 68 ( 23+35+10 )
I am able to find the count separately, but unable to subtract them
I tried like this
grep -E ".found" fruites.txt | awk ' { sum+=$3 } END {print sum }'
ANS : 95 (68+27)
grep -E "Banana|.found" fruites.txt | grep -A1 "Banana" | grep -E ".found" | awk ' { sum+=$3 } END {print sum }'
AND : 27 ( only bananas)
Can anyone please help

awk '$1 != "Banana" {s+=$(NF-2)} END { print s}' RS= fruites.txt
The key here is the RS= assignment which makes awk treat each section of text delimited by blank lines as a separate record. Note that you may prefer to write RS="" fruites.txt for clarity, but that is not necessary. Be sure not to omit the space after the =, though, as the key is to have a blank string as the value of RS.
-- Edit --
Given the comments and the modified question, perhaps you want:
awk '! match($1,"Banana") && match($NF, "found") {
s += $(NF-2)} END { print s }' RS= fruites.txt

You could use the below awk command.
$ awk -v RS="\n\n" '!/Banana/ && /peoples found\.$/{s+=$(NF-2)} END { print s}' file
68
The above awk command sets a blank line \n\n as the Record seperator value and check for the non-existence of Banana string and the existence of peoples found. string at the last. If both conditions are satisfied, then only the sum of third column from the last would be calculated. So s+=$(NF-2) also written as s = s + $(NF-2) contains the sum. Printing the value of s at the last will give you the total sum.

Related

Printing out a particular row based on condition in another row

apologies if this really basic stuff but i just started with awk
so i have an input file im piping into awk like below. format never changes (like below)
name: Jim
gender: male
age: 40
name: Joe
gender: female
age: 36
name: frank
gender: Male
age: 40
I'm trying to list all names where age is 40
I can find them like so
awk '$2 == "40" {print $2 }'
but cant figure out how to print the name
Could you please try following(I am driving as of now so couldn't test it).
awk '/^age/{if($NF==40){print val};val="";next} /^name/{val=$0}' Input_file
Explanation: 1st condition checking ^name if a line starts from it then store that line value in variable val. Then in other condition checking if a line starts from age; then checking uf that line's 2nd field is greater than 40 then print value if variable val and nullify it too.
Using gnu awk and set Record Selector to nothing makes it works with blocks.
awk -v RS="" '/age: 40/ {print $2}' file
Jim
frank
Some shorter awk versions of suspectus and RavinderSingh13 post
awk '/^name/{n=$2} /^age/ && $NF==40 {print n}' file
awk '/^name/{n=$2} /^age: 40/ {print n}' file
Jim
frank
If line starts with name, store the name in n
IF line starts with age and age is 40 print n
Awk knows the concept records and fields.
Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS.
By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition:
RS:
The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So with the file format you give, we can define the records based on RS="".
So based on this, we can immediately list all records who have the line age: 40
$ awk 'BEGIN{RS="";ORS="\n\n"}/age: 40/
There are a couple of problems with the above line:
What if we have a person that is 400 yr old, he will be listed because the line /age: 400/ contains that the requested line.
What if we have a record with a typo stating age:40 or age : 40
What if our record has a line stating wage: 40 USD/min
To solve most of these problems, it is easier to work with well-defined fields in the record and build the key-value-pairs per record:
key value
---------------
name => Jim
gender => male
age => 40
and then, we can use this to select the requested information:
$ awk 'BEGIN{RS="";FS="\n"}
# build the record
{ delete rec;
for(i=1;i<=NF;++i) {
# find the first ":" and select key and value as substrings
j=index($i,":"); key=substr($i,1,j-1); value=substr($i,j+1)
# remove potential spaces from front and back
gsub(/(^[[:blank:]]*|[[:blank:]]$)/,key)
gsub(/(^[[:blank:]]*|[[:blank:]]$)/,value)
# store key-value pair
rec[key] = value
}
}
# select requested information and print
(rec["age"] == 40) { print rec["name"] }' file
This is not a one-liner, but it is robust. Furthermore, this method is fairly flexible and adaptable to make selections based on a more complex logic.
If you are not averse to using grep and the format is always the same:
cat filename | grep -B2 "age: 40" | grep -oP "(?<=name: ).*"
Jim
frank
awk -F':' '/^name/{name=$2} \
/^age/{if ($NF==40)print name}' input_file

Print rows that has numbers in it

this is my data - i've more than 1000rows . how to get only the the rec's with numbers in it.
Records | Num
123 | 7 Y1 91
7834 | 7PQ34-102
AB12AC|87 BWE 67
5690278| 80505312
7ER| 998
Output has to be
7ER| 998
5690278| 80505312
I'm new to linux programming, any help would be highly useful to me. thanks all
I would use awk:
awk -F'[[:space:]]*[|][[:space:]]*' '$2 ~ /^[[:digit:]]+$/'
If you want to print the number of lines deleted as you've been asking in comments, you may use this:
awk -F'[[:space:]]*[|][[:space:]]*' '
{
if($2~/^[[:digit:]]+$/){print}else{c++}
}
END{printf "%d lines deleted\n", c}' file
A short and simple GNU awk (gawk) script to filter lines with numbers in the second column (field), assuming a one-word field (e.g. 1234, or 12AB):
awk -F'|' '$2 ~ /\y[0-9]+\y/' file
We use the GNU extension for regexp operators, i.e. \y for matching the word boundary. Other than that, pretty straightforward: we split fields on | and look for isolated digits in the second field.
Edit: Since the question has been updated, and now explicitly allows for multiple words in the second field (e.g. 12 AB, 12-34, 12 34), to get lines with numbers and separators only in the second field:
awk -F'|' '$2 ~ /^[- 0-9]+$/' file
Alternatively, if we say only letters are forbidden in the second field, we can use:
awk -F'|' '$2 ~ /^[^a-zA-Z]+$/' file

awk - skip last line for condition

When I wrote an answer for this question I used the following:
something | sed '$d' | awk '$1>3{print $0}'
e.g.
print only lines where the 1st field is bigger than 3 (awk)
but omit the last line sed '$d'.
This seems for me a bit of duplicate work, surely it is possible to do the above only with awk - without the sed?
I'm an awkdiot - so, can someone suggest a solution?
Here's one way you could do it:
$ printf "%s\n" {1..10} | awk 'NR>1&&p>3{print p}{p=$1}'
4
5
6
7
8
9
Basically, print the first field of the previous line, rather than the current one.
As Wintermute has rightly pointed out in the comments (thanks), in order to print the whole line, you can modify the code to this:
awk 'p { print p; p="" } $1 > 3 { p = $0 }'
This only assigns the contents of contents of the line to p if the first field is greater than 3.

grep: Keeping lines that has specific string in certain column

I am trying to pick out the lines that have certain value in certain column and save it to an output. I am trying to do this with grep. Is it possible?
My data is looks like this:
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
melon 1 ewtedf wersdf
orange 3 qqqwetr hredfg
I want to pick out lines that have value 5 on its 2nd column and save it to new outputfile.
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
I would appreciate for help!
It is probably possible with grep but the adequate tool to perform this operation is definitely awk. You can filter every line having 5 on the second column with
awk '$2 == 5'
Explanation
awk splits it inputs in records (usually a line) and fields (usually a column) and perform actions on records matching certain conditions. Here
awk '$2 == 5'
is a short form for
awk '$2 == 5 {print($0)}'
which translates to
For each record, if the second field ($2) is 5, print the full record ($0).
Variations
If you need to choose dynamically the key value used to filter your values, use the -v option of awk:
awk -v "key=5" '$2 == key {print($0)}'
If you need to keep the first line of the file because it contains a header to the table, use the NR variable that keeps track of the ordinal number of the current record:
awk 'NR == 1 || $2 == 5'
The field separator is a regular expression defining which text separates columns, it can be modified with the -F field. For instance, if your data were in a basic CSV file, the filter would be
awk -F", *" '$2 == 5'
Visit the awk tag wiki to find a few useful information to get started learning awk.
To print when the second field is 5 use: awk '$2==5' file
Give this a try:
grep '^[^\s]\+\s5.*$' file.txt
the pattern looks for start of line, followed by more than one non-space character, followed by space, followed by 5, follwed by any number of chars, followed by eol.
You can get following command.
$ cat data.txt
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
melon 1 ewtedf wersdf
orange 3 qqqwetr hredfg
grape 55 kkkkkkk aaaaaa
$ grep -E '[^ ]+ +5 .*' data.txt > output.txt
$ cat output.txt
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
You can get the answer only with grep command.
But I strongly recommend you use awk command.
The simple way to do it is:
grep '5' MyDataFile
The result:
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
To capture that in a new file:
grep '5' MyDataFile > newfile
Note: that will find a 5 anywhere in MyDataFile. To restrict to the second column, a short script is what would suit your needs. If you want to limit it to the second column only, then a quick script like the following will do. Usage: script number datafile:
#!/bin/bash
while read -r fruit num stuff || [ -n "$stuff" ]; do
[ "$num" -eq "$1" ] && printf "%s %s %s\n" "$fruit" "$num" "$stuff"
done <"$2"
output:
$ ./fruit.sh 5 dat/mydata.dat
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf

Calculating 95th percentile with awk

I'm new in awk scripting and would like to have some help in calculating 95th percentile value for a file that consist of this data:
0.0001357
0.000112
0.000062
0.000054
0.000127
0.000114
0.000136
I tried:
cat filename.txt | sort -n |
awk 'BEGIN{c=0} {total[c]=$1; c++;} END{print total[int(NR*0.95-0.5)]}'
but I dont seem to get the correct value when I compare it to excel.
I am not sure if Excel does some kind of weighted percentile, but if you actually want one of the numbers that was in your original set, then your method should work correctly for rounding.
You can simplify a little bit like this, but it's the same thing.
sort -n input.txt | awk '{all[NR] = $0} END{print all[int(NR*0.95 - 0.5)]}'
Following the calculation suggested here, you can do this:
sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
Output for given input:
sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
0.0001357
Explanation:
Sort the file numerically
drop the top 5%
pick the next value
PS. The statement p5=p5%1?int(p5)+1:p5 is doing a ceil operation available in many languages.
Just for the record, there is also solution, inspired by merlin2011 answer, that prints several desired percentiles:
# get amount of values
num="$(wc -l input.txt | cut -f1 -d' ')";
# sort values
sort -n input.txt > temp && mv temp input.txt
# print the desired percentiles
for p in 50 70 80 90 92 95 99 100; do
printf "%3s%%: %-5.5sms\n" "$p" "$(head input.txt -n "$((num / 100 * $p))" | tail -n1)";
done
Update: I messed it up. Bash math can't handle floating numbers, even not if used during a "single expression". That only works for files with 100*(N>0) values. So either bc or awk is required to do the math.
In case you have an "odd" amount of values, you should replace "$((num / 100 * $p))" with "$(awk "BEGIN {print int($num/100*$p)}")" in the code above.
Finally awk is part of that answer. ;)