grep: Keeping lines that has specific string in certain column - awk

I am trying to pick out the lines that have certain value in certain column and save it to an output. I am trying to do this with grep. Is it possible?
My data is looks like this:
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
melon 1 ewtedf wersdf
orange 3 qqqwetr hredfg
I want to pick out lines that have value 5 on its 2nd column and save it to new outputfile.
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
I would appreciate for help!

It is probably possible with grep but the adequate tool to perform this operation is definitely awk. You can filter every line having 5 on the second column with
awk '$2 == 5'
Explanation
awk splits it inputs in records (usually a line) and fields (usually a column) and perform actions on records matching certain conditions. Here
awk '$2 == 5'
is a short form for
awk '$2 == 5 {print($0)}'
which translates to
For each record, if the second field ($2) is 5, print the full record ($0).
Variations
If you need to choose dynamically the key value used to filter your values, use the -v option of awk:
awk -v "key=5" '$2 == key {print($0)}'
If you need to keep the first line of the file because it contains a header to the table, use the NR variable that keeps track of the ordinal number of the current record:
awk 'NR == 1 || $2 == 5'
The field separator is a regular expression defining which text separates columns, it can be modified with the -F field. For instance, if your data were in a basic CSV file, the filter would be
awk -F", *" '$2 == 5'
Visit the awk tag wiki to find a few useful information to get started learning awk.

To print when the second field is 5 use: awk '$2==5' file

Give this a try:
grep '^[^\s]\+\s5.*$' file.txt
the pattern looks for start of line, followed by more than one non-space character, followed by space, followed by 5, follwed by any number of chars, followed by eol.

You can get following command.
$ cat data.txt
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
melon 1 ewtedf wersdf
orange 3 qqqwetr hredfg
grape 55 kkkkkkk aaaaaa
$ grep -E '[^ ]+ +5 .*' data.txt > output.txt
$ cat output.txt
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
You can get the answer only with grep command.
But I strongly recommend you use awk command.

The simple way to do it is:
grep '5' MyDataFile
The result:
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
To capture that in a new file:
grep '5' MyDataFile > newfile
Note: that will find a 5 anywhere in MyDataFile. To restrict to the second column, a short script is what would suit your needs. If you want to limit it to the second column only, then a quick script like the following will do. Usage: script number datafile:
#!/bin/bash
while read -r fruit num stuff || [ -n "$stuff" ]; do
[ "$num" -eq "$1" ] && printf "%s %s %s\n" "$fruit" "$num" "$stuff"
done <"$2"
output:
$ ./fruit.sh 5 dat/mydata.dat
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf

Related

How do I print every nth entry of the mth column, starting from a particular line of a file?

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

awk find out how many times columns two and three equal specific word

Lets say I have a names.txt file with the following
Bob Billy Billy
Bob Billy Joe
Bob Billy Billy
Joe Billy Billy
and using awk I want to find out how many times $2 = Billy while $3 = Billy. In this case my desired output would be 3 times.
Also, I'm testing this on a mac if that matters.
You first need to test $2==$3 then test that one of those equals "Billy". Increment a counter and then print the result at the end:
$ awk '$2==$3 && $2=="Billy"{cnt++} END{print cnt+0}' names.txt
3
Or, you could almost write just what you said:
$ awk '$2=="Billy" && $3=="Billy" {cnt++} END{print cnt+0}' names.txt
3
And if you want to use a variable so you don't need to type it several times:
$ awk -v name='Billy' '$2==name && $3==name {cnt++}
END{printf "Found \"%s\" %d times\n", name, cnt+0}' names.txt
Found "Billy" 3 times
Or, you could collect them all up and report what was found:
$ awk '{cnts[$2 "," $3]++}
END{for (e in cnts) print e ": " cnts[e]}' names.txt
Billy,Billy: 3
Billy,Joe: 1
You may also consider use grep to do that,
$ grep -c "\sBilly\sBilly" name.txt
3
-c: print a count of matching lines

awk: Search missing value in file

awk newbie here! I am asking for help to solve a simple specific task.
Here is file.txt
1
2
3
5
6
7
8
9
As you can see a single number (the number 4) is missing. I would like to print on the console the number 4 that is missing. My idea was to compare the current line number with the entry and whenever they don't match I would print the line number and exit. I tried
cat file.txt | awk '{ if ($NR != $1) {print $NR; exit 1} }'
But it prints only a newline.
I am trying to learn awk via this small exercice. I am therefore mainly interested in solutions using awk. I also welcome an explanation for why my code does not do what I would expect.
Try this -
awk '{ if (NR != $1) {print NR; exit 1} }' file.txt
4
since you have a solution already, here is another approach, comparing with previous values.
awk '$1!=p+1{print p+1} {p=$1}' file
you positional comparison won't work if you have more than one missing value.
Maybe this will help:
seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'
4
Since I am learning awk as well
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
`awk` explanation read each number from `$1` into array `i` and increment that number list line by line with `i++`, if the number is not sequential, then print it.
cat file
1
2
3
5
6
7
8
9
11
12
13
15
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
10
14

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

Addition of particular numbers in a file using awk or grep

I am looking for something like this:
FILE NAME : fruites.txt
Apple a day keeps doctor away
but people dont like it............... 23 peoples found.
Banana_A.1 keeps u fit
and its very tasty.................... 12 peoples found.
Banana_B.2 juices is very good to taste
and most people like them
as well as consumed the most.......... 15 peoples found.
Anar is difficult to eat
as well as its very costly............ 35 peoples found.
grapes are easy to eat
and people like it the most........... 10 peoples found.
fruites are very healthy and improves vitamins.
Apple : The apple tree is a deciduous tree in the rose family best known for its sweet, pomaceous
fruit, the apple.
Banana_A.1: A banana is an edible fruit, botanically a berry, produced by several kinds of large
herbaceous flowering plants in the genus Musa.
Banana_B.2: A banana is an fruit, botanically a kerry, produced by several kinds of large
herbaceous flowering plants in the genus Musa.
Anar : The pomegranate, botanical name Punica granatum, is a fruit-bearing deciduous shrub or
small tree growing between 5 and 8 m tall.
I want the addition of all peoples found except banana
ANS : 68 ( 23+35+10 )
I am able to find the count separately, but unable to subtract them
I tried like this
grep -E ".found" fruites.txt | awk ' { sum+=$3 } END {print sum }'
ANS : 95 (68+27)
grep -E "Banana|.found" fruites.txt | grep -A1 "Banana" | grep -E ".found" | awk ' { sum+=$3 } END {print sum }'
AND : 27 ( only bananas)
Can anyone please help
awk '$1 != "Banana" {s+=$(NF-2)} END { print s}' RS= fruites.txt
The key here is the RS= assignment which makes awk treat each section of text delimited by blank lines as a separate record. Note that you may prefer to write RS="" fruites.txt for clarity, but that is not necessary. Be sure not to omit the space after the =, though, as the key is to have a blank string as the value of RS.
-- Edit --
Given the comments and the modified question, perhaps you want:
awk '! match($1,"Banana") && match($NF, "found") {
s += $(NF-2)} END { print s }' RS= fruites.txt
You could use the below awk command.
$ awk -v RS="\n\n" '!/Banana/ && /peoples found\.$/{s+=$(NF-2)} END { print s}' file
68
The above awk command sets a blank line \n\n as the Record seperator value and check for the non-existence of Banana string and the existence of peoples found. string at the last. If both conditions are satisfied, then only the sum of third column from the last would be calculated. So s+=$(NF-2) also written as s = s + $(NF-2) contains the sum. Printing the value of s at the last will give you the total sum.