Grep specific part of string from another file - awk

I want to grep the first three digits of numbers in 1.txt from the first three digits after zeros in 2.txt.
cat 1.txt
23456
12345
6789
cat 2.txt
20000023485 xxx888
20000012356 xxx888
20000067234 xxx234
Expected output
20000023485 xxx888
20000012356 xxx888

awk 'FNR==NR {a[substr($1,0,3)];next}
{match($1, /0+/);
if(substr($1, RSTART+RLENGTH,3) in a)print}' 1.txt 2.txt
{a[substr($1,0,3)];next} - stores the first 3 characters in an associative array.
match($1, /0+/);if(substr($1, RSTART+RLENGTH,3) in a)
Matches the 3 charaacters after the series of zeroes and checks whether these 3 characters are present in the associative array that was created earlier and prints the whole line if match is found.

Try this with grep:
grep -f <(sed 's/^\(...\).*/00\1/' file1) file2
Output:
20000023485 xxx
20000012356 xxx

grep -f will match a series of patterns from the given file, one per line. But first you need to turn 1.txt into the patterns you want. In your case, you want the first three characters of each line of 1.txt, after zeros: 00*234, 00*123, etc. (I'm assuming you want at least one zero.)
sed -e 's/^\(...\).*$/00*\1/' 1.txt > 1f.txt
grep -f 1f.txt 2.txt

Related

How do I print every nth entry of the mth column, starting from a particular line of a file?

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

Count b or B in even lines

I need count the number of times in the even lines of the file.txt the letter 'b' or 'B' appears, e.g. for the file.txt like:
everyB or gbnBra
uitiakB and kanapB bodddB
Kanbalis astroBominus
I got the first part but I need to count these b or B letters and I do not know how to count them together
awk '!(NR%2)' file.txt
$ awk '!(NR%2){print gsub(/[bB]/,"")}' file
4
Could you please try following, one more approach with awk written on mobile will try it in few mins should work but.
awk -F'[bB]' 'NR%2 == 0{print (NF ? NF - 1 : 0)}' Input_file
Thanks to #Ed sir for solving zero matches found line problem in comments.
In a single awk:
awk '!(NR%2){gsub(/[^Bb]/,"");print length}' file.txt
gsub(/[^Bb]/,"") deletes every character in the line the line except for B and b.
print length prints the length of the resulting string.
awk '!(NR%2)' file.txt | tr -cd 'Bb' | wc -c
Explanation:
awk '!(NR%2)' file.txt : keep only even lines from file.txt
tr -cd 'Bb' : keep only B and b characters
wc -c : count characters
Example:
With file bellow, the result is 4.
everyB or gbnBra
uitiakB and kanapB bodddB
Kanbalis astroBominus
Here is another way
$ sed -n '2~2s/[^bB]//gp' file | wc -c

Comparing two files based on 1st column, printing the unique part of one file

I have two files looking like this:
file1:
RYR2 29 70 0.376583106063 4.77084855376
MUC16 51 94 0.481067457376 3.9233164551
DCAF4L2 0 13 0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087 2.81864194236
ZFHX4 14 37 0.371576262084 2.81030548752
file2:
A26B2
RYR2
MUC16
ACTL9
I need to compare them based on first column and print only those lines of first file that are not in second, so the output should be:
DCAF4L2 0 13 0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087 2.81864194236
ZFHX4 14 37 0.371576262084 2.81030548752
I tried with grep:
grep -vFxf file2 file1
with awk:
awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file 2 file1
comm:
comm -23 <(sort file1) <(sort file2)
nothing works
You can use
grep -vFf file2 file1
Also, grep -vf file2 file1 will work, too, but in case the file2 strings contain * or [ that should be read in as literal chars you might get into trouble since they should be escaped. F makes grep treat those strings as fixed strings.
NOTES
-v: Invert match.
-f file: Take regexes from a file.
-F: Interpret the pattern as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.
So, it reads the regexes from file2 and applies them to file1, and once it finds a match, that line is not output due to inverted search. This is enough because only the first column contains alphanumerics, the rest contain numeric data only.
Why your command did not work
The -x (short for --line-regexp) option means Select only those matches that exactly match the whole line.
Also, see more about grep options in grep documentation.

Only output line if value in specific column is unique

Input:
line1 a gh
line2 a dd
line3 c dd
line4 a gg
line5 b ef
Desired output:
line3 c dd
line5 b ef
That is, I want to output line only in the case that no other line includes the same value in column 2. I thought I could do this with combination of sort (e.g. sort -k2,2 input) and uniq, but it appears that with uniq I can only skip columns from the left (-f avoid comparing the first N fields). Surely there's some straightforward way to do this with awk or something.
You can do this as a two-pass awk script:
awk 'NR==FNR{a[$2]++;next} a[$2]<2' file file
This runs through the file once incrementing a counter in an array whose key is the second field of each line, then runs through a second time printing only those lines whose counter is less than 2.
You'd need multiple reads of the file because at any point during the first read, you can't possibly know whether there will be another instance of the second field of that line later in the file.
Here is a one pass awk solution:
awk '{a1[$2]++;a2[$2]=$0} END{for (a in a1) if (a1[a]==1) print a2[a]}' file
The original order of the file will be lost however.
You can combine awk, grep, sort and uniq for a quick one-liner:
grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d) " input.txt
Edit, to avoid the regexes, \+ and \backreferences:grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d | sed 's/[^+0-9]/\\&/g') " input.txt
alternative to awk to demonstrate that it can still be done with sort and uniq (there is option -u for this), however setting up the right format requires some juggling (decorate/do stuff/undecorate pattern).
$ paste file <(cut -d' ' -f2 file) | sort -k2 | uniq -uf3 | cut -f1
line5 b ef
line3 c dd
as a side effect you lose the original sorting order, which can be recovered as well if you add line numbers...

Count files that have unique prefixes

I have set of files that looks like the following. I'm looking for a good way to count all files that have unique prefixes, where "prefix" is defined by all characters before the second hyphen.
0406-0357-9.jpg 0591-0349-9.jpg 0603-3887-27.jpg 59762-1540-40.jpg 68180-517-6.jpg
0406-0357-90.jpg 0591-0349-90.jpg 0603-3887-28.jpg 59762-1540-41.jpg 68180-517-7.jpg
0406-0357-91.jpg 0591-0349-91.jpg 0603-3887-29.jpg 59762-1540-42.jpg 68180-517-8.jpg
0406-0357-92.jpg 0591-0349-92.jpg 0603-3887-3.jpg 59762-1540-5.jpg 68180-517-9.jpg
0406-0357-93.jpg 0591-0349-93.jpg 0603-3887-30.jpg 59762-1540-6.jpg
Depending on what you actually want output, either of these might be what you want:
ls | awk -F'-' '{c[$1"-"$2]++} END{for (p in c) print p, c[p]}'
or
ls | awk -F'-' '!seen[$1,$2]++{count++} END{print count+0}'
If it's something else, update your question to show the output you're looking for.
This should do it:
ls *.jpg | cut -d- -s -f1,2 | uniq | wc -l
Or if your prefixes are always 4 digits, one dash, 4 digits, you don't need cut:
ls *.jpg | uniq -w9 | wc -l
Parses ls (bad, but it doesn't look like it will cause a problem with these filenames),
uses awk to set the field separator as -.
!seen[$1,$2]++) uses an associative array with $1,$2 as the key and increments, then checks if the value equals 0 to ensure it is only printed once (based on $1 and $2).
print prints on screen :)
ls | awk 'BEGIN{FS="-" ; printf("%-20s%-10s\n","Prefix","Count")} {seen[$1"-"$2]++} END{ for (k in seen){printf("%-20s%-10i\n",k,seen[k])}}'
Will now count based on prefix with headers :)