Edit binary data in PDF with SED / BBE (change colors in a PDF) - pdf

I want to change some background colors in a batch of PDF's
I found out that the color information is stored in the first stream - endstream block
in a format like such: 1 1 1 sc which in this example represents white #FFFFFF
here an example after I decode the binary stream with
qpdf --qdf --object-streams=disable IN.pdf OUT.pdf
stream
q Q q /Cs1 cs 0.9686275 0.9725490 0.9764706 sc 0 12777 m 600 12777 l 600 0
l 0 0 l h f 0 12777 m 600 12777 l 600 0 l 0 0 l h f ➡️1 1 1 sc⬅️ 0 12575 m 600
12575 l 600 12308 l 0 12308 l h f 0.1254902 0.2666667 0.3921569 sc 0 872 m
600 872 l 600 462 l 0 462 l h f 0 462 m 600 462 l 600 0 l 0 0 l h f ➡️1 1 1
sc⬅️ 0 12297 m 600 12297 l 600 5122 l 0 5122 l h f 0.7411765 0.8980392 0.9725490
sc 23 7249 m 577 7249 l 577 6007 l 23 6007 l h f 1 0.9215686 0.9333333 sc
23 5848 m 577 5848 l 577 5533 l 23 5533 l h f 0.9686275 0.9725490 0.9764706
sc 23 5510 m 577 5510 l 577 5156 l 23 5156 l h f ➡️1 1 1 sc⬅️ 0 5110 m 600 5110
...
endstream
If I open the PDF in TextEdit and manually replace 1 1 1 sc with 0 1 0 sc my white background immediately changes to green after saving the PDF file.
How can I do this in an automated way with a Text Tool?
sed 's/1 1 1 sc/0 1 0 sc/' IN.pdf > OUT.pdf
gives me the error: sed: RE error: illegal byte sequence
bbe -e 's/0 1 1 sc/0 1 0 sc/' IN.pdf > OUT.pdf
no errors, OUT.pdf is written but no colors have changed
echo 'hello 1 1 1 sc world' | bbe -e 's/1 1 1 sc/0 1 0 sc/'
seems to work fine...
In the above stream (the first stream block) in the 1-page PDF file I need to replace only the second and third find. The second one has a line break?

It is not completely clear what you are doing.
You mention commands:
qpdf --qdf --object-streams=disable IN.pdf OUT.pdf
sed 's/1 1 1 sc/0 1 0 sc/' IN.pdf > OUT.pdf
bbe -e 's/0 1 1 sc/0 1 0 sc/' IN.pdf > OUT.pdf
It is not obvious if IN.pdf in the sed or bbe commands is the same IN.pdf file as the qpdf command.
If all three commands are using the same file as input, then that can explain why bbe fails.
Another possibility is that the bbe command shown is the command you are actually using and not a typo. It does not actually look for the string 1 1 1 sc.
sed is not designed to work with binary data.
Although the GNU implementation has a non-standard -z option to help read binary files, it still works on a form of "lines". Perl can be used as an improved sed here.
To change only the first three instances of the string 1 1 1 sc in the file, you could try:
qpdf --qdf --object-streams=disable IN.pdf - |\
perl -0777 -pe 'for $i (1..3) { s/1 1 1 sc/0 1 0 sc/ }' |\
qpdf - OUT.pdf
In this Perl command:
-0777 - treat entire input as single record
-pe - run command on each record, then print (like sed)
for $i (1..3) { ... } - run three times
s/.../.../ - similar to sed's s/// command

I think I will tackle this task with PikePDF, a Python library which seems to be able to work with content streams: https://pikepdf.readthedocs.io/en/latest/topics/content_streams.html
I was just able to Pretty Print the content streams by using:
#!/usr/bin/env python
from pikepdf import Pdf
import pikepdf
with pikepdf.open('IN.pdf') as pdf:
page = pdf.pages[0]
instructions = pikepdf.parse_content_stream(page)
data = pikepdf.unparse_content_stream(instructions)
print(data.decode('ascii'))
Now working my way to actual Edit the content stream ..........
Here the stream fragment from my question, pretty printed:
q
Q
q
/Cs1 cs
0.9686275 0.9725490 0.9764706 sc
0 12777 m
600 12777 l
600 0 l
0 0 l
h
f
0 12777 m
600 12777 l
600 0 l
0 0 l
h
f
➡️1 1 1 sc⬅️
0 12575 m
600 12575 l
600 12308 l
0 12308 l
h
f
0.1254902 0.2666667 0.3921569 sc
0 872 m
600 872 l
600 462 l
0 462 l
h
f
0 462 m
600 462 l
600 0 l
0 0 l
h
f
➡️1 1 1 sc⬅️
0 12297 m
600 12297 l
600 5122 l
0 5122 l
h
f
0.7411765 0.8980392 0.9725490 sc
23 7249 m
577 7249 l
577 6007 l
23 6007 l
h
f
1 0.9215686 0.9333333 sc
23 5848 m
577 5848 l
577 5533 l
23 5533 l
h
f
0.9686275 0.9725490 0.9764706 sc
23 5510 m
577 5510 l
577 5156 l
23 5156 l
h
f
➡️1 1 1 sc⬅️
0 5110 m
600 5110
Some more info about the color value:
Just divide the RGB values by 255
for example:
DeepSkyBlue = #00bfff = RGB(0, 191, 255)
0/255 = 0
191/255 = 0.7490196
255/255 = 1
0 0.7490196 1 sc

Related

Remove dots but not decimals in tab delimited file

I have these data:
#chr pos ref alt af_alt filter an
22 10510033 T C . AC0;AS_VQSR 0
22 10510061 A T 0 AC0 2
(non-dot lines snipped)
22 10510286 G A 0 AC0;AS_VQSR 1598
22 10510286 G T 0.00125156 AS_VQSR 1598
I wantto remove lone dots in every column but, eg columns 1 for af_alt but leave decimals (eg last row)
I tried this solution, but does not seem to change the file in anyway:
awk 'BEGIN {OFS=FS=" "} {gsub(/^\.$/,"",$1)}1'
In awk you can do:
awk -v OFS="\t" '/\./{for (i=1;i<=NF;i++) if ($i==".") $i=""} 1' file
This works on any field (where a regex relying on a leading or trailing space does not) and allows awk to handle the surrounding field separators. That allows simple string equality to be used to test for ".".
With your example (with the runs of spaces replaced with tabs) prints:
#chr pos ref alt af_alt filter an
22 10510033 T C AC0;AS_VQSR 0
22 10510061 A T 0 AC0 2
22 10510077 C A 0 AC0 18
22 10510103 A T 0 AC0;AS_VQSR 64
22 10510105 T A 0 AC0;AS_VQSR 70
22 10510113 C T 0 AC0;AS_VQSR 94
22 10510119 A G 0 AC0;AS_VQSR 120
22 10510130 A G 0 AC0;AS_VQSR 164
22 10510138 CATA C 0 AC0;AS_VQSR 218
22 10510143 T A 0 AC0;AS_VQSR 264
22 10510161 T A 0 AC0;AS_VQSR 430
22 10510164 A T 0 AC0;AS_VQSR 468
22 10510169 G A 0 AS_VQSR 502
22 10510171 C T 0 AC0;AS_VQSR 530
22 10510183 A G 0 AS_VQSR 718
22 10510193 G C 0 AC0;AS_VQSR 804
22 10510200 C T 0 AC0;AS_VQSR 936
22 10510212 A T 0 AS_VQSR 1070
22 10510228 G T 0 AC0;AS_VQSR 1318
22 10510232 A G 0 AS_VQSR 1364
22 10510233 G A 0 AC0 1370
22 10510235 C A 0 AC0;AS_VQSR 1376
22 10510236 G A 0 AC0 1394
22 10510250 C T 0 AC0;AS_VQSR 1434
22 10510258 C T 0 AS_VQSR 1442
22 10510263 A T 0 AC0;AS_VQSR 1486
22 10510276 G A 0 AC0;AS_VQSR 1550
22 10510277 A G 0 AC0;AS_VQSR 1570
22 10510286 G A 0 AC0;AS_VQSR 1598
22 10510286 G T 0.00125156 AS_VQSR 1598
You might harness GNU sed for this task following way, let file.txt be tab-sheared with following content
1.1 2.2 3.3
1.1 2.2 .
1.1 . 3.3
1.1 . .
. 2.2 3.3
. 2.2 .
. . 3.3
. . .
then
sed -e 's/^\.\t/\t/' -e 's/\t\.$/\t/' -e 's/\t\.\t/\t\t/g' file.txt
gives output
1.1 2.2 3.3
1.1 2.2
1.1 3.3
1.1
2.2 3.3
2.2
3.3
Explanation: there are 3 cases: . might be at beginning of line, . might be at end of line, . might be in middle of line. 1st and 2nd is done using ^ (start of line) and $ (end of line) and it is sufficient to do each once, 3rd might require global (g) replacement, each is replace with number of TAB characters from match (1, 1 and 2 respectively). Observe that . needs to be escaped to mean literal dot, not any character.
(tested in GNU sed 4.7)
You can use sed to replace any dot between spaces by a space:
sed 's/ \. / /'
For the last column, you might need a $ instead of the final space.
For a regex solution, I'd use perl
perl -pe 's/(^|\t)\.(\t|$)/$1$2/g'
Demo:
Some tab-separated text:
$ printf '%s\t%s\t%s\n' 1 2 3 4 . 5 . 7.5 .
1 2 3
4 . 5
. 7.5 .
with the perl filter
$ printf '%s\t%s\t%s\n' 1 2 3 4 . 5 . 7.5 . | perl -pe 's/(^|\t)\.(\t|$)/$1$2/g'
1 2 3
4 5
7.5
if you don't wanna deal with loops but also don't mind an extra downstream pipe:
{m,g,n}awk 'gsub("\23\\456", "\23_", $!(NF = NF))^_' OFS='\23' |
column -s$'\23' -t
#chr pos ref alt af_alt filter an
22 10510033 T C _ AC0;AS_VQSR 0
22 10510061 A T 0 AC0 2
(non-dot lines snipped)
22 10510286 G A 0 AC0;AS_VQSR 1598
22 10510286 G T 0.00125156 AS_VQSR 1598

Map tab separated file with a list using awk/sed/join

I have multiple large files with two columns (tab-delimited).
The content of these files are like this:
working_file-1
K00001 0.188
K00005 15.97
K00008 188.09
mapping file
K00001
K00002
K00003
K00004
K00005
K00006
K00007
K00008
mapping file range is K00001 - K25804
I want to map my working_file-1 to mapping file so that the output looks like this:
K00001 0.188
K00002
K00003
k00004
K00005 15.97
K00006
K00007
k00008 188.09
The blanks (K0's not present in working_file) can be filled with zero (if possible) or can be left as blank.
So far, I tried these codes by following other similar posts (but did not success):
awk 'NR==FNR {a[$1]++; next} $1 in a' mapping file working_file-1 > output.file
grep -Fw -f mapping file working_file-1 > output.file
edit: output of od -c work1; od -c map
0000000 K 0 0 0 0 1 \r \n K 0 0 0 0 2 \r \n
0000020 K 0 0 0 0 3 \r \n K 0 0 0 0 4 \r \n
0000040 K 0 0 0 0 5 \r \n K 0 0 0 0 6 \r \n
0000060 K 0 0 0 0 7 \r \n K 0 0 0 0 8 \r \n
Given a "map" file like:
a
b
c
d
e
f
g
h
and a "work1" file like:
a A
c C
g G
and desired "merged1" output like:
a A
b
c C
d
e
f
g G
h
then join can do the combining:
join -1 1 -2 1 -a 1 -o 0,2.2 map work1 > merged1
-1 1 -2 1 joins on first (space-delimited) field of each file
-a 1 prints lines from first file even if unpaired
-o 0,2.2 formats output to be join field (first column), separator (space), then second field of second file
To produce instead "merged2" like:
a A
b 0
c C
d 0
e 0
f 0
g G
h 0
add the -e option:
join -1 1 -2 1 -a 1 -e 0 -o 0,2.2 map work1 > merged2
If "work2" has fields separated by a character other than single space, use the -t option.
For example, with "work2" using single tab delimiter like:
a A
c C
g G
(note: stackoverflow displays this with spaces instead of single tab) then with POSIX shell, use -t ' ' (ie. QUOTE TAB QUOTE - may need to be typed as: 'CTRL-VTAB' if shell performs history completion):
join -t ' ' -1 1 -2 1 -a 1 -e 0 -o 0,2.2 map work2 >merged3
or with bash, using -t $'\t' is possible instead:
join -t $'\t' -1 1 -2 1 -a 1 -e 0 -o 0,2.2 map work2 >merged3
to produce "merged3" like:
a A
b 0
c C
d 0
e 0
f 0
g G
h 0
(note: again stackoverflow displays the tab as spaces)
Using GNU sort for -s (stable sort) this might be what you want:
$ sort -k1,1 -u -s working_file-1 mapping_file
K00001 0.188
K00002
K00003
K00004
K00005 15.97
K00006
K00007
K00008 188.09
or if you want to add 0s:
$ sort -k1,1 -u -s working_file-1 mapping_file |
awk -v OFS='\t' '{print $1, $2+=0}'
K00001 0.188
K00002 0
K00003 0
K00004 0
K00005 15.97
K00006 0
K00007 0
K00008 188.09
If you don't have GNU sort then you could do:
$ sort -k1,1 -k2,2rn working_file-1 mapping_file |
awk -v OFS='\t' '$1 != p{print $1, $2+0; p=$1}'
K00001 0.188
K00002 0
K00003 0
K00004 0
K00005 15.97
K00006 0
K00007 0
K00008 188.09

PDF m l operators

I am using a PDF parser to extract lines from a pdf document. It fails on a specific doc generated pdf. The smallest pdf that it fails for has a 1 cell 1 row table, but the stream shows a 2 cell 1 row table. I have these questions:-
Why does the stream show 2 cells instead of just 1?
What are those re operators for, as there are no rectangles?
Who generates these instructions, is it MS Word? Or the PDF Printing application (Cute PDF Writer)?
Here is the pdf :-
Here is the relevant stream:-
stream
q 0.12 0 0 0.12 0 0 cm
/R7 gs
q
647 5996 m
700 5996 l
700 5885 l
647 5885 l
h
W n
0 0 0 rg
q
8.33333 0 0 8.33333 0 0 cm BT
/R8 11.04 Tf
0.998087 0 0 1 77.64 709.2 Tm
()Tj
ET
Q
Q
q
700 5996 m
746 5996 l
746 5885 l
700 5885 l
h
W n
0 0 0 rg
q
8.33333 0 0 8.33333 0 0 cm BT
/R8 11.04 Tf
0.998087 0 0 1 84 709.2 Tm
()Tj
ET
Q
Q
0 0 0 rg
600 5996 4 4 re
f
600 5996 4 4 re
f
604 5996 3892 4 re
f
4496 5996 4 4 re
f
4496 5996 4 4 re
f
600 5884 4 112 re
f
600 5880 4 4 re
f
600 5880 4 4 re
f
604 5880 3892 4 re
f
4496 5884 4 112 re
f
4496 5880 4 4 re
f
4496 5880 4 4 re
f
q
8.33333 0 0 8.33333 0 0 cm BT
/R8 11.04 Tf
0.998087 0 0 1 72 695.28 Tm
()Tj
ET
Q
Q
endstream
and here is the image drawn using the m and l instructions above :-
Why does the stream show 2 cells instead of just 1?
The stream does not show any cells at all. Only tagged PDFs may have a certain awareness of tables and table cells but your PDF does not look tagged.
What you (considering your question title) appear to mean are the sequences
647 5996 m
700 5996 l
700 5885 l
647 5885 l
h
W n
and
700 5996 m
746 5996 l
746 5885 l
700 5885 l
h
W n
But all they do is intersecting the current clip path with a rectangle. Thus, following drawing operations are restricted to the respective rectangle. Such restriction can be found in PDFs in many situations, table cells are only one of them, and such clip path changes are not even necessary for table cells...
Furthermore, considering the preceding transformation matrix change
0.12 0 0 0.12 0 0 cm
the rectangles above are fairly small, each probably large enough for a single character.
What are those re operators for, as there are no rectangles?
Well, they are rectangles.
Very small in height and/or width, but rectangles nonetheless.
And they are filled rectangles, cf. the f operator.
To make a long story short, the "lines" around the area we perceive as a table cell, are actually filled rectangles:
604 5996 3892 4 re
600 5884 4 112 re
604 5880 3892 4 re
4496 5884 4 112 re
Furthermore the corners of the cell are drawn as tiny squares (and each corner twice):
600 5996 4 4 re
600 5996 4 4 re
4496 5996 4 4 re
4496 5996 4 4 re
600 5880 4 4 re
600 5880 4 4 re
4496 5880 4 4 re
4496 5880 4 4 re
Thus, these re instructions give you the border edges and corners of what we perceive as table cell.
Who generates these instructions, is it MS Word? Or the PDF Printing application (Cute PDF Writer)?
The concrete instructions you see are PDF instructions. Thus, your printing application creates them.
Of course, though, your printing application creates them because that is how it interprets the MS Word output...
Cute PDF Writer apparently (from a quick glance on their web page) uses the Windows printing system. In general, in cases like this, you print from MS Word, and MS Word will try to use Windows methods to draw the lines and other items, which the printer driver (Cute PDF Writer in this case) will then translate to PDF commands. An intermediate stage with first rendering to PostScript and then translating to PDF is also possible.
So, that would mean that MS Word is responsible for the fact that two cells are drawn.
I only see one rectangle in the image of the PDF that you posted, so I'm not sure what is happening here. Also, I can't explain the other re commands. The rectangles in the second image look like they might be a frame around a two-on-one printed page, but the coordinates look strange, so it could also be something else.

Merge files print 0 in empty field

I have 5 tab delim files
file 0 is basically a key
A
C
F
AA
BC
CC
D
KKK
S
file1
A 2
C 3
F 5
AA 5
BC 4
D 7
file2
A 2
C 3
F 7
D 10
file3
A 2
C 2
F 5
CC 4
D 7
file4
A 1
C 3
F 5
CC 4
D 7
KKK 10
I would like to merge all files based on the 1st column and print 0 in missing fields.
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0
Columns must keep the order of input file0, file1, file2, file3, file4
I was going to wait til you included your own attempt in your question but since you have 2 answers already anyway....
$ cat tst.awk
NR==FNR {
key2rowNr[$1] = ++numRows
rowNr2key[numRows] = $1
next
}
FNR==1 { ++numCols }
{
rowNr = key2rowNr[$1]
vals[rowNr,numCols] = $2
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
printf "%s", rowNr2key[rowNr]
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%d", OFS, vals[rowNr,colNr]
}
print ""
}
}
$ awk -f tst.awk file0 file1 file2 file3 file4
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0
awk solution
awk '
FNR==1{f++}
{
a[f""$1]=$2
b[$1]++
}
END{
for(i in b){
printf i" "
for(j=1;j<=f;j++){
tmp=j""i
if(tmp in a){
printf a[tmp]" "
}else{
printf 0" "
}
}
print ""
}
}
' file*
oupput :
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
first i store every value for each file number and key value in variable a
then store all uniqe key in variable b
and in END block, checked if key is exists or not, if exists print it OR not exist print 0
we can delete file0, if delete it, awk show only exists key in file1,2,3,4,..
Not awk, but this sort of joining of files on a common field is exactly what join is meant for. Complicated a bit by it only working with two files at a time; you have to pipe the results of each one into the next as the first file.
$ join -o 0,2.2 -e0 -a1 <(sort file0) <(sort file1) \
| join -o 0,1.2,2.2 -e0 -a1 - <(sort file2) \
| join -o 0,1.2,1.3,2.2 -e0 -a1 - <(sort file3) \
| join -o 0,1.2,1.3,1.4,2.2 -e0 -a1 - <(sort file4) \
| tr ' ' '\t'
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
Caveats: This requires a shell like bash or zsh that understands <(command) redirection. Sorting all the files in advance is an alternative. Or as pointed out, even though join normally requires its input files to be sorted on the column that's being joined on, it works anyways without the sorts for this particular input.
With GNU awk you can use the ENDFILE clause to make sure you have enough elements in all rows, e.g.:
parse.awk
BEGIN { OFS = "\t" }
# Collect all information into the `h` hash
{ h[$1] = (ARGIND == 1 ? $1 : h[$1] OFS $2) }
# At the end of each file do the necessary padding
ENDFILE {
for(k in h) {
elems = split(h[k], a, OFS)
if (elems != ARGIND)
h[k] = h[k] OFS 0
}
}
# Print the content of `h`
END {
for(k in h)
print h[k]
}
Run it like this:
awk -f parse.awk file[0-4]
Output:
AA 5 0 0 0
A 2 2 2 1
C 3 3 2 3
D 7 10 7 7
BC 4 0 0 0
CC 0 0 4 4
S 0 0 0 0
KKK 0 0 0 10
F 5 7 5 5
NB: This solution assumes you only have two columns per file (except the first one).
You could use coreutils join to determine missing fields and add them to each file:
sort file0 > file0.sorted
for file in file[1-4]; do
{
cat $file
join -j 1 -v 1 file0.sorted <(sort $file) | sed 's/$/ 0/'
} | sort > $file.sorted
done
Now you just need to paste them together:
paste file0.sorted \
<(cut -d' ' -f2 file1.sorted) \
<(cut -d' ' -f2 file2.sorted) \
<(cut -d' ' -f2 file3.sorted) \
<(cut -d' ' -f2 file4.sorted)
Output:
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0

script to remove redundant lines from two different files

I will explain my problem with an example
I have the following files in Solaris
file1:
1 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U386.A0 I have some text here
1 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U386.A1 I have some text here
2 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U246.A0 I have some text here
2 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U246.A1 I have some text here
3 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0 I have some text here
3 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1 I have some text here
3 INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1 I have some text here
4 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0 I have some text here
4 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1 I have some text here
5 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0 I have some text here
5 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1 I have some text here
6 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U375.A0 I have some text here
6 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U375.A1 I have some text here
7 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U98.A I have some text here
8 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U392.A0 I have some text here
8 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U392.A1 I have some text here
9 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U372.A0 I have some text here
10 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U372.A1 I have some text here
11 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U184.A I have some text here
12 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U97.B I have some text here
file2:
INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0
INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1
INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1
INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0
INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1
INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0
INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1
Now i have file2 as reference and print all the lines that match in file1
Expected output is:
3 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0 I have some text here
3 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1 I have some text here
3 INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1 I have some text here
4 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0 I have some text here
4 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1 I have some text here
5 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0 I have some text here
5 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1 I have some text here
I have tried grep:
grep -F -x -f file1 -v file2 > file3
and fgrep:
fgrep -x -f file1 -v file2 > file3
based on several posts from stackoverflow. But didnt find what i need. Since i am a starter, I am really confused to find a way out of this. Your help is most appreciated
this works for you:
grep -Ff file2 file1 >file3
test with your files:
kent$ grep -Ff f2 f1
3 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0 I have some text here
3 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1 I have some text here
3 INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1 I have some text here
4 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0 I have some text here
4 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1 I have some text here
5 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0 I have some text here
5 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1 I have some text here