Trying to join two text files based on the first column in both files and want to keep all the columns of the matches from the second file - awk

I'm trying to join two text files based on their first columns and where those columns are the same I want to keep all the columns from the second file.
List1.txt
action
adan
adap
adapka
adat
yen
List2.txt
action e KK SS # n
adham a d h a m
adidas a d i d a s
administration e d m i n i s t r e SS # n
administrative e d m i n i s t r e t i v
admiral e d m aj r # l
adnan a d n a n
ado a d o
adan a d # n
adap a d a p
adapka a d a p k a
adrenalin # d r e n # l i n
adrian a d r j a n
adat a d a t
adtec e d t e k
adult # d a l t
yen j e n
I'd like to get everything from list1.txt that matches list2.txt plus all the other columns in list2.txt. List3.txt should look like this.
List3.txt
action e KK SS # n
adan a d # n
adap a d a p
adapka a d a p k a
adat a d a t
yen j e n
I've tried the following command from here:
$awk -F: 'FNR==NR{a[$1]=$0;next}{if($1 in a){print a[$1];} else {print;}}' List1.txt List2.txt > List3.txt
I've also tried this:
$comm <(sort List2.txt) <(sort List1.txt)

I'm sure there are ways to do this is awk, but join is also relatively simple.
join -1 1 -2 1 List1.txt <(sort -k 1,1 List2.txt) > List3.txt
You are joining List1 based on the first column, and joining List2 also based on the first column. You then need to make sure the files are sorted in alphabetical order so join can work.
This produces the columns you want, separated by a whitespace.
List3.txt
action e KK SS # n
adan a d # n
adap a d a p
adapka a d a p k a
adat a d a t
yen j e n

Another simple way to accomplish what you are attempting is with grep using the values in List1.txt to match as fixed-string the content in List2.txt redirecting the result to List3.txt, e.g.
grep -Ff List1.txt List2.txt > List3.txt
If using GNU grep or with -w, --word-regex available adding -w ensures only a whole-word match, e.g.
grep -Fwf List1.txt List2.txt > List3.txt
Resulting List3.txt
$ cat List3.txt
action e KK SS # n
adan a d # n
adap a d a p
adapka a d a p k a
adat a d a t
yen j e n
(note: all whitespace is preserved in List3.txt)

Related

oc rsh + awk prints extra indentation at beginning of each line, seems only did line break but does not return carriage

I want to filter lines of oc rsh du -shc output like this:
oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1 | awk '$2=="total" {print $1}'
But I got no results. For local du -shc /some/dir 2>&1 I get desired output.
# locally
$ du -shc ~ 2>&1
du: cannot read directory '/home/xxxx/.cache/doc': Permission denied
du: cannot read directory '/home/xxxx/.cache/dconf': Permission denied
du: cannot read directory '/home/xxxx/.gvfs': Permission denied
52G /home/xxxx/
52G total
And filtering:
# filtering works; search the 2nd arg "total" and print arg 1
$ du -shc ~ 2>&1 | awk '$2=="total" {print $1}'
52G
And printing:
$ du -shc ~ 2>&1 | awk '{print $1}'
du:
du:
du:
52G
52G
And $2:
$ du -shc ~ 2>&1 | awk '{print $2}'
cannot
cannot
cannot
/home/xxx
total
But remotely:
oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1 | awk '$2=="total" {print $1}'
# no output
And if I don't use awk:
$ oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1
du: cannot read directory '/proc/tty/driver': Permission denied
du: cannot read directory '/proc/acpi': Permission denied
du: cannot read directory '/proc/scsi': Permission denied
du: cannot access '/proc/130224/task/130224/fd/3': No such file or directory
du: cannot access '/proc/130224/task/130224/fdinfo/3': No such file or directory
du: cannot access '/proc/130224/fd/4': No such file or directory
du: cannot access '/proc/130224/fdinfo/4': No such file or directory
du: cannot read directory '/run/cryptsetup': Permission denied
du: cannot read directory '/run/secrets/rhsm': Permission denied
du: cannot read directory '/sys/firmware': Permission denied
du: cannot read directory '/var/lib/yum/history/2021-12-02/1': Permission denied
du: cannot read directory '/var/lib/yum/history/2021-12-02/2': Permission denied
du: cannot read directory '/var/lib/yum/history/2021-12-02/4': Permission denied
du: cannot read directory '/var/lib/yum/history/2021-12-02/3': Permission denied
du: cannot read directory '/var/lib/machines': Permission denied
du: cannot read directory '/var/cache/ldconfig': Permission denied
1.8G /
1.8G total
command terminated with exit code 1
And, if I only print $1:
$ oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1 | awk '{print $1}'
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
1.8G
1.8G
command
Why there are extra indentations? Seems only line break is done, but no return carriage to the beginning of the line???
If I print $2, we can see the 2 lines at the end are aligned; what is wrong here?
$ oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1 | awk '{print $2}'
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
/
total
terminated
Local awk version is:
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2016 Free Software Foundation.
, and remote, openshift awk version is
GNU Awk 4.0.2
Copyright (C) 1989, 1991-2012 Free Software Foundation.
and local du version is
du (GNU coreutils) 8.28
Copyright (C) 2017 Free Software Foundation, Inc.
, while remote, openshift pod du version is
du (GNU coreutils) 8.22
Copyright (C) 2013 Free Software Foundation, Inc.
Seems remote versions are behind local versions a lot, notice the years of copyright.
As per request of #Ed Morton:
$ oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | od -c
0000000 d u : c a n n o t r e a d
0000020 d i r e c t o r y ' / p r o c
0000040 / t t y / d r i v e r ' : P e
0000060 r m i s s i o n d e n i e d \r
0000100 \n d u : c a n n o t r e a d
0000120 d i r e c t o r y ' / p r o
0000140 c / a c p i ' : P e r m i s s
0000160 i o n d e n i e d \r \n d u :
0000200 c a n n o t r e a d d i r e
0000220 c t o r y ' / p r o c / s c s
0000240 i ' : P e r m i s s i o n d
0000260 e n i e d \r \n d u : c a n n o
0000300 t a c c e s s ' / p r o c /
0000320 2 8 7 2 7 / t a s k / 2 8 7 2 7
0000340 / f d / 3 ' : N o s u c h
0000360 f i l e o r d i r e c t o r
0000400 y \r \n d u : c a n n o t a c
0000420 c e s s ' / p r o c / 2 8 7 2
0000440 7 / t a s k / 2 8 7 2 7 / f d i
0000460 n f o / 3 ' : N o s u c h
0000500 f i l e o r d i r e c t o r
0000520 y \r \n d u : c a n n o t a c
0000540 c e s s ' / p r o c / 2 8 7 2
0000560 7 / f d / 4 ' : N o s u c h
0000600 f i l e o r d i r e c t o
0000620 r y \r \n d u : c a n n o t a
0000640 c c e s s ' / p r o c / 2 8 7
0000660 2 7 / f d i n f o / 4 ' : N o
0000700 s u c h f i l e o r d i
0000720 r e c t o r y \r \n d u : c a n
0000740 n o t r e a d d i r e c t o
0000760 r y ' / r u n / c r y p t s e
0001000 t u p ' : P e r m i s s i o n
0001020 d e n i e d \r \n d u : c a n
0001040 n o t r e a d d i r e c t o
0001060 r y ' / r u n / s e c r e t s
0001100 / r h s m ' : P e r m i s s i
0001120 o n d e n i e d \r \n d u : c
0001140 a n n o t r e a d d i r e c
0001160 t o r y ' / s y s / f i r m w
0001200 a r e ' : P e r m i s s i o n
0001220 d e n i e d \r \n d u : c a n
0001240 n o t r e a d d i r e c t o
0001260 r y ' / v a r / l i b / y u m
0001300 / h i s t o r y / 2 0 2 1 - 0 1
0001320 - 2 6 / 1 ' : P e r m i s s i
0001340 o n d e n i e d \r \n d u : c
0001360 a n n o t r e a d d i r e c
0001400 t o r y ' / v a r / l i b / y
0001420 u m / h i s t o r y / 2 0 2 1 -
0001440 0 1 - 2 6 / 2 ' : P e r m i s
0001460 s i o n d e n i e d \r \n d u :
0001500 c a n n o t r e a d d i r
0001520 e c t o r y ' / v a r / l i b
0001540 / y u m / h i s t o r y / 2 0 2
0001560 1 - 0 1 - 2 6 / 4 ' : P e r m
0001600 i s s i o n d e n i e d \r \n d
0001620 u : c a n n o t r e a d d
0001640 i r e c t o r y ' / v a r / l
0001660 i b / y u m / h i s t o r y / 2
0001700 0 2 1 - 0 1 - 2 6 / 3 ' : P e
0001720 r m i s s i o n d e n i e d \r
0001740 \n d u : c a n n o t r e a d
0001760 d i r e c t o r y ' / v a r
0002000 / l i b / m a c h i n e s ' :
0002020 P e r m i s s i o n d e n i e
0002040 d \r \n d u : c a n n o t r e
0002060 a d d i r e c t o r y ' / v
0002100 a r / c a c h e / l d c o n f i
0002120 g ' : P e r m i s s i o n d
0002140 e n i e d \r \n 3 . 7 G \t / \r \n 3
0002160 . 7 G \t t o t a l \r \n
0002173
xxxxxxx#elxag5zs8d3:~
and:
$ oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | awk '{print $1}' | od -c
0000000 d u : \n d u : \n d u : \n d u : \n
*
0000100 3 . 7 G \n 3 . 7 G \n
0000112
If I change RS, I got stranger results.
$ oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | awk 'BEGIN {RS="\r\n";} {print $1}'
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
3.7G
3.7G
xxxxxx#elxag5zs8d3:~
It's very odd that your oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | od -c output shows no blanks or tabs, e.g. between cannot and read in:
0000000 d u : c a n n o t r e a d
while without od -c, i.e. when you run oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1, there clearly are blanks:
du: cannot read directory '/proc/tty/driver': Permission denied
Anyway, as shown in the oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | od -c output your output lines end in \r\n, e.g. (emphasis mine):
0000000 d u : c a n n o t r e a d
0000020 d i r e c t o r y ' / p r o c
0000040 / t t y / d r i v e r ' : P e
0000060 r m i s s i o n d e n i e d **\r
0000100 \n** d u : c a n n o t r e a d
0000120 d i r e c t o r y ' / p r o
0000140 c / a c p i ' : P e r m i s s
0000160 i o n d e n i e d **\r \n**
You mentioned that locally du outputs \n-terminated lines so that just means it's probably rsh or oc that's changing that to \r\n - try it with some other command like oc rsh broker-amq-1-15-snd64 date to verify.
In any case, to handle that simply and portably, and assuming you don't actually want \r\n line endings in your final output, change your awk script from this:
awk '$2=="total" {print $1}'
to this which removes the \r at the end of each line before doing anything else with the input:
awk '{sub(/\r$/,"")} $2=="total" {print $1}'
The reason you weren't getting output with $2=="total" is that given input like:
foo total\r\n
$2 isn't "total", it's "total\r", and so the comparison fails.
You mentioned setting ORS="\r\n" to get the desired output - no, that just propagates the problem to the next command.
You could set RS="\r\n" but that would only work in an awk that accepts multi-char RS, e.g. GNU awk, in other awks that'd be treated the same as RS="\r" per the POSIX standard that says RS is a single char.
There are some platforms out there where when \r\n input is detected either:
The \r automatically gets stripped by the underlying C primitives that awk calls to read input, or
The RS automatically gets set to \r\n.
so just be aware of that and check what input awk is actually getting and what RS and ORS are set to if the sub() above doesn't do what you want (awk 'NR==1{ print "$0=<"$0">" ORS "RS=<"RS">" ORS "ORS=<"ORS">"; exit }' | od -c should tell you all you need to know).
For more information see Why does my tool output overwrite itself and how do I fix it?.
For completeness, as a side note:
oc rsh seems always to append CRLF after the output and there's some bug reports about that, see https://bugzilla.redhat.com/show_bug.cgi?id=1668763 and https://bugzilla.redhat.com/show_bug.cgi?id=1638447. It says oc rsh requires a pseudo-terminal and with option --no-tty=true it will not append CRLF. Meanwhile oc exec will not append CRLF. It seems to come from docker term which both depends on.
This would most likely have to be a patch made against the docker term[1] package, which the rsh and exec commands depend on.
[1]. https://github.com/moby/moby/tree/master/pkg/term
So, the following two forms will require no stripping of \r afterwards, and at the same time introduce no extra indentation after execution(before xxxxxx#localhost:~):
oc exec broker-amq-1-15-snd64 -- du -shc / 2>/dev/null | awk '$2=="total" {print $1}'
3.6G
xxxxxx#localhost:~
or:
oc rsh --no-tty=true broker-amq-1-15-snd64 du -shc / 2>/dev/null | awk '$2=="total" {print $1}'
3.6G
xxxxxx#localhost:~
Note for exec you need --.
Perhaps a solution that neither involves forcing oc (or any other application) to alter it's line-ending behavior, nor having to fudge with either FS or RS ?
gawk -te 'NF=/\11total\15?$/'
19G
Tested and confirmed working on mawk 1.3.4, mawk 1.9.9.6, macos nawk, and gawk 5.1.1, including invocation flags -c/-P/-t/-S
I would've used \r instead of \15 but gawk -t traditional mode was complaining about \r not being available, so \15 would be the most portable solution
But if you don't need gawk -t compatibility and don't mind fudging FS, then
nawk NF=NF==2 FS='\ttotal\r?'
or
mawk -— —-NF FS='\ttotal\r?'
or even
mawk 'NF=!+FS<NF' FS='total$' RS='\r?\n'
-- The 4Chan Teller

Remove column taking last column as reference

I am looking to remove third last and second last column and print the rest using bash. eg.
Line 1 ------ A B C D E F G H I J K
Line 2 ------ A B C D E F E F I G H I J M
Line 3 ------ A B C D E I J Y
Line 4 ------ A B C D A B C D F G J E F G H I J C
Now taking last column as reference ($NF) I need to remove third last and second last column.
Desired output should look like below where in each line I J should be removed.
Line 1 ------ A B C D E F G H K
Line 2 ------ A B C D E F E F I G H M
Line 3 ------ A B C D E Y
Line 4 ------ A B C D A B C D F G J E F G H C
Thanks
Depending if you want to keep or collapse the separators around the removed fields:
$ awk '{$(NF-2)=$(NF-1)=""}1' file
Line 1 ------ A B C D E F G H K
Line 2 ------ A B C D E F E F I G H M
Line 3 ------ A B C D E Y
Line 4 ------ A B C D A B C D F G J E F G H C
$ awk '{$(NF-2)=$(NF-1)=""; $0=$0; $1=$1}1' file
Line 1 ------ A B C D E F G H K
Line 2 ------ A B C D E F E F I G H M
Line 3 ------ A B C D E Y
Line 4 ------ A B C D A B C D F G J E F G H C
You said in a comment about ...retains my tab delimiter. If your fields are tab-separated then state that in your question and add BEGIN{FS=OFS="\t"} at the start of the script.
You can do this with a for loop inside of awk:
awk '{for(i=1;i<=NF;++i){if (i<NF-2||i==NF){printf i==NF?"%s\n":"%s ", $i}}}'
That's just looping through all of the columns, if the column isn't the 2nd or 3rd from last then it prints the column, appending a line feed if it's the last column.
There may be a prettier way to do it in awk, but it works.
This might work for you (GNU sed):
sed -E 's/(\s+\S+){3}$/\1/' file
Replace the last 3 fields with the last field on each line.

Use Sed/Awk to extract first three unique instances of the line

I have a list with 20000 probes, is there a way to extract the first three lines/occurences for each probe using sed/awk?
Example of dataset:
Probe1 A GTTAGAGGAGGTGGAAGAGC
Probe1 B CTGAGGTCGGGACGGAGCAC
Probe1 C GATGTAGGCGGTTGGCGTGG
Probe1 D GTTGGCGAAGTCACATCTAG
Probe1 E CATGTCGCCGACTCCGTCGA
Probe1 F GTGATGTTCTGAGTACATAG
Probe3 A GATTGTAGGTTTCCTGCCAG
Probe3 L ACCCAGCCAGGGGAAAACCA
Probe3 Z GGAGATGTAGGCGGTTGGCG
Probe3 Y GGAGATGTAGGCCTTAAAAA
Probe3 D GATTGTAGGGGTCCTGCCAG
Desired output:
Probe1 A GTTAGAGGAGGTGGAAGAGC
Probe1 B CTGAGGTCGGGACGGAGCAC
Probe1 C GATGTAGGCGGTTGGCGTGG
Probe3 A GATTGTAGGTTTCCTGCCAG
Probe3 L ACCCAGCCAGGGGAAAACCA
Probe3 Z GGAGATGTAGGCGGTTGGCG
awk to the rescue!
$ awk '++a[$1]<4' file
to remove the empty lines
$ awk '++a[$1]<4 && NF' file
No need to use sed or awk here (if you'd like to use Python). Unless I've mistaken your question, this should do it:
probes = [
"""Probe1 A GTTAGAGGAGGTGGAAGAGC
Probe1 B CTGAGGTCGGGACGGAGCAC
Probe1 C GATGTAGGCGGTTGGCGTGG
Probe1 D GTTGGCGAAGTCACATCTAG
Probe1 E CATGTCGCCGACTCCGTCGA
Probe1 F GTGATGTTCTGAGTACATAG""",
"""Probe3 A GATTGTAGGTTTCCTGCCAG
Probe3 L ACCCAGCCAGGGGAAAACCA
Probe3 Z GGAGATGTAGGCGGTTGGCG
Probe3 Y GGAGATGTAGGCCTTAAAAA
Probe3 D GATTGTAGGGGTCCTGCCAG"""]
for probe in probes:
for i, line in enumerate(probe.split("\n")):
print(line)
if i >= 2:
break

matching rows and fields from two files

I would like to match the record number in one file will the same field number in another file:
file1:
1
3
5
4
3
1
5
file2:
A B C D E F G
H I J J K L M
N O P Q R S T
I would like to use the record numbers corresponding to 5 in the first file to obtain the corresponding fields in the second file. Desired output:
C G
J M
P T
So far, I've done:
awk '{ if ($1=="5") print NR }' file1 > temp
for i in $(cat temp); do
awk '{ print $"'${i}'" }' file2
done
But get the output:
C
J
P
G
M
T
I would like to have this in the format of the desired output above, but can't get it to work. Perhaps using prinf or awk for-loop might work, but I have had no success.
Thank you all.
awk 'NR==FNR{if($1==5)a[NR];next}{for(i in a){printf $i" "}print ""}' a b
C G
J M
P T

strange html file returned by web server

While working on a web crawler, I ran across this strange occurrence; the following is a snippet of the page content returned by the web server for http://nexgen.ae :
< ! D O C T Y P E H T M L P U B L I C " - / / W 3 C / / D T D H T M L 4 . 0 T r a n s i t i o n a l / / E N " >
< H T M L > < H E A D > < T I T L E > N e x G e n T e c h n o l o g i e s L L C | F i n g e r p r i n t T i m e A t t e n d a n c e M a n a g e m e n t S y s t e m | A c c e s s C o n t r o l M a n a g e m e n t S y s t e m | F a c e R e c o g n i t i o n | D o o r A c c e s s C o n t r o l | E m p l o y e e s A t t e n d a n c e | S o l u t i o n P r o v i d e r | N e t w o r k S t r u c t u e d C a b l i n g | D u b a i | U A E ) < / T I T L E >
As you can see, the web server seems to have inserted a space character after every other character in the original HTML source. I checked the HTML source with "Page Source" in Firefox and there were no extra spaces there. I also checked other web pages from the same website, and I am obtaining the correct HTML file for those pages. So far the problem seems to only be happening with this website's default page when accessed through a web crawler.
I noticed the html file contains "google optimizer tracking script" at the very end. I wonder if the problem has anything to do with that...
Or could this just be the Website manager's way of keeping web crawlers away? If that's the case, a robots.txt file would do!
Those probably aren't spaces, they are null bytes. The page is encoded in UTF-16 (multiples of 2 bytes per character, minimum 2), and because the website has not properly specified its encoding in its HTTP headers, you are trying to read it as ASCII (1 byte per character) or possibly UTF-8 (1 byte or more per character).
To see what I mean, open it in your browser and change the encoding (somewhere in the browser's menus, might have to right-click on the page) and choose the UTF-16LE option.