awk using space in character class - awk

Would anybody be able to tell me how to use awk to parse columns based on a character class in with a space in it? I have some arbitrary input and I'm trying to parse it using:
somecommand | awk -F'[:space:]' '{print $2}'
I have also tried
somecommand | awk -F'[ ]' '{print $2}'
and
somecommand | awk -F'[[:space:]]' '{print $2}'
I know for sure that the command below works, although I can't use it because I am trying to parse upon multiple delimiters, space being one of them (colon being the other), which is why I'm trying to use a character class.
somecommand | awk -F' ' '{print $2}'
The problem is that I am getting spaces in the output of the first three examples this, which I obviously would not expect. I am certain that I am actually getting spaces in the output because I am piping the output to "od -a" and see the character is actually a space. I am using GNU Awk 3.1.7. I appreciate your help :-)
]$ cat hi.txt | awk -F'[[:blank:]]' '{print $2,$3}'
*
*
*
*
*
]$ cat hi.txt
* PID: 4240 Sessions: 0 Processed: 1 Uptime: 3m 28, URL : http://127.0.0.1:56784, Password: F3N04418IlVe9230Vs2bnJ7lPk9KE2PIzhlSOrv173X
* PID: 4247 Sessions: 0 Processed: 0 Uptime: 3m 28, URL : http://127.0.0.1:59918, Password: g0jOTdEawcbFluPMGawbbeSo3u7mQvYlT5P136ymELh
* PID: 4254 Sessions: 0 Processed: 0 Uptime: 3m 28, URL : http://127.0.0.1:60870, Password: C9QHDrxfCPZjZd1AiIvbSdnaG3rI0QuvfUf4stvdqww
* PID: 4261 Sessions: 0 Processed: 0 Uptime: 3m 28, URL : http://127.0.0.1:56414, Password: cnPMFUXq6f6AtkcqOCqba7Bx76mC3K7uX7XxCfIDYYq
* PID: 4268 Sessions: 0 Processed: 0 Uptime: 3m 28, URL : http://127.0.0.1:53253, Password: yAff1NmrTz4B6kUKCsZ0HPvA75bWAvmOl8YHUi9aOgo
]$ awk --version
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
~]$ cat hi.txt | od -a
0000000 sp sp * sp P I D : sp 4 2 4 0 sp sp sp
0000020 sp S e s s i o n s : sp 0 sp sp sp sp
0000040 sp sp sp P r o c e s s e d : sp 1 sp
0000060 sp sp sp sp sp sp U p t i m e : sp 3 m
0000100 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0000120 : sp h t t p : / / 1 2 7 . 0 . 0
0000140 . 1 : 5 6 7 8 4 , sp P a s s w o
0000160 r d : sp F 3 N 0 4 4 1 8 I l V e
0000200 9 2 3 0 V s 2 b n J 7 l P k 9 K
0000220 E 2 P I z h l S O r v 1 7 3 X nl
0000240 sp sp * sp P I D : sp 4 2 4 7 sp sp sp
0000260 sp S e s s i o n s : sp 0 sp sp sp sp
0000300 sp sp sp P r o c e s s e d : sp 0 sp
0000320 sp sp sp sp sp sp U p t i m e : sp 3 m
0000340 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0000360 : sp h t t p : / / 1 2 7 . 0 . 0
0000400 . 1 : 5 9 9 1 8 , sp P a s s w o
0000420 r d : sp g 0 j O T d E a w c b F
0000440 l u P M G a w b b e S o 3 u 7 m
0000460 Q v Y l T 5 P 1 3 6 y m E L h nl
0000500 sp sp * sp P I D : sp 4 2 5 4 sp sp sp
0000520 sp S e s s i o n s : sp 0 sp sp sp sp
0000540 sp sp sp P r o c e s s e d : sp 0 sp
0000560 sp sp sp sp sp sp U p t i m e : sp 3 m
0000600 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0000620 : sp h t t p : / / 1 2 7 . 0 . 0
0000640 . 1 : 6 0 8 7 0 , sp P a s s w o
0000660 r d : sp C 9 Q H D r x f C P Z j
0000700 Z d 1 A i I v b S d n a G 3 r I
0000720 0 Q u v f U f 4 s t v d q w w nl
0000740 sp sp * sp P I D : sp 4 2 6 1 sp sp sp
0000760 sp S e s s i o n s : sp 0 sp sp sp sp
0001000 sp sp sp P r o c e s s e d : sp 0 sp
0001020 sp sp sp sp sp sp U p t i m e : sp 3 m
0001040 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0001060 : sp h t t p : / / 1 2 7 . 0 . 0
0001100 . 1 : 5 6 4 1 4 , sp P a s s w o
0001120 r d : sp c n P M F U X q 6 f 6 A
0001140 t k c q O C q b a 7 B x 7 6 m C
0001160 3 K 7 u X 7 X x C f I D Y Y q nl
0001200 sp sp * sp P I D : sp 4 2 6 8 sp sp sp
0001220 sp S e s s i o n s : sp 0 sp sp sp sp
0001240 sp sp sp P r o c e s s e d : sp 0 sp
0001260 sp sp sp sp sp sp U p t i m e : sp 3 m
0001300 sp 2 8 , sp sp sp sp U R L sp sp sp sp sp
0001320 : sp h t t p : / / 1 2 7 . 0 . 0
0001340 . 1 : 5 3 2 5 3 , sp P a s s w o
0001360 r d : sp y A f f 1 N m r T z 4 B
0001400 6 k U K C s Z 0 H P v A 7 5 b W
0001420 A v m O l 8 Y H U i 9 a O g o nl
0001440
[a.cri.dsullivan#aaa-arnor04 ~]$ cat hi.txt | awk -F'[[:blank:]]' '{print $4,$6}'
PID:
PID:
PID:
PID:
PID:
[a.cri.dsullivan#aaa-arnor04 ~]$ cat hi.txt | awk -F'[[:blank:]]' '{print $4,$7}'
PID:
PID:
PID:
PID:
PID:
[a.cri.dsullivan#aaa-arnor04 ~]$ cat hi.txt | awk -F'[[:space:]]' '{print $4,$7}'
PID:
PID:
PID:
PID:
PID:

Here's a space class with semicolon, commas and colons:
[[:space:];:,]
And some other variations:
[[:blank:][:cntrl:]]
[ \t,:;]
Update
Try this one:
somecommand | awk -F'[[:space:]]+' '{print $2}' ## + or *, they're probably just similar.

There are a couple of different Field Separator values that might do what you want:
FS="[[:space:]:]+"
or
FS="[[:space:]]+|:"
or
FS="[[:space:]]*:[[:space:]]+"
or something else. Given the input data you posted at https://gist.github.com/anonymous/a007032c9facd9e71dff, that last one actually looks like what you really want.
If you go with FS="[[:space:]:]+"it will break your URL value (e.g. http://127.0.0.1:56784) into 3 separate sections around the :s and the other alternatives will have similar negative impacts for various fields.

Related

oc rsh + awk prints extra indentation at beginning of each line, seems only did line break but does not return carriage

I want to filter lines of oc rsh du -shc output like this:
oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1 | awk '$2=="total" {print $1}'
But I got no results. For local du -shc /some/dir 2>&1 I get desired output.
# locally
$ du -shc ~ 2>&1
du: cannot read directory '/home/xxxx/.cache/doc': Permission denied
du: cannot read directory '/home/xxxx/.cache/dconf': Permission denied
du: cannot read directory '/home/xxxx/.gvfs': Permission denied
52G /home/xxxx/
52G total
And filtering:
# filtering works; search the 2nd arg "total" and print arg 1
$ du -shc ~ 2>&1 | awk '$2=="total" {print $1}'
52G
And printing:
$ du -shc ~ 2>&1 | awk '{print $1}'
du:
du:
du:
52G
52G
And $2:
$ du -shc ~ 2>&1 | awk '{print $2}'
cannot
cannot
cannot
/home/xxx
total
But remotely:
oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1 | awk '$2=="total" {print $1}'
# no output
And if I don't use awk:
$ oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1
du: cannot read directory '/proc/tty/driver': Permission denied
du: cannot read directory '/proc/acpi': Permission denied
du: cannot read directory '/proc/scsi': Permission denied
du: cannot access '/proc/130224/task/130224/fd/3': No such file or directory
du: cannot access '/proc/130224/task/130224/fdinfo/3': No such file or directory
du: cannot access '/proc/130224/fd/4': No such file or directory
du: cannot access '/proc/130224/fdinfo/4': No such file or directory
du: cannot read directory '/run/cryptsetup': Permission denied
du: cannot read directory '/run/secrets/rhsm': Permission denied
du: cannot read directory '/sys/firmware': Permission denied
du: cannot read directory '/var/lib/yum/history/2021-12-02/1': Permission denied
du: cannot read directory '/var/lib/yum/history/2021-12-02/2': Permission denied
du: cannot read directory '/var/lib/yum/history/2021-12-02/4': Permission denied
du: cannot read directory '/var/lib/yum/history/2021-12-02/3': Permission denied
du: cannot read directory '/var/lib/machines': Permission denied
du: cannot read directory '/var/cache/ldconfig': Permission denied
1.8G /
1.8G total
command terminated with exit code 1
And, if I only print $1:
$ oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1 | awk '{print $1}'
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
1.8G
1.8G
command
Why there are extra indentations? Seems only line break is done, but no return carriage to the beginning of the line???
If I print $2, we can see the 2 lines at the end are aligned; what is wrong here?
$ oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1 | awk '{print $2}'
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
cannot
/
total
terminated
Local awk version is:
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2016 Free Software Foundation.
, and remote, openshift awk version is
GNU Awk 4.0.2
Copyright (C) 1989, 1991-2012 Free Software Foundation.
and local du version is
du (GNU coreutils) 8.28
Copyright (C) 2017 Free Software Foundation, Inc.
, while remote, openshift pod du version is
du (GNU coreutils) 8.22
Copyright (C) 2013 Free Software Foundation, Inc.
Seems remote versions are behind local versions a lot, notice the years of copyright.
As per request of #Ed Morton:
$ oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | od -c
0000000 d u : c a n n o t r e a d
0000020 d i r e c t o r y ' / p r o c
0000040 / t t y / d r i v e r ' : P e
0000060 r m i s s i o n d e n i e d \r
0000100 \n d u : c a n n o t r e a d
0000120 d i r e c t o r y ' / p r o
0000140 c / a c p i ' : P e r m i s s
0000160 i o n d e n i e d \r \n d u :
0000200 c a n n o t r e a d d i r e
0000220 c t o r y ' / p r o c / s c s
0000240 i ' : P e r m i s s i o n d
0000260 e n i e d \r \n d u : c a n n o
0000300 t a c c e s s ' / p r o c /
0000320 2 8 7 2 7 / t a s k / 2 8 7 2 7
0000340 / f d / 3 ' : N o s u c h
0000360 f i l e o r d i r e c t o r
0000400 y \r \n d u : c a n n o t a c
0000420 c e s s ' / p r o c / 2 8 7 2
0000440 7 / t a s k / 2 8 7 2 7 / f d i
0000460 n f o / 3 ' : N o s u c h
0000500 f i l e o r d i r e c t o r
0000520 y \r \n d u : c a n n o t a c
0000540 c e s s ' / p r o c / 2 8 7 2
0000560 7 / f d / 4 ' : N o s u c h
0000600 f i l e o r d i r e c t o
0000620 r y \r \n d u : c a n n o t a
0000640 c c e s s ' / p r o c / 2 8 7
0000660 2 7 / f d i n f o / 4 ' : N o
0000700 s u c h f i l e o r d i
0000720 r e c t o r y \r \n d u : c a n
0000740 n o t r e a d d i r e c t o
0000760 r y ' / r u n / c r y p t s e
0001000 t u p ' : P e r m i s s i o n
0001020 d e n i e d \r \n d u : c a n
0001040 n o t r e a d d i r e c t o
0001060 r y ' / r u n / s e c r e t s
0001100 / r h s m ' : P e r m i s s i
0001120 o n d e n i e d \r \n d u : c
0001140 a n n o t r e a d d i r e c
0001160 t o r y ' / s y s / f i r m w
0001200 a r e ' : P e r m i s s i o n
0001220 d e n i e d \r \n d u : c a n
0001240 n o t r e a d d i r e c t o
0001260 r y ' / v a r / l i b / y u m
0001300 / h i s t o r y / 2 0 2 1 - 0 1
0001320 - 2 6 / 1 ' : P e r m i s s i
0001340 o n d e n i e d \r \n d u : c
0001360 a n n o t r e a d d i r e c
0001400 t o r y ' / v a r / l i b / y
0001420 u m / h i s t o r y / 2 0 2 1 -
0001440 0 1 - 2 6 / 2 ' : P e r m i s
0001460 s i o n d e n i e d \r \n d u :
0001500 c a n n o t r e a d d i r
0001520 e c t o r y ' / v a r / l i b
0001540 / y u m / h i s t o r y / 2 0 2
0001560 1 - 0 1 - 2 6 / 4 ' : P e r m
0001600 i s s i o n d e n i e d \r \n d
0001620 u : c a n n o t r e a d d
0001640 i r e c t o r y ' / v a r / l
0001660 i b / y u m / h i s t o r y / 2
0001700 0 2 1 - 0 1 - 2 6 / 3 ' : P e
0001720 r m i s s i o n d e n i e d \r
0001740 \n d u : c a n n o t r e a d
0001760 d i r e c t o r y ' / v a r
0002000 / l i b / m a c h i n e s ' :
0002020 P e r m i s s i o n d e n i e
0002040 d \r \n d u : c a n n o t r e
0002060 a d d i r e c t o r y ' / v
0002100 a r / c a c h e / l d c o n f i
0002120 g ' : P e r m i s s i o n d
0002140 e n i e d \r \n 3 . 7 G \t / \r \n 3
0002160 . 7 G \t t o t a l \r \n
0002173
xxxxxxx#elxag5zs8d3:~
and:
$ oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | awk '{print $1}' | od -c
0000000 d u : \n d u : \n d u : \n d u : \n
*
0000100 3 . 7 G \n 3 . 7 G \n
0000112
If I change RS, I got stranger results.
$ oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | awk 'BEGIN {RS="\r\n";} {print $1}'
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
du:
3.7G
3.7G
xxxxxx#elxag5zs8d3:~
It's very odd that your oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | od -c output shows no blanks or tabs, e.g. between cannot and read in:
0000000 d u : c a n n o t r e a d
while without od -c, i.e. when you run oc rsh broker-amq-1-2-fsbnd du -shc / 2>&1, there clearly are blanks:
du: cannot read directory '/proc/tty/driver': Permission denied
Anyway, as shown in the oc rsh broker-amq-1-15-snd64 du -shc / 2>/dev/null | od -c output your output lines end in \r\n, e.g. (emphasis mine):
0000000 d u : c a n n o t r e a d
0000020 d i r e c t o r y ' / p r o c
0000040 / t t y / d r i v e r ' : P e
0000060 r m i s s i o n d e n i e d **\r
0000100 \n** d u : c a n n o t r e a d
0000120 d i r e c t o r y ' / p r o
0000140 c / a c p i ' : P e r m i s s
0000160 i o n d e n i e d **\r \n**
You mentioned that locally du outputs \n-terminated lines so that just means it's probably rsh or oc that's changing that to \r\n - try it with some other command like oc rsh broker-amq-1-15-snd64 date to verify.
In any case, to handle that simply and portably, and assuming you don't actually want \r\n line endings in your final output, change your awk script from this:
awk '$2=="total" {print $1}'
to this which removes the \r at the end of each line before doing anything else with the input:
awk '{sub(/\r$/,"")} $2=="total" {print $1}'
The reason you weren't getting output with $2=="total" is that given input like:
foo total\r\n
$2 isn't "total", it's "total\r", and so the comparison fails.
You mentioned setting ORS="\r\n" to get the desired output - no, that just propagates the problem to the next command.
You could set RS="\r\n" but that would only work in an awk that accepts multi-char RS, e.g. GNU awk, in other awks that'd be treated the same as RS="\r" per the POSIX standard that says RS is a single char.
There are some platforms out there where when \r\n input is detected either:
The \r automatically gets stripped by the underlying C primitives that awk calls to read input, or
The RS automatically gets set to \r\n.
so just be aware of that and check what input awk is actually getting and what RS and ORS are set to if the sub() above doesn't do what you want (awk 'NR==1{ print "$0=<"$0">" ORS "RS=<"RS">" ORS "ORS=<"ORS">"; exit }' | od -c should tell you all you need to know).
For more information see Why does my tool output overwrite itself and how do I fix it?.
For completeness, as a side note:
oc rsh seems always to append CRLF after the output and there's some bug reports about that, see https://bugzilla.redhat.com/show_bug.cgi?id=1668763 and https://bugzilla.redhat.com/show_bug.cgi?id=1638447. It says oc rsh requires a pseudo-terminal and with option --no-tty=true it will not append CRLF. Meanwhile oc exec will not append CRLF. It seems to come from docker term which both depends on.
This would most likely have to be a patch made against the docker term[1] package, which the rsh and exec commands depend on.
[1]. https://github.com/moby/moby/tree/master/pkg/term
So, the following two forms will require no stripping of \r afterwards, and at the same time introduce no extra indentation after execution(before xxxxxx#localhost:~):
oc exec broker-amq-1-15-snd64 -- du -shc / 2>/dev/null | awk '$2=="total" {print $1}'
3.6G
xxxxxx#localhost:~
or:
oc rsh --no-tty=true broker-amq-1-15-snd64 du -shc / 2>/dev/null | awk '$2=="total" {print $1}'
3.6G
xxxxxx#localhost:~
Note for exec you need --.
Perhaps a solution that neither involves forcing oc (or any other application) to alter it's line-ending behavior, nor having to fudge with either FS or RS ?
gawk -te 'NF=/\11total\15?$/'
19G
Tested and confirmed working on mawk 1.3.4, mawk 1.9.9.6, macos nawk, and gawk 5.1.1, including invocation flags -c/-P/-t/-S
I would've used \r instead of \15 but gawk -t traditional mode was complaining about \r not being available, so \15 would be the most portable solution
But if you don't need gawk -t compatibility and don't mind fudging FS, then
nawk NF=NF==2 FS='\ttotal\r?'
or
mawk -— —-NF FS='\ttotal\r?'
or even
mawk 'NF=!+FS<NF' FS='total$' RS='\r?\n'
-- The 4Chan Teller

following line take lot of time to update since it has nearly 2.5l records are present

In one dataframe i took group count which is more than one,need to update those index sepcific column value since its 2.5l it is failing with memory error is there any fast solution for it?
gl_no=primary.groupby('GL Account').filter(lambda x:len(x)>1)
primary_index=primary[primary['GL Account'].isin(gl_no['GL Account'])].index
primary.loc[primary_index]['Cost Element']='01'
primary.loc[primary_index]['GL Acc Type']='P'
You can use GroupBy.transform with GroupBy.size and comparing for boolean mask and set new values by boolean indexing with DataFrame.loc:
primary = pd.DataFrame({
'Cost Element':list('abcdef'),
'GL Acc Type':list('abcdef'),
'GL Account':list('aadbbc')
})
print (primary)
Cost Element GL Acc Type GL Account
0 a a a
1 b b a
2 c c d
3 d d b
4 e e b
5 f f c
mask=primary.groupby('GL Account')['GL Account'].transform('size') > 1
primary.loc[mask, ['Cost Element','GL Acc Type']] = ['01', 'P']
print (primary)
Cost Element GL Acc Type GL Account
0 01 P a
1 01 P a
2 c c d
3 01 P b
4 01 P b
5 f f c

group by clause with where clause should return count zero also in sqlserver

SELECT * FROM A
1 A ACCEPT
2 A ACCEPT
3 C ACCEPT
4 C ACCEPT
5 B HOLD
6 G HOLD
7 G HOLD
8 B REJECT
9 G REJECT
10 H REJECT
11 H REJECT
12 A NEW
13 H REJECT
14 H NEW
15 C NEW
16 D NEW
17 E NEW
18 D ACCEPT
19 D ACCEPT
20 F ACCEPT
21 I NULL
This is my table.
SELECT DISTINCT(PROD) FROM A
A
B
C
D
E
F
G
H
I
These are the products i have.
SELECT PROD,ISNULL(COUNT(*),0) FROM A WHERE STATUS='ACCEPT' GROUP BY PROD
A 2
C 2
D 2
F 1
When i execute this i am getting the above result
But my requirement :
A 2
B 0
C 2
D 2
E 0
F 0
G 0
H 0
I 1
How to achieve it.
please help me
thanks in advance
This should work perfectly:
SELECT T.PROD,SUM(T.FLAG) AS [COUNT]
FROM
(
SELECT PROD, 1 AS FLAG
FROM A
WHERE STATUS='ACCEPT'
UNION ALL
SELECT PROD, 0 AS FLAG
FROM A
WHERE STATUS <>'ACCEPT'
) AS T
GROUP BY T.PROD
ORDER BY T.PROD

combine 2 files with AWK based last colums [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
i have two files
file1
-------------------------------
1 a t p b
2 b c f a
3 d y u b
2 b c f a
2 u g t c
2 b j h c
file2
--------------------------------
1 a b
2 p c
3 n a
4 4 a
i want combine these 2 files based last columns (column 5 of file1 and column 3 of file2) using awk
result
----------------------------------------------
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c
at the very beginning, I didn't see the duplicated "a" in file2, I thought it would be solved with normal array matching. ... now it works.
an awk onliner:
awk 'NR==FNR{a[$3"_"NR]=$0;next;}{for(x in a){if(x~"^"$5) print $1,$2,$3,$4,a[x];}}' f2.txt f1.txt
test
kent$ head *.txt
==> f1.txt <==
1 a t p b
2 b c f a
3 d y u b
2 b c f a
2 u g t c
2 b j h c
==> f2.txt <==
1 a b
2 p c
3 n a
4 4 a
kent$ awk 'NR==FNR{a[$3"_"NR]=$0;next;}{for(x in a){if(x~"^"$5) print $1,$2,$3,$4,a[x];}}' f2.txt f1.txt
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c
note, the output format was not sexy, but it would be acceptable if pipe it to column -t
Other way assuming files have no headers:
awk '
FNR == NR {
f2[ $NF ] = f2[ $NF ] ? f2[ $NF ] SUBSEP $0 : $0;
next;
}
FNR < NR {
if ( $NF in f2 ) {
split( f2[ $NF ], a, SUBSEP );
len = length( a );
for ( i = 1; i <= len; i++ ) {
$NF = a[ i ];
}
}
printf "%s\n", $0;
}
' file2 file1 | column -t
It yields:
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c
A bit easier in a language that supports arbitrary data structures (list of lists). Here's ruby
# read "file2" and group by the last field
file2 = File .foreach('file2') .map(&:split) .group_by {|fields| fields[-1]}
# process file1
File .foreach('file1') .map(&:split) .each do |fields|
file2[fields[-1]] .each do |fields2|
puts (fields[0..-2] + fields2).join(" ")
end
end
outputs
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c

strange html file returned by web server

While working on a web crawler, I ran across this strange occurrence; the following is a snippet of the page content returned by the web server for http://nexgen.ae :
< ! D O C T Y P E H T M L P U B L I C " - / / W 3 C / / D T D H T M L 4 . 0 T r a n s i t i o n a l / / E N " >
< H T M L > < H E A D > < T I T L E > N e x G e n T e c h n o l o g i e s L L C | F i n g e r p r i n t T i m e A t t e n d a n c e M a n a g e m e n t S y s t e m | A c c e s s C o n t r o l M a n a g e m e n t S y s t e m | F a c e R e c o g n i t i o n | D o o r A c c e s s C o n t r o l | E m p l o y e e s A t t e n d a n c e | S o l u t i o n P r o v i d e r | N e t w o r k S t r u c t u e d C a b l i n g | D u b a i | U A E ) < / T I T L E >
As you can see, the web server seems to have inserted a space character after every other character in the original HTML source. I checked the HTML source with "Page Source" in Firefox and there were no extra spaces there. I also checked other web pages from the same website, and I am obtaining the correct HTML file for those pages. So far the problem seems to only be happening with this website's default page when accessed through a web crawler.
I noticed the html file contains "google optimizer tracking script" at the very end. I wonder if the problem has anything to do with that...
Or could this just be the Website manager's way of keeping web crawlers away? If that's the case, a robots.txt file would do!
Those probably aren't spaces, they are null bytes. The page is encoded in UTF-16 (multiples of 2 bytes per character, minimum 2), and because the website has not properly specified its encoding in its HTTP headers, you are trying to read it as ASCII (1 byte per character) or possibly UTF-8 (1 byte or more per character).
To see what I mean, open it in your browser and change the encoding (somewhere in the browser's menus, might have to right-click on the page) and choose the UTF-16LE option.