grep and tail -f for a UTF-16 binary file - trying to use simple awk - awk

How can I achieve the equivalent of:
tail -f file.txt | grep 'regexp'
to only output the buffered lines that match a regular expression such as 'Result' from the file type:
$ file file.txt
file.txt:Little-endian UTF-16 Unicode text, with CRLF line terminators
Example of the tail -f stream content below converted to utf-8:
Package end.
Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.
Result: Success
Awk?
The problems in piping to grep led me to awk as a on-stop-shop solution for stripping the offending characters and also producing matched lines from regex.
awk seems to be giving the most promising results, however, I am finding that it returns the whole stream rather than individual matching lines:
tail -f file.txt | awk '{sub("/[^\x20-\x7F]/", "");/Result/;print}'
Package end.
Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.
Result: Success
What I have tried
converting the stream and piping to grep
tail -f file.txt | iconv -t UTF-8 | grep 'regexp'
using luit to change terminal encoding as per this post
luit -encoding UTF-8 -- tail -f file.txt | grep 'regexp'
delete non ASCII characters, described here, then piping to grep
tail -f file.txt | tr -d '[^\x20-\x7F]' | grep 'regexp'
tail -f file.txt | sed 's/[^\x00-\x7F]//' | grep 'regexp'
various combinations of the above using grep flags --line-buffered, -a as well as sed -u
using luit -encoding UTF-8 -- pre-pended to the above
using a file with the same encoding containing the regular expression for grep -f
Why they failed
Most attempts, simply nothing is printed to the screen because grep searches 'regexp' when in fact the text is something like '\x00r\x00e\x00g\x00e\x00x\x00p' - for example 'R' will return the line 'Result: Success' but 'Result' won't
If a full regular expression gets a match, such as in the case of using grep -f, it will return the whole stream and doesn't seem to just return the matched lines
piping through sed or tr or iconv seems to break the pipe to grep and grep seems to still only be able to match individual characters
Edit
I looked at the raw file in it's utf-16 format using xxd with an aim of using regex to match the encoding, which gave the following output:
$ tail file.txt | xxd
00000000: 0050 0061 0063 006b 0061 0067 0065 0020 .P.a.c.k.a.g.e.
00000010: 0065 006e 0064 002e 000d 000a 000d 000a .e.n.d..........
00000020: 0054 006f 0074 0061 006c 0020 0077 0061 .T.o.t.a.l. .w.a
00000030: 0072 006e 0069 006e 0067 0073 003a 0020 .r.n.i.n.g.s.:.
00000040: 0034 0030 000d 000a 0054 006f 0074 0061 .4.0.....T.o.t.a
00000050: 006c 0020 0065 0072 0072 006f 0072 0073 .l. .e.r.r.o.r.s
00000060: 003a 0020 0030 000d 000a 0045 006c 0061 .:. .0.....E.l.a
00000070: 0070 0073 0065 0064 0020 0074 0069 006d .p.s.e.d. .t.i.m
00000080: 0065 003a 0020 0032 0034 002e 0034 0032 .e.:. .2.4...4.2
00000090: 0036 0037 0031 0039 0032 0020 0073 0065 .6.7.1.9.2. .s.e
000000a0: 0063 0073 002e 000d 000a 002e 002e 002e .c.s............
000000b0: 0050 0061 0063 006b 0061 0067 0065 0020 .P.a.c.k.a.g.e.
000000c0: 0045 0078 0065 0063 0075 0074 0065 0064 .E.x.e.c.u.t.e.d
000000d0: 002e 000d 000a 000d 000a 0052 0065 0073 ...........R.e.s
000000e0: 0075 006c 0074 003a 0020 0053 0075 0063 .u.l.t.:. .S.u.c
000000f0: 0063 0065 0073 0073 000d 000a 000d 000a .c.e.s.s........
00000100: 00

The sloppiest solution that should work on Cygwin is fixing your awk statement:
tail -f file.txt | \
LC_CTYPE=C awk '{ gsub("[^[:print:]]", ""); if($0 ~ /Result/) print; }'
This has a few bugs that cancel each other out, like tail cutting a UTF-16LE file in awkward places but awk stripping what we hope is garbage.
A robust solution might be:
tail -c +1 -f file.txt | \
script -qc 'iconv -f UTF-16LE -t UTF-8' /dev/null | grep Result
but it reads the entire file and I don't know how well Cygwin works with using script to convince iconv not to buffer (it would work on GNU/Linux).

I realised a simple regex to ignore any characters between letters in the search string might work...
This matches 'Result' whilst allowing any one character between each letter...
$ tail -f file.txt | grep -a 'R.e.s.u.l.t'
Result: Success
$ tail -f file.txt | awk '/R.e.s.u.l.t./'
Result: Success
or as per this answer: to avoid typing all the tedious dots...
search="Result"
tail -f file.txt | grep -a -e "$(echo "$search" | sed 's/./&./g')"

You can use ripgrep instead which will handle nicely UTF-16 without having to convert your input
tail -f file.txt | rg regexp

Related

Simple Pattern match with a field and a variable does not seem to work in GAWK/AWK

I am trying to extract all lines where a field matches a pattern which is defined as a variable.
I tried the following
head input.dat |
awk -F '|' -v CODE="39905|19043" '{print $13; if($13~CODE){print "Matched"} else {print "Nomatch"} }'
I am printing the value of the field before attempting a pattern match.(This way I don't have to show the entire line that contains many fields)
This is the output I got.
PLAN_ID
Nomatch
39905
Nomatch
39905
Nomatch
39883
Nomatch
19043
Nomatch
2215
Nomatch
19043
Nomatch
9149
Nomatch
42718
Nomatch
24
Nomatch
I expected to see at least 3 instances of Matched in the output. What am I doing wrong?
edit by #Fravadona
xxd input.dat | head -n 6
00000000: fffe 4d00 4f00 4e00 5400 4800 5f00 4900 ..M.O.N.T.H._.I.
00000010: 4400 7c00 5300 5600 4300 5f00 4400 5400 D.|.S.V.C._.D.T.
00000020: 7c00 5000 4100 5400 4900 4500 4e00 5400 |.P.A.T.I.E.N.T.
00000030: 5f00 4900 4400 7c00 5000 4100 5400 5f00 .I.D.|.P.A.T..
00000040: 5a00 4900 5000 3300 7c00 4300 4c00 4100 Z.I.P.3.|.C.L.A.
00000050: 4900 4d00 5f00 4900 4400 7c00 5300 5600 I.M._.I.D.|.S.V.
Turns out that the input file uses the UTF-16 LE Encoding (as shown by the hexdump of the content). Thus, the solution seems to be to convert the input file from UTF-16LE to UTF-8 before running AWK. Thanks
I found out (thanks to all who suggested looking at the hexdump of the input file) that the file used UTF-16LE encoding. Once I converted the input file using iconv , the AWK script worked as expected

How can I solve a problems with a date filter with awk [duplicate]

This question already has answers here:
Finding directories older than N days in HDFS
(5 answers)
Closed 4 years ago.
I want to filter some files for date (I can't use find, because the files are in HDFS). The solution that I find is using awk.
This is an example of data that I want to process
drwxrwx--x+ - hive hive 0 2019-01-01 20:02 /dat1
drwxrwx--x+ - hive hive 0 2019-01-02 16:38 /dat2
drwxrwx--x+ - hive hive 0 2019-01-03 16:59 /dat3
If I use this command:
$ ls -l |awk '$6 > "2019-01-02"'
drwxrwx--x+ - hive hive 0 2019-01-03 16:59 /dat3
I don't have any problems, but If I want to create a script to help me to filter 2 days ago, I add in the awk the expression:
$ date +%Y-%m-%d --date='-2 day'
2019-01-02
It is something like this, but isn't working:
ls -l |awk '$6 >" date +%Y-%m-%d --date=\'-2 day\'"'
>
It's like something is missing, but I don't know what it is.
First of all, Never try to parse the output of ls.
If you want to get your hands on the files/directories that are maximum n days old, which are in a directory /path/to/dir/
$ find /path/to/dir -type f -mtime -2 -print
$ find /path/to/dir -type d -mtime -2 -print
The first one is for files, the second for directories.
If you still want to parse ls with awk, you might try somthing like this:
$ ls -l | awk -v d=$(date -d "2 days ago" "+%F") '$6 > d'
The problem you are having is that you are nesting double quotes into single quotes.
Parsing the output of ls and manipulating the mod-time of the files is generally not recommended. But, if you stick to yyyymmdd format, then below workaround will help you. I use this hack for my daily chores as it uses number comparisons
$ ls -l --time-style '+%Y%m%d' delete_5lines.txt jobinfo.txt stan.in sample.dat report.txt
-rw-r--r-- 1 user1234 unixgrp 34 20181231 delete_5lines.txt
-rw-r--r-- 1 user1234 unixgrp 226 20190101 jobinfo.txt
-rw-r--r-- 1 user1234 unixgrp 7120 20190104 report.txt
-rw-r--r-- 1 user1234 unixgrp 70555 20190104 sample.dat
-rw-r--r-- 1 user1234 unixgrp 58 20190103 stan.in
Get files after Jan-3rd
$ ls -l --time-style '+%Y%m%d' delete_5lines.txt jobinfo.txt stan.in sample.dat report.txt | awk ' $6>20190103'
-rw-r--r-- 1 user1234 unixgrp 7120 20190104 report.txt
-rw-r--r-- 1 user1234 unixgrp 70555 20190104 sample.dat
Get files on/after Jan-3rd..
$ ls -l --time-style '+%Y%m%d' delete_5lines.txt jobinfo.txt stan.in sample.dat report.txt | awk ' $6>=20190103'
-rw-r--r-- 1 user1234 unixgrp 7120 20190104 report.txt
-rw-r--r-- 1 user1234 unixgrp 70555 20190104 sample.dat
-rw-r--r-- 1 user1234 unixgrp 58 20190103 stan.in
Exactly Jan-3rd
$ ls -l --time-style '+%Y%m%d' delete_5lines.txt jobinfo.txt stan.in sample.dat report.txt | awk ' $6==20190103'
-rw-r--r-- 1 user1234 unixgrp 58 20190103 stan.in
You can alias it like
$ alias lsdt=" ls -l --time-style '+%Y%m%d' "
and use it like
$ lsdt jobinfo.txt stan.in sample.dat report.txt
Note: Again, you should avoid it if you are going to use it for scripts... just use it for day-to-day chores

Why awk command only processes one time after I use sed command

The fist times, I use this command:
svn log -l1000 | grep '#xxxx' -B3 | awk 'BEGIN {FS="\n"; RS=""; OFS=";"} {print $1, $2}'
Out put are many lines. But it's not perfect as I want.
Because there are some blank lines or lines with format '----'. So, I use sed command to remove them. I use:
svn log -l1000 | grep '#xxxx' -B3 | sed '/^$/d' | sed '/^--/d' | awk 'BEGIN {FS="\n"; RS=""; OFS=";"} {print $1, $2}'
I checked output of command:
svn log -l1000 | grep '#xxxx' -B3 | sed '/^$/d' | sed '/^--/d'
it looks good. But when awk process it as input text, I only see one line output.
Ah, my input likes this:
------------------------------------------------------------------------
rxxxx | abc.xyz | 2016-02-01 13:42:21 +0700 (Mon, 01 Feb 2016) | 1 line
refs #kkkk [GolFeature] Fix UI 69
--
------------------------------------------------------------------------
rxxxjy | mnt.abc| 2016-02-01 11:33:45 +0700 (Mon, 01 Feb 2016) | 1 line
refs #kkkk [GoFeature] remove redundant function
--
------------------------------------------------------------------------
rxxyyxx | asdfadf.xy | 2016-02-01 11:02:06 +0700 (Mon, 01 Feb 2016) | 1 line
refs #kkkk Updated ini file
My expected output is:
2016-02-01 11:02:06 +0700 (Mon, 01 Feb 2016), rxxxx, mnt.abc, refs #kkkk Updated ini file ...

Using awk to format an output for nstats

I would like to get a complete hostname with their server up-time using "nstats" command. The script appears to be working ok. I need help with the 1st column with a "complete" hostname and the 7th column (server up-time) printed.
This following command only give me their partial hostnames:
for host in $(cat servers.txt); do nstats $host -H | awk 'BEGIN {IFS="\t"} {$2=$3=$4=$5=$6=$9="" ; print}' ; done
BAD Output: (host-names got cut off after the 12th characters)
linux_server 223 days
linux_server 123 days
windows_serv 23 days
windows_serv 23 days
EXPECTED Output:
linux_server1 223 days
linux_server2 123 days
windows_server1 23 days
windows_server2 123 days
The contents of servers.txt file are as follows:
linux_server1
linux_server2
windows_server1
windows_server2
Output without awk
LINXSERVE10% for host in $(cat servers.txt); do nstats $host -H ; done
linux_server 0.01 47% 22% 56 05:08 20 days 17:21:00
linux_server 0.00 23% 8% 45 05:08 24 days 04:16:46
windows_serv 0.04 72% 30% 58 05:09 318 days 23:32:17
windows_serv 0.00 20% 8% 40 05:09 864 days 12:23:10
windows_serv 0.00 51% 17% 41 05:09 442 days 05:30:14
Note: for host in $(cat servers.txt); do nstats $host -H | awk -v server=$host 'BEGIN {IFS="\t"} {$2=$3=$4=$5=$6=$9="" ; print server }' ; done *** this works ok but it will list only a complete hostname with no server uptime.
Any help you can give would be greatly appreciated.
Do you know you may choose the fields to print in awk?
for host in $(cat servers.txt); do
nstats $host -H |
awk 'BEGIN {IFS="\t"} {print $1,$7,$8}';
done
These will print only three fields you are interested.
The awk code labeled as a "note" is totally useless -- It is equivalent to
for host in $(cat servers.txt); do
echo "$host"
done
UPDATE: after realizing the problem was the nstats command, the awk line command would be
awk -v server="$host" 'BEGIN {IFS="\t"} {print server,$7,$8}';
then the output looked like this (server uptime overwrote the hostnames)
20 daysrver
24 daysrver
318 daysver
864 dayserv
442 dayserv
So I put that server variable at the end, it looked much better and i can extract that and play with it in excel. THANKS SO Much Jdamian!
for host in $(cat servers.txt); do nstats $host -H | awk -v server="$host" 'BEGIN {IFS="\t"} {print $7,$8,server}'; done
20 days linux_server1
24 days linux_server2
318 days windows_server1
864 days windows_server2
442 days windows_server3

Can't cut column in Linux

I have a file like this
ATOM 3197 HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM 3198 C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM 3199 OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM 3200 O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
I want to cut the second column, however when I use
cut -f1,3,4,5,6,7,8,9,10 filename
it doesn't work. Am I do something wrong?
This is because there are multiple spaces and cut can just handle them one by one.
You can start from the 5th position:
$ cut -d' ' -f 1,5- file
ATOM HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
Or squeeze spaces with tr -s like below (multiple spaces will be lost, though):
$ tr -s ' ' < file | cut -d' ' -f1,3,4,5,6,7,8,9,10
ATOM HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
Note you can indicate from 3 to the end with 3-:
tr -s ' ' < file | cut -d' ' -f1,3-
In fact I would use awk for this:
awk '{$2=""; print}' file
or just
awk '{$2=""} 1' file
There are many spaces in your file. So, you've to start with number of spaces.
The new.txt contains
ATOM 3197 HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM 3198 C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM 3199 OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM 3200 O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
and this is the command to print second column
root#52:/home/ubuntu# cut -d' ' -f4 new.txt
3197
3198
3199
3200
where -d stands for delimiter i.e 'space' in this case denoted by ' '
However, awk comes pretty handy in such cases
**# awk '{print $2}' new.txt**
You can select the position of the content in the first row at that column (3197) and then select the string at the same position in all rows with awk:
cat filename | awk -v field="3197" 'NR==1 {c = index($0,field)} {print substr($0,c,length(field))}'
souce: https://unix.stackexchange.com/a/491770/20661