How to preserve the original whitespace between fields in awk? - awk

When processing input with awk, sometimes I want to edit one of the fields, without touching anything else. Consider this:
$ ls -l | awk 1
total 88
-rw-r--r-- 1 jack jack 8 Jun 19 2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 Jun 19 2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack 4306 Dec 29 09:16 test1.html
-rw-r--r-- 1 jack jack 5476 Dec 7 08:09 test1.js
If I don't edit any of the fields ($1, $2, ...), everything is preserved as it was. But if let's say I want to keep only the first 3 characters of the first field:
$ ls -l | awk '{$1 = substr($1, 1, 3) } 1'
tot 88
-rw 1 jack jack 8 Jun 19 2013 qunit-1.11.0.css
-rw 1 jack jack 56908 Jun 19 2013 qunit-1.11.0.js
-rw 1 jack jack 4306 Dec 29 09:16 test1.html
-rw 1 jack jack 5476 Dec 7 08:09 test1.js
The original whitespace between all fields is replaced with a simple space.
Is there a way to preserve the original whitespace between the fields?
UPDATE
In this sample, it's relatively easy to edit the first 4 fields. But what if I want to keep only the 1st letter of $5 in order to get this output:
-rw-r--r-- 1 jack jack 8 J 19 2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 J 19 2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack 4306 D 29 09:16 test1.html
-rw-r--r-- 1 jack jack 5476 D 7 08:09 test1.js

If you want to preserve the whitespace you could also try the split function.
In Gnu Awk version 4 the split function accepts 4 arguments, where the latter is the separators between the fields. For instance,
echo "a 2 4 6" | gawk ' {
n=split($0,a," ",b)
a[3]=7
line=b[0]
for (i=1;i<=n; i++)
line=(line a[i] b[i])
print line
}'
gives output
a 2 7 6

I know this is an old question but I thought there had to be something better. This answer is for those that stumbled onto this question while searching. While looking around on the web, I have to say #Håkon Hægland has the best answer and that is what I used at first.
But here is my solution. Use FPAT. It can set a regular expression to say what a field should be. FPAT = "([[:space:]]*[[:alnum:][:punct:][:digit:]]+)"; In this case, I am saying the field should start with zero or more blank characters and ends with basically any other character except blank characters. Here is a link if you are having trouble understanding POSIX bracket expressions.
Also, change the output field to OFS = ""; separator because once the line has been manipulated, the output will add an extra blank space as a separator if you don't change OFS from its default.
I used the same example to test.
$ cat example-output.txt
-rw-r--r-- 1 jack jack 8 Jun 19 2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 Jun 19 2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack 4306 Dec 29 09:16 test1.html
-rw-r--r-- 1 jack jack 5476 Dec 7 08:09 test1.js
$ awk 'BEGIN { FPAT = "([[:space:]]*[[:alnum:][:punct:][:digit:]]+)"; OFS = ""; } { $6 = substr( $6, 1, 2); print $0; }' example-output.txt
-rw-r--r-- 1 jack jack 8 J 19 2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 J 19 2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack 4306 D 29 09:16 test1.html
-rw-r--r-- 1 jack jack 5476 D 7 08:09 test1.js
Keep in mind. The fields now have leading spaces. So if the field needs to be replaced by something else, you can do
len = length($1);
$1 = sprintf("%"(len)"s", "-42-");
$ awk 'BEGIN { FPAT = "([[:space:]]*[[:alnum:][:punct:][:digit:]]+)"; OFS = ""; } { if(NR==1){ len = length($1); $1 = sprintf("%"(len)"s", "-42-"); } print $0; }' example-output.txt
-42- 1 jack jack 8 Jun 19 2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 Jun 19 2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack 4306 Dec 29 09:16 test1.html
-rw-r--r-- 1 jack jack 5476 Dec 7 08:09 test1.js

It's possible to preserve the original whitespaces by editing $0 instead of individual fields ($1, $2, ...), for example:
$ ls -l | awk '{$0 = substr($1, 1, 3) substr($0, length($1) + 1)} 1'
tot 88
-rw 1 jack jack 8 Jun 19 2013 qunit-1.11.0.css
-rw 1 jack jack 56908 Jun 19 2013 qunit-1.11.0.js
-rw 1 jack jack 4306 Dec 29 09:16 test1.html
-rw 1 jack jack 5476 Dec 7 08:09 test1.js
This is relatively easy to do when editing the first column, but gets troublesome when editing others ($2, ..., $4), and breaks down after fields where the width of the whitespace in between is not fixed ($5 and beyond in this example).
UPDATE
Based on #Håkon Hægland's answer, here's a way to keep the first 2 characters of the 6th field (the month):
{
n = split($0, f, " ", sep)
f[6] = substr(f[6], 1, 2)
line = sep[0]
for (i = 1; i <= n; ++i) line = line f[i] sep[i]
print line
}

The simplest solution is to make sure that the field spliting is done on every single space. That is done by making the field separator [ ]:
$ awk -F '[ ]' '{$1=substr($1,1,3)}1' infile
-rw 1 jack jack 8 Jun 19 2013 qunit-1.11.0.css
-rw 1 jack jack 56908 Jun 19 2013 qunit-1.11.0.js
-rw 1 jack jack 4306 Dec 29 09:16 test1.html
-rw 1 jack jack 5476 Dec 7 08:09 test1.js
By default, awk will split on any repetition of white spaces (tabs and spaces, something similar to [ \t]+. The manual states:
In the special case that FS is a single space, fields are separated by runs of spaces and/or tabs and/or newlines.
That will collapse runs of spaces, tabs and newlines to only one value of OFS in the output. If OFS is also an space (also the default), the result is that only one space will be printed for each run of white space.
But awk could be told to select only one space as a field delimiter using a regular expression that will match only one character: [ ].
Note that that will change the field numbers of fields. Each space will start a new field. So, note this result from the data you presented:
$ awk -F '[ ]' '{print($4,$5,$6)}' infile
jack
jack 56908 Jun
jack 4306
jack 5476
In this specific case, there are no spaces before the first field, and only one space after, that's why it works correctly.

Related

First data file in Chronicle Queue is always touched

We're wondering why ChronicleQueue always seems to touch the first data file. Is there a reason for it?
The other filer don't seem to be affected, even if data is read from them. Are we doing something wrong?
Currently we're using version 5.19.2.
[root#node-000341 dataLog]# ls -al
total 16520
drwxrwxr-x. 2 tn tn 4096 Jan 13 15:17 .
drwxrwxr-x. 7 tn tn 94 Apr 14 2020 ..
-rw-r--r--. 1 tn tn 83886080 Jan 12 11:46 20200424.cq4
-rw-r--r--. 1 tn tn 83886080 May 5 2020 20200427.cq4
-rw-r--r--. 1 tn tn 131782 May 12 2020 20200505.cq4
-rw-r--r--. 1 tn tn 131574 May 13 2020 20200512.cq4
.....
-rw-r--r--. 1 tn tn 389465 Dec 16 09:26 20201210.cq4
-rw-r--r--. 1 tn tn 184090 Jan 12 12:07 20201216.cq4
-rw-r--r--. 1 tn tn 361994 Jan 13 15:17 20210112.cq4
-rw-r--r--. 1 tn tn 83886080 Jan 13 15:22 20210113.cq4
-rw-r--r--. 1 tn tn 65536 Jan 13 15:21 metadata.cq4t
When you create a tailer, by default it goes toStart(), which is the first file, which is what causes the first file to be touched (note that ls shows access time, not modification time).
BTW unless you really need that access time for any reason, we suggest using noatime mount option to speed up file access.

find the difference in substring of timestamp in awk

I am trying to append some text in /var/log/messages output whenever the timestamp between the two consecutive log is different such as :
previous log: 00:01:59 and current log 00:02:00
or
previous log:00:01:49 and current log 00:01:50
above substring of timestamp if different in consecutive log, append some message to $0.
You may run below command it is working for 1 minute, needed it for 10 sec.
tail -f /var/log/messages |awk '{split($3,a,":");split($3,b,"");new_time=a[1]":"a[2]":"b[1]; if(prev_time==new_time) print $0; else print "10 Second group is over, starting new: "$0" "prev_time " "new_time } {split($3,a,":");split($3,b,"");prev_time=a[1]":"a[2]":"b[1]}'
Required result is modification of above command to print same message in 10 second gap of logs , currently its doing for 1 minute. I have used split() to capture 'HH:MM:S" not "HH:MM:SS",so whenever privious 'HH:MM:S" and current 'HH:MM:S"differ , print the message "10 Second group is over, starting new: $0". Not sure what is the mistake here.
In short, currently its working when a minute changes, I need it when second changes from 39 to 40th sec or 09 sec to 10 sec. NOT 11 sec to 12 sec. HH:MM:SS , S marked in bold needed to be changed.
Sample lines:
Jan 23 15:09:54 foo bar
Jan 23 15:10:04 bla bla
this is the general idea:
$ for((i=35;i<45;i++)); do echo Jan 23 00:01:$i; done |
awk '{split($3,a,":"); print $0, (p!=(c=int(a[3]/10))?"<<<":""); p=c}'
Jan 23 00:01:35 <<<
Jan 23 00:01:36
Jan 23 00:01:37
Jan 23 00:01:38
Jan 23 00:01:39
Jan 23 00:01:40 <<<
Jan 23 00:01:41
Jan 23 00:01:42
Jan 23 00:01:43
Jan 23 00:01:44
first part is the test data generated for the script since you didn't provide enough. There is spurious first line match, which can be eliminated with NR>1 condition but I don't think that's critical.

Awk looping within condition

I need to create a condition which separates the data by decade. The first column is the year value (going back to year 0). How do I change the condition within the awk query?
0 Jan 10 2:04:40 Tot D
0 Jul 05 11:33:06 Tot A
3 May 04 22:22:05 Tot A
3 Oct 29 1:32:40 Tot D
7 Feb 20 23:03:27 Tot A
7 Aug 17 5:58:18 Tot D
10 Dec 10 6:28:52 Tot A
11 Jun 04 15:36:12 Tot D
14 Apr 04 4:41:23 Tot D
14 Sep 27 7:18:39 Tot A
18 Jan 20 10:38:27 Tot D
18 Jul 16 18:04:17 Tot A
21 May 15 5:47:44 Tot A
21 Nov 08 9:27:47 Tot D
22 May 04 23:00:32 Tot A
25 Mar 03 6:19:48 Tot A
25 Aug 27 13:47:51 Tot D
28 Dec 20 15:07:37 Tot A
29 Jun 14 22:37:10 Tot D
32 Apr 14 11:56:36 Tot D
32 Oct 07 15:38:15 Tot A
36 Jan 31 19:07:10 Tot D
36 Jul 27 0:39:47 Tot A
39 May 26 13:13:25 Tot A
39 Nov 19 17:26:37 Tot D
40 May 15 6:26:43 Tot A
I need to present the data as follows:
awk '{if ($1 >= 0 && $1 < 10) print }' All_Lunar_Eclipse.txt
0 Jan 10 2:04:40 Tot D
0 Jul 05 11:33:06 Tot A
3 May 04 22:22:05 Tot A
3 Oct 29 1:32:40 Tot D
7 Feb 20 23:03:27 Tot A
7 Aug 17 5:58:18 Tot D
But I would have to do it manually for every 10 years.
awk '{if ($1 >= 10 && $1 < 20) print }' All_Lunar_Eclipse.txt
10 Dec 10 6:28:52 Tot A
11 Jun 04 15:36:12 Tot D
14 Apr 04 4:41:23 Tot D
14 Sep 27 7:18:39 Tot A
18 Jan 20 10:38:27 Tot D
18 Jul 16 18:04:17 Tot A
I have tried something similar to the following with no joy.
awk 'BEGIN { for (i = 0; i <= 2019; +=10) print i }'
$ awk '
int(p/10)!=int($1/10) {
print "New decade begins:"
}
{ p=$1 }
1' file
0 Jan 10 2:04:40 Tot D
0 Jul 05 11:33:06 Tot A
3 May 04 22:22:05 Tot A
3 Oct 29 1:32:40 Tot D
7 Feb 20 23:03:27 Tot A
7 Aug 17 5:58:18 Tot D
New decade begins:
10 Dec 10 6:28:52 Tot A
11 Jun 04 15:36:12 Tot D
...
... on your definition of a decade (if ($1 >= 10 && $1 < 20)). I would've assumed that years 1-10 are the first decade 11-20 the second etc. Did not check, though. It would've made it one summation harder, too.
Depend on what your want but use the first line as info by dividing by 10 and catchin the integer value
awk '
# separator process
{ Decade = int( $1 / 10 ) }
# apply sample (unsorted and just stored by decade)
{ Data[ Decade] = Data[Decade] "\n" $0 }
END { for ( Dec in Data ) printf "--- Decade: %d ----\n%s\n", Dec, Data[ Dec] }
' YourFile

Read serial input with awk, insert date

I'm trying to reformat serial input, which consists of two integers separated by a comma (sent from an Arduino):
1,2
3,4
0,0
0,1
etc. I would like to append the date after each line, separating everything with a tab character. Here's my code so far:
cat /dev/cu.usbmodem3d11 | awk 'BEGIN {RS=","};{printf "%i\t%i\t%s",$1,$2,system("date")}'
Here's the result I get (with date in my locale):
1 2 0Mer 26 fév 2014 22:09:20 EST
3 4 0Mer 26 fév 2014 22:09:20 EST
0 0 0Mer 26 fév 2014 22:09:20 EST
0 1 0Mer 26 fév 2014 22:09:20 EST
Why is there an extra '0' in front of my date field? Sorry for the newbie question :(
EDIT This code solved my problem. Thanks to all who helped.
awk 'BEGIN {FS=","};{system("date")|getline myDate;printf "%i\t%i\t%s",$1, $2, myDate}' /dev/cu.usbmodem3d11
I'm not clear why, but in order for the date to keep updating and recording at what time the data was received, I have to use system("date")instead of just "date"in the code above.
2 things
It will be easier to see your problem if you add a \n at the end of your printf string
Then the output is
>echo '1,2' | awk 'BEGIN {RS=","};{printf "%i\t%i\t%s\n",$1,$2,system("date")}'
Wed Feb 26 21:30:17 CST 2014
1 0 0
Wed Feb 26 21:30:17 CST 2014
2 0 0
I'm guessing that output from system("date") returns its output "outside" of scope of awk's $0 natural scope of each line of input processed. Others may be able to offer a better explanation.
To get the output you want, I'm using the getline function to capture the output of the date command to a variable (myDt). Now the output is
> echo '1,2' | awk 'BEGIN {RS=","};{"date" | getline myDt ; printf "%i\t%i\t%s\n",$1,$2,myDt}'
1 0 Wed Feb 26 21:31:15 CST 2014
2 0 Wed Feb 26 21:31:15 CST 2014
Finally, we remove the "debugging" \n char, and get the output you specify:
> echo '1,2' | awk 'BEGIN {RS=","};{"date" | getline myDt ; printf "%i\t%i\t%s",$1,$2,myDt}'
1 0 Wed Feb 26 21:34:56 CST 2014
2 0 Wed Feb 26 21:34:56 CST 2014
And, per Jaypal's post, I see now that FS="," is another issue, so when we make that change AND return the `\n' char, we have
echo '1,2' | awk 'BEGIN {FS=","};{"date" | getline myDt ; printf "%i\t%i\t%s\n",$1,$2,myDt}'
1 2 Wed Feb 26 21:44:42 CST 2014
Two issues:
First - RS is record separator. You need FS which is Field Separator to separate two columns, where $1 will be 1 and $2 will be 2 (as per your first row)
Second - The extra 0 you see in output is the return value of system() command. It means it ran successfully. You can simply run the shell command in quotes and pipe it to getline. Putting a variable after it will allow you to capture the value returned.
Try this:
awk 'BEGIN {FS=","};{"date"|getline var;printf "%i\t%i\t%s\n",$1,$2,var}'
This is a more simple solution:
awk -F, '{print $1,$2,dt}' dt="$(date)" OFS="\t" /dev/cu.usbmodem3d11
1 2 Thu Feb 27 06:23:41 CET 2014
3 4 Thu Feb 27 06:23:41 CET 2014
0 0 Thu Feb 27 06:23:41 CET 2014
0 1 Thu Feb 27 06:23:41 CET 2014
IF you like to show date in another format, just read manual for date
Eks dt="$(date +%D)" gives 02/27/14

Linux red-hat5.4 + awk file manipulation

how to match PARAM (param=name) word in file.txt and print the lines between
NAMEx and NAMEy, via awk , as the following way :
if PARAM matched in the file.txt , then awk will print only the words between the close NAMES strings while PARAM is one of the names
remark1 PARAM can be any name as Pitter , Bob , etc.....
remark2 awk will get PARAM=(any name)
remark3 we not know how many spaces we have between (# to NAME)
more file.txt
# NAMES1
Pitter 23
Bob 75
# NAMES2
Donald 54
Josef 85
Patrick 21
# NAMES3
Tom 32
Jennifer 85
Svetlana 25
# NAMES4
examples ( regarding file.txt contents )
In case PARAM=pitter then awk will print the names to out.txt file
Pitter 23
Bob 75
In case PARAM=Josef then awk will print the names to out.txt file
Donald 54
Josef 85
Patrick 21
In case PARAM=Jennifer then awk will print the names to out.txt file
Tom 32
Jennifer 85
Svetlana 25
using RS of awk would be helpful in this case. see the test below:
testing with example
kent$ cat file
# NAMES1
Pitter 23
Bob 75
# NAMES2
Donald 54
Josef 85
Patrick 21
# NAMES3
Tom 32
Jennifer 85
Svetlana 25
# NAMES4
kent$ awk -vPARAM="Pitter" 'BEGIN{RS="# NAMES."} {if($0~PARAM)print}' file
Pitter 23
Bob 75
kent$ awk -vPARAM="Josef" 'BEGIN{RS="# NAMES."} {if($0~PARAM)print}' file
Donald 54
Josef 85
Patrick 21
kent$ awk -vPARAM="Jennifer" 'BEGIN{RS="# NAMES."} {if($0~PARAM)print}' file
Tom 32
Jennifer 85
Svetlana 25
note, there are some empty lines in output, because they existed in your input. however it would be easy to remove them from output.
update
if you have spaces between # and NAMES, you can try:
awk -vPARAM="Pitter" 'BEGIN{RS="# *NAMES."} {if($0~PARAM)print}' file