Modification of date format within a text file - awk

I have some text files containing lines as follow :
07JAN01, -0.247297942769082E+07, -0.467133797284279E+07, 0.355810777473149E+07
07JAN02, -0.247297942405032E+07, -0.467133797586388E+07, 0.355810777517715E+07
07JAN03, -0.247297942584851E+07, -0.467133797727224E+07, 0.355810777627353E+07
. . . .
. . . .
I need to produce a script which will amend the format of the date to :
01/01/07, -0.247297942769082E+07, -0.467133797284279E+07, 0.355810777473149E+07
02/01/07, -0.247297942405032E+07, -0.467133797586388E+07, 0.355810777517715E+07
03/01/07, -0.247297942584851E+07, -0.467133797727224E+07, 0.355810777627353E+07
. . . .
. . . .
I was looking for an appropriate sed or grep command to extract only some characters of each line, to define it as a variable in my script. As I would like to "reorganize" the date, I was thinking about defining three variable, where, for the for the first line it would be :
a=07
b=JAN (need to implement a "case" in the script to deal with this I think?)
c=03
I looked at some grep examples, and tons of docs, but nothing really clear appeared ...
found something about the -cut command, but ... I'm not too sure it's appropriate here.
The other question I have is about the output, as sed doesn't modify the input data, how can I modify directly the files ? Is there a way ?
Any help would really be appreciated :)

I don't think grep is the right tool for the job myself. You need something a little more expressive like Perl or awk:
echo '07JAN01, -0.24729E+07, -0.46713E+07, 0.35581E+07
07JAN02, -0.24729E+07, -0.46713E+07, 0.35581E+07
07AUG03, -0.24729E+07, -0.46713E+07, 0.35581E+07' | awk -F, '
{
yy=substr($1,1,2);
mm=substr($1,3,3);
mm=(index(":JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",mm)+2)/4;
dd=substr($1,6,2);
printf "%02d/%02d/%02d,%s,%s,%s\n",dd,mm,yy,$2,$3,$4
}'
which generates:
01/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
02/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
03/08/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
Obviously, that's just pumping some test data through a command line awk script. You'd be better off putting that into an actual awk script file and running your input through it.
If datchg.awk contains:
{
yy=substr($1,1,2);
mm=substr($1,3,3);
mm=(index(":JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",mm)+2)/4;
dd=substr($1,6,2);
printf "%02d/%02d/%02d,%s,%s,%s\n",dd,mm,yy,$2,$3,$4
}
then:
echo '07JAN01, -0.24729E+07, -0.46713E+07, 0.35581E+07
07JAN02, -0.24729E+07, -0.46713E+07, 0.35581E+07
07AUG03, -0.24729E+07, -0.46713E+07, 0.35581E+07' | awk -F, -fdatechg.awk
also produces:
01/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
02/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
03/08/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
The way this works is as follows. Each line is split into fields (-F, sets the field separator to a comma) and we extract and process the relevant parts of field 1 (the date). By this I mean the year and day are reversed and the textual month is turned into a numeric month by searching a string for it and manipulating the index where it was found, so that it falls in the range 1 through 12.
This is the only (relatively) tricky bit and is done with some basic mathematics: the index function simply finds the position within the string of your month (where the first char is 1). So JAN is at position 2, FEB at 6, MAR at 10, ..., DEC at 46 (the set {2, 6, 10, ..., 46}). They're 4 apart so we're going to need to divide by 4 eventually to get consecutive month numbers but first we add 2 so the division will work well. Adding that 2 gives you the set {4, 8, 12, ..., 48}. Then you divide by 4 to get {1, 2, 3, ... 12} and there's your month number:
Text Pos +2 /4
---- --- -- --
JAN 2 4 1
FEB 6 8 2
MAR 10 12 3
APR 14 16 4
MAY 18 20 5
JUN 22 24 6
JUL 26 28 7
AUG 30 32 8
SEP 34 36 9
OCT 38 40 10
NOV 42 44 11
DEC 46 48 12
Then we just output the new information. Obviously, this is likely to barf if you provide bad data but I'm assuming either:
the data is good; or
you'll add your own error checks.
Regarding modifying the files directly, the time-honored UNIX tradition is to use a shell script to save the current file elsewhere, process it to create a new file, then overwrite the old file with the new file (but not touching the saved file, in case something goes horribly wrong).
I won't make my answer any longer by detailing that, you've probably fallen asleep already :-)

A bit clunky, but you could do:
sed -e 's/^\(..\)JAN\(..\)/\2\/01\/\1/'
sed -e 's/^\(..\)FEB\(..\)/\2\/02\/\1/'
...
In order to run sed in-place, see the -i commandline option:
sed -i -e ...
Edit
Just to point out that this answers a previous version of the question where AWK was not specified.

awk 'BEGIN{
OFS=FS=","
# create table of mapping of months to numbers
s=split("JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",d,":")
for(o=1;o<=s;o++){
m=sprintf("%02s",o) # add 0 is single digit
date[d[o]]=m
}
}
{
yr=substr($1,1,2)
mth=substr($1,3,3)
day=substr($1,6,2)
$1=day"/"date[mth]"/"yr
}1' file

Related

How do I print every nth entry of the mth column, starting from a particular line of a file?

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

How to match "field 5 through the end of the line" (for example, in awk)

I want to pretty-print the output of a find-like script that would take input like this:
- 2015-10-02 19:45 102 /My Directory/some file.txt
and produce something like this:
- 102 /My Directory/some file.txt
In other words: "f" (for "file"), file size (right-justified), then pathname (with an arbitrary number of spaces).
This would be easy in awk if I could write a script that takes $1, $4, and "everything from $5 through the end of the line".
I tried using the awk construct substr($0, index($0, $8)), which I thought meant "everything starting with field $8 to the end of $0".
Using index() in this way is offered as a solution on linuxquestions.org and was upvoted 29 times in a stackoverflow.com thread.
On closer inspection, however, I found that index() does not achieve this effect if the starting field happens to match an earlier point in the string. For example, given:
-rw-r--r-- 1 tbaker staff 3024 2015-10-01 14:39 calendar
-rw-r--r-- 1 tbaker staff 4062 2015-10-01 14:39 b
-rw-r--r-- 1 tbaker staff 2374 2015-10-01 14:39 now or later
Gawk (and awk) get the following results:
$ gawk '{ print index($0, $8) }' test.txt
49
15
49
In other words, the value of $8 ('b') matches at index 15 instead of 49 (i.e., like most of the other filenames).
My issue, then is how to specify "everything from field X to the end of the string".
I have re-written this question in order to make this clear.
Looks to me like you should just be using the "stat" command rather than "ls", for the reasons already commented upon:
stat -c "f%15s %n" *
But you should double-check how your "stat" operates; it apparently can be shell-specific.
The built-in awk function index() is sometimes recommended as a way
to print "from field 5 through the end of the string" [1, 2, 3].
In awk, index($0, $8) does not mean "the index of the first character of
field 8 in string $0". Rather, it means "the index of the first occurrence in
string $0 of the string value of field 8". In many cases, that first
occurrence will indeed be the first character in field 8 but this is not the
case in the example above.
It has been pointed out that parsing the output of ls is generally a bad
idea [4], in part because implementations of ls significantly differ in output.
Since the author of that note recommends find as a replacement for ls for some uses,
here is a script using find:
find $# -ls |
sed -e 's/^ *//' -e 's/ */ /g' -e 's/ /|/2' -e 's/ /|/2' -e 's/ /|/4' -e 's/ /|/4' -e 's/ /|/6' |
gawk -F'|' '{ $2 = substr($2, 1, 1) ; gsub(/^-/, "f", $2) }
{ printf("%s %15s %s\n", $2, $4, $6) }'
...which yields the required output:
f 4639 /Users/foobar/uu/a
f 3024 /Users/foobar/uu/calendar
f 2374 /Users/foobar/uu/xpect
This approach recursively walks through a file tree. However, there may of course be implementation differences between versions of find as well.
http://www.linuxquestions.org/questions/linux-newbie-8/awk-print-field-to-end-and-character-count-179078/
How to print third column to last column?
Print Field 'N' to End of Line
http://mywiki.wooledge.org/ParsingLs
Maybe some variation of find -printf | awk is what you're looking for?
$ ls -l tmp
total 2
-rw-r--r-- 1 Ed None 7 Oct 2 14:35 bar
-rw-r--r-- 1 Ed None 2 Oct 2 14:35 foo
-rw-r--r-- 1 Ed None 0 May 3 09:55 foo bar
$ find tmp -type f -printf "f %s %p\n" | awk '{sub(/^[^ ]+ +[^ ]/,sprintf("%s %10d",$1,$2))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
or
$ find tmp -type f -printf "%s %p\n" | awk '{sub(/^[^ ]+/,sprintf("f %10d",$1))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
It won't work with file names that contain newlines.

convert month from Aaa to xx in little script with awk

I am trying to report on the number of files created on each date. I can do that with this little one liner:
ls -la foo*.bar|awk '{print $7, $6}'|sort|uniq -c
and I get a list how many fooxxx.bar files were created by date, but the month is in the form: Aaa (ie: Apr) and I want xx (ie: 04).
I have feeling the answer is in here:
awk '
BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
months[d[o]]=sprintf("%02d",o)
}
format = "%m/%d/%Y %H:%M"
}
{
split($4,time,":")
date = (strftime("%Y") " " months[$2] " " $3 " " time[1] " " time[2] " 0")
print strftime(format, mktime(date))
}'
But have no to little idea what I need to strip out and no idea how to pass $7 to whatever I carve out of this to convert Apr to 04.
Thanks!
Here's the idiomatic way to convert an abbreviated month name to a number in awk:
$ echo "Feb" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
02
$ echo "May" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
05
Let us know if you need more info to solve your problem.
Assuming the name of the months only appear in the month column, then you could do this:
ls -la foo*.bar|awk '{sub(/Jan/,"01");sub(/Feb/,"02");print $7, $6}'|sort|uniq -c
Just use the field number of your month as an index into the months array.
print months[$6]
Since ls output differs from system to system and sometimes on the same system depending on file age and you didn't give any examples, I have no way of knowing how to guide you further.
Oh, and don't parse ls.
To parse AIX istat, I use:
istat .profile | grep "^Last modified" | read dummy dummy dummy mon day time dummy yr dummy
echo "M: $mon D: $day T: $time Y: $yr"
-> Month: Mar Day: 12 Time: 12:05:36 Year: 2012
To parse AIX istat month, I use this two-liner AIX 6.1 ksh 88:
monstr="???JanFebMarAprMayJunJulAugSepOctNovDec???"
mon="Oct" ; hugo=${monstr%${mon}*} ; hugolen=${#hugo} ; let hugol=hugolen/3 ; echo "Month: $hugol"
-> Month: 10
1..12 : month name ok
If lt 1 or gt 12 : month name not ok
Instead of "hugo" use speaking names ;-))
Adding a version for AIX, that shows how to retrieve all the date elements (in whatever timezone you need it them in), and display an iso8601 output
tempTZ="UTC" ; TZ="$tempTZ" istat /path/to/somefile \
| grep modified \
| awk -v tmpTZ="$tempTZ" '
BEGIN {Mmms="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec";
n=split(Mmms,Mmm," ") ;
for(i=1;i<=n;i++){ mm[Mmm[i]]=sprintf("%02d",i) }
}
{ printf("%s-%s-%sT%s %s",$NF, mm[$4], $5, $6, tmpTZ ) }
' ## this will output an iso8601 date of the modification date of that file,
## for ex: 2019-04-18T14:16:05 UTC
## you can tempTZ=anything, for ex: tempTZ="UTC+2" to see that date in UTC+2 timezone... or tempTZ="EST" , etc
I show the iso8601 version to make it more known & used, but of course you may only need the "mm" portion, which is easly done : mm[$4]

how to parse a number from sql output

Hi I need to parse a number from sql output :
COUNT(*)
----------
924
140
173
583
940
77
6 rows selected.
if the the fisrt line is less then 10 I want to create a empty file,
The problem is I dont know how to parse it, the numbers are still changing (from 0 to ca. 10 000 ) .
Question is very unclear so I'll make some assumptions. You'll get the output above from sql either to file or stdout and you would like to test of the first line containing digits is less than 10. Correct?
This is one way to do it.
sed -n '3p' log | awk '{ print ($1 < 10) ? "true" : "false" }'
sed is used to print the 3rd line from your example
this is then piped into awk which makes to comparison.
...or putting it together in bash
#!/bin/bash
while read variable;
do
if [[ "$variable" =~ ^[0-9]+$ ]]
then
break
fi
done < input
if [ "$variable" -lt 10 ]
then
echo 'less than 10'
# add your code here, eg
# touch /path/to/file/to/be/created
fi