convert month from Aaa to xx in little script with awk - awk

I am trying to report on the number of files created on each date. I can do that with this little one liner:
ls -la foo*.bar|awk '{print $7, $6}'|sort|uniq -c
and I get a list how many fooxxx.bar files were created by date, but the month is in the form: Aaa (ie: Apr) and I want xx (ie: 04).
I have feeling the answer is in here:
awk '
BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
months[d[o]]=sprintf("%02d",o)
}
format = "%m/%d/%Y %H:%M"
}
{
split($4,time,":")
date = (strftime("%Y") " " months[$2] " " $3 " " time[1] " " time[2] " 0")
print strftime(format, mktime(date))
}'
But have no to little idea what I need to strip out and no idea how to pass $7 to whatever I carve out of this to convert Apr to 04.
Thanks!

Here's the idiomatic way to convert an abbreviated month name to a number in awk:
$ echo "Feb" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
02
$ echo "May" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
05
Let us know if you need more info to solve your problem.

Assuming the name of the months only appear in the month column, then you could do this:
ls -la foo*.bar|awk '{sub(/Jan/,"01");sub(/Feb/,"02");print $7, $6}'|sort|uniq -c

Just use the field number of your month as an index into the months array.
print months[$6]
Since ls output differs from system to system and sometimes on the same system depending on file age and you didn't give any examples, I have no way of knowing how to guide you further.
Oh, and don't parse ls.

To parse AIX istat, I use:
istat .profile | grep "^Last modified" | read dummy dummy dummy mon day time dummy yr dummy
echo "M: $mon D: $day T: $time Y: $yr"
-> Month: Mar Day: 12 Time: 12:05:36 Year: 2012
To parse AIX istat month, I use this two-liner AIX 6.1 ksh 88:
monstr="???JanFebMarAprMayJunJulAugSepOctNovDec???"
mon="Oct" ; hugo=${monstr%${mon}*} ; hugolen=${#hugo} ; let hugol=hugolen/3 ; echo "Month: $hugol"
-> Month: 10
1..12 : month name ok
If lt 1 or gt 12 : month name not ok
Instead of "hugo" use speaking names ;-))

Adding a version for AIX, that shows how to retrieve all the date elements (in whatever timezone you need it them in), and display an iso8601 output
tempTZ="UTC" ; TZ="$tempTZ" istat /path/to/somefile \
| grep modified \
| awk -v tmpTZ="$tempTZ" '
BEGIN {Mmms="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec";
n=split(Mmms,Mmm," ") ;
for(i=1;i<=n;i++){ mm[Mmm[i]]=sprintf("%02d",i) }
}
{ printf("%s-%s-%sT%s %s",$NF, mm[$4], $5, $6, tmpTZ ) }
' ## this will output an iso8601 date of the modification date of that file,
## for ex: 2019-04-18T14:16:05 UTC
## you can tempTZ=anything, for ex: tempTZ="UTC+2" to see that date in UTC+2 timezone... or tempTZ="EST" , etc
I show the iso8601 version to make it more known & used, but of course you may only need the "mm" portion, which is easly done : mm[$4]

Related

awk next month last date logic issue

I searched SO, but could find answers using programming languages like Java, Python, JS, etc.
I'm trying to find the next month last date when a relative date is given as input to the awk. Looking for awk implementation.
Here is my logic, but it breaks for different inputs.
$ export DT="2020-12-31"
$ awk -v dt=$DT ' BEGIN { gsub(/-/," ",dt); print dt;dt2=mktime(dt " 0 0 0");
while(c<2) { dd=strftime("%d",dt2+(i++)*86400);if(dd*1==1)c++;} ; print strftime("%F",dt2+(i-2)*86400) } '
2020 12 31
2021-01-31 # Correct
Below one is giving wrong answer
$ export DT="2020-01-01"
$ awk -v dt=$DT ' BEGIN { gsub(/-/," ",dt); print dt;dt2=mktime(dt " 0 0 0");
while(c<2) { dd=strftime("%d",dt2+(i++)*86400);if(dd*1==1)c++;} ; print strftime("%F",dt2+(i-2)*86400) } '
2020 01 01
2020-01-31 # Wrong - Should be 2020-02-29
what is wrong with the logic?. I welcome other awk solutions
In a similar way, I need to calculate prior-month start & end dates, relative month start and end date, next month's start and end date in a robust way using awk.
Below is the code for previous month end date
$ awk -v dt="2021-01-30" ' BEGIN { gsub(/-/," ",dt); print dt;dt2=mktime(dt " 0 0 0");
dy=strftime("%d",dt2); print strftime("%F",dt2-dy*86400) } '
2021 01 30
2020-12-31
Abusing the GNU awk's mktime. Split the date to its components, add 2 months (for example 12+2=14) and subtract a day from the first day (01-1=0), so 2020-12-31 -> 2020-14-00 == 2021-01-31:
$ awk -v dt="2020-12-31" '
BEGIN {
split(dt,a,/-/) # split to get the date components
e=mktime(a[1] " " a[2]+2 " 0 0 0 0") # add 2 months and subtract a day from 1st day
print strftime("%F",e) # output
}'
Output:
2021-01-31
$ cat tst.awk
BEGIN {
split(dt,curYMD,/-/)
if ( curYMD[2] < 12 ) {
tgtYear = curYMD[1]
tgtMth = curYMD[2] + 1
}
else {
tgtYear = curYMD[1] + 1
tgtMth = 1
}
begSecs = mktime(tgtYear" "tgtMth" 1 12 0 0",1)
begDate = strftime("%Y-%m-%d",begSecs)
for (day=31; mth!=tgtMth; day--) {
endSecs = mktime(tgtYear" "tgtMth" "day" 12 0 0",1)
mth = strftime("%m",endSecs)+0
}
endDate = strftime("%Y-%m-%d",endSecs)
print dt":\t" begDate, "->", endDate
}
$ awk -v dt='2020-12-31' -f tst.awk
2020-12-31: 2021-01-01 -> 2021-01-31
$ awk -v dt='2020-01-01' -f tst.awk
2020-01-01: 2020-02-01 -> 2020-02-29
$ awk -v dt='2021-01-30' -f tst.awk
2021-01-30: 2021-02-01 -> 2021-02-28
Best I can tell the problems in your code are:
Trying to use the average number of seconds in a day as the actual duration in seconds of all days to count in days instead of just counting in days, and
Using midnight as the time each day for mktime() as that causes problems with leap seconds and DST changes too, always use noon instead when you're just doing day calculations. It wouldn't hurt to add the UTC flag too to avoid DST concerns.
I used a loop to find the actual end day of the month because mktime() will try to guess at what you meant if you give it a date with too large a number of days for the given month so just see if the current month really supports 31 days and if not, decrement til it does (obvious will only go down to 28 max).
Hopefully the equivalent logic for the prior month start/end date is obvious but in case it's not, just change the first if statement to:
if ( curYMD[2] > 1 ) {
tgtYear = curYMD[1]
tgtMth = curYMD[2] - 1
}
else {
tgtYear = curYMD[1] - 1
tgtMth = 12
}
I recommend to go with James answer. But here is the fix to my code, in case if someone is interested in the crude way of manipulating date using epoch calculations.
If I do pre-increment the "i" and then subtract it when printing, it gets right answer.
$ export DT="2020-12-31"
$ awk -v dt=$DT ' BEGIN { gsub(/-/," ",dt); print dt;dt2=mktime(dt " 0 0 0");
while(c<2) { dd=strftime("%d",dt2+(++i)*86400);if(dd*1==1)c++;} ; print strftime("%F",dt2+(i-1)*86400) } '
2020 12 31
2021-01-31
Case-2:
$ export DT="2020-01-01"
$ awk -v dt=$DT ' BEGIN { gsub(/-/," ",dt); print dt;dt2=mktime(dt " 0 0 0");
while(c<2) { dd=strftime("%d",dt2+(++i)*86400);if(dd*1==1)c++;} ; print strftime("%F",dt2+(i-1)*86400) } '
2020 01 01
2020-02-29

Convert timestamps when viewing logfile output to UTC offset using awk

I have a file with timestamps in format:
2020-06-10 04:51:34.572 INFO: [17] Log message
I'd like to be able to view the output in UTC+{n}
If tried:
awk '{
#for(timefield = 2;timefield<=NF;timefield++)
timefield = 2
#cmd = "date --date=\x27"TZ=UTC+7\" \"" $timefield "\"x27"
#cmd = "date --date=\"TZ=UTC+7\" "" $timefield"
cmd = "date --date=\"TZ=Europe/London\" "" $timefield"
# if($timefield ~ /[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}/) {
while(cmd | getline line) {
sub($timefield,line,$0)
}
print
}'
But am getting awk: line 12: runaway regular expression /} ...
UPDATE:
Having a lot of difficulties working out the correct quoting.
The correct GNU date syntax to get the timezone conversion is:
~# date --date='TZ="Asia/Bangkok" 2020-06-10 03:19:16.222'
Tue Jun 9 20:19:16 UTC 2020
Notice the single quote ("'") in --date=' section. How do I correctly quote this within the awk cmd command?
Implementing the correct quoting is annoying, given that the need to quote the awk script inside a shell. From the OP comment, the goal to convert GMT time in the log file to HK time (UTC+7).
Following the OP structure, possible to do the following.
#! /bin/sh
awk '
{
t_in = $1 " " $2
cmd = "TZ=UTC+7 date -d \"" t_in "Z\""
cmd | getline t_out
$1 = t_out
$2 = ""
print
}
'
Worth mentioning that this solution will invoke the date utility on every line. This could be expensive for large file (10,000+ lines). If this is the case, alternative solution, using built-in awk methods can be used.

Text processing to create a list of unique ID's

I Have a file with ID’s and Names applicable to them as below:
1234|abc|cde|fgh
5678|ijk|abc|lmn
9101|cde|fgh|klm
1213|klm|abc|cde
I need a file with only unique Names as a list.
Output File:
abc|sysdate
cde|sysdate
fgh|sysdate
ijk|sysdate
lmn|sysdate
klm|sysdate
Where sysdate is the current timestamp of processing.
Requesting you to help on this. Also requesting for a explanation for the code suggested.
What this code does :
awk -F\| '{ for(i=2; i <= NF; i++) a[$i] = a[$i] FS $1 }' input.csv
-F sets the delimiter to |, awk process line by line your file, creates a map named 'a', reads from column 2 until the end and fill the map using the current cell processed as key and the current cell + file separator + value in the first column as value.
When awk ends processing the first line, 'a' is :
a['abc'] = 'abc|1234'
a['cde'] = 'cde|1234'
a['fgh'] = 'fgh|1234'
This script does not print anything.
What you want is something like this :
awk -F'|' '{for(i=2;i<=NF;i++){if(seen[$i] != 1){print $i, strftime(); seen[$i]=1}}}' OFS='|' input.csv
-F sets the input delimiter to |, OFS does the same for the output delimiter.
For each value from the column 2 to the end of the line, we check if it has already been seen before. If not, we print the value and the time of process. Then we register the value in a map so we can avoid to process it again.
Output :
abc|Thu Oct 18 10:40:13 CEST 2018
cde|Thu Oct 18 10:40:13 CEST 2018
fgh|Thu Oct 18 10:40:13 CEST 2018
ijk|Thu Oct 18 10:40:13 CEST 2018
lmn|Thu Oct 18 10:40:13 CEST 2018
klm|Thu Oct 18 10:40:13 CEST 2018
You can change the format of sysdate. See documentation of gawk strftime here

Search logs within date/time range

I, Newbie, have searched this forum high and low, and have tried several awks, seds, & greps.
I am trying to search log files to output all logs within a date & time.
Unfortunately, the logs that I am searching all have different date formats.
I did get this one to work:
awk '$0 >= "2018-08-23.11:00:00" && $0 <= "2018-08-23.14:00:00"' catalina.out
for that specific date format.
I can't get these date formats to work, maybe an issue with the spacing?
2018-08-23 11:00:00, or Aug 23, 2018 11:00:00
Some examples of what I have tried:
sed -n '/2018-08-23 16:00/,/2018-08-23 18:00/p' testfile.txt
sed -n '/Feb 23 13:55/,/Feb 23 14:00/p' testfile.txt
awk '$0 >= "2018-08-23 17:00:00" && $0 <= "2018-08-23 19:00:00"' testfile.txt
I have also tried setting variables:
FROM="Aug 23, 2018 17:00:00" , TO="Aug 23, 2018 19:00:00"
awk '$0 >= "$FROM" && $0 <= "$TO"' testfile.txt
Can anyone help me with this?
UPDATE: I got THIS to work for the 2018-08-23 11:00:00 format
grep -n '2018-08-23 11:[0-9][0-9]' testfile.txt | head -1
grep -n '2018-08-23 12:[0-9][0-9]' testfile.txt | tail -1
awk 'NR>=2 && NR<=4' testfile.txt > rangeoftext
But I could not get it to work with the Aug 23, 2018 11:00:00 -- again, I think this may be a space issue? Not sure how to resolve....
This is a difficult problem. grep and sed have no concept of a date, and even GNU awk has only limited support for dates and times.
The problem becomes somewhat more tractable if you use a sane date format, i.e. a date format that can be used in string comparisons, such as 2018-08-15 17:00:00. This should work regardless of whether the string contains whitespace or not. However, beware of tools that automatically split on whitespace, such as the shell and awk.
Now, to your examples:
sed -n '/2018-08-23 16:00/,/2018-08-23 18:00/p' testfile.txt
sed -n '/Feb 23 13:55/,/Feb 23 14:00/p' testfile.txt
awk '$0 >= "2018-08-23 17:00:00" && $0 <= "2018-08-23 19:00:00"' testfile.txt
The first two should work, but only if the file really contains both timestamps, since you are only checking for the presence of certain arbitrary strings. The third should also work, provided that the records all start with a timestamp.
This might be what you're looking for (making some assumptions about what your input file might look like):
$ cat file
Aug 22, 2018 11:00:00 bad
2018-08-23 11:00:00 good
Aug 23, 2018 11:00:00 good
2018-08-24 11:00:00 bad
$ cat tst.awk
BEGIN {
min = raw2dt(min)
max = raw2dt(max)
}
{ cur = raw2dt($0) }
(cur >= min) && (cur <= max)
function raw2dt(raw, tmp, mthNr, dt, fmt) {
fmt = "%04d%02d%02d%02d%02d%02d"
if ( match(raw,/[0-9]{4}(-[0-9]{2}){2}( [0-9:]+)?/) ) {
split(substr(raw,RSTART,RLENGTH),tmp,/[^[:alnum:]]+/)
dt = sprintf(fmt, tmp[1], tmp[2], tmp[3], tmp[4], tmp[5], tmp[6])
}
else if ( match(raw,/[[:alpha:]]{3} [0-9]{2}, [0-9]{4}( [0-9:]+)?/) ) {
split(substr(raw,RSTART,RLENGTH),tmp,/[^[:alnum:]]+/)
mthNr = (index("JanFebMarAprMayJunJulAugSepOctNovDec",tmp[1])+2)/3
dt = sprintf(fmt, tmp[3], mthNr, tmp[2], tmp[4], tmp[5], tmp[6])
}
return dt
}
$ awk -v min='Aug 23, 2018 11:00' -v max='2018-08-23 11:00' -f tst.awk file
2018-08-23 11:00:00 good
Aug 23, 2018 11:00:00 good
The above will work using any POSIX awk in any shell on any UNIX box.
When trying to obtain a set of log-entries which appear between two dates, one should never use sed to check for this. Yes it is true that sed has a cool and very useful feature to check address ranges (so does awk btw.) but
sed -n `/date1/,/date2/p` file
will not always work. This means it will only work if date1 and date2 are actually in the file. If one of them is missing, this will fail.
An editing command with two addresses shall select the inclusive range from the first pattern space that matches the first address through the next pattern space that matches the second.
[address[,address]]
On top of that, when comparing dates, one should never use string comparisons unless you use a sane format. Some sane formats are YYYY-MM-DD, YYYY-MM-DD hh:mm:ss, ... Some bad formats are "Aug 1 2018" as it comes before "Jan 1 2018" and "99-01-31" comes after "01-01-31", or "2018-2-1" comes after "2018-11-1"
So if you can, try to convert your date you obtain to a sane format. The sanest format is computing the date-difference wrt an epoch. Unix has various tools that allow computing the number of seconds since the UNIX EPOCH of 1970-01-01 00:00:00 UTC. This is what you are really after.
As you mention, your log-file has various date-formats, and this does not make things easy. Even though gnu awk has various Time Functions, they require that you know the format beforehand.
Since we do not know which formats exist in your log-file, we will make use of the unix function date which has a very elaborate interpreter that knows a lot of formats.
Also, I will make the assumption that in awk you are able to uniquely identify the date somehow store the date in a string called date. Maybe there is a special character always appearing after the date that allows you to do this:
Example input file:
2018-08-23 16:00 | some entry
Aug 23 2018 16:01:01 | some other entry
So, in this case, we can say:
awk -F| -v t1=$(date -d "START_DATE" "+%s") \
-v t2=$(date -d "END_DATE" "+%s") \
'{date=$1}
{cmd="date -d \""$1"\" +%s"; cmd | getline epoch; close cmd}
(t1 <= epoch && epoch <= t2)' testfile

Remove commas and format dates

I have a large file with entries such as:
<VAL>17,451.26</VAL>
<VAL>353.93</VAL>
<VAL>395.00</VAL>
<VAL>2,405.00</VAL>
<DATE>31 Jul 2013</DATE>
<DATE>31 Jul 2013</DATE>
<DATE>31 Dec 2014</DATE>
<DATE>21 Jun 2002</DATE>
<DATE>10 Jul 2002</DATE>
<MOD>PL</MOD>
<BATCH>13382</BATCH>
<TYPE>Invoice</TYPE>
<REF1>13541/13382</REF1>
<REF2>671042638320</REF2>
<NOTES>a-07 final elec</NOTES>
<SNAME>EDF ENERGY ( Electricity )</SNAME>
<VAL>55.22</VAL>
</CLT>
<CLT>
<CHD>MAT-01</CHD>
<OPN>U5U1</OPN>
<PERIOD>07 2013</PERIOD>
<DATE>13 Jun 2013</DATE>
<DATE>10 Jul 2002</DATE>
<DATE>10 Jul 2002</DATE>
<DATE>21 Aug 2007</DATE>
<DATE>10 Jul 2002</DATE>
<VAL>-4,122,322.03</VAL>
I need to remove the commas in the VAL fields and change the dates to YYYY-MM-DD (e.g. 2013-07-31) in the DATE fields.
Looking for a quick (efficient) way of doing this.
Thanks
This should get you started:
awk -F"[<>]" 'BEGIN {split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",month," ");for (i=1;i<=12;i++) mdigit[month[i]]=i} /<VAL>/ {gsub(/\,/,"")} /<DATE>/ {split($3,a," ");$0=sprintf("<DATE>%s-%02d-%02d</DATE>",a[3],mdigit[a[2]],a[1])}1' file
<VAL>17451.26</VAL>
<VAL>353.93</VAL>
<VAL>395.00</VAL>
<VAL>2405.00</VAL>
<DATE>2013-07-31</DATE>
<DATE>2013-07-31</DATE>
<DATE>2014-12-31</DATE>
<DATE>2002-06-21</DATE>
<DATE>2002-07-10</DATE>
<MOD>PL</MOD>
<BATCH>13382</BATCH>
<TYPE>Invoice</TYPE>
<REF1>13541/13382</REF1>
<REF2>671042638320</REF2>
<NOTES>a-07 final elec</NOTES>
<SNAME>EDF ENERGY ( Electricity )</SNAME>
<VAL>55.22</VAL>
</CLT>
<CLT>
<CHD>MAT-01</CHD>
<OPN>U5U1</OPN>
<PERIOD>07 2013</PERIOD>
<DATE>2013-06-13</DATE>
<DATE>2002-07-10</DATE>
<DATE>2002-07-10</DATE>
<DATE>2007-08-21</DATE>
<DATE>2002-07-10</DATE>
<VAL>-4122322.03</VAL>
sed '# init month convertor in holding buffer
1{h;s/.*/Jan01Fev02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/;x;}
# change Val
/^<VAL>/ s/,//g
# Change Date
/^<DATE>/ {
# change month
G
s/[[:space:]]\{1,\}\([A-Z][a-z][a-z]\)[[:space:]]\{1,\}\(.*\)\n.*\1\([0-9][0-9]\).*/-\3-\2/
# reformat order
s/>\(.*\)-\(.*\)-\(.*\)</>\3-\2-\1</
}' YourFile
posix sed with not extra sub shell for dae conversion
reformat date take 2 s///here but could be merged in 1 s/// a bit more unreadeable (already very attractive regex like this)
could easily add some security feature about source date like bad date format
Your input seems like XML. I'd use a proper XML handling tool, e.g. XML::XSH2, a wrapper around Perl's XML::LibXML:
open file.xml ;
for //VAL set . xsh:subst(., ',', '','g') ;
perl { use Time::Piece } ;
for my $d in //DATE {
$t = $d/text() ;
set $d/text() { Time::Piece->strptime($t, '%d %b %Y')->ymd } ;
}
save :b ;
This might work for you (GNU sed & bash):
sed -r '/^<VAL>/s/,//g;/^(<DATE>)(.*)(<\/DATE>)$/s//echo "\1"$(date -d "\2" +%F)"\3"/e' file
This removes all commas on a line starting <VAL> and for those lines that contain date tags, uses the date utility and the evaluate flag in the substitution command to rearrange the date to YYYY-MM-DD.
An alternative solution, using only seds commands:
sed -r '/^<VAL>/s/,//g;/^<DATE>/!b;s/$/\nJan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/;s/^(<DATE>)(..) (...) (....)(<\/DATE>\n).*\3(..)/\1\4-\6-\2\5/;P;d' file
Appends a lookup to the end of the date line and uses regexp to rearrange the output.