I have a CSV file with a heading row and multiple data rows each with 11 data columns like this:
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Mildred#email.com,205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,Misty#email.com,205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
I want to remove the duplicates in that file and sum the values in the Quantity data column. I want the result to be like this:
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
I want to sum only the values in the fifth data column Quantity while leaving the rest as it is. I have tried the solution in Sum duplicate row values with awk, but the answer there works only if the file has only two data columns. My CSV file has 11 data columns and so it doesn't work.
How to do it with awk?
awk to the rescue!
$ awk 'BEGIN{FS=OFS=","}
NR==1{print; next}
{q=$5; $5="~"; a[$0]+=q}
END {for(k in a) {sub("~",a[k],k); print k}}' file
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Misty#email.com,205588935534900,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,Martha#email.com,205486016644400,,4,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Mildred#email.com,205583995140400,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Syahrul Address
note that the order of records are not guaranteed, but also doesn't require them to be sorted initially. To preserve the order there are multiple solutions...
Also, I use ~ as a placeholder. If your data includes this char you can replace with an unused one.
UPDATE
To preserve the order (based on first appearance of a row)
$ awk 'BEGIN{FS=OFS=","}
NR==1{print; next}
{q=$5;$5="~"; if(!($0 in a)) b[++c]=$0; a[$0]+=q}
END {for(k=1;k<=c;k++) {sub("~",a[b[k]],b[k]); print b[k]}}' file
keep a separate structure to mark the order of the rows and iterate over that data structure...
Taking direct adaption from Karafka's solution and adding some code in it a bit to get the lines in proper order(in which they are present in Input_file) as per OP's request.
awk -F, '
FNR==1{
print;
next}
{
val=$5;
$5="~";
a[$0]+=val
}
!b[$0]++{
c[++count]=$0}
END{
for(i=1;i<=count;i++){
sub("~",a[c[i]],c[i]);
print c[i]}
}' OFS=, Input_file
Explanation: Adding explanation to above code too now.
awk -F, ' ##Setting field separator as comma here.
FNR==1{ ##Checking condition if line number is 1 then do following.
print; ##Print the current line.
next} ##next will skip all further statements from here.
{
val=$5; ##Creating a variable named val whose value is 5th field of current line.
$5="~"; ##Setting value of 5th field as ~ here to keep all lines same(to create index for array a).
a[$0]+=val ##Creating an array named a whose index is current line and its value is variable val value.
}
!b[$0]++{ ##Checking if array b whose index is current line its value is NULL then do following.
c[++count]=$0} ##Creating an array named c whose index is variable count increasing value with 1 and value is current line.
END{ ##Starting END block of awk code here.
for(i=1;i<=count;i++){ ##Starting a for loop whose value starts from 1 to till value of count variable.
sub("~",a[c[i]],c[i]); ##Substituting ~ in value of array c(which is actually lines value) with value of SUMMED $5.
print c[i]} ##Printing newly value of array c where $5 is now replaced with its actual value.
}' OFS=, Input_file ##Setting OFS as comma here and mentioning Input_file name here too.
This solution uses an extra array to guarantee output is deduped and pre-sorted in original input order :
1st array tracks input row order,
2nd one both de-dupes and sums up $5
% echo
% cat testfile.txt
% < testfile.txt mawk 'BEGIN {
print $( (FS=OFS=",")*(getline))\
($(!(__=(_+=(_=_~_)+_)+--_-(_="")))=_)
} (______=($+__)($__=_=""))==(___[_=$_]+=______) {
____[++_____]=_
} END {
for(_^=!__;_<=_____;_++) {
print $(($__=___[$!_=____[_]])<-_) } }'
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Mildred#email.com,205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,Misty#email.com,205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
I have a large file with entries such as:
<VAL>17,451.26</VAL>
<VAL>353.93</VAL>
<VAL>395.00</VAL>
<VAL>2,405.00</VAL>
<DATE>31 Jul 2013</DATE>
<DATE>31 Jul 2013</DATE>
<DATE>31 Dec 2014</DATE>
<DATE>21 Jun 2002</DATE>
<DATE>10 Jul 2002</DATE>
<MOD>PL</MOD>
<BATCH>13382</BATCH>
<TYPE>Invoice</TYPE>
<REF1>13541/13382</REF1>
<REF2>671042638320</REF2>
<NOTES>a-07 final elec</NOTES>
<SNAME>EDF ENERGY ( Electricity )</SNAME>
<VAL>55.22</VAL>
</CLT>
<CLT>
<CHD>MAT-01</CHD>
<OPN>U5U1</OPN>
<PERIOD>07 2013</PERIOD>
<DATE>13 Jun 2013</DATE>
<DATE>10 Jul 2002</DATE>
<DATE>10 Jul 2002</DATE>
<DATE>21 Aug 2007</DATE>
<DATE>10 Jul 2002</DATE>
<VAL>-4,122,322.03</VAL>
I need to remove the commas in the VAL fields and change the dates to YYYY-MM-DD (e.g. 2013-07-31) in the DATE fields.
Looking for a quick (efficient) way of doing this.
Thanks
This should get you started:
awk -F"[<>]" 'BEGIN {split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",month," ");for (i=1;i<=12;i++) mdigit[month[i]]=i} /<VAL>/ {gsub(/\,/,"")} /<DATE>/ {split($3,a," ");$0=sprintf("<DATE>%s-%02d-%02d</DATE>",a[3],mdigit[a[2]],a[1])}1' file
<VAL>17451.26</VAL>
<VAL>353.93</VAL>
<VAL>395.00</VAL>
<VAL>2405.00</VAL>
<DATE>2013-07-31</DATE>
<DATE>2013-07-31</DATE>
<DATE>2014-12-31</DATE>
<DATE>2002-06-21</DATE>
<DATE>2002-07-10</DATE>
<MOD>PL</MOD>
<BATCH>13382</BATCH>
<TYPE>Invoice</TYPE>
<REF1>13541/13382</REF1>
<REF2>671042638320</REF2>
<NOTES>a-07 final elec</NOTES>
<SNAME>EDF ENERGY ( Electricity )</SNAME>
<VAL>55.22</VAL>
</CLT>
<CLT>
<CHD>MAT-01</CHD>
<OPN>U5U1</OPN>
<PERIOD>07 2013</PERIOD>
<DATE>2013-06-13</DATE>
<DATE>2002-07-10</DATE>
<DATE>2002-07-10</DATE>
<DATE>2007-08-21</DATE>
<DATE>2002-07-10</DATE>
<VAL>-4122322.03</VAL>
sed '# init month convertor in holding buffer
1{h;s/.*/Jan01Fev02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/;x;}
# change Val
/^<VAL>/ s/,//g
# Change Date
/^<DATE>/ {
# change month
G
s/[[:space:]]\{1,\}\([A-Z][a-z][a-z]\)[[:space:]]\{1,\}\(.*\)\n.*\1\([0-9][0-9]\).*/-\3-\2/
# reformat order
s/>\(.*\)-\(.*\)-\(.*\)</>\3-\2-\1</
}' YourFile
posix sed with not extra sub shell for dae conversion
reformat date take 2 s///here but could be merged in 1 s/// a bit more unreadeable (already very attractive regex like this)
could easily add some security feature about source date like bad date format
Your input seems like XML. I'd use a proper XML handling tool, e.g. XML::XSH2, a wrapper around Perl's XML::LibXML:
open file.xml ;
for //VAL set . xsh:subst(., ',', '','g') ;
perl { use Time::Piece } ;
for my $d in //DATE {
$t = $d/text() ;
set $d/text() { Time::Piece->strptime($t, '%d %b %Y')->ymd } ;
}
save :b ;
This might work for you (GNU sed & bash):
sed -r '/^<VAL>/s/,//g;/^(<DATE>)(.*)(<\/DATE>)$/s//echo "\1"$(date -d "\2" +%F)"\3"/e' file
This removes all commas on a line starting <VAL> and for those lines that contain date tags, uses the date utility and the evaluate flag in the substitution command to rearrange the date to YYYY-MM-DD.
An alternative solution, using only seds commands:
sed -r '/^<VAL>/s/,//g;/^<DATE>/!b;s/$/\nJan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/;s/^(<DATE>)(..) (...) (....)(<\/DATE>\n).*\3(..)/\1\4-\6-\2\5/;P;d' file
Appends a lookup to the end of the date line and uses regexp to rearrange the output.
My file has timestamps in its 6th field which looks like this: Mon Jul 7 14:53:16 PDT 2014
I want to get all those lines from this file whose 6th field values are within the last 24 hours.
Sample Input:
abc -> /aa/bbb, hello, /home/user/blah.pl, 516, usc, Mon Jul 4 10:06:33 PDT 2014
abc -> /aa/bbb, hello, /home/user/blah.pl, 516, usc, Mon Jul 5 10:06:33 PDT 2014
abc -> /aa/bbb, hello, /home/user/blah.pl, 516, usc, Mon Jul 7 07:06:33 PDT 2014
abc -> /aa/bbb, hello, /home/user/blah.pl, 516, usc, Mon Jul 7 08:06:33 PDT 2014
abc -> /aa/bbb, hello, /home/user/blah.pl, 516, usc, Mon Jul 7 09:06:33 PDT 2014
abc -> /aa/bbb, hello, /home/user/blah.pl, 516, usc, Mon Jul 7 10:06:33 PDT 2014
The field delimiter is comma.
Sample Code
But it's not working as expected:
awk 'BEGIN {FS = ","};
{ a=$6;
aint=a +"%y%m%d%H%M%S";
yestint=$(date --date='1 day ago' +"%y%m%d%H%M%S");
if (aint>yestint)
print aint;
}' /location/canzee/textfile.txt
Sample Output
I get an output like this:
awk: cmd. line:4: yestint=$(date --date=1
awk: cmd. line:4: ^ syntax error
awk: cmd. line:5: (END OF FILE)
awk: cmd. line:5: syntax error
Desired Output
Mon Jul 7 07:06:33 PDT 2014
Mon Jul 7 08:06:33 PDT 2014
Mon Jul 7 09:06:33 PDT 2014
Mon Jul 7 10:06:33 PDT 2014
I would like to know how to go about this if I can't call shell commands like date within awk command. I hope its clear enough.
Here's a sketch of an idea. Beware that it is gawk-specific.
# An array to convert abbreviated month names to numbers.
BEGIN {m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6
m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12;}
# later in your script
{
# systime() gives the number of seconds since the "epoch".
# Subtract 24-hours-worth of seconds from it to get "yesterday".
# (Note that this is yesterday at a specific time, which may not
# really be what you want.)
yest = systime() - 24 * 60 * 60;
a = "Mon Jul 7 14:27:56 PDT 2014" # or however a gets its value
# Split the fields of a into the array f (splitting on spaces).
split(a, f, " ");
# Split the fields of f[4] (the time) into the array t (splitting on colons).
split(f[4], t, ":")
# mktime() converts a date specification into seconds since the epoch.
# The datespec format is: 2014 7 7 14 27 56 [optional dst flag]
# If the daylight savings time flag is left out the system tries to determine
# whether or not dst is in effect.
tm = mktime(f[6] " " m[f[2]] " " f[3] " " t[1] " " t[2] " " t[3])
#Compare the seconds since epochs.
if (tm > yest)
...
}
In the context of your program, it might be done like this:
awk '
BEGIN {
m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6
m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12;
FS = "[[:space:]]*,[[:space:]]*"
yest = systime() - 24 * 60 * 60;
}
{
split($6, f, " ")
split(f[4], t, ":")
tm = mktime(f[6] " " m[f[2]] " " f[3] " " t[1] " " t[2] " " t[3])
if (tm > yest)
print $6;
}
' /location/canzee/textfile.txt
I am trying to report on the number of files created on each date. I can do that with this little one liner:
ls -la foo*.bar|awk '{print $7, $6}'|sort|uniq -c
and I get a list how many fooxxx.bar files were created by date, but the month is in the form: Aaa (ie: Apr) and I want xx (ie: 04).
I have feeling the answer is in here:
awk '
BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
months[d[o]]=sprintf("%02d",o)
}
format = "%m/%d/%Y %H:%M"
}
{
split($4,time,":")
date = (strftime("%Y") " " months[$2] " " $3 " " time[1] " " time[2] " 0")
print strftime(format, mktime(date))
}'
But have no to little idea what I need to strip out and no idea how to pass $7 to whatever I carve out of this to convert Apr to 04.
Thanks!
Here's the idiomatic way to convert an abbreviated month name to a number in awk:
$ echo "Feb" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
02
$ echo "May" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
05
Let us know if you need more info to solve your problem.
Assuming the name of the months only appear in the month column, then you could do this:
ls -la foo*.bar|awk '{sub(/Jan/,"01");sub(/Feb/,"02");print $7, $6}'|sort|uniq -c
Just use the field number of your month as an index into the months array.
print months[$6]
Since ls output differs from system to system and sometimes on the same system depending on file age and you didn't give any examples, I have no way of knowing how to guide you further.
Oh, and don't parse ls.
To parse AIX istat, I use:
istat .profile | grep "^Last modified" | read dummy dummy dummy mon day time dummy yr dummy
echo "M: $mon D: $day T: $time Y: $yr"
-> Month: Mar Day: 12 Time: 12:05:36 Year: 2012
To parse AIX istat month, I use this two-liner AIX 6.1 ksh 88:
monstr="???JanFebMarAprMayJunJulAugSepOctNovDec???"
mon="Oct" ; hugo=${monstr%${mon}*} ; hugolen=${#hugo} ; let hugol=hugolen/3 ; echo "Month: $hugol"
-> Month: 10
1..12 : month name ok
If lt 1 or gt 12 : month name not ok
Instead of "hugo" use speaking names ;-))
Adding a version for AIX, that shows how to retrieve all the date elements (in whatever timezone you need it them in), and display an iso8601 output
tempTZ="UTC" ; TZ="$tempTZ" istat /path/to/somefile \
| grep modified \
| awk -v tmpTZ="$tempTZ" '
BEGIN {Mmms="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec";
n=split(Mmms,Mmm," ") ;
for(i=1;i<=n;i++){ mm[Mmm[i]]=sprintf("%02d",i) }
}
{ printf("%s-%s-%sT%s %s",$NF, mm[$4], $5, $6, tmpTZ ) }
' ## this will output an iso8601 date of the modification date of that file,
## for ex: 2019-04-18T14:16:05 UTC
## you can tempTZ=anything, for ex: tempTZ="UTC+2" to see that date in UTC+2 timezone... or tempTZ="EST" , etc
I show the iso8601 version to make it more known & used, but of course you may only need the "mm" portion, which is easly done : mm[$4]