Text processing to create a list of unique ID's

Text processing to create a list of unique ID's - awk

I Have a file with ID’s and Names applicable to them as below:
1234|abc|cde|fgh
5678|ijk|abc|lmn
9101|cde|fgh|klm
1213|klm|abc|cde
I need a file with only unique Names as a list.
Output File:
abc|sysdate
cde|sysdate
fgh|sysdate
ijk|sysdate
lmn|sysdate
klm|sysdate
Where sysdate is the current timestamp of processing.
Requesting you to help on this. Also requesting for a explanation for the code suggested.

What this code does :
awk -F\| '{ for(i=2; i <= NF; i++) a[$i] = a[$i] FS $1 }' input.csv
-F sets the delimiter to |, awk process line by line your file, creates a map named 'a', reads from column 2 until the end and fill the map using the current cell processed as key and the current cell + file separator + value in the first column as value.
When awk ends processing the first line, 'a' is :
a['abc'] = 'abc|1234'
a['cde'] = 'cde|1234'
a['fgh'] = 'fgh|1234'
This script does not print anything.
What you want is something like this :
awk -F'|' '{for(i=2;i<=NF;i++){if(seen[$i] != 1){print $i, strftime(); seen[$i]=1}}}' OFS='|' input.csv
-F sets the input delimiter to |, OFS does the same for the output delimiter.
For each value from the column 2 to the end of the line, we check if it has already been seen before. If not, we print the value and the time of process. Then we register the value in a map so we can avoid to process it again.
Output :
abc|Thu Oct 18 10:40:13 CEST 2018
cde|Thu Oct 18 10:40:13 CEST 2018
fgh|Thu Oct 18 10:40:13 CEST 2018
ijk|Thu Oct 18 10:40:13 CEST 2018
lmn|Thu Oct 18 10:40:13 CEST 2018
klm|Thu Oct 18 10:40:13 CEST 2018
You can change the format of sysdate. See documentation of gawk strftime here

Related

How to print matching line, 3 lines after, and matching URL

I try to extract information from SMTP mails in text, i.e:
the date (ex: Wed, 9 Oct 2019 01:55:58 -0700 (PDT)
the sender (ex: from xxx.yyy.com (zzz:com. [111.222.333.444])
URLs present in the mail (ex: http://some.thing)
Here's an example of an input:
Delivered-To: SOME#ADDRESS.COM
Received: by X.X.X.X with SMTP id SOMEID;
Wed, 9 Oct 2019 01:55:58 -0700 (PDT)
X-Received: by X.X.X.X with SMTP id SOMEID;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
Return-Path: <SOME#ADDRESS.COM>
Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X])
by SOME.THIRD.URL.COM with ESMTP id SOMEID
for <SOME#ADDRESS.COM>;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
SOME_HTML
SOME_HTML
href="http://URL1"><img
SOME_HTML
src="http://URL2"
SOME_HTML
The example is deliberately truncated because the header is longer, but this is for the example
I've tried sed and awk and I managed to do some thing but not as I want.
SED:
sed -e 's/http/\nhttp/g' -n -e '/Received: from/{h;n;n;n;H;x;s/\n \+/;/;p}' a.txt
The first one is to have the URL on one lien but I didn't manage to use it after.
And anyway, it's not in order.
AWK:
BEGIN{
RS = "\n";
FS = "";
}
/Received: from/{
from = $0;
getline;
getline;
getline;
date = $0
}
/"\"https?://[^\"]+"/
{
FS="\"";
print $0;
}
END{
print date";"from;
};
This one works except for the URL. The rexgexp doesn't works while in a oneline yes.
I also tried to find a more elegant way for the date by using the value of NR+3, but it didn't work.
And display this in csv format:
date;sender;URL1;URL2;...
I would prefer pure sed or pure awk, because I think I can do it with grep, tail, sed and awk but as I want to learn, I prefer one or both of them :)

Well, the following longish sed script with comments inside:
sed -nE '
/Received: from /{
# hold mu line!
h
# ach, spagetti code, here we go again
: notdate
${
s/.*/ERROR: INVALID INPUT: DATE NOT FOUND/
p
q1
}
# the next line after the line ending with ; should be the date
/;$/!{
# so search for a line ending with ;
n
b notdate
}
# the next line is the date
n
# remove leading spaces
s/^[[:space:]]*//
# grab the Received: from line
G
# and save it for later
h
}
# headers end with an empty line
/^$/{
# loop over lines
: read_next_line
n
# flag with \x01<URL>\x02 all occurences of URLs
s/"(http[^"]+)"/\x01\1\x02/g
# we found at least one URL if there is \x01 in the pattern space
/\x01/{
# extract each occurence to the end of pattern space with a newline
: again
s/^([^\x01]*)\x01([^\x02]*)\x02(.*)$/\1\3\n\2/
t again
# remove everything in front of separator - the unparsed part of line
s/^[^\n]*\n//
# add URLs to hold space
H
}
# if this is the last line, we should finally print something!, and, exit
${
# grab the hold space
x
# replace the separator for a ;
s/\n/;/g
# print and exit successfully
p
q 0
}
# here we go again!
b read_next_line
}
'
for the following input:
Delivered-To: SOME#ADDRESS.COM
Received: by X.X.X.X with SMTP id SOMEID;
Wed, 9 Oct 2019 01:55:58 -0700 (PDT)
X-Received: by X.X.X.X with SMTP id SOMEID;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
Return-Path: <SOME#ADDRESS.COM>
Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X])
by SOME.THIRD.URL.COM with ESMTP id SOMEID
for <SOME#ADDRESS.COM>;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
SOME_HTML
SOME_HTML
href="http://URL1"><img
SOME_HTML
src="http://URL2"
SOME_HTML
SOMEHTML src="http://URL3" SOMEHTML src="http://URL4"
outputs:
Wed, 09 Oct 2019 01:55:58 -0700 (PDT);Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X]);http://URL1;http://URL2;http://URL3;http://URL4

Difference in AWK when using NR with and withou print

I am a AWK beginner and after playing around with the built-in variable NR, I do not understand the following
Text:
CREDITS,EXPDATE,USER,GROUPS
99,01 jun 2018,sylvain,team:::admin
52,01 dec 2018,sonia,team
52,01 dec 2018,sonia,team
using
awk 'NR' file
CREDITS,EXPDATE,USER,GROUPS
99,01 jun 2018,sylvain,team:::admin
52,01 dec 2018,sonia,team
52,01 dec 2018,sonia,team
awk '{print NR}' file
1
2
3
4
Thus, I was expecting to the same results when using NR>2&&NR<5. Here is what I got:
awk 'NR>2&&NR<5' file
52,01 dec 2018,sonia,team
52,01 dec 2018,sonia,team
awk '{print NR>2&&NR<5}' file
Nothing shows up
Could you tell me why isn´t the last line showing a count from the numbers to 3 to 4 as it is displayed when using awk '{print NR}' file? It is not possible to mix a range of NR with the command print?
Thanks in advance!

awk 'NR>2&&NR<5' Input_file is where we are checking condition if line number is greater than 2 and greater than 5 then do default action which is printing current line.
In you code awk '{print NR>2&&NR<5}' Input_file, here you are using print and then mentioning condition which is NOT the way awk works.
awk works on method of:
Condition_check/regexp{action}
if NO action is given then by default print of current line will happen, which is happening in your first code.
More analysis:
To prove point {print NR>2&&NR<5} will NOT behave like default method of awk of regexp/condition_check{action} run this:
awk '{print (NR>2&&NR<5)}' Input_file
See the output what it will provide:
0
0
1
1
0
0
0
0
0
0
See line 3rd and 4th which is 1 means condition for that line is TRUE and 0 means condition for that line is FALSE. So by this we could see it prints condition's STATE in print statement if we use condition in (..) like mentioned above.

How to sum the values in one data column of duplicate rows with awk and remove the duplicate rows?

I have a CSV file with a heading row and multiple data rows each with 11 data columns like this:
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Mildred#email.com,205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,Misty#email.com,205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
I want to remove the duplicates in that file and sum the values in the Quantity data column. I want the result to be like this:
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
I want to sum only the values in the fifth data column Quantity while leaving the rest as it is. I have tried the solution in Sum duplicate row values with awk, but the answer there works only if the file has only two data columns. My CSV file has 11 data columns and so it doesn't work.
How to do it with awk?

awk to the rescue!
$ awk 'BEGIN{FS=OFS=","}
NR==1{print; next}
{q=$5; $5="~"; a[$0]+=q}
END {for(k in a) {sub("~",a[k],k); print k}}' file
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Misty#email.com,205588935534900,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,Martha#email.com,205486016644400,,4,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Mildred#email.com,205583995140400,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Syahrul Address
note that the order of records are not guaranteed, but also doesn't require them to be sorted initially. To preserve the order there are multiple solutions...
Also, I use ~ as a placeholder. If your data includes this char you can replace with an unused one.
UPDATE
To preserve the order (based on first appearance of a row)
$ awk 'BEGIN{FS=OFS=","}
NR==1{print; next}
{q=$5;$5="~"; if(!($0 in a)) b[++c]=$0; a[$0]+=q}
END {for(k=1;k<=c;k++) {sub("~",a[b[k]],b[k]); print b[k]}}' file
keep a separate structure to mark the order of the rows and iterate over that data structure...

Taking direct adaption from Karafka's solution and adding some code in it a bit to get the lines in proper order(in which they are present in Input_file) as per OP's request.
awk -F, '
FNR==1{
print;
next}
{
val=$5;
$5="~";
a[$0]+=val
}
!b[$0]++{
c[++count]=$0}
END{
for(i=1;i<=count;i++){
sub("~",a[c[i]],c[i]);
print c[i]}
}' OFS=, Input_file
Explanation: Adding explanation to above code too now.
awk -F, ' ##Setting field separator as comma here.
FNR==1{ ##Checking condition if line number is 1 then do following.
print; ##Print the current line.
next} ##next will skip all further statements from here.
{
val=$5; ##Creating a variable named val whose value is 5th field of current line.
$5="~"; ##Setting value of 5th field as ~ here to keep all lines same(to create index for array a).
a[$0]+=val ##Creating an array named a whose index is current line and its value is variable val value.
}
!b[$0]++{ ##Checking if array b whose index is current line its value is NULL then do following.
c[++count]=$0} ##Creating an array named c whose index is variable count increasing value with 1 and value is current line.
END{ ##Starting END block of awk code here.
for(i=1;i<=count;i++){ ##Starting a for loop whose value starts from 1 to till value of count variable.
sub("~",a[c[i]],c[i]); ##Substituting ~ in value of array c(which is actually lines value) with value of SUMMED $5.
print c[i]} ##Printing newly value of array c where $5 is now replaced with its actual value.
}' OFS=, Input_file ##Setting OFS as comma here and mentioning Input_file name here too.

This solution uses an extra array to guarantee output is deduped and pre-sorted in original input order :
1st array tracks input row order,
2nd one both de-dupes and sums up $5
% echo
% cat testfile.txt
% < testfile.txt mawk 'BEGIN {
print $( (FS=OFS=",")*(getline))\
($(!(__=(_+=(_=_~_)+_)+--_-(_="")))=_)
} (______=($+__)($__=_=""))==(___[_=$_]+=______) {
____[++_____]=_
} END {
for(_^=!__;_<=_____;_++) {
print $(($__=___[$!_=____[_]])<-_) } }'
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Mildred#email.com,205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Martha#email.com,205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,Misty#email.com,205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,Mildred#email.com,205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,Martha#email.com,205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,Misty#email.com,205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address

Remove commas and format dates

I have a large file with entries such as:
<VAL>17,451.26</VAL>
<VAL>353.93</VAL>
<VAL>395.00</VAL>
<VAL>2,405.00</VAL>
<DATE>31 Jul 2013</DATE>
<DATE>31 Jul 2013</DATE>
<DATE>31 Dec 2014</DATE>
<DATE>21 Jun 2002</DATE>
<DATE>10 Jul 2002</DATE>
<MOD>PL</MOD>
<BATCH>13382</BATCH>
<TYPE>Invoice</TYPE>
<REF1>13541/13382</REF1>
<REF2>671042638320</REF2>
<NOTES>a-07 final elec</NOTES>
<SNAME>EDF ENERGY ( Electricity )</SNAME>
<VAL>55.22</VAL>
</CLT>
<CLT>
<CHD>MAT-01</CHD>
<OPN>U5U1</OPN>
<PERIOD>07 2013</PERIOD>
<DATE>13 Jun 2013</DATE>
<DATE>10 Jul 2002</DATE>
<DATE>10 Jul 2002</DATE>
<DATE>21 Aug 2007</DATE>
<DATE>10 Jul 2002</DATE>
<VAL>-4,122,322.03</VAL>
I need to remove the commas in the VAL fields and change the dates to YYYY-MM-DD (e.g. 2013-07-31) in the DATE fields.
Looking for a quick (efficient) way of doing this.
Thanks

This should get you started:
awk -F"[<>]" 'BEGIN {split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",month," ");for (i=1;i<=12;i++) mdigit[month[i]]=i} /<VAL>/ {gsub(/\,/,"")} /<DATE>/ {split($3,a," ");$0=sprintf("<DATE>%s-%02d-%02d</DATE>",a[3],mdigit[a[2]],a[1])}1' file
<VAL>17451.26</VAL>
<VAL>353.93</VAL>
<VAL>395.00</VAL>
<VAL>2405.00</VAL>
<DATE>2013-07-31</DATE>
<DATE>2013-07-31</DATE>
<DATE>2014-12-31</DATE>
<DATE>2002-06-21</DATE>
<DATE>2002-07-10</DATE>
<MOD>PL</MOD>
<BATCH>13382</BATCH>
<TYPE>Invoice</TYPE>
<REF1>13541/13382</REF1>
<REF2>671042638320</REF2>
<NOTES>a-07 final elec</NOTES>
<SNAME>EDF ENERGY ( Electricity )</SNAME>
<VAL>55.22</VAL>
</CLT>
<CLT>
<CHD>MAT-01</CHD>
<OPN>U5U1</OPN>
<PERIOD>07 2013</PERIOD>
<DATE>2013-06-13</DATE>
<DATE>2002-07-10</DATE>
<DATE>2002-07-10</DATE>
<DATE>2007-08-21</DATE>
<DATE>2002-07-10</DATE>
<VAL>-4122322.03</VAL>

sed '# init month convertor in holding buffer
1{h;s/.*/Jan01Fev02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/;x;}
# change Val
/^<VAL>/ s/,//g
# Change Date
/^<DATE>/ {
# change month
G
s/[[:space:]]\{1,\}\([A-Z][a-z][a-z]\)[[:space:]]\{1,\}\(.*\)\n.*\1\([0-9][0-9]\).*/-\3-\2/
# reformat order
s/>\(.*\)-\(.*\)-\(.*\)</>\3-\2-\1</
}' YourFile
posix sed with not extra sub shell for dae conversion
reformat date take 2 s///here but could be merged in 1 s/// a bit more unreadeable (already very attractive regex like this)
could easily add some security feature about source date like bad date format

Your input seems like XML. I'd use a proper XML handling tool, e.g. XML::XSH2, a wrapper around Perl's XML::LibXML:
open file.xml ;
for //VAL set . xsh:subst(., ',', '','g') ;
perl { use Time::Piece } ;
for my $d in //DATE {
$t = $d/text() ;
set $d/text() { Time::Piece->strptime($t, '%d %b %Y')->ymd } ;
}
save :b ;

This might work for you (GNU sed & bash):
sed -r '/^<VAL>/s/,//g;/^(<DATE>)(.*)(<\/DATE>)$/s//echo "\1"$(date -d "\2" +%F)"\3"/e' file
This removes all commas on a line starting <VAL> and for those lines that contain date tags, uses the date utility and the evaluate flag in the substitution command to rearrange the date to YYYY-MM-DD.
An alternative solution, using only seds commands:
sed -r '/^<VAL>/s/,//g;/^<DATE>/!b;s/$/\nJan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/;s/^(<DATE>)(..) (...) (....)(<\/DATE>\n).*\3(..)/\1\4-\6-\2\5/;P;d' file
Appends a lookup to the end of the date line and uses regexp to rearrange the output.

convert month from Aaa to xx in little script with awk

I am trying to report on the number of files created on each date. I can do that with this little one liner:
ls -la foo*.bar|awk '{print $7, $6}'|sort|uniq -c
and I get a list how many fooxxx.bar files were created by date, but the month is in the form: Aaa (ie: Apr) and I want xx (ie: 04).
I have feeling the answer is in here:
awk '
BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
months[d[o]]=sprintf("%02d",o)
}
format = "%m/%d/%Y %H:%M"
}
{
split($4,time,":")
date = (strftime("%Y") " " months[$2] " " $3 " " time[1] " " time[2] " 0")
print strftime(format, mktime(date))
}'
But have no to little idea what I need to strip out and no idea how to pass $7 to whatever I carve out of this to convert Apr to 04.
Thanks!

Here's the idiomatic way to convert an abbreviated month name to a number in awk:
$ echo "Feb" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
02
$ echo "May" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
05
Let us know if you need more info to solve your problem.

Assuming the name of the months only appear in the month column, then you could do this:
ls -la foo*.bar|awk '{sub(/Jan/,"01");sub(/Feb/,"02");print $7, $6}'|sort|uniq -c

Just use the field number of your month as an index into the months array.
print months[$6]
Since ls output differs from system to system and sometimes on the same system depending on file age and you didn't give any examples, I have no way of knowing how to guide you further.
Oh, and don't parse ls.

To parse AIX istat, I use:
istat .profile | grep "^Last modified" | read dummy dummy dummy mon day time dummy yr dummy
echo "M: $mon D: $day T: $time Y: $yr"
-> Month: Mar Day: 12 Time: 12:05:36 Year: 2012
To parse AIX istat month, I use this two-liner AIX 6.1 ksh 88:
monstr="???JanFebMarAprMayJunJulAugSepOctNovDec???"
mon="Oct" ; hugo=${monstr%${mon}*} ; hugolen=${#hugo} ; let hugol=hugolen/3 ; echo "Month: $hugol"
-> Month: 10
1..12 : month name ok
If lt 1 or gt 12 : month name not ok
Instead of "hugo" use speaking names ;-))

Adding a version for AIX, that shows how to retrieve all the date elements (in whatever timezone you need it them in), and display an iso8601 output
tempTZ="UTC" ; TZ="$tempTZ" istat /path/to/somefile \
| grep modified \
| awk -v tmpTZ="$tempTZ" '
BEGIN {Mmms="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec";
n=split(Mmms,Mmm," ") ;
for(i=1;i<=n;i++){ mm[Mmm[i]]=sprintf("%02d",i) }
}
{ printf("%s-%s-%sT%s %s",$NF, mm[$4], $5, $6, tmpTZ ) }
' ## this will output an iso8601 date of the modification date of that file,
## for ex: 2019-04-18T14:16:05 UTC
## you can tempTZ=anything, for ex: tempTZ="UTC+2" to see that date in UTC+2 timezone... or tempTZ="EST" , etc
I show the iso8601 version to make it more known & used, but of course you may only need the "mm" portion, which is easly done : mm[$4]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Text processing to create a list of unique ID's - awk

Related

How to print matching line, 3 lines after, and matching URL

Difference in AWK when using NR with and withou print

How to sum the values in one data column of duplicate rows with awk and remove the duplicate rows?

Remove commas and format dates

convert month from Aaa to xx in little script with awk

Categories

Resources