awk to strore string format in variable - awk

In the below awk when I echo f it is empty, but if I remove the $f I get the desired results, however the new formatting is not stored in the $d variable. Basically I am trying to convert the string in $d variable into a new formatted variable $f. Thank you :).
file
ID,1A
DATE,220102
awk
d=$(awk -F, '/Date/ {print $2}' file) | f=$(date -d "$d" +'%Y-%m-%d')
f --- desired ---
2022-01-02

You need to use it this way to return a value from awk and set a shell variable:
f=$(date -d "$(awk -F, '/DATE/ {print $2}' file)" +'%Y-%m-%d')
echo "$f"
2022-01-02

With awk:
awk 'BEGIN{FS=","; OFS="-"} $1=="DATE"{ print "20" substr($2,1,2), substr($2,3,2), substr($2,5,2) }' file
Output:
2022-01-02
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

With your shown samples please try following awk code. Written and tested in GNU awk. Using awk's match function capability to use regex ^DATE,([0-9]{2})([0-9]{2})([0-9]{2})$ for getting required output. This creates 3 capturing groups and stores matched values into array named arr once this match is done then printing 20 and all 3 values of arrays separated by - as per required output.
awk -v OFS="-" '
match($0,/^DATE,([0-9]{2})([0-9]{2})([0-9]{2})$/,arr){
print "20" arr[1],arr[2],arr[3]
}
' Input_file

While the other answers provide a more efficient method of reformatting the date (and assuming OP has no need for d in follow-on code), I'm going to focus solely on a couple issues with OP's current code:
in the awk script need to match for (all caps) DATE instead of Date
current code attempts to pipe the output from d=$(...) to the f=$(...) portion of code; while this does 'work' in that f will be assigned 2022-01-02 the problem is that the assignment to f is performed in a subprocess and upon exiting the subprocess f is effectively 'unassigned'; what OP really needs is to separate the d=$(...) and f=$(...) commands from each other so that both assignments occur in the current shell, and this can be done by replacing the pipe with a semicolon.
If we make these two simple edits:
# old code:
d=$(awk -F, '/Date/ {print $2}' file) | f=$(date -d "$d" +'%Y-%m-%d')
^^^^ ^^^
# new code:
d=$(awk -F, '/DATE/ {print $2}' file) ; f=$(date -d "$d" +'%Y-%m-%d')
^^^^ ^^^
OP's code will now generate the desired result:
$ echo "${f}"
2022-01-02

the string approaches :
{n,g}awk -F'^[^,]*,' 'gsub("^....|..", "-&", $(_=!(NF*=NF==NR)))\
($+_ = substr($+_,++_+_--))^_' OFS=20
mawk -F'^[^,]*,' '$(gsub("^....|..", "-&",
$!(NF*=NF==NR))*(_=!NF)) = substr($_,++_+_)' OFS=20
mawk2 'gsub("^....|..", "-&",
$!(NF*=NF==NR)) + sub(".",_)^_' FS='^.+,' OFS=20
the numeric approach :
mawk -F',' 'NF==NR && ($!NF = sprintf("20%.*s-%.*s-%0*.f", _+=_^=_<_,
__ = $NF, _++, substr(__,_), --_, __%(_+_*_*_)^_))'
2022-01-02

Related

How to put a comma in between awk when filtering columns in bash shell script?

I want put a comma in between outputs from awk in bash shell script (linux).
This is a snippet of my original command
awk {print $13, $10} >> test.csv
If I put a comma in between $13 and $10, I would get a space in between the two columns
But what I want is a comma between these two columns
I'm very new to this and I can't find any resources about this online so bear with me if this is a simple mistake. Thank you
suggestion 1
awk '{print $13 ";" $10}' >> CPU2.csv
suggestion 2
awk '{print $13, $10}' OFS=";" >> CPU2.csv
suggestion 3
awk '{prtinf("%s;%s\n", $13, $10)}' >> CPU2.csv
echo "1 2 3"|awk '{a=";";print $1a$2","$3}'
not as elegant as I hoped, but this should work :
mawk 'NF=($+_=$13(_)$10)~_' \_=\;
It first overwrites the entire row with just $13 and $10, with the semi-colon ; in between. _ is a semi-colon, thus numerical evaluation of $+_ is identical to $0, and since I've forced the delimiter in between them, the regex test of presence of semi-colon will always yield true (1), making NF = 1, and printing just that.
Assigning 1 into NF in lieu of $1 = $1.
NF isn't a 2 because I'm using the new sep in between them, instead of space or tab, so even though $0 was overwritten, awk wouldn't have found the sep it needed to split out 2nd field.
Tested on mawk 1.3.4, mawk 1.9.9.6, macOS nawk, and gawk 5.1.1, including gawk -t traditional flag and gawk -P posix mode.
-- The 4Chan Teller

AWK, Comma delimited fields enclosed in quotes [duplicate]

The intent of this question is to provide a canonical answer.
Given a CSV as might be generated by Excel or other tools with embedded newlines and/or double quotes and/or commas in fields, and empty fields like:
$ cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",
What's the most robust way efficiently using awk to identify the separate records and fields:
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
so it can be used as those records and fields internally by the rest of the awk script.
A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.
The solution must tolerate the end of record just being LF (\n) as is typical for UNIX files rather than CRLF (\r\n) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "s with a preceding backslash (i.e. \" instead of "") as some other CSV formats allow - if you have that then adding a gsub(/\\"/,"\"\"") up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.
If your CSV cannot contain newlines then all you need is (with GNU awk for FPAT):
$ echo 'foo,"field,""with"",commas",bar' |
awk -v FPAT='[^,]*|("([^"]|"")*")' '{for (i=1; i<=NF;i++) print i " <" $i ">"}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>
or the equivalent using any awk:
$ echo 'foo,"field,""with"",commas",bar' |
awk -v fpat='[^,]*|("([^"]|"")*")' -v OFS=',' '{
rec = $0
$0 = ""
i = 0
while ( (rec!="") && match(rec,fpat) ) {
$(++i) = substr(rec,RSTART,RLENGTH)
rec = substr(rec,RSTART+RLENGTH+1)
}
for (i=1; i<=NF;i++) print i " <" $i ">"
}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>
See https://www.gnu.org/software/gawk/manual/gawk.html#More-CSV for info on the specific FPAT setting I use above.
If all you actually want to do is convert your CSV to individual lines by, say, replacing newlines with blanks and commas with semi-colons inside quoted fields then all you need is this, again using GNU awk for multi-char RS and RT:
$ awk -v RS='"([^"]|"")*"' -v ORS= '{gsub(/\n/," ",RT); gsub(/,/,";",RT); print $0 RT}' file.csv
"rec1; fld1",,"rec1"";""fld3.1 ""; fld3.2","rec1 fld4"
"rec2; fld1.1 fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3;fld2""",
Otherwise, though, the general, robust, portable solution to identify the fields that will work with any modern awk* is:
$ cat decsv.awk
function buildRec( fpat,fldNr,fldStr,done) {
CurrRec = CurrRec $0
if ( gsub(/"/,"&",CurrRec) % 2 ) {
# The string built so far in CurrRec has an odd number
# of "s and so is not yet a complete record.
CurrRec = CurrRec RS
done = 0
}
else {
# If CurrRec ended with a null field we would exit the
# loop below before handling it so ensure that cannot happen.
# We use a regexp comparison using a bracket expression here
# and in fpat so it will work even if FS is a regexp metachar
# or a multi-char string like "\\\\" for \-separated fields.
CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
$0 = ""
fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
while ( (CurrRec != "") && match(CurrRec,fpat) ) {
fldStr = substr(CurrRec,RSTART,RLENGTH)
# Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
if ( gsub(/^"|"$/,"",fldStr) ) {
gsub(/""/, "\"", fldStr)
}
$(++fldNr) = fldStr
CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
}
CurrRec = ""
done = 1
}
return done
}
# If your input has \-separated fields, use FS="\\\\"; OFS="\\"
BEGIN { FS=OFS="," }
!buildRec() { next }
{
printf "Record %d:\n", ++recNr
for (i=1;i<=NF;i++) {
# To replace newlines with blanks add gsub(/\n/," ",$i) here
printf " $%d=<%s>\n", i, $i
}
print "----"
}
.
$ awk -f decsv.awk file.csv
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
The above assumes UNIX line endings of \n. With Windows \r\n line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings.
It works by simply counting how many "s are present so far in the current record whenever it encounters the RS - if it's an odd number then the RS (presumably \n but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.
*I say "modern awk" above because there's apparently extremely old (i.e. circa 2000) versions of tawk and mawk1 still around which have bugs in their gsub() implementation such that gsub(/^"|"$/,"",fldStr) would not remove the start/end "s from fldStr. If you're using one of those then get a new awk, preferably gawk, as there could be other issues with them too but if that's not an option then I expect you can work around that particular bug by changing this:
if ( gsub(/^"|"$/,"",fldStr) ) {
to this:
if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
Thanks to the following people for identifying and suggesting solutions to the stated issues with the original version of this answer:
#mosvy for escaped double quotes within fields.
#datatraveller1 for multiple contiguous pairs of escaped quotes in a field and null fields at the end of records.
Related: also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.
An improvement upon #EdMorton's FPAT solution, which should be able to handle double-quotes(") escaped by doubling ("" -- as allowed by the CSV standard).
gawk -v FPAT='[^,]*|("[^"]*")+' ...
This STILL
isn't able to handle newlines inside quoted fields, which are perfectly legit in standard CSV files.
assumes GNU awk (gawk), a standard awk won't do.
Example:
$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v OFS='|' -v FPAT='[^,]*|("[^"]*")+' '{$1=$1}1'
a||""|"y""ck"|"""x,y,z"|" "|12
$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v FPAT='[^,]*|("[^"]*")+' '{
for(i=1; i<=NF;i++){
if($i~/"/){ $i = substr($i, 2, length($i)-2); gsub(/""/,"\"", $i) }
print "<"$i">"
}
}'
<a>
<>
<>
<y"ck>
<"x,y,z>
< >
<12>
This is exactly what csvquote is for - it makes things simple for awk and other command line data processing tools.
Some things are difficult to express in awk. Instead of running a single awk command and trying to get awk to handle the quoted fields with embedded commas and newlines, the data gets prepared for awk by csvquote, so that awk can always interpret the commas and newlines it finds as field separators and record separators. This makes the awk part of the pipeline simpler. Once awk is done with the data, it goes back through csvquote -u to restore the embedded commas and newlines inside quoted fields.
csvquote file.csv | awk -f my_awk_script | csvquote -u
EDIT:
For a complete description on csvquote, see: How it works. this also explains the `` characters which are shown in places where there was a carriage return.
csvquote file.csv | awk -f decsv.awk | csvquote -u
(for the source of decsv.awk see answer from Ed Morton )
outut:
Record 1:
$1=<rec1 fld1>
$2=<>
$3=<rec1","fld3.1",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3fld2">
$3=<>
----
I have found csvkit a really useful toolkit to handle with csv files in command line.
line='test,t2,t3,"t5,"'
echo $line | csvcut -c 4
"t5,"
echo 'foo,"field,""with"",commas",bar' | csvcut -c 3
bar
It also contains csvstat, csvstack etc. tools which are also very handy.
cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",
csvcut -c 1 file.csv
"rec1, fld1"
"rec2, fld1.1
fld1.2"
""""""
csvcut -c 3 file.csv
"rec1"",""fld3.1
"",
fld3.2"
""
""
Awk (gawk) actually provides extensions, one of which being csv processing, which is the most robust way to do so with gawk in my opinion. The extension takes care of many gotchas and parses the csv for you.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print a[1]}'
The output is:
James T. Kirk
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print a[1] }, aka prints first csv field of the line.
If you're using one of the common AWK interpreters (Gawk, onetrueawk, mawk), the other solutions are your best bet. However, if you're able to use a different interpreter, frawk and GoAWK have proper CSV support built-in.
frawk is a very fast AWK implementation written in Rust. Use -i csv to process input in CSV mode. Note that frawk is not quite POSIX compatible (see differences).
GoAWK is a POSIX-compatible AWK implementation written in Go. Also supports -i csv mode, as well as -H (parse header row) with #"named_field" syntax (read more). Disclaimer: I'm the author of GoAWK.
With file.csv as per the question, you can simply use an AWK script with a regular for loop over the fields as follows:
$ cat records.awk
{
printf "Record %d:\n", NR
for (i=1; i<=NF; i++)
printf " $%d=<%s>\n", i, $i
print "----"
}
Then use either frawk -i csv or goawk -i csv to get the expected output. For example:
$ frawk -i csv -f records.awk file.csv
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
$ goawk -i csv -f records.awk file.csv
Record 1:
... same as above ...
----

Unix Shell Script - AWK delimiter issue

I have following lines in a file. Please note I have intentionally kept the extra hash between 2 and 0 in the 2nd line.
File name : test.txt
Name#|#Age#|#Dept
AC#|#2#0#|#Science
BC#|#22#|#Commerce
I am using awk to get the data in Dept column
awk -F "#|#" -v c="Dept" 'NR==1{for (i=1; i<=NF; i++) if ($i==c){p=i; break}; next} {print $p}' "test.txt" >> result.txt
The result.txt shows me the following
|
Commerce
The first line is coming as pipe because if the extra # in the first line.
Can anyone help on this
Currently the meaning of the delimiter set is: match # or #. The pipe | character in this case acts as an OR statement; instead try using:
awk -F '#[|]#' ...
Putting | into a character class [ ... ] awk will match it literally.
If you desire to extract Dept in your content, here's a good choice you can choose from,
awk -F'#' 'NR>1{print $NF}' test.txt
output:
Science
Commerce

AWK that reads up to the /

I have the following lines of text :
170311 005201 0433 DE(N) itemhandling itemAddBarCodeData: Barcode(1/1) <0157357069/OK> ##[ti=7672,
170311 005323 0433 DE(N) itemhandling itemAddBarCodeData: Barcode(1/1) </NOREAD> ##[ti=7672,
I have the following script :
grep "itemAddBarCodeData" %myItemHandling% | gawk -F "[<>]+" -v OFS=, "{for(i=1;i<=NF;++i){if($i~/Barcode/){print substr($1,5,2)substr($1,3,2)substr($1,1,2),substr($1,8,6),$(i+1)}}}" > %myOutputPath%%myFilename%
What I need is a script that reads only the /NOREAD and the /OK so the output is like :
11-03-17,00:52:01,NOREAD
11-03-17,00:53:23,OK
any help would be greatly appreciated
Thanks
Complex gawk approach:
awk -F"[ />]" '{patsplit($1, a, /[0-9]{2}/); patsplit($2, b, /[0-9]{2}/);
printf("%s-%s-%s,%s:%s:%s,%s\n",a[3],a[2],a[1],b[1],b[2],b[3],$10)}' inpufile
The output:
11-03-17,00:52:01,OK
11-03-17,00:53:23,NOREAD
-F"[ />]" - "composite" field separator
patsplit(string, array [, fieldpat [, steps ] ])
Divide string into pieces defined by fieldpat and store the pieces in array and
the separator strings in the seps array.
You can use this following script:
script.awk
/\/[A-Z]+>/ { match($1"-"$2,/(..)(..)(..)-(..)(..)(..)/,ts)
dt=mktime( sprintf("20%s %s %s %s %s %s",
ts[1], ts[2], ts[3],
ts[4], ts[5], ts[6]) )
dtd = strftime( "%d-%m-%y", dt )
dts = strftime( "%H:%M:%S", dt )
match ( $0, /\/[A-Z]+>/) # set RSTART and RLENGTH
print dtd, dts, substr( $0, RSTART+1, RLENGTH-2)
}
Run it like this: awk -v OFS=, -f script.awk yourfile
The important part is the second match function call, which matches
a string of capital letters [A_Z]
preceded by a /
followed by a >.
It should match the OK and NOREAD case and not the Barcode(1/1).
The variables
RSTART and
RLENGTH
are set by the match function, we have to correct them by +1 and -2, because the match RE included / and >.
The first match, mktime, strftime and the sprintf function call are another way the format the date and time. The time functions are GNU AWK extensions.
Regular awk version:
awk '
{
d=$1$2
gsub(/../,"& ",d)
split(d,T)
split($8,R,"[/>]")
printf "%s-%s-%s,%s:%s:%s,%s\n",T[3],T[2],T[1],T[4],T[5],T[6],R[2]
}
' file
With script in file:
script.awk:
{
d=$1$2
gsub(/../,"& ",d)
split(d,T)
split($8,R,"[/>]")
printf "%s-%s-%s,%s:%s:%s,%s\n",T[3],T[2],T[1],T[4],T[5],T[6],R[2]
}
awk -f script.awk file
crammed on one line..
awk '{d=$1$2; gsub(/../,"& ",d); split(d,T); split($8,R,"[/>]"); printf "%s-%s-%s,%s:%s:%s,%s\n",T[3],T[2],T[1],T[4],T[5],T[6],R[2]}' file
You don't need grep when you're using awk. With GNU awk for gensub():
$ awk '/itemAddBarCodeData/{print gensub(/(..)(..)(..) (..)(..)(..).*\/([^>]+).*/,"\\3-\\2-\\1,\\4:\\5:\\6,\\7",1)}' file
11-03-17,00:52:01,OK
11-03-17,00:53:23,NOREAD
Here's a pragmatic combination of awk and sed that is conceptually relatively simple:
On Linux and BSD/macOS:
awk -F'[ />]' -v OFS=, '/itemAddBarCodeData/ {print $1, $2, $10}' file |
sed -E 's/^(..)(..)(..),(..)(..)(..)/\3-\2-\1,\4:\5:\6/'
On a Windows system, invoked from cmd.exe, different quoting and line continuation rules apply (assumes the presence of ported GNU utilities):
awk -F"[ />]" -v OFS=, "/itemAddBarCodeData/ {print $1, $2, $10}" file ^
| sed -E "s/^(..)(..)(..),(..)(..)(..)/\3-\2-\1,\4:\5:\6/"
Note how:
"..." strings rather than '...' strings must be used to protect the embedded content from interpretation by the shell
Unlike with "..." on Unix, $ has no special meaning to cmd.exe, so it can be used as-is.
^ as the very last character on a line serves as the explicit line-continuation character, and the line must be broken before the | (whereas on Unix a line ending in | is implicitly continued).
This is only used for readability here; of course, you can place your command on a single line.

Run command inside awk and store result inplace

I have a script that I need to run on every value. It basically return a number by taking an argument, like below
>>./myscript 4832
>>1100
my.csv contains the following:
123,4832
456,4833
789,4834
My command
cat my.csv | awk -F',' '{$3=system("../myscript $2");print $1,$2,$3'}
myscript is unable to understand that I'm passing the second input field $2 as argument. I need the output from the script to be added to the output as my 3rd column.
The expected output is
123,4832,1100
456,4833,17
789,4834,42
where the third field is the output from myscript with the second field as the argument.
If you are attempting to add a third field with the output from myscript $2 where $2 is the value of the second field, try
awk -F , '{ printf ("%s,%s,", $1, $2); system("../myscript " $2) }' my.csv
where we exploit the convenient fact that the output from myscript will complete the output without a newline with the calculated value and a newline.
This isn't really a good use of Awk; you might as well do
while IFS=, read -r first second; do
printf "%s,%s," "$first" "$second"
../mycript "$second"
done <my.csv
I'm assuming you require comma-separated output; changing this to space-separated is obviously a trivial modification.
The syntax you want is:
awk 'BEGIN{FS=OFS=","}
{
cmd = "./myscript \047" $2 "\047"
val = ( (cmd | getline line) > 0 ? line : "NaN" )
close(cmd)
print $0, val
}
' file
Tweak the getline part to do different error handling if you like and make sure you read and fully understand http://awk.freeshell.org/AllAboutGetline before using getline.
We can use in gnu-awk Two-Way Communications with Another Process
awk -F',' '{"../myscript "$2 |& getline v; print $1,$2,v}' my.csv
you get,
123 4832 1100
456 4833 17
789 4834 42
awk -F',' 'BEGIN { OFS=FS }{"../myscript "$2 |& getline v; print $1,$2,v}' my.csv
you get,
123,4832,1100
456,4833,17
789,4834,42
from GNU awk online documentation:
system: Execute the operating system command command and then return to the awk program. Return command’s exit status (see further on).
you need to use getline getline piped documentation
You need to specify the $2 separately in the string concatenation, that is
awk -F',' '{ system("echo \"echo " $1 "$(../myexecutable " $2 ") " $3 "\" | bash"); }' my.csv