I have a xml file with the following data.
<record record_no = "2" error_code="100">"18383531";"22677833";"21459732";"41001";"394034";"0208";"Prime Lending - ;Corporate - 2201";"";"Prime Lending - Lacey - 2508";"Prime Lending - Lacey - 2508";"1";"rrvc";"Tiffany Poe";"HEIDI";"BUNDY";"000002274";"2.0";"18.0";"2";"362661";"Rejected by IRS";"A1AAA";"20160720";"1021";"HEDI & Bundy";"4985045838";"PPASSESS";"Web";"3683000826";"823";"IC W2";"";"";"";"";"Rapid_20160801_Monthly.txt";"20160720102100";"";"20160803095309";"286023";"RGT";"1";"14702324400223";"14702324400223";"0";"OMCProcessed"
I'm using following code:
cat RR_00404.fin.bc_lerr.xml.bc| awk 'BEGIN { FS=OFS=";" }/<record/ { gsub(/"/,"\""); gsub(/.*=" ">.*/,"",$1);print $1,$40,$43,$46 ,"'base_err_xml'", "0",$7; }'
The idea is to do the following:
Replace "e; with "
Extract the error_code
Print " and ; seperated values.
Use sqlldr to load ( not to worry about this).
Problem to solve:
There is ; within the text. e.g Prime Lending -;Corporate - 2201
There's &
Output:
100;"20160803095309";"1";"1";"base_err_xml";"0";"Prime Lending
100;"286023";"14702324400223";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate - 2201"
100;"286024-1";"";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate - 2201"
awk is the wrong tool for this job, without some preprocessing. Here, we use XMLStarlet for the first pass (decoding all XML entities and splitting attributes off into separate fields), and GNU awk for the second (reading those fields and performing whatever transforms or logic you actually need):
#!/bin/sh
# reads XML on stdin; puts record_no in first field, error code in second,
# ...record content for remainder of output line.
xmlstarlet sel -t -m '//record' \
-v ./#record_no -o ';' \
-v ./#error_code -o ';' \
-v . -n
...and, cribbed from the GNU awk documentation...
#!/bin/env gawk -f
# must be GNU awk for the FPAT feature
BEGIN {
FPAT = "([^;]*)|(\"[^\"]*\")"
}
{
print "NF = ", NF
for (i = 1; i <= NF; i++) {
printf("$%d = <%s>\n", i, $i)
}
}
Here, what we're doing with gawk is just showing how the fields get split, but obviously, you can modify the script for whatever needs you have.
A subset of output from the above for your given input file (when extended to actually be valid XML) is quoted below:
$1 = <2>
$2 = <100>
$9 = <"Prime Lending - ;Corporate - 2201">
Note, then, that $1 is the record_no, $2 is the error_code, and $9 correctly contains the semicolon as literal content.
Obviously, you can encapsulate both these components in shell functions to avoid the need for separate files.
Related
I was trying to do masking of file with command 'tr' and 'awk' but failing with error fatal: cannot open pipe ( Too many open pipes) error. FILE has approx 1000000 records quite a huge number.
Below is the code I am trying :-
awk - F "|" - v OFS="|" '{ "echo \""$1"\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\"" | get line $1}1' FILE.CSV > test.CSV
It is showing error :-
awk: (FILENAME=- FNR=1019) fatal: cannot open pipe `echo ""TTP_123"" | tr "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" "QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq"' (Too many open pipes)
Please let me know what I am doing wrong here
Also a Note any number of columns could be used for masking and can be at any positions in this example I have taken 1 and 2 column positions but it could be 3 and 10 or 5,7,25 columns
Thanks
AJ
First things first, you can't have a space between - and F or v.
I was going to suggest sed, but as you only want to translate the first column, that's not as easy.
Unfortunately, awk doesn't have built-in tr functionality, so you'd have to use the shell like you are and just close the pipe:
awk -F "|" -v OFS="|" '{
command="echo \"\\"$1"\\\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\""
command | getline $1
close(command)
}1' FILE.CSV > test.CSV
However, I suggest using perl, which can do field splitting and character translation:
perl -F'\|' -lane '$F[0] =~ tr/0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq/; print join("|", #F)' FILE.CSV > test.CSV
Or, for a shorter command line, just put the program into a file, drop the e in -lane and use the file name instead of the '...' command.
you can do the mapping in awk instead of making a system call for each line, or perhaps simply
paste -d'|' <(cut -d'|' -f1 file | tr '0-9' 'a-z') <(cut -d'|' -f2- file)
replace the tr arguments with yours.
This does not answer your question, but you can implement tr as an awk function that would save having to spawn lots of external processes
$ cat tr.awk
function tr(str, from, to, s,i,c,idx) {
s = ""
for (i=1; i<=length($str); i++) {
c = substr(str, i, 1)
idx = index(from, c)
s = s (idx == 0 ? c : substr(to, idx, 1))
}
return s
}
{
print $1, tr($1,
" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ",
" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq")
}
Example:
$ printf "%s\n" hello wor-ld | awk -f tr.awk
hello KGCCN
wor-ld 3N8-CF
I currently have this ORIGINAL string:
mklink .\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks $.\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks
#.......................................................^
....and need to replace as follows:
mklink .\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks ..\..\..\..\.\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks
#.......^......^...........^..........^.................^^^^^^^^^^^^
i.e. replace $ with "..\" 4 times based on the number of slashes in the 2nd column of the ORIGINAL string (".\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks")
I can for example do the following individually:
awk -F '\\' '{print NF-1}' --> to print the number of occurrences of the backward slash
sed -e "s,\\\,$(printf '..\\\%.0s' {1..4}),g" --> to replace and repeat the string pattern
......but unsure how to string together in 1 command line.
An awk command that works with an arbitrary number of input lines (with the same field structure):
awk -v prefixUnit="..\\" '{
count = gsub("\\\\", "\\", $2) # count the number of "\"s
# Build the prefix composed of count prefixUnit instances
prefix = ""; for (i=1; i<=count; ++i) prefix = prefix prefixUnit
$3 = prefix substr($3, 2) # prepend the prefix to the 3rd field, replacing the "$"
print # print result
}' file
As a condensed one-liner:
awk -v pu="..\\" '{n=gsub("\\\\","\\",$2);p="";for(i=1;i<=n;++i)p=p pu;$3=p substr($3,2);print}' file
I'd use perl, although this is a bit "clever"
perl -ane '
$n = ($F[1] =~ tr/\\/\\/);
$F[2] =~ s{\$}{"..\\" x $n}e;
print join(" ", #F);
' file
where
-a - autosplit the line into the array #F
-n - execute the body for each line of "file"
$F[1] =~ tr/\\/\\/ - count the number of backslashes in the 2nd field
$F[2] =~ s{\$}{"..\\" x $n}e - replace the dollar with the right number of "..\"
awk one-liner, assuming that there is just a single line to process
awk -F\\ -v RS='$' -v ORS='' '{print} NR==1{for(i=1; i< NF; i++) printf("..\\")}'
Explanation
awk -F\\ -v RS='$' ' -v ORS='' ' # Specify separators
{print} # print the line
NR==1{ # for first record
for (i=1; i<NF; i++) # print multiple ..\
printf("..\\") # need to escape \\
}
'
Experts,
I have the following text in an xml files ( there will 20,000 rows in file).
<record record_no = "1" error_code="101">"21006041";"28006041";"34006211";"43";"101210-0001"
Here is how I need the result for each row to be and append to new file.
"21006041";"28006041";"34006211";"43";"101210-0001";101
Here is what I need to do to get the above result.
I replaced " with "
remove <record record_no = "1" error_code="
Get the text 101 ( it can have any value in this position)
append to the last.
Here is what I have been trying.
BEGIN { FS=OFS=";" }
/<record/ {
gsub(/"/,"\"")
gsub(/'/,"")
gsub(/.*="|">.*/,"",$1)
$(NF+1)=$1;
$1="";
print $0;
}
This should do the trick.
awk -F'">' -v OFS=';' '{gsub(/<record record_no = \"[0-9]+\" error_code="/,""); gsub(/"/,"\""); print $2,$1}'
The strategy is to:
split the string at closing chars of the xml element ">
remove the first bit of the xml element including the attribute names leaving only the error code.
replace all " xml entities with ".
print the two FS sections in reverse order.
Test it out with the following data generation script. The script will generate 500x20000 line files with records of random length, some with dashes in the values.
#!/bin/bash
recCount=0
for h in {1..500};
do
for i in {1..20000};
do
((recCount++))
error=$(( RANDOM % 998 + 1 ))
record="<record record_no = "'"'"${recCount}"'"'" error_code="'"'"${error}"'"'">"
upperBound=$(( RANDOM % 4 + 5 ))
for (( k=0; k<${upperBound}; k++ ));
do
randomVal=$(( RANDOM % 99999999 + 1))
record+=""${randomVal}"
if [[ $((RANDOM % 4)) == 0 ]];
then
randomVal=$(( RANDOM % 99999999 + 1))
record+="-${randomVal}"
fi
record+="""
if [[ $k != $(( ${upperBound} - 1 )) ]];
then
record+=";"
fi
done;
echo "${record}" >> "file-${h}.txt"
done;
done;
On my laptop I get the following performance.
$ time cat file-*.txt | awk -F'">' -v OFS=';' '{gsub(/<record record_no = \"[0-9]+\" error_code="/,""); gsub(/"/,"\""); print $2,$1}' > result
real 0m18.985s
user 0m17.673s
sys 0m2.697s
As an added bonus, here is the "equivalent" command in sed:
sed -e 's|\("\)|"|g' -e 's|^.*error_code="\([^>]\+\)">\(.\+\).*$|\2;\1|g'
Much slower although the strategy is the same. Two expressions are used. First replace all " xml entities with ". Lastly group all characters (.+) after >. Display the remembered patterns in reverse order \2;\1
Timing statistics:
$ time cat file-* | sed -e 's|\("\)|"|g' -e 's|^.*error_code="\([^>]\+\)">\(.\+\).*$|\2;\1|g' > result.sed
real 5m59.576s
user 5m56.136s
sys 0m9.850s
Is this too thick:
$ awk -F""+" -v OFS='";"' -v dq='"' '{gsub(/^.*="|">$/,"",$1);print dq""$2,$4,$6,$8,$10dq";"$1}' test.in
"21006041";"28006041";"34006211";"43";"101210-0001";101
I have a command that gets the next ID of a table from a pool of sql files, now I am trying to put this command as an alias in ~/.bashrc using a shell function, but I did not figure out how to escape $ so it gets to awk and not replaced by bash, here's the code in .bashrc:
function nextval () {
grep 'INSERT INTO \""$1"\"' *.sql | \
awk '{print $6}' | \
cut -c 2- | \
awk -F "," '{print $1}' | \
sort -n | \
tail -n 1 | \
awk '{print $0+1}'
}
alias nextval=nextval
Usage: # nextval tablename
Escaping with \$ I get an the error: awk: backslash not last character on line.
The $ is not inside double quotes, so why bash is replacing it ?
Perhaps the part you really need to change is this
'INSERT INTO \""$1"\"'
to
"INSERT INTO \"$1\""
#konsolebox answered your question but also you could write the function without so many tools and pipes, e.g.:
function nextval () {
awk -v tbl="$1" '
$0 ~ "INSERT INTO \"" tbl "\"" {
split( substr($6,2), a, /,/ )
val = ( ((val == "") || (a[1] > val)) ? a[1] : val)
}
END { print val+1 }
' *.sql
}
It's hard to tell if the above is 100% correct without any sample input or expected output to test it against but it should be close.
I have a file (user.csv)like this
ip,hostname,user,group,encryption,aduser,adattr
want to print all column sort by user,
I tried awk -F ":" '{print|"$3 sort -n"}' user.csv , it doesn't work.
How about just sort.
sort -t, -nk3 user.csv
where
-t, - defines your delimiter as ,.
-n - gives you numerical sort. Added since you added it in your
attempt. If your user field is text only then you dont need it.
-k3 - defines the field (key). user is the third field.
Use awk to put the user ID in front.
Sort
Use sed to remove the duplicate user ID, assuming user IDs do not contain any spaces.
awk -F, '{ print $3, $0 }' user.csv | sort | sed 's/^.* //'
Seeing as that the original question was on how to use awk and every single one of the first 7 answers use sort instead, and that this is the top hit on Google, here is how to use awk.
Sample net.csv file with headers:
ip,hostname,user,group,encryption,aduser,adattr
192.168.0.1,gw,router,router,-,-,-
192.168.0.2,server,admin,admin,-,-,-
192.168.0.3,ws-03,user,user,-,-,-
192.168.0.4,ws-04,user,user,-,-,-
And sort.awk:
#!/usr/bin/awk -f
# usage: ./sort.awk -v f=FIELD FILE
BEGIN {
FS=","
}
# each line
{
a[NR]=$0 ""
s[NR]=$f ""
}
END {
isort(s,a,NR);
for(i=1; i<=NR; i++) print a[i]
}
#insertion sort of A[1..n]
function isort(S, A, n, i, j) {
for( i=2; i<=n; i++) {
hs = S[j=i]
ha = A[j=i]
while (S[j-1] > hs) {
j--;
S[j+1] = S[j]
A[j+1] = A[j]
}
S[j] = hs
A[j] = ha
}
}
To use it:
awk sort.awk f=3 < net.csv # OR
chmod +x sort.awk
./sort.awk f=3 net.csv
You can choose a delimiter, in this case I chose a colon and printed the column number one, sorting by alphabetical order:
awk -F\: '{print $1|"sort -u"}' /etc/passwd
awk -F, '{ print $3, $0 }' user.csv | sort -nk2
and for reverse order
awk -F, '{ print $3, $0 }' user.csv | sort -nrk2
try this -
awk '{print $0|"sort -t',' -nk3 "}' user.csv
OR
sort -t',' -nk3 user.csv
awk -F "," '{print $0}' user.csv | sort -nk3 -t ','
This should work
To exclude the first line (header) from sorting, I split it out into two buffers.
df | awk 'BEGIN{header=""; $body=""} { if(NR==1){header=$0}else{body=body"\n"$0}} END{print header; print body|"sort -nk3"}'
With GNU awk:
awk -F ',' '{ a[$3]=$0 } END{ PROCINFO["sorted_in"]="#ind_str_asc"; for(i in a) print a[i] }' file
See 8.1.6 Using Predefined Array Scanning Orders with gawk for more sorting algorithms.
I'm running Linux (Ubuntu) with mawk:
tmp$ awk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan
random-funcs: srandom/random
regex-funcs: internal
compiled limits:
sprintf buffer 8192
maximum-integer 2147483647
mawk (and gawk) has an option to redirect the output of print to a command. From man awk chapter 9. Input and output:
The output of print and printf can be redirected to a file or command by appending > file, >> file or | command to the end of the print statement. Redirection opens file or command only once, subsequent redirections append to the already open stream.
Below you'll find a simplied example how | can be used to pass the wanted records to an external program that makes the hard work. This also nicely encapsulates everything in a single awk file and reduces the command line clutter:
tmp$ cat input.csv
alpha,num
D,4
B,2
A,1
E,5
F,10
C,3
tmp$ cat sort.awk
# print header line
/^alpha,num/ {
print
}
# all other lines are data lines that should be sorted
!/^alpha,num/ {
print | "sort --field-separator=, --key=2 --numeric-sort"
}
tmp$ awk -f sort.awk input.csv
alpha,num
A,1
B,2
C,3
D,4
E,5
F,10
See man sort for the details of the sort options:
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
-k, --key=KEYDEF
sort via a key; KEYDEF gives location and type
-n, --numeric-sort
compare according to string numerical value