How to get the SUM without exponent in xmlstarlet? - sum

When using the SUM in xmlstarlet.
I get the result in exponent.
xmlstarlet sel -t -v "sum(VM_POOL/VM/MONITORING/MEMORY)"/tmp/file.xml
3.058237512e+09
If I get the SUM using awk. I get the desired numeral.
xmlstarlet sel -t -v VM_POOL/VM/MONITORING/MEMORY /tmp/file.xml|awk '{u = u+$1}; END { print u }'
3058237512
How to get these numeral use only xmlstarlet?
Sample input data
<VM_POOL>
.....
<VM>
<ID>1111</ID>
<MONITORING>
<MEMORY><![CDATA[2153128]]></MEMORY>
</MONITORING>
</VM>
<VM>
<ID>1112</ID>
<MONITORING>
<MEMORY><![CDATA[2153128]]></MEMORY>
</MONITORING>
</VM>
.....
</VM_POOL>

Related

Filtering using awk returns empty files

I have a similar problem to this question: How to do filtering of multiple files in a directory using awk?
The solution in the answers of the question above does not work for me.
I have tab-delimited txt files (all in folder Observation_by_pracid). For each file, I want to create a new file that only contains rows with a specific value in column $9 (medcodeid). The specific values are to be found in medicalcode_list.txt.
There is no error, however it returns only empty files.
Codelist
medcodeid
2576
3199
Format of input files
patid consid ... medcodeid
500470520002 3062539302 ... 2576
951924020002 3062538414 ... 310803013
503478020002 3061587464 ... 257619018
951924020002 3062537807 ... 55627011
503576720002 3062537720 ... 3199
Desired output
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
My code
mkdir HBA1C_observation_bypracid
awk '
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Solution
mkdir HBA1C_observation_bypracid
awk '
BEGIN{ FS=OFS="\t" }
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Adding "BEGIN..." solved my problem.
You can join two files on a column using join.
Files must be sorted on the joined column. To perform a numerical sort on a column, use sort this way, where N is the column number:
sort -kN -n FILE
You also need to get ride of the first line (column names) of each files. You can use tail command the way below, where N is the number of line from which you want to output the content (so 2nd line):
tail -n +N
... But still need to display the column values:
head -n 1 FILE
To join two files f1 and f2, on the fields c1 of f1 and c2 of f2, and output fields y of files x:
join -1 c1 -2 c2 f1 f2 -o "x.y, x.y"
Working sample:
head -n 1 input_file
for input_file in *.txt ; do
join -1 1 -2 9 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9" \
<(tail -n +2 PATH/medicalcode_list.txt | sort -k1 -n) \
<(tail -n +2 "$input_file" | sort -k3 -n)
done
Result (for the input file you gave):
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
Note: the column names arent aligned with the values. Don't know if it's a prerequisite. You can format the display with printf command.
Personally I think it would be simpler to loop over in the shell (understanding that this will reread the code list more than once), with a simpler awk function that you should be able to test and debug. Something like:
for file in *.txt; do
awk 'FNR == NR { mlist[$1] } FNR != NR && ($9 in mlist) { print }' \
PATH/medicalcode_list.txt "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done
You should be able to start without the redirection to make sure that for a single file, you get the results printed to the terminal that you were expected. If you don't there might be some incorrect assumption about the files.
Another option would be to write a separate awk script that writes the code to hard-code the list in another awk script. Also gives the advantage to check the contents of the variable mlist.
printf 'BEGIN {\n%s\n}\n $9 in mlist { print }' \
"$(awk '{ print "mlist[" $1 "]" }' PATH/medicalcode_list.txt)" > filter.awk
for file in *.txt; do
awk -f filter.awk "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done

Replace the string either with sed or awk, where it identifies a patter as mentioned below

Could u please let know how can I convert the below INPUT into the mentioned OUTPUT using AWK:
INPUT
CREATE TABLE ${hf:XX_DB_XX}.test_${hf:XX_YYYYMMDD_XX}
AS
SELECT id
FROM ${hf:XX_R_DB_XX}.usr_${hf:XX_YYYYMMDD_XX}
WHERE year = ${hf:XX_YYYY_XX}
AND month = ${hf:XX_MM_XX}
AND day = ${hf:XX_DD_XX};
OUTPUT
CREATE TABLE XX_DB_XX.test_XX_YYYYMMDD_XX
AS
SELECT id
FROM XX_R_DB_XX.usr_XX_YYYYMMDD_XX
WHERE year = XX_YYYY_XX
AND month = XX_MM_XX
AND day = XX_DD_XX;
Below is what I have used to convert from the given OUTPUT to INPUT.
awk '{gsub(/XX_[a-zA-Z]+_XX/,"${hf:&}")} 1' <filename>
And to reverse that I tried the below , but did not work out
awk '{gsub(/${hf:XX_[a-zA-Z]+_XX}/,"&")} 1' <filename>
sed will do here
$ sed -E 's/\$\{hf:([^}]+)\}/\1/g' file
similarly with GNU awk
$ awk '{print gensub(/\${hf:([^}]+)}/,"\\1","g")}' file

awk field separator with in the xml

I have a xml file with the following data.
<record record_no = "2" error_code="100">"18383531";"22677833";"21459732";"41001";"394034";"0208";"Prime Lending - ;Corporate - 2201";"";"Prime Lending - Lacey - 2508";"Prime Lending - Lacey - 2508";"1";"rrvc";"Tiffany Poe";"HEIDI";"BUNDY";"000002274";"2.0";"18.0";"2";"362661";"Rejected by IRS";"A1AAA";"20160720";"1021";"HEDI & Bundy";"4985045838";"PPASSESS";"Web";"3683000826";"823";"IC W2";"";"";"";"";"Rapid_20160801_Monthly.txt";"20160720102100";"";"20160803095309";"286023";"RGT";"1";"14702324400223";"14702324400223";"0";"OMCProcessed"
I'm using following code:
cat RR_00404.fin.bc_lerr.xml.bc| awk 'BEGIN { FS=OFS=";" }/<record/ { gsub(/"/,"\""); gsub(/.*=" ">.*/,"",$1);print $1,$40,$43,$46 ,"'base_err_xml'", "0",$7; }'
The idea is to do the following:
Replace &quote; with "
Extract the error_code
Print " and ; seperated values.
Use sqlldr to load ( not to worry about this).
Problem to solve:
There is ; within the text. e.g Prime Lending -;Corporate - 2201
There's &
Output:
100;"20160803095309";"1";"1";"base_err_xml";"0";"Prime Lending
100;"286023";"14702324400223";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate - 2201"
100;"286024-1";"";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate - 2201"
awk is the wrong tool for this job, without some preprocessing. Here, we use XMLStarlet for the first pass (decoding all XML entities and splitting attributes off into separate fields), and GNU awk for the second (reading those fields and performing whatever transforms or logic you actually need):
#!/bin/sh
# reads XML on stdin; puts record_no in first field, error code in second,
# ...record content for remainder of output line.
xmlstarlet sel -t -m '//record' \
-v ./#record_no -o ';' \
-v ./#error_code -o ';' \
-v . -n
...and, cribbed from the GNU awk documentation...
#!/bin/env gawk -f
# must be GNU awk for the FPAT feature
BEGIN {
FPAT = "([^;]*)|(\"[^\"]*\")"
}
{
print "NF = ", NF
for (i = 1; i <= NF; i++) {
printf("$%d = <%s>\n", i, $i)
}
}
Here, what we're doing with gawk is just showing how the fields get split, but obviously, you can modify the script for whatever needs you have.
A subset of output from the above for your given input file (when extended to actually be valid XML) is quoted below:
$1 = <2>
$2 = <100>
$9 = <"Prime Lending - ;Corporate - 2201">
Note, then, that $1 is the record_no, $2 is the error_code, and $9 correctly contains the semicolon as literal content.
Obviously, you can encapsulate both these components in shell functions to avoid the need for separate files.

How to Field Separate and append text using awk

Experts,
I have the following text in an xml files ( there will 20,000 rows in file).
<record record_no = "1" error_code="101">"21006041";"28006041";"34006211";"43";"101210-0001"
Here is how I need the result for each row to be and append to new file.
"21006041";"28006041";"34006211";"43";"101210-0001";101
Here is what I need to do to get the above result.
I replaced " with "
remove <record record_no = "1" error_code="
Get the text 101 ( it can have any value in this position)
append to the last.
Here is what I have been trying.
BEGIN { FS=OFS=";" }
/<record/ {
gsub(/"/,"\"")
gsub(/&apos;/,"")
gsub(/.*="|">.*/,"",$1)
$(NF+1)=$1;
$1="";
print $0;
}
This should do the trick.
awk -F'">' -v OFS=';' '{gsub(/<record record_no = \"[0-9]+\" error_code="/,""); gsub(/"/,"\""); print $2,$1}'
The strategy is to:
split the string at closing chars of the xml element ">
remove the first bit of the xml element including the attribute names leaving only the error code.
replace all " xml entities with ".
print the two FS sections in reverse order.
Test it out with the following data generation script. The script will generate 500x20000 line files with records of random length, some with dashes in the values.
#!/bin/bash
recCount=0
for h in {1..500};
do
for i in {1..20000};
do
((recCount++))
error=$(( RANDOM % 998 + 1 ))
record="<record record_no = "'"'"${recCount}"'"'" error_code="'"'"${error}"'"'">"
upperBound=$(( RANDOM % 4 + 5 ))
for (( k=0; k<${upperBound}; k++ ));
do
randomVal=$(( RANDOM % 99999999 + 1))
record+=""${randomVal}"
if [[ $((RANDOM % 4)) == 0 ]];
then
randomVal=$(( RANDOM % 99999999 + 1))
record+="-${randomVal}"
fi
record+="""
if [[ $k != $(( ${upperBound} - 1 )) ]];
then
record+=";"
fi
done;
echo "${record}" >> "file-${h}.txt"
done;
done;
On my laptop I get the following performance.
$ time cat file-*.txt | awk -F'">' -v OFS=';' '{gsub(/<record record_no = \"[0-9]+\" error_code="/,""); gsub(/"/,"\""); print $2,$1}' > result
real 0m18.985s
user 0m17.673s
sys 0m2.697s
As an added bonus, here is the "equivalent" command in sed:
sed -e 's|\("\)|"|g' -e 's|^.*error_code="\([^>]\+\)">\(.\+\).*$|\2;\1|g'
Much slower although the strategy is the same. Two expressions are used. First replace all " xml entities with ". Lastly group all characters (.+) after >. Display the remembered patterns in reverse order \2;\1
Timing statistics:
$ time cat file-* | sed -e 's|\("\)|"|g' -e 's|^.*error_code="\([^>]\+\)">\(.\+\).*$|\2;\1|g' > result.sed
real 5m59.576s
user 5m56.136s
sys 0m9.850s
Is this too thick:
$ awk -F""+" -v OFS='";"' -v dq='"' '{gsub(/^.*="|">$/,"",$1);print dq""$2,$4,$6,$8,$10dq";"$1}' test.in
"21006041";"28006041";"34006211";"43";"101210-0001";101

Divide floats in awk

I have written a code to calculate the zscore which calculates the mean and standard deviation from one file and uses some values from rows in another file, as follows:
mean=$(awk '{total += $2; count++} END {print total/count}' ABC_avg.txt)
#calculating mean of the second column of the file
std=$(awk '{x[NR]=$2; s+=$2; n++} END{a=s/n; for (i in x){ss += (x[i]-a)^2} sd = sqrt(ss/n); print sd}' ABC_avg.txt)
#calculating standard deviation from the second column of the same file
awk '{if (std) print $2-$mean/$std}' ABC_splicedavg.txt" > ABC.tmp
#calculate the zscore for each row and store it in a temporary file
zscore=$(awk '{total += $0; count++} END {if (count) print total/count}' ABC.tmp)
#calculate an average of all the zscores in the rows and store it in a variable
echo $motif" "$zscore
rm ABC.tmp
However when I execute this code ,at the step where a temp file is created I get an error as fatal: division by zero attempted, what is the right way to implement this code? TIA I used bc -l option but it gives a very long version of the floating integer.
Here is a script to compute mean and std in one pass, you may lose some resolution if not acceptable there are alternatives...
$ awk '{print rand()}' <(seq 100)
| awk '{sum+=$1; sqsum+=$1^2}
END{print mean=sum/NR, std=sqrt(sqsum/NR-mean^2), z=mean/std}'
0.486904 0.321789 1.51312
Your script for z-score for each sample is wrong! You need to do ($2-mean)/std.
You can control the precision of your output with bc by using the scale variable:
$ echo "4/7" | bc -l
.57142857142857142857
$ echo "scale=3; 4/7" | bc -l
.571