I have this xml file (it's called "LogReg.xml" and it contains some information about a logistic regression (I am interested in the name of the features and their coefficient - I'll explain in more detail below):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
<Header>
<Application name="JPMML-SkLearn" version="1.6.35"/>
<Timestamp>2022-02-15T09:44:54Z</Timestamp>
</Header>
<MiningBuildTask>
<Extension name="repr">PMMLPipeline(steps=[('classifier', LogisticRegression())])</Extension>
</MiningBuildTask>
<DataDictionary>
<DataField name="Target" optype="categorical" dataType="integer">
<Value value="0"/>
<Value value="1"/>
</DataField>
<DataField name="const" optype="continuous" dataType="double"/>
<DataField name="grade" optype="continuous" dataType="double"/>
<DataField name="emp_length" optype="continuous" dataType="double"/>
<DataField name="dti" optype="continuous" dataType="double"/>
<DataField name="Orig_FicoScore" optype="continuous" dataType="double"/>
<DataField name="inq_last_6mths" optype="continuous" dataType="double"/>
<DataField name="acc_open_past_24mths" optype="continuous" dataType="double"/>
<DataField name="mort_acc" optype="continuous" dataType="double"/>
<DataField name="mths_since_recent_bc" optype="continuous" dataType="double"/>
<DataField name="num_rev_tl_bal_gt_0" optype="continuous" dataType="double"/>
<DataField name="percent_bc_gt_75" optype="continuous" dataType="double"/>
</DataDictionary>
<RegressionModel functionName="classification" algorithmName="sklearn.linear_model._logistic.LogisticRegression" normalizationMethod="logit">
<MiningSchema>
<MiningField name="Target" usageType="target"/>
<MiningField name="const"/>
<MiningField name="grade"/>
<MiningField name="emp_length"/>
<MiningField name="dti"/>
<MiningField name="Orig_FicoScore"/>
<MiningField name="inq_last_6mths"/>
<MiningField name="acc_open_past_24mths"/>
<MiningField name="mort_acc"/>
<MiningField name="mths_since_recent_bc"/>
<MiningField name="num_rev_tl_bal_gt_0"/>
<MiningField name="percent_bc_gt_75"/>
</MiningSchema>
<Output>
<OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
<OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
</Output>
<RegressionTable intercept="0.8064694059338298" targetCategory="1">
<NumericPredictor name="const" coefficient="0.8013433785974717"/>
<NumericPredictor name="grade" coefficient="0.9010481046582982"/>
<NumericPredictor name="emp_length" coefficient="0.9460686056314133"/>
<NumericPredictor name="dti" coefficient="0.5117062988491518"/>
<NumericPredictor name="Orig_FicoScore" coefficient="0.07944303372859234"/>
<NumericPredictor name="inq_last_6mths" coefficient="0.20516234445402765"/>
<NumericPredictor name="acc_open_past_24mths" coefficient="0.4852503249658917"/>
<NumericPredictor name="mort_acc" coefficient="0.6673203078463711"/>
<NumericPredictor name="mths_since_recent_bc" coefficient="0.1962158305958366"/>
<NumericPredictor name="num_rev_tl_bal_gt_0" coefficient="0.12964661294856686"/>
<NumericPredictor name="percent_bc_gt_75" coefficient="0.04534570018290847"/>
</RegressionTable>
<RegressionTable intercept="0.0" targetCategory="0"/>
</RegressionModel>
</PMML>
I have parsed it using this code:
from lxml import objectify
path = 'LogReg.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()
data = []
if True:
for elt in root.RegressionModel.RegressionTable:
el_data = {}
for child in elt.getchildren():
el_data[child.tag] = child.text
data.append(el_data)
perf = pd.DataFrame(data)
I am interested in parsing this bit:
<RegressionTable intercept="0.8064694059338298" targetCategory="1">
<NumericPredictor name="const" coefficient="0.8013433785974717"/>
<NumericPredictor name="grade" coefficient="0.9010481046582982"/>
<NumericPredictor name="emp_length" coefficient="0.9460686056314133"/>
<NumericPredictor name="dti" coefficient="0.5117062988491518"/>
<NumericPredictor name="Orig_FicoScore" coefficient="0.07944303372859234"/>
<NumericPredictor name="inq_last_6mths" coefficient="0.20516234445402765"/>
<NumericPredictor name="acc_open_past_24mths" coefficient="0.4852503249658917"/>
<NumericPredictor name="mort_acc" coefficient="0.6673203078463711"/>
<NumericPredictor name="mths_since_recent_bc" coefficient="0.1962158305958366"/>
<NumericPredictor name="num_rev_tl_bal_gt_0" coefficient="0.12964661294856686"/>
<NumericPredictor name="percent_bc_gt_75" coefficient="0.04534570018290847"/>
</RegressionTable>
so that I can build the following dictionary:
myDict = {
"const : 0.8013433785974717,
"grade" : 0.9010481046582982,
"emp_length" : 0.9460686056314133,
"dti" : 0.5117062988491518,
"Orig_FicoScore" : 0.07944303372859234,
"inq_last_6mths" : 0.20516234445402765,
"acc_open_past_24mths" : 0.4852503249658917,
"mort_acc" : 0.6673203078463711,
"mths_since_recent_bc" : 0.1962158305958366,
"num_rev_tl_bal_gt_0" : 0.12964661294856686,
"percent_bc_gt_75" : 0.04534570018290847
}
Basically, in the dictionary the Key is the name of the feature and the value is the coefficient of the logistic regression.
Please can anyone help me with the code?
I'm not sure you need pandas for this, but you do need to handle the namespaces in your xml.
Try something along these lines:
myDict = {}
#register the namespace
ns = {'xx': 'http://www.dmg.org/PMML-4_4'}
#you could collapse the next two into one line, but I believe it's clearer this way
rt = root.xpath('//xx:RegressionTable[.//xx:NumericPredictor]',namespaces=ns)[0]
nps = rt.xpath('./xx:NumericPredictor',namespaces=ns)
for np in nps:
myDict[np.attrib['name']]=np.attrib['coefficient']
myDict
The output should be your expected output.
I'm using below XML to fetch the ClauseCode but it's not returning any data for the below query.
<Policy>
<Plans>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>3</ClauseCode>
</ProductEndorsementClauses>
</Plan>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>1</ClauseCode>
<ClauseCode>2</ClauseCode>
</ProductEndorsementClauses>
</Plan>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>5</ClauseCode>
<ClauseCode>4</ClauseCode>
</ProductEndorsementClauses>
</Plan>
</Plans>
</Policy>
Here is my query :
select proposaid,
,Col1.value('(/*/ProductEndorsementClauses/ClauseCode)[1]','nvarchar(max)')
from Policy p
CROSS APPLY data.nodes('/*/Plans/Plan') AS Tbl(Col)
CROSS APPLY Tbl.Col.nodes('/ProductEndorsementClauses/ClauseCode') AS TblPec(Col1)
where Col1.value('(ProductEndorsementClauses/ClauseCode)[1]', 'nvarchar(max)') in ('1','3')
For inspiration:
declare #x xml = N'
<Policy>
<Plans>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>3</ClauseCode>
</ProductEndorsementClauses>
</Plan>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>1</ClauseCode>
<ClauseCode>2</ClauseCode>
</ProductEndorsementClauses>
</Plan>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>5</ClauseCode>
<ClauseCode>4</ClauseCode>
</ProductEndorsementClauses>
</Plan>
</Plans>
</Policy>'
select #x
--all clause code values
select t.col.value('.[1]', 'nvarchar(10)')
from #x.nodes('Policy/Plans/Plan/ProductEndorsementClauses/ClauseCode') as t(col);
--only clause code values that have at least a 1, 3 clause code in their parent ProductEndorsementClauses
select t.col.value('.[1]', 'nvarchar(10)')
from #x.nodes('Policy/Plans/Plan/ProductEndorsementClauses[(./ClauseCode/text() = "1") or (./ClauseCode/text() = "3")]/ClauseCode') as t(col);
--result only when exists at least a 1 or 3
select t.col.value('.[1]', 'nvarchar(max)')
from #x.nodes('Policy/Plans/Plan/ProductEndorsementClauses/ClauseCode') as t(col)
where #x.exist('Policy/Plans/Plan/ProductEndorsementClauses/ClauseCode[text() = "1" or text() = "3"]') = 1;
declare #Policy table
(
proposaid int,
data xml
);
insert into #Policy(proposaid, data)
values(1, N'
<Policy>
<Plans>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>3</ClauseCode>
</ProductEndorsementClauses>
</Plan>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>1</ClauseCode>
<ClauseCode>2</ClauseCode>
</ProductEndorsementClauses>
</Plan>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>5</ClauseCode>
<ClauseCode>4</ClauseCode>
</ProductEndorsementClauses>
</Plan>
</Plans>
</Policy>'),
(2, N'
<Policy>
<Plans>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>13</ClauseCode>
</ProductEndorsementClauses>
</Plan>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>11</ClauseCode>
<ClauseCode>12</ClauseCode>
</ProductEndorsementClauses>
</Plan>
<Plan>
<ProductEndorsementClauses>
<ClauseCode>15</ClauseCode>
<ClauseCode>14</ClauseCode>
</ProductEndorsementClauses>
</Plan>
</Plans>
</Policy>');
select *
from #Policy;
select p.proposaid,
t.col.value('.[1]', 'nvarchar(max)')
from #Policy as p
cross apply p.data.nodes('Policy/Plans/Plan/ProductEndorsementClauses/ClauseCode') as t(col)
where p.data.exist('Policy/Plans/Plan/ProductEndorsementClauses/ClauseCode[text() = "1" or text() = "3"]') = 1;
If found several questions about how to convert a table (or query) into XML, but none that showed how to start with one main table and join several one:many satellite tables, and from that generate XML that represents the hierarchical structure of the data. So I thought I'd share this solution now that I've figured it out. If someone else has another way of doing this, please post another answer.
Given this contrived data:
create table #recipe (id int, name varchar(10))
create table #ingredient (recipe_id int, name varchar(30), quantity varchar(20), sort int)
create table #instruction (recipe_id int, task varchar(32), sort int)
insert into #recipe values (1, 'pizza'), (2, 'omelet')
insert into #ingredient values (1, 'pizza dough', '1 package', 1),
(1, 'tomato sauce', '1 can', 2),
(1, 'favorite toppings', 'you choose', 3),
(2, 'eggs', 'three', 1),
(2, 'a bunch of other ingredients', 'you choose', 2)
insert into #instruction values (1, 'pre-bake pizza dough', 1),
(1, 'add tomato sauce', 2),
(1, 'add toppings', 3),
(1, 'bake a little longer', 4),
(2, 'break eggs into mixing bowl', 1),
(2, 'beat yolks and whites together', 2),
(2, 'pour into large sauce pan', 3),
(2, 'add other ingredients', 4),
(2, 'fold in half', 5),
(2, 'cook until done', 6)
.
Which looks like this in tabular form:
#recipe
id name
----------- ----------
1 pizza
2 omelet
.
#ingredient
recipe_id name quantity sort
----------- ------------------------------ -------------------- -----------
1 pizza dough 1 package 1
1 tomato sauce 1 can 2
1 favorite toppings you choose 3
2 eggs three 1
2 a bunch of other ingredients you choose 2
.
#instruction
recipe_id task sort
----------- -------------------------------- -----------
1 pre-bake pizza dough 1
1 add tomato sauce 2
1 add toppings 3
1 bake a little longer 4
2 break eggs into mixing bowl 1
2 beat yolks and whites together 2
2 pour into large sauce pan 3
2 add other ingredients 4
2 fold in half 5
2 cook until done 6
.
I want to create an XML document that has one record for each recipe, and within each recipe element, I want a group of ingredients and another group of instructions, like this:
<recipes>
<recipe id="2" name="omelet">
<ingredients>
<ingredient name="eggs" quantity="three" />
<ingredient name="a bunch of other ingredients" quantity="you choose" />
</ingredients>
<instructions>
<instruction task="break eggs into mixing bowl" />
<instruction task="beat yolks and whites together" />
<instruction task="pour into large sauce pan" />
<instruction task="add other ingredients" />
<instruction task="fold in half" />
<instruction task="cook until done" />
</instructions>
</recipe>
<recipe id="1" name="pizza">
<ingredients>
<ingredient name="pizza dough" quantity="1 package" />
<ingredient name="tomato sauce" quantity="1 can" />
<ingredient name="favorite toppings" quantity="you choose" />
</ingredients>
<instructions>
<instruction task="pre-bake pizza dough" />
<instruction task="add tomato sauce" />
<instruction task="add toppings" />
<instruction task="bake a little longer" />
</instructions>
</recipe>
</recipes>
This SQL creates the desired XML verbatim:
select recipe.*,
(
select ingredient.name, ingredient.quantity
from #ingredient ingredient
where recipe.id = ingredient.recipe_id
order by ingredient.sort
for xml auto, root('ingredients'), type
),
(
select instruction.task
from #instruction instruction
where recipe.id = instruction.recipe_id
order by instruction.sort
for xml auto, root('instructions'), type
)
from #recipe as recipe
order by recipe.name
for xml auto, root('recipes'), type
I aliased the temp table names because using for xml auto on temp tables creates poorly named XML elements. This is how it looks:
<recipes>
<recipe id="2" name="omelet">
<ingredients>
<ingredient name="eggs" quantity="three" />
<ingredient name="a bunch of other ingredients" quantity="you choose" />
</ingredients>
<instructions>
<instruction task="break eggs into mixing bowl" />
<instruction task="beat yolks and whites together" />
<instruction task="pour into large sauce pan" />
<instruction task="add other ingredients" />
<instruction task="fold in half" />
<instruction task="cook until done" />
</instructions>
</recipe>
<recipe id="1" name="pizza">
<ingredients>
<ingredient name="pizza dough" quantity="1 package" />
<ingredient name="tomato sauce" quantity="1 can" />
<ingredient name="favorite toppings" quantity="you choose" />
</ingredients>
<instructions>
<instruction task="pre-bake pizza dough" />
<instruction task="add tomato sauce" />
<instruction task="add toppings" />
<instruction task="bake a little longer" />
</instructions>
</recipe>
</recipes>
.
This SQL creates another version of the XML with all data as values instead of attributes, but in the same basic hierarchical structure:
select recipe.*,
(
select ingredient.name, ingredient.quantity
from #ingredient ingredient
where recipe.id = ingredient.recipe_id
order by ingredient.sort
for xml path('ingredient'), root('ingredients'), type
),
(
select instruction.task
from #instruction instruction
where recipe.id = instruction.recipe_id
order by instruction.sort
for xml path('instruction'), root('instructions'), type
)
from #recipe as recipe
order by recipe.name
for xml path('recipe'), root('recipes'), type
.
This is how it looks:
<recipes>
<recipe>
<id>2</id>
<name>omelet</name>
<ingredients>
<ingredient>
<name>eggs</name>
<quantity>three</quantity>
</ingredient>
<ingredient>
<name>a bunch of other ingredients</name>
<quantity>you choose</quantity>
</ingredient>
</ingredients>
<instructions>
<instruction>
<task>break eggs into mixing bowl</task>
</instruction>
<instruction>
<task>beat yolks and whites together</task>
</instruction>
<instruction>
<task>pour into large sauce pan</task>
</instruction>
<instruction>
<task>add other ingredients</task>
</instruction>
<instruction>
<task>fold in half</task>
</instruction>
<instruction>
<task>cook until done</task>
</instruction>
</instructions>
</recipe>
<recipe>
<id>1</id>
<name>pizza</name>
<ingredients>
<ingredient>
<name>pizza dough</name>
<quantity>1 package</quantity>
</ingredient>
<ingredient>
<name>tomato sauce</name>
<quantity>1 can</quantity>
</ingredient>
<ingredient>
<name>favorite toppings</name>
<quantity>you choose</quantity>
</ingredient>
</ingredients>
<instructions>
<instruction>
<task>pre-bake pizza dough</task>
</instruction>
<instruction>
<task>add tomato sauce</task>
</instruction>
<instruction>
<task>add toppings</task>
</instruction>
<instruction>
<task>bake a little longer</task>
</instruction>
</instructions>
</recipe>
</recipes>
Originally I tried placing the ingredients and instructions in the main query's from clause with an inner join to the recipe table. But the instructions were all nested within the ingredients, which were nested within the recipe. When I moved them up to the select part of the query it straightened out the XML.
<passengergroup>
<passengerList>
<passDetails>
<route>LONDON</route>
<lastname>RAY</lastname>
</passDetails>
<seatDetails>
<SeatNo>1A</SeatNo>
</seatDetails>
<customervalue>AB</customervalue>
</passengerList
<passengerList>
<passDetails>
<route>LONDON</route>
<lastname>RAY</lastname>
</passDetails>
<seatDetails>
<SeatNo>1B</SeatNo>
</seatDetails>
<customervalue>good</customervalue>
</passengerList
<passengerList>
<passDetails>
<route>DELHI</route>
<lastname>RAY</lastname>
</passDetails>
<seatDetails>
<SeatNo>2C</SeatNo>
</seatDetails>
<customervalue>BC</customervalue>
</passengerList>
<passengerList>
<passDetails>
<route>DELHI</route>
<lastname>RAY</lastname>
</passDetails>
<seatDetails>
<SeatNo>2D</SeatNo>
</seatDetails>
<customervalue>okey</customervalue>
</passengerList>
</passengergroup>
<xsl:for-each select="passengergroup/passengerList">
<xsl:if test="customervalue='good'
<xsl:value-of select="route"/><xsl:text> </xsl:text>
<xsl:value-of select="customervalue"/><xsl:text> </xsl:text>
<xsl:value-of select="seatDetails/SeatNo"/>
</for-each>
<xsl:for-each select="passengergroup/passengerList">
<xsl:if test="customervalue='ok'
<xsl:value-of select="route"/><xsl:text> </xsl:text>
<xsl:value-of select="customervalue"/><xsl:text> </xsl:text>
<xsl:value-of select="seatDetails/SeatNo"/>
</for-each>
Output
It will produce output like this
LONDON good 1A
LONDON good 1B
DELHI okey 2C
DELHI okey 2D
But i need the output like this
LONDON good 1A 1B
DELHI okey 2C 2D
If 'LONDON good 'is repeating many times,it has to be printed only once.but we got to repeat the seat no like' 1A 1B 1C 1D 1F 2G and so on'.i AM using xslt2.0 AND MY OUTPUT TYPE IS text. Thing is no need to display the items many times
I tried lot ..not able to figure out the solutions please help me out.
IMO your input xml doesn't correspond to desired output (e.g. there is only one LONDON with customervalue = good. But may be I don't understand well what you needs. But following xslt could make a job.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<xsl:output method="text" />
<xsl:template match="/">
<xsl:apply-templates select="passengergroup" />
</xsl:template>
<xsl:template match="passengergroup">
<xsl:for-each-group select="passengerList" group-by="concat(passDetails/route, ' ', customervalue)">
<xsl:value-of select="current-grouping-key()" />
<xsl:text> </xsl:text>
<xsl:value-of select="current-group()/seatDetails/SeatNo" separator=" " />
<xsl:value-of select="'
'" />
</xsl:for-each-group>
</xsl:template>
</xsl:stylesheet>
For input
<?xml version="1.0" encoding="UTF-8"?>
<passengergroup>
<passengerList>
<passDetails>
<route>LONDON</route>
<lastname>RAY</lastname>
</passDetails>
<seatDetails>
<SeatNo>1A</SeatNo>
</seatDetails>
<customervalue>good</customervalue>
</passengerList>
<passengerList>
<passDetails>
<route>LONDON</route>
<lastname>RAY</lastname>
</passDetails>
<seatDetails>
<SeatNo>1B</SeatNo>
</seatDetails>
<customervalue>good</customervalue>
</passengerList>
<passengerList>
<passDetails>
<route>DELHI</route>
<lastname>RAY</lastname>
</passDetails>
<seatDetails>
<SeatNo>2C</SeatNo>
</seatDetails>
<customervalue>BC</customervalue>
</passengerList>
<passengerList>
<passDetails>
<route>DELHI</route>
<lastname>RAY</lastname>
</passDetails>
<seatDetails>
<SeatNo>2D</SeatNo>
</seatDetails>
<customervalue>okey</customervalue>
</passengerList>
</passengergroup>
it produces following output
LONDON good 1A 1B
DELHI BC 2C
DELHI okey 2D
I have the following lines showing up in files that have been converted to XML from an Excel worksheet:
<Worksheet ss:Name="Sheet1">
<Names>
<NamedRange ss:Name="Print_Area" ss:RefersTo="=Sheet1!R30C1:R8642C15"/>
</Names>
<Table ss:ExpandedColumnCount="14" ss:ExpandedRowCount="8655" x:FullColumns="1"
x:FullRows="1" ss:StyleID="s16">
<Column ss:Index="2" ss:StyleID="s16" ss:AutoFitWidth="0" ss:Width="41.25"/>
<Column ss:StyleID="s16" ss:AutoFitWidth="0" ss:Width="36"/>
<Column ss:StyleID="s16" ss:AutoFitWidth="0" ss:Width="35.25"/>
<Column ss:StyleID="s16" ss:AutoFitWidth="0" ss:Width="38.25" ss:Span="1"/>
<Column ss:Index="8" ss:StyleID="s16" ss:AutoFitWidth="0" ss:Width="31.5"/>
<Column ss:Index="11" ss:StyleID="s16" ss:AutoFitWidth="0" ss:Width="30"/>
<Column ss:StyleID="s16" ss:AutoFitWidth="0" ss:Width="33.75"/>
<Column ss:StyleID="s16" ss:AutoFitWidth="0" ss:Width="28.5"/>
<Row ss:StyleID="s18">
<Cell ss:StyleID="s17"><Data ss:Type="String">UNITED STATES</Data></Cell>
<Cell ss:StyleID="s17"/>
<Cell ss:StyleID="s17"/>
<Cell ss:StyleID="s17"/>
<Cell ss:StyleID="s17"/>
<Cell ss:StyleID="s17"/>
<Cell ss:StyleID="s17"/>
</Row>
I am attempting to only remove the <Column .. /> lines. I "thought" I had a pretty good handle on Regular Expressions in VB.NET, but I cannot seem to match these lines. I have tried the following match strings:
'Using (RegexOptions.Multiline)
Private Const Column_MatchExpression As String = "^[\s]*<Column[\s\S]+$"
Private Const Column_MatchExpression As String = " <Column[\s\S]+$"
Private Const Column_MatchExpression As String = "^ <Column[\s\S]+$"
Private Const Column_MatchExpression As String = "^[\s]+<Column[\s\w\W]+$"
Any thoughts on the matter would be appreciated.
What about
"^\s*<Column.*/>\s*$"
?
\<Column[^>]*\>
Should work