Making ID attributes unique in XML - sql

I have a merged XML file which (for the sake of simplicity) has the following form:
<bookstore>
<books>
<book id="1"/>
<book id="2"/>
<book id="2"/>
<book id="3"/>
<book id="10"/>
</books>
</bookstore>
Because the XML is merged, there are books with the same ID attribute. I need to make the ID’s unique following this rule: If an ID is encountered (from top to down) that is already taken, change this ID to MAX(ID)+1.
<bookstore>
<books>
<book id="1"/>
<book id="2"/>
<book id="11"/>
<book id="3"/>
<book id="10"/>
</books>
</bookstore>
A straightforward way to do this would be to extract the ID’s, check their occurrence, and if it occurs more than once, then search the second occurrence (from top to down) of the ID and replace it. But this isn’t very elegant…
As I am reading about XML processing now, I was hoping for a (simple) XQuery which could do this.
If anyone has some pointers or pseudo code: they are all welcome.
My environment is Oracle (PL)SQL database, supporting XMLTYPE and XQuery.

In your environment you can use XSLT 1.0 to transform the document and generate IDs during the process. See: DBMS_XSLPROCESSOR.
With a XSLT stylesheet you can copy the nodes from your XML source to a result tree, creating unique IDs in the process. The IDs will not be sequential numbers, but unique string sequences generated by the generate-id() method. You can't control what they look like, but you can guarantee they are unique. (XSLT also allows you to get rid of duplicate nodes (using a key) if that's your intention, but from your example I understood that duplicate *ID*s doesn't actually mean the node is a duplicate, since you want to generate a new ID for it.)
The stylesheet below has two templates. The second one is an identity transform: it simply copies elements and attributes to the result tree. The first template creates an attribute named id containing an unique ID.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes"/>
<xsl:template match="book">
<xsl:copy>
<xsl:attribute name="id">
<xsl:value-of select="generate-id(.)"/>
</xsl:attribute>
<xsl:apply-templates select="node()|#*[name() != 'id']"/>
</xsl:copy>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The other templates (in this case only the identity template) are called for all nodes and attributes, except the id attribute by <xsl:apply-templates ...>. The result is a copy of your original XML file with generated unique IDs for the book elements.
If you had a XML such as this one:
<bookstore>
<books>
<book id="1" other="123"/>
<book id="2"/>
<book id="2"/>
<book id="3">
<chapter number="123" id="ch1">Text</chapter>
</book>
<book id="10"/>
</books>
<magazines>
<mag id="non-book-id"></mag>
</magazines>
</bookstore>
the XSLT above would transform it into this XML:
<bookstore>
<books>
<book id="d2e3" other="123"/>
<book id="d2e4"/>
<book id="d2e5"/>
<book id="d2e6">
<chapter number="123" id="ch1">Text</chapter>
</book>
<book id="d2e9"/>
</books>
<magazines>
<mag id="non-book-id"/>
</magazines>
</bookstore>
(the string sequences are arbitrary, and might be different in your implementation).
For creating ID/IDREF links the generated string sequences are better than numbers since you can use them anywhere (numbers and identifiers that start with numbers can't always be used as IDs). But if string sequences are not acceptable and you need sequential numbers, you can use XPath node position() in XQuery or XSLT to generate a number based on the element's position in the whole document (which will be unique). If all books are siblings in the same context, you can simply replace the generate-id(.) in the stylesheet above for position():
<xsl:template match="book">
<xsl:copy>
<xsl:attribute name="id">
<xsl:value-of select="position()"/>
</xsl:attribute>
<xsl:apply-templates select="node()|#*[name() != 'id']"/>
</xsl:copy>
</xsl:template>
(if the books are not siblings, you will need to do it in a slightly different way, using a variable).
If you want to retain the existing IDs and only generate sequential ones for the duplicates, it will be a bit more complicated but you can achieve that with keys (or XQuery instead of XSLT). The maximum id can be obtained in XPath 2.0 using the max() function:
max(//book/#id)
That function does not exist in XPath 1.0, but you can obtain the maximum ID by using:
//book[not(#id < //book/#id)]/#id

Related

XSLT to put one particular XML element before all others

XSLT 1.0 solution required. My question is similar to XSLT Change element order and I'll take this answer if I have to, but I hope I can do something like 'put this_element first, and retain the original order of all the rest of them'. The input is something like this, where ... can be any set of simple elements or text nodes, but no processing instructions nor comments. See below also.
<someXML>
<recordList>
<record priref="1" created="2009-06-04T16:54:35" modification="2014-12-16T14:56:51" selected="False">
...
<collection_type>3D</collection_type>
...
<object_category>headgear</object_category>
<object_name>hat</object_name>
<object_number>060998</object_number>
...
</record>
<record priref="3" created="2009-06-04T11:54:35" modification="2020-08-05T18:24:33" selected="False">
...
<collection_type>3D</collection_type>
<description>a very elaborate coat</description>
<object_category>clothing</object_category>
<object_name>coat</object_name>
<object_number>060998</object_number>
</record>
</recordList>
</someXML>
This would be the desired output.
<someXML>
<recordList>
<record priref="1" created="2009-06-04T16:54:35" modification="2014-12-16T14:56:51" selected="False">
<object_category>clothing</object_category>
...
<collection_type>3D</collection_type>
...
<object_name>hat</object_name>
<object_number>060998</object_number>
...
</record>
<record priref="3" created="2009-06-04T11:54:35" modification="2020-08-05T18:24:33" selected="False">
<object_category>clothing</object_category>
...
<collection_type>3D</collection_type>
<description>a very elaborate coat</description>
<object_name>coat</object_name>
<object_number>060998</object_number>
</record>
</recordList>
</someXML>
It's probably OK if object_category is put first, and then occurs again later on in the record, i.e. in the tags in their original order.
I'll add some background. There's this API producing about 900.000 XML records with different tags (element names) in alphabetical order, per record. There are about 170 different element names (that's why I don't want to have to list them all individually, unless there's no other way). The XML is ingested into this graph database. That takes time, but it could be sped up if we see the object_category as the first element in the record.
Edit: We can configure the API, but not the C# code behind the API. We step through the database, step by step ingesting chunks of ~100 records. If we specify nothing else, we get the XML as exemplified above. We can also specify an XSL sheet to transform the XML. That's what we want to do here.
The example is ambiguous, because we don't know what all those ... placeholders stand for. I suppose this should work for you:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="record">
<xsl:copy>
<xsl:apply-templates select="#*"/>
<xsl:apply-templates select="object_category"/>
<xsl:apply-templates select="node()[not(self::object_category)]"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

XSLT for-each, trying to create header-line structure

This is not the browser type XSLT, this is for processing data (SAP B1 Integration Framework). Suppose we have two SQL Tables, HEADER and LINE and we want to avoid the kind of work where we first SELECT from the HEADER and then launch a separate select for the lines for each, because that requires "visual programming", and we like writing code more than connecting arrows. So we are sending the server a query like SELECT * FROM HEADER, SELECT * FROM LINES and we get an XML roughly like this:
<ResultSets>
<ResultSet>
<Row><MHeaderNum>1</MHeaderNum></Row>
<Row><MHeaderNum>2</MHeaderNum></Row>
<Row><MHeaderNum>3</MHeaderNum></Row>
</ResultSet>
<ResultSet>
<Row><LineNum>1</LineNum> <HeaderNum>1</HeaderNum></Row>
<Row><LineNum>2</LineNum> <HeaderNum>2</HeaderNum></Row>
<Row><LineNum>1</LineNum> <HeaderNum>3</HeaderNum></Row>
<Row><LineNum>2</LineNum> <HeaderNum>1</HeaderNum></Row>
<Row><LineNum>1</LineNum> <HeaderNum>2</HeaderNum></Row>
<Row><LineNum>2</LineNum> <HeaderNum>3</HeaderNum></Row>
</ResultSet>
so we think we are imperative, procedural programmers and pull a
<xsl:for-each select="//ResultSets/Resultset[1]/Row">
do stuff with header data
<xsl:for-each select="//ResultSets/Resultset[2]/Row[HeaderNum=MHeaderNum]">
do stiff with the line data beloning to this particular header
</xsl:for-each>
</xsl:for-each>
And of course this does not blinkin' work because MHeaderNum went out of context like grunge went out of fashion, and we cannot save it into a variable either because we will not be update that variable, as XSLT is something sort of an immutable functional programming language.
But fear not, says an inner voice, because XSLT gurus can solve things like that with templates. Templates, if I understand it, are sort of XSLT's take on functions. They can be recursive and stuff like that. So can they be used to solve problems like this?
And of course we are talking about XSLT 1.0 because I don't know whether Java ever bothered to implement the later versions, but SAP certainly did not bother to used said, hypothetical implementation.
Or should I really forget about this and just connect my visual arrows? The thing is, SQL is not supposed to be used in such an iterate through headers then iterate through lines ways. What I am trying to do is what makes an SQL database a happy, get a big ol' chunk of data out of it and then process it somewhere else, not bother it with seventy zillion tiny queries. And in our case the somewhere else is sadly XSLT, although technically I could try JavaScript as well as SAP added a Nashorn to this pile of mess as well, but maybe it is solvable in "pure" XSL?
Whether XSLT 1 or later and whether with templates and for-each, the current() function exists: //ResultSets/Resultset[2]/Row[HeaderNum=current()/MHeaderNum].
The best way to resolve cross-references is by using a key.
For example, the following stylesheet:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:key name="line-by-header" match="ResultSet[2]/Row" use="HeaderNum" />
<xsl:template match="/ResultSets">
<output>
<xsl:for-each select="ResultSet[1]/Row">
<header num="{MHeaderNum}">
<xsl:for-each select="key('line-by-header', MHeaderNum)">
<line>
<xsl:value-of select="LineNum"/>
</line>
</xsl:for-each>
</header>
</xsl:for-each>
</output>
</xsl:template>
</xsl:stylesheet>
when applied to the following input:
XML
<ResultSets>
<ResultSet>
<Row><MHeaderNum>1</MHeaderNum></Row>
<Row><MHeaderNum>2</MHeaderNum></Row>
<Row><MHeaderNum>3</MHeaderNum></Row>
</ResultSet>
<ResultSet>
<Row><LineNum>1</LineNum> <HeaderNum>1</HeaderNum></Row>
<Row><LineNum>2</LineNum> <HeaderNum>2</HeaderNum></Row>
<Row><LineNum>3</LineNum> <HeaderNum>3</HeaderNum></Row>
<Row><LineNum>4</LineNum> <HeaderNum>1</HeaderNum></Row>
<Row><LineNum>5</LineNum> <HeaderNum>2</HeaderNum></Row>
<Row><LineNum>6</LineNum> <HeaderNum>3</HeaderNum></Row>
</ResultSet>
</ResultSets>
will return:
Result
<?xml version="1.0" encoding="UTF-8"?>
<output>
<header num="1">
<line>1</line>
<line>4</line>
</header>
<header num="2">
<line>2</line>
<line>5</line>
</header>
<header num="3">
<line>3</line>
<line>6</line>
</header>
</output>
You can try the following XSLT. It is using three XSLT templates.
Because desired output is unknown, I placed some arbitrary processing for each of the header and line item templates.
XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<root>
<xsl:apply-templates/>
</root>
</xsl:template>
<xsl:template match="/ResultSets/ResultSet[1]">
<ResultSet1>
<xsl:for-each select="Row">
<r>
<xsl:value-of select="MHeaderNum"/>
</r>
</xsl:for-each>
</ResultSet1>
</xsl:template>
<xsl:template match="/ResultSets/ResultSet[2]">
<ResultSet2>
<xsl:for-each select="Row">
<xsl:copy-of select="."/>
</xsl:for-each>
</ResultSet2>
</xsl:template>
</xsl:stylesheet>

SQL query for Dynamic nodes in XML [duplicate]

This question already has an answer here:
SQL Data as XML Element
(1 answer)
Closed 5 years ago.
Our table is like
StudentNo Name Subject Mark Grade
1 John English 41 A
1 John Hindi 42 B
We want an XML format from this table as follows.
<Student>
<Name>John</Name>
<Subject>
<English>
<Mark>41</Mark>
<Grade>A</Grade>
</English>
<Hindi>
<Mark>42</Mark>
<Grade>B</Grade>
</Hindi>
</Subject>
<Student>
Here the subject name nodes should be generated dynamically.
This is very similar to SQL Data as XML Element - so much so that I think it might be a duplicate - but I want to explain a bit more for your context why this isn't the best idea. In my answer to that question, I show a really hacky way that you could do this, but it's not the best idea.
Your XML will be nearly impossible to create a schema for. Any consumer of that XML will never be able to be sure what values might appear as elements. Rather than try to create dynamic elements, you should probably use attributes of some sort. You could even use xsi:type to create an abstract type in your XML of sorts (although in my example I'm just using a plain old attribute - you could pick whatever attribute will make the most sense for your consumers). The Query for that XML would be:
declare #subjects TABLE(studentno int, name varchar(10), subjecT varchar(10), mark int, grade char(1))
INSERT #subjects
VALUES
(1, 'John','English', 41,'A'),
(1, 'John','Hindi', 42,'B')
select
s.Name
,(SELECT
s2.Subject as '#type'
,s2.Mark
,s2.Grade
FROM #subjects s2
WHERE s2.studentno = s.studentno
FOR XML PATH('Subject'), ROOT('Subjects'), TYPE)
from #subjects s
GROUP BY s.name, s.studentno
FOR XML PATH('Student')
produces:
<Student>
<Name>John</Name>
<Subjects>
<Subject type="English">
<Mark>41</Mark>
<Grade>A</Grade>
</Subject>
<Subject type="Hindi">
<Mark>42</Mark>
<Grade>B</Grade>
</Subject>
</Subjects>
</Student>
This XML will be possible to make sense of by consumers, where they can, for example, iterate the subjects without knowing what subjects might be there (and without needing to resort to assuming that every direct child of Subjects is in fact a subject and not some other type of node that got added in a new version of the schema).
If you really need that output, I'd prefer to use XSLT to transform the output above to your format, e.g.:
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" encoding="UTF-8" indent="yes" />
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
<xsl:template match="Subject">
<xsl:element name="{#type}">
<xsl:apply-templates />
</xsl:element>
</xsl:template>
<xsl:template match="Subjects">
<xsl:element name="Subject">
<xsl:apply-templates />
</xsl:element>
</xsl:template>
</xsl:transform>
gets you
<?xml version="1.0" encoding="UTF-8"?>
<Student>
<Name>John</Name>
<Subject>
<English>
<Mark>41</Mark>
<Grade>A</Grade>
</English>
<Hindi>
<Mark>42</Mark>
<Grade>B</Grade>
</Hindi>
</Subject>
</Student>
Note you can't do this completely with SQL Server though - you'd have to resort to building the XML string and casting it as XML, as in my other answer.

Unable to understand some XSLT grouping code through <xsl:key and generate-id()

I have data in the below XML format
<item>
<title>Body Cleaner</title>
<vendor>Wipro</vendor>
<location>EMEA</location>
<manufacture_date>12/08/2010</manufacture_date>
<item_type>House Hold</item_type>
<item_type>Health</item_type>
</item>
<item>
<title>Sweet Catch up</title>
<vendor>Unilever</vendor>
<location>APAC</location>
<manufacture_date>21/07/2013</manufacture_date>
<item_type>House Hold</item_type>
<item_type>Kitchen</item_type>
</item>
(1) Below is code in a xsl file
<xsl:key name="groups" match="item_type" use="."/>
and
<xsl:apply-templates select="item/item_type[generate-id() = generate-id(key('groups', .)[1])]"/>
here i am unable to understand the use of generate-id() in that particular case . What is the purpose of below code
[generate-id() = generate-id(key('groups', .)[1])]
(1) Below is code in another xsl file
<xsl:key name="vendors" match="item" use="vendor"/>
and
<xsl:apply-templates select="item[count(.|key('vendors',vendor)[1])=1]">
here i am unable to understand the expression
item[count(.|key('vendors',vendor)[1])=1]
specially the purpose of .| in count.
Could somebody help me to make the ground here so that i can further understand the XSLT code.
Thanks
This is called muenchian grouping. In xslt 1.0, there was no built in grouping.
Keys, like this one are used by the key() function:
<xsl:key name="vendors" match="item" use="vendor"/>
The key() function returns a node-set from the document, using the index specified by an element (info).
item[count(.|key('vendors',vendor)[1])=1]
This will see whether a node set is made up of the two nodes has one or two nodes in it (e.g.grouping). This basically identifies the groups. Once you have Identified the groups you can visit each node in the key.

How costly is usage of unnecessary variables in XSLT?

It's more of a clarification that I am in need ..
as per this answer on a question, XSLT variables are cheap! My question is: Is this statement valid for all the scenarios? The instant variables which get created and get destroyed withing 4 line code aren't bothersome but loading a root node or child entities, in my opinion is indeed bad practice..
I have two XSLT files, designed for same input and output requirement:
XSLT1 (without unnecessary variable):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<Collection>
<xsl:for-each select="CATALOG/CD">
<DVD>
<Cover>
<xsl:value-of select="string(TITLE)"/>
</Cover>
<Author>
<xsl:value-of select="string(ARTIST)"/>
</Author>
<BelongsTo>
<xsl:value-of select="concat(concat(string(COUNTRY), ' '), string(COMPANY))"/>
</BelongsTo>
<SponsoredBy>
<xsl:value-of select="string(COMPANY)"/>
</SponsoredBy>
<Price>
<xsl:value-of select="string(number(string(PRICE)))"/>
</Price>
<Year>
<xsl:value-of select="string(floor(number(string(YEAR))))"/>
</Year>
</DVD>
</xsl:for-each>
</Collection>
</xsl:template>
</xsl:stylesheet>
XSLT2 (with unnecessary variable "root" in which whole XML is loaded):
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="root" select="."/>
<Collection>
<xsl:for-each select="$root/CATALOG/CD">
<DVD>
<Cover>
<xsl:value-of select="string(TITLE)"/>
</Cover>
<Author>
<xsl:value-of select="string(ARTIST)"/>
</Author>
<BelongsTo>
<xsl:value-of select="concat(concat(string(COUNTRY), ' '), string(COMPANY))"/>
</BelongsTo>
<SponsoredBy>
<xsl:value-of select="string(COMPANY)"/>
</SponsoredBy>
<Price>
<xsl:value-of select="string(number(string(PRICE)))"/>
</Price>
<Year>
<xsl:value-of select="string(floor(number(string(YEAR))))"/>
</Year>
</DVD>
</xsl:for-each>
</Collection>
</xsl:template>
</xsl:stylesheet>
Approach-2 exists in realtime and infact the XML would be several KBs to few MBs, In XSLT usage of variables is extended to child entities as well..
To put-forth my proposal to change the approach, I need to verify the theory behind it..
As per my understanding incase of approach-2, system is reloading the XML data over and over in memory (incase of usage of multiple variables to load child entities the situation turns worst) and thereby slowing down the transformation process.
Before posting this question here I tested the performance of two XSLTs using timer. First approach takes few milliseconds lesser than approach-2. (I used copy-XML files to test two XSL files to avoid complexity with system cache). But again system cache might play huge confusing role here ..
Despite of this analysis of mine I still have a question in mind! Do we really need to avoid usage of variables. And as far as my system is concerned, how worthy is it to modify the realtime XSLT files, so as to use 'approach-1'?
OR Is it like XSLT variables are different than other programming languages (Incase if I'm not aware) .. Say for example, XSLT variables don't actually store the data when you do select="." but they kind of point to the data! or something like this..? AND HENCE continue using XSLT variables without hesitation..
What is your suggestion on this?
Quick Info on current system:
Host Programming Language or System: Siebel (C++ is the backend code)
XSLT Processor: Xalan (Unless Saxon is used explicitely)
I agree with the comments made that you need to measure performance with your particular XSLT processor.
But your descriptions or expectations like "approach-2, system is reloading the XML data over and over in memory" seem wrong to me. The XSLT processor builds an input tree of the primary input XML document anyway and I can't imagine that any implementation then with <xsl:variable name="root" select="."/> does anything like loading the document completely again, it would even be wrong, as node identity and generate-id would not work. The variable will simply keep a reference to the document node of the existing input tree.
Of course in your sample where you have a single input document and a single template where the current node is the document anyway the use of the variable you have is superfluous. But there are cases where you need to store the document node of the primary input document, in particular when you deal with multiple documents.