I have a table with 25,000 rows. Table Audit (Id int identity(1,1), AdditionalInfo xml)
The sample data in AdditionalInfo column for a row looks like below
<Audit version="1">
<Context name="Event">
<Action name="OrganizationEventReceived">
<Input>
<Source type="SourceOrganizationId">77d2678b-ea4a-43ad-816b-c63edf206b08</Source>
<Target type="TargetOrganizationId">b98fd3ae-dbcb-4826-9d92-7e445ad61273,b98fd3ae-dbcb-4826-9d92-7e445ad61273,b98fd3ae-dbcb-4826-9d92-7e445ad61273</Target>
</Input>
</Action>
</Context>
</Audit>
I like to shred the xml and collect the data in output dataset with following query.
SELECT Id,
p.value('(#name)[1]', 'nvarchar (100)') AS TargetAction,
p.value('(Input/Source/text())[1]', 'nvarchar (500)') AS Source,
p.value('(Input/Target/text())[1]', 'nvarchar (max)') AS Target
FROM dbo.Audit CROSS APPLY AdditionalInfo.nodes('/Audit/Context/Action') AS AdditionalInfo(p)
The performance of the query is bad. It is taking 15 seconds to give the result set for just 25,000 rows. Is there a better way of doing it. I even tried putting primary and secondary xml indexes on AdditionalInfo column. Please help and let me know, to use better sql server xquery techniques.
Thanks,
Great question.
My recent task requires to parse about 35'000 XML documents, valid document being ~20kB.
More and larger xml files tend to exponentially fill the memory:
100 documents: 0:33
1000 documents: 25:00 😵💫
Try to distribute your work:
Variable target stores unstructured data, which eats most of computing power due to the data type and different length in values
depth of nodes in CROSS APPLY matters: avoid triple nodes in nodes(), consider two nodes and recursion (see below on split)
batch mode: process several documents at time, WHERE id IN (1,2,3)
loop a list of documents, FOR;
parse using local variables, such as DECLARE #xml_doc XML; SET #xml_doc = SELECT xmldata FROM xmlsource WHERE id=1;
avoid exporting xml node content, only write result values
parse all elements separately: saving order of elements using function ROW_NUMBER(), then LEFT JOIN all parts to xml documents list using some identifier, such as xml_id
Related
I have a XML with many records as nodes in it. I need to save each record in xml format a SQL server table in column of XML datatype .
I can perform this task in SSIS using "XML Task Editor" to count all the nodes and using "For Loop Container" and read Node value using "XML Task Editor" and save it database.
Another option is using Script task, reading the XML file and save each node in a loop.
Please suggest a better approach which is efficient with big files.
Below is sample of Input XML File. I need to save each (3 records in below example) "RECORD" full node in XML form in SQL Server database table which has a column with xml datatype.
I would suggest 2 step approach.
Use SSIS Import Column Transformation in a Data Flow Task to load entire XML file into a staging table single row column.
Use stored procedure to produce individual RECORD XML fragments as separate rows and INSERT them into a final destination table.
SQL
DECLARE #staging_tbl TABLE (id INT IDENTITY PRIMARY KEY, xmldata XML);
INSERT INTO #staging_tbl (xmldata) VALUES
(N'<root>
<RECORD UI="F298AF1F"></RECORD>
<RECORD UI="4C6AAA65"></RECORD>
</root>');
-- INSERT INTO destination_table (ID, xml_record)
SELECT id
, c.query('.') AS xml_record
FROM #staging_tbl
CROSS APPLY xmldata.nodes('/root/RECORD') AS t(c);
Output
id
xml_record
1
<RECORD UI="F298AF1F" />
1
<RECORD UI="4C6AAA65" />
You can use the nodes() method to return a rowset of nodes in the xml document. This is the simplest example:
select node_table.xml_node_column.query('.') node
from xmldocument
cross apply xmldocument.nodes('/root/RECORD') node_table(xml_node_column)
https://learn.microsoft.com/en-us/sql/t-sql/xml/nodes-method-xml-data-type?view=sql-server-ver16
I have XML data like this
DECLARE #input XML =
'<LicensingReportProcessResult>
<LicensingReport>
<Address key="3845HoopaLnLasVegasNV89169-3350U.S.A.">
<LineOne>3845 Hoopa Ln</LineOne>
<CityName>Las Vegas</CityName>
<StateOrProvinceCode>NV</StateOrProvinceCode>
<PostalCode>89169-3350</PostalCode>
<CountryCode>U.S.A.</CountryCode>
</Address>
<Person key="PersonPRI711284842">
<ExternalIdentifier>
<TypeCode>NAICProducerCode</TypeCode>
<Id>8001585</Id>
</ExternalIdentifier>
<BirthDate>1961-07-29</BirthDate>
</Person>
</LicensingReport>
</LicensingReportProcessResult>'
My T-SQL code to extract one specific set of elements:
-- extract into temp table
INSERT INTO #Address
SELECT
Tbl.Col.value('#Address', 'NVARCHAR(100)'),
Tbl.Col.value('#City', 'NVARCHAR(100)'),
Tbl.Col.value('#State', 'NVARCHAR(100)'),
Tbl.Col.value('#PostalCode', 'NVARCHAR(100)'),
Tbl.Col.value('#CountryCode', 'NVARCHAR(100)')
FROM
#xml.nodes('//LicensingReportProcessResult/LicensingReport/Address') Tbl(Col)
-- verify results
SELECT * FROM #Address
I want to insert different element data into separate tables. Like Address data into an Address table and Person data into a Person table. As new elements are added I want to save data into separate tables.
Can someone help?
Are you asking how to dynamically define new tables for top level xml elements in a document? You can do that with any Xml Serialization library that reads a document and returns the elements and attributes as a tree, and from that metadata create a table definition that you then execute in sql.
Also consider simply storing your data as xml, perhaps with a defined schema, and then writing queries or views that extract the various elements using XPath or the xml data type methods as you already do, instead of extracting into physical tables.
PostgresSQL v12.5
There is a table with single column containing strings formatted as XML.
create table XMLDATA (value text);
--------
text
--------
<something> <a>uyt</a> <b>xyz</b> </something>
<something> <a>ryu</a> <b>sdg</b> </something>
For simplicity let's claim that there are no nesting: all tags inside <something> contain primitive values (strings).
Assuming that there are much more elements than <a> and <b> inside, it would be great to have an option to convert these values into a relational form without enumerating all of the nested tags manually.
Was trying to get something in documentation related to XPATH, XMLTABLE, XPATH_TABLE, but there are small number of examples that did not help me to reveal the full power of these functions.
What I am looking for is a function special_function with results like
select * from special_function(XMLDATA);
a | b
-----------
uyt | xyz
ryu | sdg
Could you help me find a functionality of PostgreSQL that automatically recognizes XML tags and convert their content into columns?
without enumerating all of the nested tags manually.
That's not possible.
One fundamental restriction of SQL is, that the number, data types and names of all columns need to be known to the database before the query starts executing. SQL can't do this "at runtime" and change structure of the query based on data that is retrieved.
You can extract the content using xmltable() - but as explained, there is no way without specifying each output column.
select x.*
from xmldata d
cross join xmltable('/something' passing d.value
columns a text path 'a',
b text path 'b') as x
This assumes value is declared with the data type xml (which it should be). If it's not the case, you need to cast it: passing d.value::xml
I have a database column containing a string that might look something like this u/1u/3u/19/g1/g4 for a particular row.
Is there a performant way to get all rows that have at least one of the following elements ['u/3', 'g4'] in that column?
I know I can use AND clauses, but the number of elements to verify against varies and could become large..
I am using RoR/ActiveRecord in my project.
in sql server, you can use XML to convert your list of search params into a record set, then cross join that with the base table, and do charIndex() to see if the column contains the substring.
Since i don't know your table or column names, i used a table (persons) that i already had data in, which has a column 'phone_home'. To search for any phone number that contains '202' or '785' i would use this query:
select person_id,phone_home,Split.data.value('.', 'VARCHAR(10)')
from (select *, cast('<n>202</n><n>785</n>' as XML) as myXML
from persons) as data cross apply myXML.nodes('/n') as Split(data)
where charindex(Split.data.value('.', 'VARCHAR(10)'),data.phone_Home) > 0
you will get duplicate records if it matches more than one value, so throw a distinct in there and remove the Split from the select statement if that is not desired.
Using xml in sql is voodoo magic to me...i got the idea from this post http://www.sqljason.com/2010/05/converting-single-comma-separated-row.html
no idea what performance is like...but at least there aren't any cursors or dynamic sql.
EDIT: Casting the XML is pretty slow, so i made it a variable so it only gets cast once.
declare #xml XML
set #xml = cast('<n>202</n><n>785</n>' as XML)
select person_id,phone_home,Split.persons.value('.', 'VARCHAR(10)')
from persons cross apply #xml.nodes('/n') as Split(persons)
where charindex(Split.persons.value('.', 'VARCHAR(10)'),phone_Home) > 0
I have written an sql query like :
select field1, field2 from table_name;
The problem is this query will return 1 million records/ or more than 100k records.
I have a directory in which I have input files (around 20,000 to 50,000 records) that contain field1 . This is the main data I am concerned with.
Using perl script, I am extracting from the directory.
But , if I write a query like :
select field1 , field2 from table_name
where field1 in (need to write a query to take field1 from directory);
If I use IN cause then it has limitation of processing 1000 entries, then how should I overcome the limitation of IN cause?
In any DBMS, I would insert them into a temporary table and perform a JOIN to workaround the IN clause limitation on the size of the list.
E.g.
CREATE TABLE #idList
(
ID INT
)
INSERT INTO #idList VALUES(1)
INSERT INTO #idList VALUES(2)
INSERT INTO #idList VALUES(3)
SELECT *
FROM
MyTable m
JOIN #idList AS t
ON m.id = t.id
In SQL Server 2005, in one of our previous projects, we used to convert this list of values that are a result of querying another data store (lucene index) into XML and pass it as XML variable in the SQL query and convert it into a table using the nodes() function on XML data types and perform a JOIN with that.
DECLARE #IdList XML
SELECT #idList = '
<Requests>
<Request id="1" />
<Request id="2" />
<Request id="3" />
</Requests>'
SELECT *
FROM
MyTable m
JOIN (
SELECT id.value('(#id)[1]', 'INT') as 'id'
FROM #idList.nodes('/Requests/Request') as T(id)
) AS t
ON m.id = t.id
Vikdor is right, you shouldn't be querying this with an IN() clause, it's faster and more memory efficient to use a table to JOIN.
Expanding on his answer I would recommend the following approach:
Get a list of all input files via Perl
Think of some clever way to compute a hash value for your list that is unique and based on all input files (I'd recommend the filenames or similar)
This hash will serve as the name of the table that stores the input filenames (think of it as a quasi temporary table that gets discarded once the hash changes)
JOIN that table to return the correct records
For step 2. you could either use a cronjob or compute whenever the query is actually needed (which would delay the response, though). To get this right you need to consider how likely it is that files are added/removed.
For step 3. you would need some logic that drops the previously generated tables once the current hash value differs from last execution, then recreate the table named after the current hash.
For the quasi temporary table names I'd recommend something along the lines of
input_files_XXX (.i.e. prefix_<hashvalue>)
which makes it easier to know what stale tables to drop.
You could split your 50'000 ids in 50 lists of 1000 ids, do a query for each such list, and collect the result sets in perl.
Oracle wise, the best solution with using a temporary table - which without indexing won't give you much performance is to use a nested tabled type.
CREATE TYPE my_ntt is table of directory_rec;
Then create a function f1 that returns a variable of my_ntt type and use in the query.
select field1 , field2 from table_name where field1 in table (cast (f1 as my_ntt));