How to download a webpage and parse in SQL - sql

I am simply trying to download a webpage and store it in an accessible format in SQL Server 2012. I have resorted to using dynamic SQL, but perhaps there is a cleaner, easier way to do this. I have been able to successfully download the htm files to my local drive using the below code, but I am having difficulty working with the html itself. I am trying to convert the webpage to XML and parse from there, but I think I am not addressing the HTML to XML conversion properly.
I get the following error, "Parsing XML with internal subset DTDs not allowed. Use CONVERT with style option 2 to enable limited internal subset DTD support"
DECLARE #URL NVARCHAR(500);
DECLARE #Ticker NVARCHAR(10)
DECLARE #DynamicTickerNumber INT
SET #DynamicTickerNumber = 1
CREATE TABLE Parsed_HTML(
[Date] DATETIME
,[Ticker] VarChar (8)
,[NodeName] VarChar (50)
,[Value] NVARCHAR (50));
WHILE #DynamicTickerNumber <= 2
BEGIN
SET #Ticker = (SELECT [Ticker] FROM [Unique Tickers Yahoo] WHERE [Unique Tickers Yahoo].[Ticker Number]= #DynamicTickerNumber)
SET #URL ='http://finance.yahoo.com/q/ks?s=' + #Ticker + '+Key+Statistics'
DECLARE #cmd NVARCHAR(250);
DECLARE #tOutput TABLE(data NVARCHAR(100));
DECLARE #file NVARCHAR(MAX);
SET #file='D:\Ressources\Execution Model\Execution Model for SQL\DB Temp\quoteYahooHTML.htm'
SET #cmd ='powershell "(new-object System.Net.WebClient).DownloadFile('''+#URL+''','''+#file+''')"'
EXEC master.dbo.xp_cmdshell #cmd, no_output
CREATE TABLE XmlImportTest
(
xmlFileName VARCHAR(300),
xml_data xml
);
DECLARE #xmlFileName VARCHAR(300)
SELECT #xmlFileName = 'D:\Ressources\Execution Model\Execution Model for SQL\DB Temp\quoteYahooHTML.htm'
EXEC('
INSERT INTO XmlImportTest(xmlFileName, xml_data)
SELECT ''' + #xmlFileName + ''', xmlData
FROM
(
SELECT *
FROM OPENROWSET (BULK ''' + #xmlFileName + ''' , SINGLE_BLOB) AS XMLDATA
) AS FileImport (XMLDATA)
')
DECLARE #x XML;
DECLARE #string VARCHAR(MAX);
SET #x = (SELECT xml_data FROM XmlImportTest)
SET #string = CONVERT(VARCHAR(MAX), #x, 1);
INSERT INTO [Parsed_HTML] ([NodeName], [Value])
SELECT [NodeName], [Value] FROM dbo.XMLTable(#string)
--above references XMLTable Parsing function that works consistently
END
Unfortunately this needs to be run within the confines of SQL Server, and my understanding is that the HTML Agility Pack is not immediately compatible. I also notice that the intermediate table, XMLimportTest, never gets populated, so this is likely not a function of malformed HTML.

Short answer: don't.
SQL is very good for some things but for downloading and parsing HTML it's a terrible choice. In your example you're using PowerShell to download the file, why not parse the HTML in PowerShell too? Then you could write the parsed data into something like a CSV file and load that in using OPENROWSET.
Another option, still not using SQL but a bit more within SQL Server might be to use a .Net SP via SQL CLR.
As a few of the comments point out, if you could guarantee the HTML was well formed XML then you could use SQL XML functionality to parse it, but web pages are rarely well formed XML so this would be a risky choice.

Related

Bulk import of a huge XML into an SQL Cellt

I'm trying to import an XML file into a SQL cell to process it. My first idea is do an OPENROWSET to keep the XML and the just divide it with NODES. One of the XML its too huge to keep it on a CELL, so the OPENROWSET cut the XML, so It's impossible to work with it then. That is the code:
DECLARE #XMLwithOpenXML TABLE
(
Id INT IDENTITY PRIMARY KEY,
XMLData XML,
LoadedDateTime DATETIME
)
INSERT INTO #XMLwithOpenXML(XMLData, LoadedDateTime)
SELECT CONVERT(XML, BulkColumn) AS BulkColumn
,GETDATE()
FROM OPENROWSET(BULK 'C:\temp\PP015.xml', SINGLE_CLOB) AS x;
SELECT * FROM #XMLwithOpenXML
The second option is use the BCP to do the same, but I'm getting an error.
DECLARE #sql NVARCHAR(500)
SET #sql = 'bcp [ExternaDB].[dbo].[xmltab] IN "C:\temp\PP015.xml" -T -c'
EXEC xp_cmdshell #sql
select * from xmltab
I want to know if I'm on the correct way (How to work with an XML when is already in an SQL cell I know how to do it) and how I can BULK import the full XML into a cell without Length constraint.
What is the size of the XML file on the file system?
Please try the following solution. It is very similar to yours with three differences:
SINGLE_BLOB instead of SINGLE_CLOB
No need in CONVERT(XML, BulkColumn)
DEFAULT clause is used for the LoadedDateTime column
Additionally, you can use SSIS for the task. SSIS has a streaming XML Source Adapter with no XML file size limitation.
SQL
DECLARE #tbl TABLE(
ID INT IDENTITY PRIMARY KEY,
XmlData XML,
LoadedDateTime DATETIME DEFAULT (GETDATE())
);
INSERT INTO #tbl(XmlData)
SELECT BulkColumn
FROM OPENROWSET(BULK N'C:\temp\PP015.xml', SINGLE_BLOB) AS x;
SELECT * FROM #tbl;
Thanks for the help but I found the solution. SQL has configurate a maxium characters retrieved for XML data. To solve this issue just we have to reconfigure this parameter.
enter image description here

want to get the Email information from XML, But getting error

CREATE TABLE XMLTABLE(id int IDENTITY PRIMARY KEY,XML_DATA XML,DATE DATETIME);
go
INSERT INTO XMLTABLE(XML_DATA,DATE)
SELECT CONVERT(XML,BULKCOLUMN)AS DATA,getdate()
FROM OPENROWSET(BULK 'c:\Demo.xml',SINGLE_BLOB)AS x
go
DECLARE #XML AS XML
DECLARE #OUPT AS INT
DECLARE #SQL NVARCHAR (MAX)
SELECT #XML= XML_DATA FROM XMLTABLE
EXEC sp_xml_preparedocument #OUPT OUTPUT,#XML,'<root xmlns:d="http://abc" xmlns:ns2="http://def" />'
SELECT EMAILR
FROM OPENXML(#OUPT,'d:ns2:FORM/ns2:Form1/ns2:Part/ns2:Part1/ns2:Ba')
WITH
(EMAILR [VARCHAR](100) 'ns2:EmailAddress')
EXEC sp_xml_removedocument #OUPT
go
i.e Demo.xml contains>>
<ns2:FORM xmlns="http://abc" xmlns:ns2="http://def">
<ns2:Form1>
<ns2:Part>
<ns2:Part1>
<ns2:Ba>
<ns2:EmailA>Hello#YAHOO.COM</ns2:EmailA> ...
Error:Msg 6603, Level 16, State 2, Line 6 XML parsing error: Expected
token 'eof' found ':'.
d:ns2-->:<--FORM/ns2:Form1/ns2:Part/ns2:Part1/ns2:Ba
The approach with sp_xml_... methods and FROM OPENXML is outdated!
You should better use the current XML methods .nodes(), .value(), query() and .modify().
Your XML example is not complete, neither is is valid, had to change it a bit to make it working. You'll probably have to adapt the XPath (at least Part1 is missing).
DECLARE #xml XML=
'<ns2:FORM xmlns="http://abc" xmlns:ns2="http://def">
<ns2:Form1>
<ns2:Part>
<ns2:Ba>
<ns2:EmailA>Hello#YAHOO.COM</ns2:EmailA>
</ns2:Ba>
</ns2:Part>
</ns2:Form1>
</ns2:FORM> ';
This is the secure way with namespaces and full path
WITH XMLNAMESPACES(DEFAULT 'http://abc'
,'http://def' AS ns2)
SELECT #xml.value('(/ns2:FORM/ns2:Form1/ns2:Part/ns2:Ba/ns2:EmailA)[1]','nvarchar(max)');
And this is the lazy approach
SELECT #xml.value('(//*:EmailA)[1]','nvarchar(max)')
You should - however - prefer the full approach. The more you give, the better and fast you get...

How to keep client databases consistent whenever a coder changes stored proc/table definition/views/triggers etc

I have asked this question before (here), but it never solved my problems.
Here is the scenario:
1. A coder modifies a stored proc/table definition/views etc on his "development server"
2. The modified T-SQL code is tested and passed by another team
3. Now the tested T-SQL code needs to be updated in 20 client databases. (Which is an extremely tough task).
4. Currently, we copy paste the T-SQL code in every db individually. This also results in errors which are resolved only when the client complaints.
We are using SQL Server 2012, and I guess usage of Schema's may resolve this issue. But I don't know how to do it.
Probably you can use the bellow query. Only thing is you must have access to all the databases and all those DBs are in the same server.
-- Provide DB names as CSV
DECLARE #DBNames VARCHAR(MAX) = 'ExpDB,ExpDB_DUP'
-- Provide Your Update Script here
DECLARE #Script VARCHAR(MAX) = 'CREATE TABLE TestTab (Id int IDENTITY(1,1) NOT NULL,
Value nvarchar(50) NULL)'
DECLARE #DBNamesTab TABLE (DBName VARCHAR(128))
INSERT INTO #DBNamesTab
SELECT LTRIM(RTRIM(m.n.value('.[1]','varchar(128)'))) AS DBName
FROM
(
SELECT CAST( '<XMLRoot><RowData>'
+ REPLACE(#DBNames,',','</RowData><RowData>')
+ '</RowData></XMLRoot>' AS XML) AS x
)t
CROSS APPLY x.nodes('/XMLRoot/RowData')m(n)
DECLARE #DBName VARCHAR(128)
DECLARE #ScriptExe VARCHAR(MAX)
DECLARE dbNameCursor CURSOR FOR SELECT DBName FROM #DBNamesTab
OPEN dbNameCursor
FETCH NEXT FROM dbNameCursor INTO #DBName
WHILE ##FETCH_STATUS = 0
BEGIN
SET #ScriptExe = 'USE ' + #DBName + ' ' + #Script
EXEC(#ScriptExe)
FETCH NEXT FROM dbNameCursor INTO #DBName
END
CLOSE dbNameCursor;
DEALLOCATE dbNameCursor;

How to run a more than 8000 characters SQL statement from a variable?

I can use the following code for tiny little queries:
DECLARE #sql VARCHAR(8000)
SET #sql = 'SELECT * FROM myTable'
Exec #sql
The above method is very useful in order to maintain large amounts of code, especially when we need to make changes once and have them reflected everywhere.
My problem is my query (it's only one single query) that I want to feed into the #sql variable uses more than 25 table joins, some of them on temporary table variables, incorporates complex operations and it is hence much more than 8000 characters long.
I wished to use TEXT data type to store this query, but MSDN shows a warning message that Microsoft is planning to remove Text, NText and Image data types from their next versions. I wish my code to run in future too.
I thought of storing this query in a separate file, but as it uses joins on table variables and other procedure-specific parameters, I doubt if this is possible.
Kindly tell me a method to store a large query into a variable and execute it multiple times in a procedure.
The problem is with implicit conversion.
If you have Unicode/nChar/nVarChar values you are concatenating, then SQL Server will implicitly convert your string to VarChar(8000), and it is unfortunately too dumb to realize it will truncate your string or even give you a Warning that data has been truncated for that matter!
When concatenating long strings (or strings that you feel could be long) always pre-concatenate your string building with CAST('' as nVarChar(MAX)) like so:
SET #Query = CAST('' as nVarChar(MAX))--Force implicit conversion to nVarChar(MAX)
+ 'SELECT...'-- some of the query gets set here
+ '...'-- more query gets added on, etc.
What a pain and scary to think this is just how SQL Server works. :(
I know other workarounds on the web say to break up your code into multiple SET/SELECT assignments using multiple variables, but this is unnecessary given the solution above.
For those who hit a 4000 character max, it was probably because you had Unicode so it was implicitly converted to nVarChar(4000).
Warning:
You still Cannot have a Single Unbroken Literal String Larger than 8000 (or 4000 for nVarChar).
Literal Strings are those you hard-code and wrap in apostrophe's.
You must Break those Strings up or SQL Server will Truncate each one BEFORE concatenating.
I add ' + ' every 20 lines (or so) to make sure I do not go over.
That's an average of at most 200 characters per line - but remember, spaces still count!
Explanation:
What's happening behind the scenes is that even though the variable you are assigning to uses (MAX), SQL Server will evaluate the right-hand side of the value you are assigning first and default to nVarChar(4000) or VarChar(8000) (depending on what you're concatenating). After it is done figuring out the value (and after truncating it for you) it then converts it to (MAX) when assigning it to your variable, but by then it is too late.
If you are on SQL Server 2008 or newer you can use VARCHAR(MAX)
DECLARE #sql VARCHAR(MAX)
DECLARE #sql VARCHAR(max)
SET #sql = 'SELECT * FROM myTable'
Exec #sql
Note:
Print(#sql)
only show the first 8000 characters!
use
EXEC
(
'
--your sql script here
'
)
Problem is because your string has limit 8000 symbols by default. To prevent this you should convert it to (N)VARCHAR(MAX)
DECLARE #sql VARCHAR(8000)
SET #sql = CAST('SELECT * FROM myTable' AS VARCHAR(MAX))
--Check length of variable
PRINT 'Length is: '+CAST(LEN(#sql) AS VARCHAR)+ 'symbols'
Exec #sql
You should read the answer of this post which explains extremely well the situation :
SQL NVARCHAR and VARCHAR Limits
If the length x of your string is below 4000 characters, a string will be transformed into nvarchar(x)
If the length y is between 4000 and 8000, varchar(y)
If the length is more than 8000 characters, nvarchar(max) which can store up to 2GB.
Problem is that nvarchar(max) + varchar(y) = nvarchar(max) + nvarchar(4000) ; SQL will convert your varchar(y) into nvarchar(y) or nvarchar(4000) if y is greater than 4000 and lesser than 8000, truncating your string !
Well I ran to this before (in SQL 2005) and I can tell you that you have two options:
1 - Use the sys.sp_sqlexec stored procedure that can take a param of type text (IMO this is the way to go). Don't mind the warning. In SQL 2008 ntext is still supported, and if you do the varchar(max) thingy there, it will work. So basically, if you have 2008, both the text solution and the varchar(max) will work, so you will have time to change it =-). In 2012 though, only the varchar(max) will work, therefore you'll have to change it before upgrading.
2- (This is what I did at first) Check THIS post: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=52274 and do what user "Kristen" says. Worked like a charm for me. Don't forget to pre-set them to an empty string. If you understood my post you know by now that in SQL 2008 or newer is silly to do this.
I had the same issue. I have a SQL which was more than 21,000 characters. For some reason,
Declare #SQL VARCHAR(MAX)
EXEC(#SQL)
would come up with several issues
I had to finally split it up in multiple variables equally and then it worked.
Declare #SQL1 VARCHAR(MAX) = 'First Part'
Declare #SQL2 VARCHAR(MAX) = 'Second Part'
Declare #SQL3 VARCHAR(MAX) = 'Third Part'
Declare #SQL4 VARCHAR(MAX) = 'Fourth Part'
Set #SQL= #SQL1 + #SQL2 + #SQL3 + #SQL4
EXEC(#SQL)
There is no solution for this along the way that you are doing it. MsSql as of 2012 supports Ntext for example that allows you to go beyond 8000 characters in a variable. The way to solve this is to make multiple variables or multiple rows in a table that you can iterate through.
At best with a MsSql version the max size of a variable is 8000 characters on the latest version as of when this was typed. So if you are dealing with a string of say 80,000 characters. You can parse the data into ten variables of 8000 characters each (8000 x 10 = 80,000) or you can chop the variable into pieces and put it into a table say LongTable (Bigstring Varchar(8000)) insert 10 rows into this and use an Identity value so you can retrieve the data in the same order.
The method you are trying will not work with MsSql currently.
Another obscure option that will work but is not advisable is to store the variable in a text file by using command shell commands to read/write the file. Then you have space available to you beyond 8000 characters. This is slow and less secure than the other methods described above.
ALTER PROCEDURE [dbo].[spGetEmails]
AS
BEGIN
SET NOCOUNT ON;
-- Insert statements for procedure here
declare #p varbinary(max)
set #p = 0x
declare #local table (col text)
SELECT #p = #p + 0x3B + CONVERT(varbinary(100), Email)
FROM tbCarsList
where email <> ''
group by email
order by email
set #p = substring(#p, 2, 10000000)
insert #local values(cast(#p as varchar(max)))
select col from #local
END
I have been having the same problem, with the strings being truncated. I learned that you can execute the sp_executesql statement multiple times.
Since my block of code was well over the 4k/Max limit, I break it out into little chunks like this:
set #statement = '
update pd
set pd.mismatchtype = 4
FROM [E].[dbo].[' + #monthName + '_P_Data] pd
WHERE pd.mismatchtype is null '
exec sp_executesql #statement
set #statement = 'Select * from xxxxxxx'
exec sp_executesql #statement
set #statement = 'Select * from yyyyyyy '
exec sp_executesql #statement
end
So each set #Statement can have the varchar(max) as long as each chunk itself is within the size limit (i cut out the actual code in my example, for space saving reasons)
Before print convert into cast and change datatype.
PRINT CAST(#sql AS NTEXT)
Now, try it.
If what you are trying to accomplish is to do this in Management Studio, the script below might help.
DECLARE #Len INT = 5
DECLARE #Str VARCHAR(MAX) = '1111122222333334444455555'
DECLARE #TmpStr VARCHAR(MAX)
DECLARE #Return TABLE (RetStr VARCHAR(MAX))
WHILE(LEN(#Str) > 0)
BEGIN
SET #TmpStr = LEFT(#Str, #Len)
IF(LEN(#Str) > #Len)
SET #Str = RIGHT(#Str, LEN(#Str) - #Len)
ELSE
SET #Str = ''
INSERT INTO #Return SELECT #Str
END
SELECT * FROM #Return
There #Len should be 8000, as this is the maximum length Management Studio shows. #Str is the text that is longer than 8000 characters.

Inserting XML documents into SQL Server 2008 database

I need help inserting xml files into SQL Server 2008.
I have the following SQL statement:
insert into dbo.articles(id, title, contents)
SELECT X.article.query('id').value('.', 'INT'),
X.article.query('article').value('.', 'VARCHAR(50)'),
X.article.query('/doc/text()').value('.', 'VARCHAR(MAX)')
FROM (
SELECT CAST(x AS XML)
FROM OPENROWSET(
BULK 'E:\test\test_files\1000006.xml',
SINGLE_BLOB) AS T(x)
) AS T(x)
CROSS APPLY x.nodes('doc') AS X(article);
which basically shreds an XML doc into a columns. However, I want to be able to insert all the files in a folder, and not manually specify the file, as in this case E:\test\test_files\1000006.xml
Ok, first crack at answering a question in stackoverflow...
You have two issues:- firstly getting the filenames from the folder into a SQL table or table variable, and then reading the XML from each.
The first is easy, if you don't mind using xp_cmdshell
DECLARE #Folder VARCHAR(255) = 'C:\temp\*.xml'
DECLARE #Command VARCHAR(255)
DECLARE #FilesInAFolder TABLE (XMLFileName VARCHAR(500))
--
SET #Command = 'DIR ' + #Folder + ' /TC /b'
--
INSERT INTO #FilesInAFolder
EXEC MASTER..xp_cmdshell #Command
--
SELECT * FROM #FilesInAFolder
WHERE XMLFileName IS NOT NULL
The second part, converting the XML files to SQL rows is a little trickier because BULK INSERT won't take a parameter and you can't BULK INSERT into an XML table type. Here's code that works for ONE file...
DECLARE #x xml
DECLARE #Results TABLE (result xml)
DECLARE #xmlFileName NVARCHAR(300) = 'C:\temp\YourXMLFile.xml'
DECLARE #TempTable TABLE
(
ID INT,
Article NVARCHAR(50),
doctext NVARCHAR(MAX)
)
/* ---- HAVE TO USE DYNAMIC sql BECAUSE BULK INSERT WON'T TAKE A PARAMETER---------*/
DECLARE #sql NVARCHAR(4000) =
'SELECT * FROM OPENROWSET ( BULK ''' + #xmlFileName + ''', SINGLE_BLOB )AS xmlData'
/* ---- have to use a normal table variable because we can't directly bulk insert
into an XML type table variable ------------------------------------------*/
INSERT INTO #results EXEC(#SQL)
SELECT #x = result FROM #Results
/* ---- this is MUCH faster than using a cross-apply ------------------------------*/
INSERT INTO #TempTable(ID,Article,doctext)
SELECT
x.value('ID[1]', 'INT' ),
x.value('Article[1]', 'NVARCHAR(50)' ),
x.value('doctext[1]', 'NVARCHAR(MAX)' )
FROM #x.nodes(N'/doc') t(x)
SELECT * FROM #TempTable
Now the hard bit is putting these two together. I tried several ways to get this code into a function but you can't use dynamic SQL or EXEC in a function and you can't call an SP from a function and you can't put the code into two separate SPs because you can't have cascading EXEC statements i.e. you try and EXEC an SP with the above code in it that also has an EXEC in it, so... you have to either use a cursor to put the two code blocks above together i.e. cursor through the #FilesInAFolder passing each XMLFileName value into the second code block as variable #XMLFileName or you use SSIS or CLR.
Sorry I ran out of time to build a complete SP with a directory name as a parameter and a cursor but that is pretty straightforward. Phew!
Are you using a stored procedure? You can specify the file name as a parameter.
Something like...
CREATE PROCEDURE sp_XMLLoad
#FileName
AS SET NOCOUNT ON
SELECT X.article.query('id').value('.', 'INT'),
X.article.query('article').value('.', 'VARCHAR(50)'),
X.article.query('/doc/text()').value('.', 'VARCHAR(MAX)')
FROM (
SELECT CAST(x AS XML)
FROM OPENROWSET(
BULK #FileName,
SINGLE_BLOB) AS T(x)
Not exactly like that ... you'll need to add quotes around the #Filename I bet. Maybe assemble it with quotes and then use that variable.
If you're using SSIS, you can then pump all the files from a directory to the stored procedure, or to the SSIS code used.
I think you can do it with a cursor and xp_cmdshell. I would not recommend to ever use xp_cmdshell though.
DECLARE #FilesInAFolder TABLE (FileNames VARCHAR(500))
DECLARE #File VARCHAR(500)
INSERT INTO #FilesInAFolder
EXEC MASTER..xp_cmdshell 'dir /b c:\'
DECLARE CU CURSOR FOR
SELECT 'c:\' + FileNames
FROM #FilesInAFolder
WHERE RIGHT(FileNames,4) = '.xml'
OPEN CU
FETCH NEXT FROM CU INTO #File
WHILE ##FETCH_STATUS = 0
BEGIN
INSERT INTO dbo.articles(id, title, contents)
SELECT X.article.query('id').value('.', 'INT'),
X.article.query('article').value('.', 'VARCHAR(50)'),
X.article.query('/doc/text()').value('.', 'VARCHAR(MAX)')
FROM (
SELECT CAST(x AS XML)
FROM OPENROWSET(
BULK #File,
SINGLE_BLOB) AS T(x)
) AS T(x)
CROSS APPLY x.nodes('doc') AS X(article);
FETCH NEXT FROM CU INTO #File
END
CLOSE CU
DEALLOCATE CU