So I am generating a JSON file from SQL server 2016 using 'FOR JSON'
I have used JSON_QUERY to wrap queries to prevent the escape characters from appearing before the generated double quotes ("). This worked correctly except they are still showing up for the forward slashes (/) on the formatted dates.
One thing to note is that I am converting the datetime objects in SQL using the following method CONVERT(VARCHAR, [dateEntity], 101)
An example (This is a subquery)
JSON_QUERY((
SELECT [LegacyContactID]
,[NameType]
,[LastName]
,[FirstName]
,[Active]
,[Primary]
,CONVERT(VARCHAR,[StartDate],101) AS [StartDate]
,CONVERT(VARCHAR,[EndDate],101) AS [EndDate]
FROM [LTSS].[ConsumerFile_02_ContactName]
WHERE [LegacyContactID] = ContactList.[LegacyContactID]
FOR JSON AUTO, WITHOUT_ARRAY_WRAPPER
)) AS ContactName
And the result will be
"ContactName": {
"LegacyContactID": "123456789",
"NameType": "Name",
"LastName": "Jack",
"FirstName": "Apple",
"Active": true,
"Primary": true,
"StartDate": "04\/01\/2016",
"EndDate": "04\/30\/2016"
}
I have the whole query wrapped in JSON_QUERY to eliminate the escaping but it still escapes the forward slashes on the dates.
I also have passed the dates as strings without the conversion and still get the same results.
Any insight?
One solution is to avoid the "/" generally in date, by using the "right" JSON data format
SELECT JSON_QUERY((
SELECT TOP 1 object_id, create_date
FROM sys.tables
FOR JSON AUTO, WITHOUT_ARRAY_WRAPPER
))
Result
{"object_id":18099105,"create_date":"2017-08-14T11:19:22.670"}
UPDATED:
Ah, yes, escape and CLRF characters.
Unless you your environment shows the offending characters, you will be forced to manually copy and paste from the result sets and replace the string from there.
Now, what you mention in your recent update got me considering why you feel the need to transform your data in the first place. DATES do not have formatting by default, so unless JSON is incompatible with handling SQL dates, there is really no need to transform this data inside JSON if you your target tables enforce the correct format.
So unless there is still a concern for the truncation of data, from an ETL perspective there are two ways you can accomplish this:
1 - USE STAGING TABLES
Staging tables can either be temporary tables, CTEs, or actual empty tables you use to extract, cleanse, and transform your data.
Advantages: You are only dealing with the rows being inserted, do not have to be concerned with constraints, and can easily modify OUTSIDE JSON any corruption or non-structured aspect of your data.
Disadvantages: Staging tables may represent more object in your database, depending how repetitive the need for them is. Thus, finding better, consistent structured data is preferable.
2 - ALTER YOUR TABLE TO USE STRINGS
Here you enforce the business rules cleansing the data AFTER insertion into the persistent table.
Advantages: You save on space, simplify the cleansing process, and can still use indexes. SQL Server is pretty efficient at parsing through DATE strings, still take advantage of EXISTS() and possible SARGS to check for not-dates when running your insert.
Disadvantages: You lose a primary integrity check on your table while the dates are now stored as strings, opening up possibilities of dirty data being exposed. Your UPDATE statements will be forced to use the entire table, which can drag on performances.
JSON_QUERY((
SELECT [LegacyContactID]
,[NameType]
,[LastName]
,[FirstName]
,[Active]
,[Primary]
,[StartDate] --it already is in a dateformat
,[EndDate]
FROM [LTSS].[ConsumerFile_02_ContactName]
WHERE [LegacyContactID] = ContactList.[LegacyContactID]
FOR JSON AUTO, WITHOUT_ARRAY_WRAPPER
)) AS ContactName
I have ran into some similar issues. Without going into a ton of detail, believe this is some of the reason the new JSON functionality isn't getting a ton of adoption yet from what I can see.
I've added a couple comments to the MSDN about this and a tweet:
"Why can't the auto-escaping of ALL strings be turned off with a flag???" - https://msdn.microsoft.com/en-us/library/dn921889.aspx
"Almost there, but not quite yet..." - https://msdn.microsoft.com/en-us/library/dn921882.aspx
"Anyone else frustrated with forced auto-escaping of all JSON in #SQLServer / #AzureSQLDB? (see link for my comments) msdn.microso…" - https://twitter.com/brian_jorden/status/844621512711831552
If you come across a method or way to deal with this, would love to hear in this or any of those threads, and good luck...
Related
We're trying to implement change detection in out ETL process.
So we decided to get the cryptographic hash using
SET a.[HASH] = (SELECT
master.dbo.fn_varbintohexsubstring(0, HashBytes('md5', (SELECT TOP 1 * FROM customer_demographics_staging b WHERE b.customer_no = a.customer_no FOR XML RAW)), 1, 0))
FROM customer_demographics_staging a
For a table with 700k records and about 140 columns (we are yet to determine the changing columns), the query run for about half an hour before we canceled it.
Is there anyway, apart from reducing the number of queries we can improve this?
A couple of things. If the data type of the HASH column is varbinary(20), you don't need to concern yourself with converting the MD5 hash to a string; just store the hash bytes. To that end though, if you want to use a cryptographic hash for change detection, I'd use an inline table-valued function to get it. Here's an example that I cobbled together using AdventureWorks:
ALTER TABLE [HumanResources].[Employee] ADD [Hash] VARBINARY(20) NULL;
GO
CREATE FUNCTION dbo.CalculateHash(#EmployeeID AS INT)
RETURNS TABLE
AS
RETURN
SELECT e.[BusinessEntityID], HASHBYTES('md5', (
SELECT *
FROM [HumanResources].[Employee] AS [e2]
WHERE [e2].[BusinessEntityID] = e.[BusinessEntityID]
FOR XML RAW
)) AS [Hash]
FROM [HumanResources].[Employee] AS [e]
WHERE [e].[BusinessEntityID] = #EmployeeID
go
SELECT TOP 10 [e].*, ch.[Hash]
FROM [HumanResources].[Employee] AS [e]
CROSS APPLY dbo.[CalculateHash]([e].[BusinessEntityID]) AS [ch]
GO
That said, if it were me, I wouldn't bother with MD5 at all and just use the CHECKSUM() function (perhaps as a persisted computed column in the table). It supports taking multiple columns natively (so you don't incur the overhead of serializing the row to XML).
In line with what [Ben Thul] already said, I too tend to rely on BINARY_CHECKSUM() simply because its ease of use.
I'll agree that this function returns "but an int" which is 8 bytes while e.g MD5 will return a varbinary(16) which is twice as much of bytes so you get the square (not the double!) of 'result-space' meaning you'll end up with an incredibly much smaller chance on collisions. But paranoid me would like to add that even so, an exact match of MD5 values does not mean you also have the same (input) values!
In all honesty, I use the function only to eliminate differences. If the result of the checksum (or hash) is different then you can be 100% certain that the values are different too. If they are identical then you should still check the source-values in their entirety to see if there are no 'false matches'.
Your use-case seems to be the other way around: you want to find the ones that are different by eliminating the ones that are identical and short-cutting the latter by looking at the hash-code only. To be honest, I'm not a fan of the approach simply because you risk running into a collision causing a 'changed' record in your staging table to get the exact same hash value as the old one and thus being ignored when you want to copy the changes. Again, the chances are incredibly small, but like I said, I'm paranoid when it comes to this =)
If you'd wish to continue down this track nevertheless, some remarks:
HashBytes only supports an input of 8000 bytes. Given the overhead added by the XML syntax you might run into trouble with those 140 columns
I don't see any (good) reason to convert the result of HashBytes to something else before writing it to the table
Although FOR XML is pretty fast, wouldn't CONCAT be just as fast while at the same time resulting in a 'smaller' result (cf point 1)? I'll agree that it brings its own set of issues like when field1,field2,field3 are "hello", "world" "" would result in the same thing as "hello", "", "world" =/ You could get around this by CONCAT-ing the LEN() of each field too... not sure how much gain we'd have left though =)
I'm guessing you already have it, but is there an index, preferably unique and clustered, on the customer_no field in the staging table?
I have problems to use the Oracle XML functions like
xmlelement, xmlagg, xmlattributes
For instance:
select
XMLELEMENT(
"OrdrList",
XMLAGG(
XMLELEMENT(
"IDs",
XMLATTRIBUTES(
USERCODE AS "usrCode",
VALDATE AS "validityDate"
)
)
)
) from TMP
/
The code seems to be correct as it does work when returning a small number of messages
And yes, I did try to set "long", "pagesize", "linesize" etc... but have never been able to retrieve the full set of approx. 500.000 XML-messages (i.e. table rows).
Reading some background literature (e.g. "Oracle SQL" by Jürgen Sieben) it seems that the functions are not designed for large data sets. Mr. Sieben explains that he uses these only for small queries (max. 1 MB output size), above that he recommends to use "object-oriented functions" but does not explain which.
Does somebody have experience with this and has the above XML-functions working or knows alternatives?
As per advice below: converting to CLOB through [...].getclobval(0, 2) from TMP iterates now through the whole table. Slow, but complete.
I have to make a correction: getclobval delivers a longer but still not complete list.
As my confidence in the implementation/documentation quality of the above Oracle XML functions is weak, I will create a standard file-output from the database and implement the XML-conversion myself.
New update: I found the culprit: XMLAGG! If I take it out, the database is speedily, properly, stepwise and completely parsed. Strange since XMLAGG does not really have a complicated job: creating an ingoing and outgoing XML-tag
I think showing this data in sqlplus + spool completely is going to be a struggle.
I have used these functions for > 100Mb of data without problems, but I have written the returned XMLType out to files after converting to CLOB, using either UTL_FILE for server side or in client apps in Java/C#.
If you are stuck with sqlplus, have you tried it with "SET TERM OFF" and spool? It might give better results, would certainly be quicker. Note to use SET TERM OFF you have to be careful how you invoke sqlplus; sqlplus #script will work, but "cat <
We're currently investigating the load against our SQL server and looking at ways to alleviate it. During my post-secondary education, I was always told that, from a performance standpoint, it was cheaper to make SQL Server do the work. But is this true?
Here's an example:
SELECT ord_no FROM oelinhst_sql
This returns 783119 records in 14 seconds. The field is a char(8), but all of our order numbers are six-digits long so each has two blank characters leading. We typically trim this field, so I ran the following test:
SELECT LTRIM(ord_no) FROM oelinhst_sql
This returned the 783119 records in 13 seconds. I also tried one more test:
SELECT LTRIM(RTRIM(ord_no)) FROM oelinhst_sql
There is nothing to trim on the right, but I was trying to see if there was any overhead in the mere act of calling the function, but it still returned in 13 seconds.
My manager was talking about moving things like string trimming out of the SQL and into the source code, but the test results suggest otherwise. My manager also says he heard somewhere that using SQL functions meant that indexes would not be used. Is there any truth to this either?
Only optimize code that you have proven to be the slowest part of your system. Your data so far indicates that SQL string manipulation functions are not effecting performance at all. take this data to your manager.
If you use a function or type cast in the WHERE clause it can often prevent the SQL server from using indexes. This does not apply to transforming returned columns with functions.
It's typically user defined functions (UDFs) that get a bad rap with regards to SQL performance and might be the source of the advice you're getting.
The reason for this is you can build some pretty hairy functions that cause massive overhead with exponential effect.
As you've found with rtrim and ltrim this isn't a blanket reason to stop using all functions on the sql side.
It somewhat depends on what all is encompassed by: "things like string trimming", but, for string trimming at least, I'd definitely let the database do that (there will be less network traffic as well). As for the indexes, they will still be used if you're where clause is just using the column itself (as opposed to a function of the column). Use of the indexes won't be affected whatsoever by using functions on the actual columns you're retrieving (just on how you're selecting the rows).
You may want to have a look at this for performance improvement suggestions: http://net.tutsplus.com/tutorials/other/top-20-mysql-best-practices/
As I said in my comment, reduce the data read per query and you will get a speed increase.
You said:
our order numbers are six-digits long
so each has two blank characters
leading
Makes me think you are storing numbers in a string, if so why are you not using a numeric data type? The smallest numeric type which will take 6 digits is an INT (I'm assuming SQL Server) and that already saves you 4 bytes per order number, over the number of rows you mention that's quite a lot less data to read off disk and send over the network.
Fully optimise your database before looking to deal with the data outside of it; it's what a database server is designed to do, serve data.
As you found it often pays to measure but I what I think your manager may have been referring to is somthing like this.
This is typically much faster
SELECT SomeFields FROM oelinhst_sql
WHERE
datetimeField > '1/1/2011'
and
datetimeField < '2/1/2011'
than this
SELECT SomeFields FROM oelinhst_sql
WHERE
Month(datetimeField) = 1
and
year(datetimeField) = 2011
even though the rows that are returned are the same
We are attempting to concatenate possibly thousands of rows of text in SQL with a single query. The query that we currently have looks like this:
DECLARE #concatText NVARCHAR(MAX)
SET #concatText = ''
UPDATE TOP (SELECT MAX(PageNumber) + 1 FROM #OrderedPages) [#OrderedPages]
SET #concatText = #concatText + [ColumnText] + '
'
WHERE (RTRIM(LTRIM([ColumnText])) != '')
This is working perfectly fine from a functional standpoint. The only issue we're having is that sometimes the ColumnText can be a few kilobytes in length. As a result, we're filling up tempDB when we have thousands of these rows.
The best reason that we have come up with is that as we're doing these updates to #concatText, SQL is using implicit transactions so the strings are effectively immutable.
We are trying to figure out a good way of solving this problem and so far we have two possible solutions:
1) Do the concatenation in .NET. This is an OK option, but that's a lot of data that may go back across the wire.
2) Use .WRITE which operates in a similar fashion to .NET's String.Join method. I can't figure out the syntax for this as BoL doesn't cover this level of SQL shenanigans.
This leads me to the question: Will .WRITE work? If so, what's the syntax? If not, are there any other ways to do this without sending data to .NET? We can't use FOR XML because our text may contain illegal XML characters.
Thanks in advance.
I'd look at using CLR integration, as suggested in #Martin's comment. A CLR aggregate function might be just the ticket.
What exactly is filling up tempdb? It cannot be #concatText = #concatText + [ColumnText], there is no immutability involved and the #concatText variable will be at worst case 2GB size (I expect your tempdb is much larger than that, if not increase it). It seems more like your query plan creates a spool for haloween protection and that spool is the culprit.
As a generic answer, using the UPDATE ... SET #var = #var + ... for concatenation is known to have correctness issues and is not supported. Alternative approaches that work more reliably are discussed in Concatenating Row Values in Transact-SQL.
First, from your post, it isn't clear whether or why you need temp tables. Concatenation can be done inline in a query. If you show us more about the query that is filling up tempdb, we might be able to help you rewrite it. Second, an option that hasn't been mentioned is to do the string manipulation outside of T-SQL entirely. I.e., in your middle-tier query for the raw data, do the manipulation and push it back to the database. Lastly, you can use Xml such that the results handle escapes and entities properly. Again, we'd need to know more about what and how you are trying to accomplish.
Agreed..A CLR User Defined Function would be the best approach for what you guys are doing. You could actually read the text values into an object and then join them all together (inside the CLR) and have the function spit out a NVARCHAR(MAX) result. If you need details on how to do this let me know.
This is hopefully just a simple question involving performance optimizations when it comes to queries in Sql 2008.
I've worked for companies that use Stored Procs a lot for their ETL processes as well as some of their websites. I've seen the scenario where they need to retrieve specific records based on a finite set of key values. I've seen it handled in 3 different ways, illustrated via pseudo-code below.
Dynamic Sql that concatinates a string and executes it.
EXEC('SELECT * FROM TableX WHERE xId IN (' + #Parameter + ')'
Using a user defined function to split a delimited string into a table
SELECT * FROM TableY INNER JOIN SPLIT(#Parameter) ON yID = splitId
USING XML as the Parameter instead of a delimited varchar value
SELECT * FROM TableZ JOIN #Parameter.Nodes(xpath) AS x (y) ON ...
While I know creating the dynamic sql in the first snippet is a bad idea for a large number of reasons, my curiosity comes from the last 2 examples. Is it more proficient to do the due diligence in my code to pass such lists via XML as in snippet 3 or is it better to just delimit the values and use an udf to take care of it?
There is now a 4th option - table valued parameters, whereby you can actually pass a table of values in to a sproc as a parameter and then use that as you would normally a table variable. I'd be preferring this approach over the XML (or CSV parsing approach)
I can't quote performance figures between all the different approaches, but that's one I'd be trying - I'd recommend doing some real performance tests on them.
Edit:
A little more on TVPs. In order to pass the values in to your sproc, you just define a SqlParameter (SqlDbType.Structured) - the value of this can be set to any IEnumerable, DataTable or DbDataReader source. So presumably, you already have the list of values in a list/array of some sort - you don't need to do anything to transform it into XML or CSV.
I think this also makes the sproc clearer, simpler and more maintainable, providing a more natural way to achieve the end result. One of the main points is that SQL performs best at set based/not looping/non string manipulation activities.
That's not to say it will perform great with a large set of values passed in. But with smaller sets (up to ~1000) it should be fine.
UDF invocation is a little bit more costly than splitting the XML using the built-in function.
However, this only needs to be done once per query, so the performance difference will be negligible.