U-SQL Error Extracting from BCP CSV File - azure-data-lake

I have data that I extracted from SQL Server using BCP the file is ASCII CSV.
Dates in the 2016-03-03T23:00:00 format.
When running the extract I get
Additional information:
{"diagnosticCode":195887127,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_COLUMN_CONVERSION_INVALID_ERROR","message":"Invalid
character when attempting to convert column data.","description":"HEX:
\"223022\" Invalid character when converting input record.\nPosition:
line 1, column 21.","resolution":"Check the input for errors or use
\"silent\" switch to ignore over(under)-sized rows in the
input.\nConsider that ignoring \"invalid\" rows may influence job
results and that types have to be nullable for conversion errors to be
ignored.","helpLink":"","details":"============================================================================================\nHEX:5432333B35313B34362D323031362E30332E30335432333B30303B30302D302D352D323031362E30332E30335432333B35313B34392F3536372D302D323031362E30332E3033\n
^\nTEXT:T23:51:46,2016-03-03T23:00:00,0,5,2016-03-03T23:51:49.567,0,2016-03-03\n
How do you handle dates properly on extraction? It's unclear to me why it is spliting in the middle of a date time column.
A sample row looks like
50CA2FBB-95C3-4216-A729-999BE2DB491A,2016-03-03T23:51:49.567,1001464881,1001464795,1001464795,00000000-0000-0000-0000-000000000000,00000000-0000-0000-0000-000000000000,100 ,100 , ,12643,bCAwvRnNVwrKDXKxZkVed2Z1zHY=,o2lsnhueDApmvSbm31mh3aetYnc=,2016-03-03T23:50:46,2016-03-03T23:00:00,2016-03-03T23:51:46,2016-03-03T23:00:00,0,5,2016-03-03T23:51:49.567,0,2016-03-03T00:00:00,2016-03-03T23:59:59,00000000-0000-0000-0000-000000000000
Extract Statement is
#res =
EXTRACT
LicenseId Guid,
EntryDate DateTime,
UltimateId long,
SiteId string,
VirtualId string,
ProjectId Guid,
DocumentId Guid,
MasterId string,
ProductId string,
FeatureString string,
VersionId long,
ComputerSid string,
UserSid string,
AppStartTime DateTime,
StartHour DateTime,
AppStopTime DateTime,
StopHour DateTime,
GmtDelta int,
RecordedGmtDelta int,
LastUpdated DateTime,
Processed bool,
StartDate DateTime,
EndDate DateTime,
ImsId Guid
FROM #dataFile
USING Extractors.Csv();

The default encoding of the built-in extractors is Encoding.UTF-8. So most likely, the three byte sequence you see is being interpreted as UTF-8 and not ASCII.
If your BCP output really only contains code points in the ASCII range (0-127) (and not ANSI 8 bit characters), you can specify Extractors.Csv(encoding:Encoding.[ASCII]) (note the [] around ASCII to escape them from the reserved keyword rule).
If your data however is containing ANSI range characters, you have to BCP out as either UTF-16 (I don't think BCP supports UTF-8), or convert the result of BCP into UTF-8.
Note that if the file is larger than 250MB, we currently have a bug around the record boundary detection when uploading the file if it is in UTF-16 encoding. Until we have this bug fixed, I suggest you upload the file with UTF-8 encoding in that case.
Also, if you need the full ANSI codepage supported, please vote your support for the user voice item at https://feedback.azure.com/forums/327234-data-lake/suggestions/13077555-add-ansi-code-page-support-for-built-in-extractors and provide the code page that you need to have supported (e.g., Windows-1254 or ISO-Latin-1).

Related

How to cast hex data string to a string db2 sql

How would you decode a hex string to get the value in text format by using a select statement?
For example my data in hex is:
4f004e004c005900200046004f00520020004200410043004b002d005500500020004f004e0020004c004500560045004c0020004f004e004500200046004f00520020004300520041004e
I want to decode it to get the string value using a select statement.
The value of the above is "ONLY FOR BACK-UP ON LEVEL ONE FOR CRANES"
what I have tried is :
SELECT CAST('4f004e004c005900200046004f00520020004200410043004b002d005500500020004f004e0020004c004500560045004c0020004f004e004500200046004f00520020004300520041004e004500530020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020002000200020000000'
AS VARCHAR(30000) CCSID 37) from myschema.atable
The above sql returns the exact same hex string and not the decoded text string of "ONLY FOR BACK-UP ON LEVEL ONE FOR CRANES" what I expected.
Is it possible to do this with a cast? If it is what will the syntax be?
My problem that I have is a system stores text data in a blob field and I want to use a select statement to see what the text data is in the blob field.
Db : Db2 on Ibm
Edit:
I have managed to covert the string to the hex value by using :
select hex(cast('ONLY FOR BACK-UP ON LEVEL ONE FOR CRANES' as varchar(100) ccsid 1208))
FROM myschema.atable
This gives me the string in hex :
4F4E4C5920464F52204241434B2D5550204F4E204C4556454C204F4E4520464F52204352414E4553
Now somehow I need to do the inverse and get the value.
Thanks.
Edit
Using the answer from Daniel Lema, I tried using the unhex function but my result that I got was :
|+<ßã|êâ ä.í&|+<áîá<|+áã|êäê +áë
Is this something to do with a CSSID? Or how should I convet the above to a readable string?
This is the table field definition if it will help the field with my data in is GDTXFT a BLOB :
I was able to take your shortened hex string and convert is to a valid EBCDIC string.
The problem I ran into is that the original hex code you receive comes in UTF-16LE (Thanks Tom Blodget). IBM's CCSID system does not have a distinction between UTF-16BE and UTF-16LE so I am at a loss there on how to convert it properly.
If it is in UTF-8 as you generated later, the following would work for you. It's not the prettiest but throw it in a couple functions and it will work.
Create or replace function unpivothex (in_ varchar(30000))
returns table (Hex_ char(2), Position_ int)
return
with returnstring (ST , POS )
as
(Select substring(STR,1,2), 1
from table(values in_) as A(STR)
union all
Select nullif(substring(STR,POS+2,2),'00'), POS+2
from returnstring, table(values in_) as A(STR)
where POS+2 <= length(in_)
)
Select ST, POS
from returnstring
;
Create or replace function converthextostring
(in_string char(30000))
returns varchar(30000)
return
(select listagg(char(varbinary_format(B.Hex_),1)) within group(order by In_table.Position_)
from table(unpivothex(upper(in_string))) in_table
join table(unpivothex(hex(cast('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz ' as char(53) CCSID 1208)))) A on In_table.Hex_ = A.Hex_
join table(unpivothex(hex(cast('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz ' as char(53) CCSID 37)))) B on A.Position_ = B.Position_
);
Here is a version if you're not on at least V7R2 TR6 or V7R3 TR2.
Create or replace function converthextostring
(in_string char(30000))
returns varchar(30000)
return
(select xmlserialize(
xmlagg(
xmltext(cast(char(varbinary_format(B.Hex_),1) as char(1) CCSID 37))
order by In_table.Position_)
as varchar(30000))
from table(unpivothex(upper(in_string))) in_table
join table(unpivothex(hex(cast('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz ' as char(53) CCSID 1208)))) A on In_table.Hex_ = A.Hex_
join table(unpivothex(hex(cast('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz ' as char(53) CCSID 37)))) B on A.Position_ = B.Position_
);
I tried the following solution I found published by Marcin Rudzki at Convert HEX value to CHAR on DB2, tested in my own Db2 for LUW v11 with a small modification.
the solution consists on creating a function just as Marcin suggested:
CREATE FUNCTION unhex(in VARCHAR(32000) FOR BIT DATA)
RETURNS VARCHAR(32000)
LANGUAGE SQL
CONTAINS SQL
DETERMINISTIC NO EXTERNAL ACTION
BEGIN ATOMIC
RETURN in;
END
To test the solution, lets create an HEXSAMPLE table with a HEXSTRING column loaded with the string representation of a HEX sequence:
INSERT INTO HEXSAMPLE (HEXSTRING) VALUES ('4F4E4C5920464F52204241434B2D5550204F4E204C4556454C204F4E4520464F52204352414E4553')
Then exec the following query (and here it is different from the original proposal):
SELECT UNHEX(CAST(HEXTORAW(HEXSTRING) AS VARCHAR(2000) FOR BIT DATA)) as TEXT, HEXSTRING FROM HEXSAMPLE
With result:
TEXT HEXSTRING
---------------------------------------- --------------------------------------------------------------------------------
ONLY FOR BACK-UP ON LEVEL ONE FOR CRANES 4F4E4C5920464F52204241434B2D5550204F4E204C4556454C204F4E4520464F52204352414E4553
I hope someone else can find a more direct solution. Also, if someone can explain why it works, it will be very interesting.
I question why you need to do this...
There's valid reasons to convert a hex string back to it's character equivalent...for instance somebody sends you a 32 byte string UUID and you want it back it it's 16 byte binary form.
But there's no reason ONLY FOR BACK-UP ON LEVEL ONE FOR CRANES should have been transformed to hex.
I suspect you need to post a new question asking why you're not getting readable strings in the first place.
However, in answer to this question... IBM i has an MI function Convert Character to Hex (CVTCH) that is easily called from any ILE langage. You could wrap that function call up into a user defined function in order to use it from SQL.
Note that you'll need to know what the hex string represents, EBCDIC, ASCII or Unicode, because you'll need to be able to tell the system what you've started with. From there there are ways to convert between encoding.
Here's an article that shows how to call the MI function from RPG.
Utilizing MI Functions in RPG Programs
A more modern free form version of the prototype that takes advantage of enhancements to the CCSID keyword might look like
dcl-pr FromHex extproc('cvtch');
charString char(32767) ccsid(*UTF8) options(*varsize);
hexString char(65534) ccsid(*HEX) const options(*varsize);
hexStringLen int(10) value;
end-pr;
With the above prototype, the system will treat the character string that comes back as UTF8 (ccsid 1208). But all I'm doing is telling the system how to interpret the bytes that come back. If the string was actually EBCDIC, I'm going to get garbage.
I think you could even defined the cvtch function directly as an external UDF without needing an ILE wrapper. I'd have to play around with that...
Disregard that idea...cvtch only has parameters, not a return value. Using an ILE wrapper is the best way to move the output parameter to a return value for use as a UDF.
The problem is that your original string is in ASCII format (actually with x'00' byte after each letter), and you have to convert it to EBCDIC.
Below is the solution for latin capital letters only:
select cast(translate(replace(mycol, x'00', x'')
, x'C1C2C3C4C5C6C7C8C9D1D2D3D4D5D6D7D8D9E2E3E4E5E6E7E8E940'
, x'4142434445464748494A4B4C4D4E4F505152535455565758595A20'
) as varchar(500) ccsid 37)
from mytab;
Every ASCII character is translated to the corresponding EBCDIC one.
x'00' symbols are removed.
cast (col_name as varchar(2000) ccsid ascii for sbcs data)

Additional 0 in varbinary insert in SSMS

I have a problem when I am trying to move a varbinary(max) field from one DB to another.
If I insert like this:
0xD0CF11E0A1B11AE10000000
It results the beginning with an additional '0':
0x0D0CF11E0A1B11AE10000000
And I cannot get rid of this. I've tried many tools, like SSMS export tool or BCP, but without any success. And it would be better fro me to solve it in a script anyway.
And don't have much kowledge about varbinary (a program generates it), my only goal is to copy it:)
0xD0CF11E0A1B11AE10000000
This value contains an odd number of characters. Varbinary stores bytes. Each byte is represented by exactly two hexadecimal characters. You're either missing a character, or your not storing bytes.
Here, SQL Server is guessing that the most significant digit is a zero, which would not change the numeric value of the string. For example:
select 0xD0C "value"
,cast(0xD0C as int) "as_integer"
,cast(0x0D0C as int) "leading_zero"
,cast(0xD0C0 as int) "trailing_zero"
value 3_char leading_zero trailing_zero
---------- --------- --------------- ----------------
0d0c 3340 3340 53440
Or:
select 1 "test"
where 0xD0C = 0x0D0C
test
-------
1
It's just a difference of SQL Server assuming that varbinary always represents bytes.

Conversion failed when converting the nvarchar value 'AAAR78509883' to data type int

I have a nvarchar column in one of my tables that I have imported from Access. I am trying to change to an int. To move to a new table.
The original query:
insert into members_exams_answer
select
ua.members_exams_id, ua.exams_questions_id,
ua.members_exams_answers_value, ua.members_exams_answers_timestamp
from
members_exams as me
full join
UserAnswers1 as ua on me.members_exams_username = ua.members_exams_id
full join
exams_questions as eq on eq.exams_questions_id = ua.exams_questions_id
This throws an error:
Conversion failed when converting the nvarchar value 'AAAR78509883' to data type int.
I have tired:
select convert (int, UserAnswers1.members_exams_id)
from UserAnswers1
and
select cast(members_exams_id as integer) int_members_exams_id
from UserAnswers1
and
select cast (members_exams_id as int)
from UserAnswers1
All result in the same error
Conversion failed when converting the nvarchar value 'AAAR78509883' to data type int.
Clearly you are trying to convert data that is alphanumeric to an int and that cannot be done.
Looking at your data why are you insisting on converting it to an int when it cannot be an int? Why not just process it as an nvarchar?
Your problem could be systemic where all data has a leading alpha characters that you need to strip out (and hopefully the same number of alpha characters)
In that case use a substring to strip off the alphas (this assumes the name number of alphabetic characters in each record). Or use a varchar or nvarchar field instead of an int. If the number of leading characters varies or if they can be leading or trailing or some other combination, it will much more complex to fix than we can probably describe on the Internet.
The other possibility is that you simply have some bad data. In which case identify the records which are not numeric and fix them or null the value out if they cannot be fixed. This happens frequently when you have stored the data in an incorrect datatype.

Replace very long HTML text in SQL Server database

I'm trying to find and replace very long HTML text in a SQL Server temp table (I'm using SQL Server 2012).
The query structure is:
UPDATE #Descriptions
SET Desc1 = replace(Desc1, 'VERY LONG HTML 1','VERY LONG HTML 2')
WHERE Desc1 like 'VERY LONG HTML 1'
I'm getting
Text or binary data will be truncated
error. The HTML parts are long, but should fit easily - whole thing fits in varchar(max) and these are only parts of it.
Can anyone help please?
Thank you!
M.
try using this :
field = replace(cast(field as varchar(max)),'string' ,'replacement')
As Sean Lange pointed out it's because of output string being truncated. This can be found in official documentation:
Return Types
Returns nvarchar if one of the input arguments is of the nvarchar data type; otherwise, REPLACE returns varchar. Returns
NULL if any one of the arguments is NULL. If string_expression is not
of type varchar(max) or nvarchar(max), REPLACE truncates the return
value at 8,000 bytes. To return values greater than 8,000 bytes,
string_expression must be explicitly cast to a large-value data type.
This can be workarounded by splitting your string into multiple string of fixed length, replacing data in there and then concatenating them back together.

Varbinary and image conversion

I have an Access front end that links to a SQL Server backend.
There are 3 fields in a table that I am trying to convert to text from the backend:
o_name varbinary(2000)
O_PropertyBinary1 varbinary(2000)
O_PropertyBinary2 image
I can convert the o_name field using:
convert(varchar([max]),[O_Name])
and that works fine.
e.g. 4153534554 = ASSET
However, what can I use for the other two fields, as it seems I can't convert an image field and converting the O_PropertyBinary1 comes out with garbage characters.
The output is depended on the stored data an the appropriate conversion.
If the stored data is binary e.g. Bitmaps, converting to text will never give a usable result.
If data stored is text, it could be Varchar or NVarchar and kind conversion is depending.
in the example below VC_VB2NVarchar and VC_IMG2NVarchar would display your described garbage characters
Declare #tab Table(nvc NVarchar(100),vc Varchar(100)
,img image,vb VarBinary(200),img2 image,vb2 VarBinary(200))
Insert into #tab (nvc,vc) Values ('123456789','123456789')
Update #tab set vb=Convert(VarBinary(200),nvc),img=Convert(Image,Convert(Varbinary(max),nvc))
,vb2=Convert(VarBinary(200),vc),img2=Convert(Image,Convert(Varbinary(max),vc))
Select nvc,vc
,CONVERT(Nvarchar(100),vb) as NVC_VB2NVarchar
,CONVERT(Varchar(200),vb) as NVC_VB2Varchar
,CONVERT(Nvarchar(100),Convert(VarBinary(max),img)) as NVC_IMG2NVarchar
,CONVERT(Varchar(200),Convert(VarBinary(max),img)) as NVC_IMG2Varchar
,CONVERT(Nvarchar(100),vb2) as VC_VB2NVarchar
,CONVERT(Varchar(200),vb2) as VC_VB2Varchar
,CONVERT(Nvarchar(100),Convert(VarBinary(max),img2)) as VC_IMG2NVarchar
,CONVERT(Varchar(200),Convert(VarBinary(max),img2)) as VC_IMG2Varchar
from #Tab