How to extract a file having varbinary column in u-sql script using default Extractor? - azure-data-lake

I have to Extract the varbinary column in a file. When I tried to extract it with byte[]. It show the error "Conversion Error. Column having invalid characters".
U-SQL Script:
EXTRACT Id int?,createddate DateTime?,Photo byte[]
FROM #input
USING Extractors.Csv(quoting: true,nullEscape:"\N");

The built-in Csv/Tsv/Text extractors assume that they operate on textual data, where the binary content is hex-encoded. This is necessary, since the binary otherwise could contain any of the delineation characters. See https://msdn.microsoft.com/en-us/library/azure/mt621366.aspx under byte[].
So if your photo is not hex-encoded, you would have to write your own custom extractor.

Related

How to get text bytes used by a string in Hive?

I have some data in Hive 1.2.1 table. I have to get raw bytes of a specific column. The column data is html raw in multiple languages. In order to get length of characters, I can use simple query like below
select baseurl, LENGTH(content) from clss limit 30;
Above query is ok for characters length the problem is for text other is English, their value is incorrect. For a Character in Arabic, it is saved as unicoded that's why character length is changed. Some characters are of two bytes and some are single byte.
Is there any builtin function to know bytes of text instead of characters ?
Function character_length(string str) was added in Jira HIVE-15979 And it says Fix versions 2.3.0. If you cannot upgrade your Hive (and this is quite risky), then try to download UDF source codes and build it, then add jar and create temporary function.
Download code: GenericUDFCharacterLength.java

how to render a dicom file's header unreadable

Kind of a strange question, but I'm doing some testing to handle errors when a dicom file's tags can't be read.
Unfortunately I don't have a damaged dicom available.
Specifically, can anyone advise how to apply some sort of incorrectly encoded text tag or some invalid numeric data tag onto the file, such that it can't be read by python's pydicom package?
you could have a look at the dcmodify tool from the DCMTK. It can be used to insert, modify and delete attributes. I doubt that it is possible to specify invalid attribute values through the command line, but you could surely modify the source code to accomplish that (except you can definitely write attribute values that exceed the maximum length according to the Value Representation).
My approach would be to create a buffer of characters and write binary data to it. Then pass it to the method that writes the value to the attribute.
Examples:
write unicode (UTF-8) sequences which are not a valid unicode character
write ascii characters which are not covered by the characterset specified by (0008,0005) - not sure whether pydicom would run into problems but it would be wrong from the DICOM perspective
write non-numeric characters to attributes with Value Representation "Decimal String" or "Integer String".
formats other than YYYYMMDD for VR "Date"
formats other than HHMMSS.FFFFFF for VR "Time"
other characters than ['0'-'9'], '.' for VR "Unique Identifier"
[edit]: DCMTK, dcmodify: http://dicom.offis.de/dcmtk.php.en

DB2 SQL Interpret a field as other CCSID

So I have a file on my AS400 as a result of DSPJRN and I want to look at some data in the JOESD field which is the after image from the journal of a file. This is defined as char with CCSID = 65535. I guess this is because it is the whole record with a mixture of ccsid and numeric fields.
I can use substr() to get the actual field from the original file. In the original file the column is defined graphic(10) ccsid 13488. Thats UCS-2. If I do hex(substr(joesd,522,20)) I get a result of 004100530044... and so on so I know it's the correct data but I can't get it to display as 'ASD...'
I tried graphic(substr(joesd,522,20),10,13488) but it gives an error that the conversion from ccsid 65535 to 13488 isn't valid. I don't want to convert it but interpret it as the other ccsid
GRAPHIC() doesn't take CCSID as a parm. The third parm is length according to my 7.1 reference.
What version are you using?
I thought CAST() might be a solution, but it doesn't appear to work.
As I see it, one option would be to build a user defined function (UDF) that does the conversion you need; possibly with the iconv() API.
The other option, would be to dump the data into a properly formatted file. I use the DBUJRN utility from DBU. There's other similar options. Including an open source one (sorry that the description is in German, but google translate does a good enough job to figure out the source to download).
The utilities basically work the same way; you can in fact run through the same process manually. Try the following:
Step 1 (the DSPJRN you've been doing)
DSPJRN <...> OUTFILE(MYLIB/MYJRNOUT)
Step 2 - Create a new file with the journal header fields followed by all the fields from your journaled file (MYFILE)
CREATE TABLE mylib/mytbl as
( select JOENTL, JOSEQN, JOCODE, JOENTT, JODATE,
JOTIME, JOJOB, JOUSER, JONBR, JOPGM, JOOBJ,
JOLIB, JOMBR, JOCTRR, JOFLAG, JOCCID,
JOINCDAT, JOMINESD, JORES,
m.*
from MYLIB/MYJRNOUT , MYLIB/MYFILE m
) with no data
Step 3 - Copy the data without regard to the format differences..
CPYF FROMFILE(MYLIB/MYJRNOUT) TOFILE(MYLIB/MYTBL) MBROPT(*ADD) FMTOPT(*NOCHK)
You should end up with data originally in JOESD split into it's appropriate fields.
Note of course that this technique only works for one file at a time. Also, make sure you're only dumping *RCD entries and you'll probably want to skip the DELETE entries.

Inserting string as regular string in mongodb

The pymongo documentation says that BSON strings are UTF-8 encoded so PyMongo must ensure that any strings it stores contain only valid UTF-8 data. Unicode strings (<type ‘unicode’>) are encoded UTF-8 first. The reason our example string is represented in the Python shell as u’Mike’ instead of ‘Mike’ is that PyMongo decodes each BSON string to a Python unicode string, not a regular str.
So I understand that to get rid of the Unicode literal 'u', I will have to call json.dumps() on the document returned by the query.
The documentation also says that Regular strings (<type ‘str’>) are validated and stored unaltered. And I am assuming that the query result also throws it back as a regular string and not a Unicode string.
I created a dictionary with regular string types and inserted it in DB and when I retrieve it, I get the strings as Unicode. Any idea on how do I do it? The purpose is to avoid calling json.dumps() on the query result. I need to fetch large number of documents from the DB and json.dumps() is taking quite some time. The strings that I am storing contain ASCII data so I don't need Unicode strings.
The assumption that the regular string is returned back as regular string was not correct. It is stored unaltered and not encoded to UTF-8 because it is already UTF-8. While decoding during the query, everything is converted back to Unicode.
Source:
Automatic string to unicode object conversion
How can I get pymongo to always return str and not unicode?

how to import flat file source to database using sql

im currently want to inport my data from flat file to the database.
the flat file is in a txt file. in that txt file, i save a list of URLs. example:
http://www.mimi.com/Hotels-g303188-Rurrenabaque-Hotels.html
im using the SQL Server Import and Export wizard to do it. but when the time of execution, it has error saying
Error 0xc02020a1:
Data Flow Task 1: Data conversion failed. The data conversion for column
"Column 0" returned status value 4 and status text "Text was truncated or one
or more characters had no match in the target code page.".
can anyone help?..
You get this error because the text is too long for the column youve chosen to put it in.
Text was truncated or
You might want to check the size of the database column vis-a-vis your input data. Does the longest URL less than the column width?
one or more characters had no match in the target code page.".
Check if your input file has any special characters. An easy way to check this would be to save your file in ANSI (Notepad > Save As > Encoding = ANSI). Note - you'd still have to select the right code page so that the import interprets your input text correctly.
Here's a very nice link that has some background on what code pages are - http://www.joelonsoftware.com/articles/Unicode.html
Note you can also change the target column data type (to text stream for example) in the Datasource->Advanced section