Parquet Binary Data type - impala

I have a question regarding the Binary data type. I am trying to write a Parquet Schema for my MR job to create the Parquet file contrary to have Hive or Impala create one. I see some references to a Binary type which I do not see in Parquet
Is binary an alias to BYTE_ARRAY?
Also is UTF-8 a default encoding on Binary data types?

Raw bytes are stored in Parquet either as a fixed-length byte array (FIXED_LEN_BYTE_ARRAY) or as a variable-length byte array (BYTE_ARRAY, also called binary). Fixed is used when you have values with a constant size, like a SHA1 hash value. Most of the time, the variable-length version is used.
Strings are encoded as variable-length binary with the UTF8 type annotation to indicate how to interpret the raw bytes back into a String. UTF8 is the only encoding supported in the format, but not every binary uses UTF8 because not all binary fields are storing string data.

There is no data type in parquet-column called BYTE_ARRAY.
I saw their PrimitiveType in latest package but could not see it.
Could not write byte[] in binary as well.

Related

SSIS error "UTF8" has no equivalant in encoding "WIN1252"

I'm using SSIS package to extract the data from a Postgres database, but I'm getting following error one of the tables.
Character with byte sequence 0xef 0xbf 0xbd in encoding "UTF8" has no
equivalant in encoding "WIN1252
I have no idea how to resolve it. I made all the columns in the sql table to NVARCHAR(MAX) but still no use. Please provide the solution.
The full Unicode character set (as encoded in UTF8) contains tens of thousands of different characters. WIN1252 contains 256. Your data contains characters that cannot be represented in WIN1252.
You either need to export to a more useful character encoding, remove the "awkward" characters from the source database or do some (lossy) translation with SSIS itself (I believe "character map translation" is what you want to search for).
I would recommended first though spending am hour or so googling around the subject of Unicode, it's utf encodings and its relationship to the ISO and WIN character sets. That way you will understand which of the above to choose.

What is the limit of BINARY data types in Hive 1.2?

I did not find much about BINARY data types in apache docs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
I created a table with BINARY column using-
create table table1(col1 binary);
After fetching metadata via JDBC I found,
columnSize:2147483647
Is there any official document for this?
From the Binary DataType Proposal :
How is 'binary' represented internally in Hive
Binary type in Hive will map to 'binary' data type in thrift.
Primitive java object for 'binary' type is ByteArrayRef
PrimitiveWritableObject for 'binary' type is BytesWritable
And since ByteArrayRef holds a reference to a byte array, the answer should be Integer.MAX_VALUE - 5, see here

Convert sql binary (16) to utf-8 byte in .net

I have values stored in a sql database with datatype of binary(16), they come over into the .NET application (using Entity Framework) as type System.Data.Linq.Binary. I'd like to convert this binary representation of my data to data type byte[] without losing any data and preferably using UTF-8 encoding. Is this not built into the .NET framework? Must I convert it to some intermediary data type first before being able to get my byte array?
Binary has nothing to do with UTF-8.
binary(16) means 16 bytes of binary data. There is a Binary::ToArray method to get a byte array.
https://msdn.microsoft.com/en-us/library/system.data.linq.binary.toarray

Postgres' text column doesn't like my zlib compressed data

Is there a better data type to be using to store a zlib compressed string in Postgresql?
Use bytea "The bytea data type allows storage of binary strings"
Use a bytea. Zip compressed data is not a text.

How much difference does BLOB or TEXT make in comparison with VARCHAR()?

If I don't know the length of a text entry (e.g. a blog post, description or other long text), what's the best way to store it in MYSQL?
TEXT would be the most appropriate for unknown size text. VARCHAR is limited to 65,535 characters from MYSQL 5.0.3 and 255 chararcters in previous versions, so if you can safely assume it will fit there it will be a better choice.
BLOB is for binary data, so unless you expect your text to be in binary format it is the least suitable column type.
For more information refer to the Mysql documentation on string column types.
use TEXT if you want it treated as a character string, with a character set.
use BLOB if you want it treated as a binary string, without a character set.
I recommend using TEXT.