In Hive what is the difference between FIELDS TERMINATED BY '\u0004' and FIELDS TERMINATED BY '\u001C' - hive

In my project I saw two Hive tables and in the create table statement I saw one table has ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0004' and another table has ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u001C'. I want to know what does these '\u0004' and '\u001C' mean and when to use them? Kindly answer.

In many text formats, \u introduces a Unicode escape sequence. This is a way of storing or sending a character that can't be easily displayed or represented in the format you're using. The four characters after the \u are the Unicode "code point" in hexadecimal. A Unicode code point is a number denoting a specific Unicode character.
All characters have a code point, even the printable ones. For example, a is U+0061.
U+0004 and U+001C are both unprintable characters, meaning there's no standard character you can use to display them on the screen. That's why an escape sequence is used here.
If you use a simple, printable character like , as your field delimiter, it will make the stored data easier for a human to read. The field values will be stored with a , between each one. For example, you might see the values one, two and three stored as:
one,two,three
But if you expect your field values to actually contain a ,, it would be a poor choice of field delimiter (because then you'd need a special way to tell the difference between a single field with a value of one,two or two different fields with the values one and two). The choice of delimiter depends both on whether you want to be able to read it easily, and what characters you expect the field to contain.

Related

Characterizing the data format of the inputs that an AWK tool processes?

AWK newbie here.
I am trying to characterize (for myself) the data format that an AWK tool expects of the input it processes. (Terminology question: Would such a "data format characterization" be called "AWK's data format model"?) Below is my attempt at a characterization. Is it correct? Is it complete? Is it easy to read and understand? What changes/additions are needed to make it correct, complete, and easy to read/understand?
As an aside: One of the things that I really like about AWK is that the data format of its input is readily described in a few short sentences. That's powerful! Contrast with other common data formats (e.g., XML, JSON, CSV) which require many pages of dense prose.
The data format consists of lines (lines are strings that are
typically separated by newlines, although the user may use a symbol
other than newline, if desired). Each line contains fields. Fields are
ASCII strings. Fields are separated by a delimiter (common delimiters
include the tab, space, or comma symbol, although the user is free to
use another symbol if desired). Fields may contain the field delimiter
symbol provided the symbol is preceded by a backslash symbol (this is
called "escaping the symbol"). Fields may be empty. Each line has zero
or more fields. Lines do not need to have the same number of fields.
CSV(...)which require many pages of dense prose.
I must protest, CSV is defined by RFC4180, prose is 7 points inside Definition of the CSV Format at most 2 pages, so I can not say it is many.
Is it complete?
I would say not, because you are using terms without defining them. For example what is ASCII string and what is symbol?

Cant Migrate to bigquery because bigquery column names allow only english characters

Bigquery column names (fields) can only contain English letters, numbers, and underscores.
I am using python and I want to create a script to migrate my data from Postgres to Bigquery and the Postgres tables have many non-english column names.
I will probably need to encode the column names to some format that Bigquery accepts, but I will need the ability to later decode it back to the original.
what is the best way to do this?
You can encode the column names to something like base64 and replace the +=/ characters to some kind of place holder.
If you don't care about fields length you can encode to base32 (its about 20% longer then base64 but don't use '+' or '/' and the '=' is used only for padding so you can discard it and it wont affect the string)
Except that you can make small conversion table for each non English character in your language to some combination in English chars, this will work only if you have small amount of non-english characters.

Can we select the datas that have spaces between the lines in DB without the spaces?

I have a textbox to make a search in my table.My table name is ADDRESSBOOK and this table holds the personel records like name,surname,phone numbers and etc.The phone numbers holding like "0 123 456789".If I write "0 123 456789" in my textbox in the background this code is working
SELECT * FROM ADDRESSBOOK WHERE phonenumber LIKE "0 123 456789"
My problem is how can I select the same row with writing "0123456789" in the textbox.Sorry for my english
You can use replace():
WHERE REPLACE(phonenumber, ' ', '') LIKE REPLACE('0 123 456789', ' ', '')
If performance is an issue, you can do the following in SQL Server:
alter table t add column phonenumber_nospace as (replace(phonenumber, ' ', '');
create index idx_t_phonenumber_nospace on t(phonenumber_nospace);
Then, remove the spaces in the parameter value before constructing the query, and use:
WHERE phonenumber_nospace = #phonenumber_nospace
This assumes an equality comparison, as in your example.
If there is a specific format in which the Phone number is stored than you can insert space at the specific locations and than pass that to the database query.
For Example as you have mentioned in the question for number 0 123 456789.
If there is a space after first number and space after fourth number then you could take the text from the textbox and insert space at second position and sixth position(as after adding space at second position + next three positions are number so sixth position) and pass that text to the database query.
An important part of Db design is ensuring data consistency. The more consistently it's stored, the easier it is to query. That's why you should make a point of ensuring your columns use the correct data types:
Dates/time columns should use an appropriate date/time type.
Number columns should use a numeric type of the appropriate size. (None of this numeric varchar rubbish.)
String columns should be of the appropriate length (whether char or varchar).
Columns with referential relationships should never store invalid references to the referenced table.
And similarly, you need to determine the exact format you wish to use when storing telephone numbers; and ensure that any time you store a number it's done so consistently.
Some queries will be complex enough as is. As soon as you're unable to rely on a consistent format, your queries to find data need to cater for all the possible variations. They'll be less likely to leverage indexes effectively.
I have seen argument in favour of storing telephone numbers as numeric data. (It is after all a "number".) Though I'm not really convinced because this approach would be unable to represent leading zeroes (which might be desirable).
Conclusion
Whenever you insert/update a telephone number, ensure it's stored in a consistent format. (NOTE: You can be flexible about how the number appears to your users. It's only the stored value that needs to be consistent.)
Whenever you search for a telephone number, convert the search value into the compatible format before searching.
It's up to you exactly where/how you do these conversions. But you might wish to consider CHECK constraints to ensure that if you failed to convert a number appropriately at some point, that it isn't accidentally stored in the incorrect format. E.g.
CONSTRAINT CK_NoSpacesInTelno CHECK (Telephone NOT LIKE '% %')

How to get rid of special character in Netezza columns

I am transferring data from one Netezza database to another using Talend, an ETL tool. When I pull data from a varchar(30) field and try to put it in the new database's varchar(30) field, it gives an error saying it's too long. Logs show the field has whitespace at the end followed by a square, representing some character I can't figure out. I attached a screenshot of the logs below. I have tried writing SQL to pull this field and replace what I thought was a CRLF, but no luck. When I do a select on the field and get the length, it has a few extra characters than what you see, so something is there and I want to get rid of it. Trimming does not do anything.
This SQL does not return a length shorter than simply doing length() on the column itself. Does anyone know what else it could be?
SELECT LENGTH(trim(translate(TRANSLATE(<column>, chr(13), ''), chr(10), ''))) as len_modified
Note that the last column in the logs, where you see a square in brackets, is supposed to show the last character examined.
Save the data to a larger target table size that works. If 30 character data put it in a 500 character table. Get it to work. Then look through character by character on the fields that are the longest to determine what character is being added. Use commands like ascii() to determine the ascii value of the individual characters and the beginning and end. Most likely you are getting some additional character in the beginning or the end. Determine what the extra character data is and then write code to remove it or to never load it so that it fits in the 30 character column. Or just leave your target column with longer and include the additional characters. For example Varchar(30) becomes Varchar(32) (waste the space but don't alter the data as it comes in to you).

How do I escape an enclosure character in a SQL Loader data file?

I have a SQL*Loader control file that has a line something like this:
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '#'
Normally, I'd use a quotation mark, but that seems to destroy emacs's python syntax highlighting if used inside a multi-line string. The problem is that we are loading an ADDRESS_LINE_2 column where only 7,000 out of a million records are loading because they have lines like this:
...(other columns),Apt #2,(other columns)...
Which is of course causing errors. Is there any way to escape the enclosing character so this doesn't happen? Or do I just need to choose a better enclosing character?
I've looked through the documentation, but don't seem to have found an answer to this.
I found it...
If two delimiter characters are encountered next to each other, a single occurrence of the delimiter character is used in the data value. For example, 'DON''T' is stored as DON'T. However, if the field consists of just two delimiter characters, its value is null.
Field List Reference
Unfortunately, SqlLoader computes both occurrences of the delimiter while checking for max length of the field. For instance, DON''T will be rejected in a CHAR(5) field, with ORA-12899: value too large for column blah.blah2 (actual: 6, maximum: 5).
At least in my 11gR2 . Haven't tried in other versions....