Cant Migrate to bigquery because bigquery column names allow only english characters - sql

Bigquery column names (fields) can only contain English letters, numbers, and underscores.
I am using python and I want to create a script to migrate my data from Postgres to Bigquery and the Postgres tables have many non-english column names.
I will probably need to encode the column names to some format that Bigquery accepts, but I will need the ability to later decode it back to the original.
what is the best way to do this?

You can encode the column names to something like base64 and replace the +=/ characters to some kind of place holder.
If you don't care about fields length you can encode to base32 (its about 20% longer then base64 but don't use '+' or '/' and the '=' is used only for padding so you can discard it and it wont affect the string)
Except that you can make small conversion table for each non English character in your language to some combination in English chars, this will work only if you have small amount of non-english characters.

Related

In Hive what is the difference between FIELDS TERMINATED BY '\u0004' and FIELDS TERMINATED BY '\u001C'

In my project I saw two Hive tables and in the create table statement I saw one table has ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0004' and another table has ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u001C'. I want to know what does these '\u0004' and '\u001C' mean and when to use them? Kindly answer.
In many text formats, \u introduces a Unicode escape sequence. This is a way of storing or sending a character that can't be easily displayed or represented in the format you're using. The four characters after the \u are the Unicode "code point" in hexadecimal. A Unicode code point is a number denoting a specific Unicode character.
All characters have a code point, even the printable ones. For example, a is U+0061.
U+0004 and U+001C are both unprintable characters, meaning there's no standard character you can use to display them on the screen. That's why an escape sequence is used here.
If you use a simple, printable character like , as your field delimiter, it will make the stored data easier for a human to read. The field values will be stored with a , between each one. For example, you might see the values one, two and three stored as:
one,two,three
But if you expect your field values to actually contain a ,, it would be a poor choice of field delimiter (because then you'd need a special way to tell the difference between a single field with a value of one,two or two different fields with the values one and two). The choice of delimiter depends both on whether you want to be able to read it easily, and what characters you expect the field to contain.

How to find Bad characters in the column

I am trying to pull 'COURSE_TITLE' column value from 'PS_TRAINING' table in PeopleSoft and writing into UTF-8 text file to get loaded into Workday system. The file is erroring out while loading because of bad characters(Ã â and many more) present in the column. I have used a procedure which will convert non-ascii value into space. But because of this procedure, the 'Course_Title' which are written in non-english language like Chinese, Korean, Spanish also replacing with spaces.
I even tried using regular expressions (``regexp_like(course_title, 'Ã) only to find bad characters but since the table has hundreds of thousands of rows, it would be difficult to find all bad characters. Please suggest a way to solve this.
If you change your approach, this may work.
Define what you want, and retrieve it.
select *
from PS_TRAINING
where not regexp_like(course_title, '[0-9A-Za-z]')```
If you take out too much data, just add it to the regex

How to translate and what could cause characters such as å¿è€

Goal:
I only use select statements with the dbs I have access to.
One of the columns is supposed to store legible english sentences but there are values with strange characters. I would like to find a way to translate those special characters to legible characters
My question is two fold:
Can I translate the following string to a legible format all stored data is basically lost in translation
How can I ensure that the data is stored correctly?
Column Collation: SQL_Latin1_General_CP1_CI_AS
Column Data Type: NVARCHAR(300)
Data Examples:
å¿è€
ÐžÐ±Ð°Ð¶Ð´Ð°Ð½Ð¸Ñ Ð·Ð°
Use prefix of ‘N’ While you Enter to table
Insert Into TownMessage_Tbl Values (elanat=N' + Elanat +"')

updating label of a bigquery table/view

I am trying to add a label to my bigquery table/view using the following bq command.
bq update --set_label primary_keys:a,b project-id:dataset.tablename
The command works perfectly fine if I have only one key (a) as the primary key. However, when I try to insert multiple keys (a,b) separated by comma then it throws an invalid characters error. Is there a way to add multiple keys within the same label separated by comma.
I don't think that this is feasible, thus comma character is not accepted there, according to the documentation:
Keys and values can contain only lowercase letters, numeric
characters, underscores, and dashes. All characters must use UTF-8
encoding, and international characters are allowed.
According to the documentation, labels are key-value pairs that helps you organize your Google Cloud BigQuery resources.
Being a key-value pair is a requirement as per the documentation, and this is not compatible with your intention of giving two different values to the same key.

Redshift table column name auto convert to lowercase issue

I am facing an issue while fetching the data via query from a redshift table. For example:
table name: test_users
column names: user_id, userName, userLastName
Now while creating the test_users table it converts the capital letter of the userName column to username and similar with userLastName which will be converted to userlastname.
I have found the way to convert the all columns to capital or in lowercase, but not in the way to get it as it is.
Unfortunately, AWS Redshift does not support case-sensitive identifiers at the time of writing (Feb 2020). And, while Redshift is based on PostgreSQL, AWS has heavily modified it to the point where many assumptions that would be correct for PostgreSQL 8 are not correct for Redshift.
The documentation at https://docs.aws.amazon.com/redshift/latest/dg/r_names.html explicitly states that it downcases identifiers. The relevant paragraph is below, with the critical sentence bolded:
Names identify database objects, including tables and columns, as well as users and passwords. The terms name and identifier can be used interchangeably. There are two types of identifiers, standard identifiers and quoted or delimited identifiers. Identifiers must consist of only UTF-8 printable characters. ASCII letters in standard and delimited identifiers are case-insensitive and are folded to lowercase in the database. In query results, column names are returned as lowercase by default. To return column names in uppercase, set the describe_field_name_in_uppercase configuration parameter to true.
To preserve case:
SET enable_case_sensitive_identifier TO true;
https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html
To force returned uppercase fields (for anyone else curious):
SET describe_field_name_in_uppercase TO on;
https://docs.aws.amazon.com/redshift/latest/dg/r_describe_field_name_in_uppercase.html