How can I force BigQuery to autodetect schema with all strings? - google-bigquery

Is there a way to use --autodect in BigQuery forcing all new fields to be treated as strings?
The issue is the following: I have a csv file separated by \t and where all fields are quoted like this '67.4'. Now, if I simply provide a schema, then the bq load breaks for reasons I cannot understand. If I do bq load --autodect it works fine, but the values are still quoted. Now, I tried to do
bq load --autodetect --quote="'" --max_bad_records=10000
--field_delimiter="\t" --source_format=CSV
repo:abc.2017 gs://abc/abc_2017-*.csv.gz
But it now breaks wih
- gs://abc/abc_2017-04-16.csv.gz: Error while reading data,
error message: Could not parse '67.4' as int for field
int64_field_35 (position 35) starting at location 2138722
Here's one row, fields again are separated by tabs:
'333933353332333633383339333033333337' '31373335434633' 'pre' 'E' '1' '333933383335333833393333333333383338' '2017-02-01
05:13:59' '29' '333733333330333033323339333933313335333333303333333433393336' '333333353331333933363338333033373333333833323338333733323330' '3333343234313434' 'R' 'LC' '100' '-70.2' '-31.34' 'HSFC310' 'WOMT24I' '146' '1' '05'
Ideas?

Auto detect schema samples up to the first 100 rows so if the column contains all integer for up to the first 100 rows then the data type will be integer. The purpose of --qoute flag is to enclose the column with the specified value.
Example:
Sample csv data:
col1, col2
1, "2"
If you don't specify the --quote then by default it will be ". The data type for col2 will be Integer and the value will be 2.
If you specify the --quote other than the default " then it will enclose the data with that value. Example: --quote="'", col2 will be String type and data value will be "2" (the double quotes itself will be part of the data value)
As of now you can't force auto-detect schema to make all your columns to be of certain datatype, otherwise, it wouldn't be auto-detect after all. You may want to file a feature request to add another flag for bq load (and even in the UI) to make certain columns to be of certain data type (e.g. I want to make column # 1, 2, 15, 100, xxx to be String or All columns should be String/Integer/Numeric, etc...).

Related

How to determine properly data types when I have number in quotes or mixed data?

I know that this is very simple questions but I can't go any further. I want to import data from csv file to PostgreSQL. I have made a table, name column as they are named on the file and first problem that I have got is that I don't know the data type. I mean in first column when i open CSV file i have something like that:
"COLUMN1";"COLUMN2";"COLUMN3";"COLUMN4"
"009910";NA;NA;"FALSE"
"953308";0;41;"TRUE"
"936540";NA;NA;"FALSE"
"902346";1;5;"TRUE"
"747665";NA;NA;"FALSE"
"074554";NA;NA;"FALSE"
"154572";NA;NA;"FALSE"
And when I am import this base via pgAdmin 4 its return error with datatype. I set column2 as Integer but it's kinda 'mixed'. The column 1 I also set as integer but numbers are in quote so I wonder if PostgreSQL see it as string. The same is up to column4. How should I properly determine data types of each column?
During import it will cast the value to the column's type, if possible.
For example, if you do SELECT 'FALSE'::boolean it will cast and return false. SELECT '074554'::int works as well and returns 74554.
But the bare characters NA will give you problems. If those are intended to be null, try to do a find/replace on the file and just take them out, so that the first row of data has "009910";;;"FALSE" and see if that works.
You could also have all columns as text, quote the NA values, and import.
Then create a new table, and use INSERT INTO ... SELECT from the all-text table and manually cast or use CASE as needed to convert types.
For example, if you imported into a table called raw_data, and have a nicer table imports:
INSERT INTO imports
SELECT
column1::int,
CASE WHEN column2 = 'NA' THEN null ELSE column2::int END,
CASE WHEN column3 = 'NA' THEN null ELSE column3::int END,
column4::boolean
FROM
raw_data

Amazon Spectrum incremental load directly from string

I have take a field as 'filename Pro_180913_171842' from spectrum.
Tried the function in sql like
`select
fields
from spectrum.ex
where cast(SPLIT_PART('filename Pro_180913_171842','Pro_',2)as
timestamp)>cast('2018-09-12 15:13:54.0' as timestamp)`
but it returned empty rows only!
Your field has no date component, so unless we add date information, it makes no sense to compare to a full timestamp. If you intend to compare only times, then try this:
SELECT fields
FROM spectrum.ex
WHERE SPLIT_PART('filename Pro_180913_171842', '_', 2) > '151354';

Invalid digits on Redshift

I'm trying to load some data from stage to relational environment and something is happening I can't figure out.
I'm trying to run the following query:
SELECT
CAST(SPLIT_PART(some_field,'_',2) AS BIGINT) cmt_par
FROM
public.some_table;
The some_field is a column that has data with two numbers joined by an underscore like this:
some_field -> 38972691802309_48937927428392
And I'm trying to get the second part.
That said, here is the error I'm getting:
[Amazon](500310) Invalid operation: Invalid digit, Value '1', Pos 0,
Type: Long
Details:
-----------------------------------------------
error: Invalid digit, Value '1', Pos 0, Type: Long
code: 1207
context:
query: 1097254
location: :0
process: query0_99 [pid=0]
-----------------------------------------------;
Execution time: 2.61s
Statement 1 of 1 finished
1 statement failed.
It's literally saying some numbers are not valid digits. I've already tried to get the exactly data which is throwing the error and it appears to be a normal field like I was expecting. It happens even if I throw out NULL fields.
I thought it would be an encoding error, but I've not found any references to solve that.
Anyone has any idea?
Thanks everybody.
I just ran into this problem and did some digging. Seems like the error Value '1' is the misleading part, and the problem is actually that these fields are just not valid as numeric.
In my case they were empty strings. I found the solution to my problem in this blogpost, which is essentially to find any fields that aren't numeric, and fill them with null before casting.
select cast(colname as integer) from
(select
case when colname ~ '^[0-9]+$' then colname
else null
end as colname
from tablename);
Bottom line: this Redshift error is completely confusing and really needs to be fixed.
When you are using a Glue job to upsert data from any data source to Redshift:
Glue will rearrange the data then copy which can cause this issue. This happened to me even after using apply-mapping.
In my case, the datatype was not an issue at all. In the source they were typecast to exactly match the fields in Redshift.
Glue was rearranging the columns by the alphabetical order of column names then copying the data into Redshift table (which will
obviously throw an error because my first column is an ID Key, not
like the other string column).
To fix the issue, I used a SQL query within Glue to run a select command with the correct order of the columns in the table..
It's weird why Glue did that even after using apply-mapping, but the work-around I used helped.
For example: source table has fields ID|EMAIL|NAME with values 1|abcd#gmail.com|abcd and target table has fields ID|EMAIL|NAME But when Glue is upserting the data, it is rearranging the data by their column names before writing. Glue is trying to write abcd#gmail.com|1|abcd in ID|EMAIL|NAME. This is throwing an error because ID is expecting a int value, EMAIL is expecting a string. I did a SQL query transform using the query "SELECT ID, EMAIL, NAME FROM data" to rearrange the columns before writing the data.
Hmmm. I would start by investigating the problem. Are there any non-digit characters?
SELECT some_field
FROM public.some_table
WHERE SPLIT_PART(some_field, '_', 2) ~ '[^0-9]';
Is the value too long for a bigint?
SELECT some_field
FROM public.some_table
WHERE LEN(SPLIT_PART(some_field, '_', 2)) > 27
If you need more than 27 digits of precision, consider a decimal rather than bigint.
If you get error message like “Invalid digit, Value ‘O’, Pos 0, Type: Integer” try executing your copy command by eliminating the header row. Use IGNOREHEADER parameter in your copy command to ignore the first line of the data file.
So the COPY command will look like below:
COPY orders FROM 's3://sourcedatainorig/order.txt' credentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter '\t' IGNOREHEADER 1;
For my Redshift SQL, I had to wrap my columns with Cast(col As Datatype) to make this error go away.
For example, setting my columns datatype to Char with a specific length worked:
Cast(COLUMN1 As Char(xx)) = Cast(COLUMN2 As Char(xxx))

Coldfusion Query of Queries with Empty Strings

The query I start out with has 40,000 lines of empty rows, which stems from a problem with the original spreadsheet from which it was taken.
Using CF16 server
I would like to do a Query of Queries on a variably named 'key column'.
In my query:
var keyColumn = "Permit No."
var newQuery = "select * from source where (cast('#keyColumn#' as varchar) <> '')";
Note: the casting comes from this suggestion
I still get all those empty fields in there.
But when I use "City" as the keyColumn, it works. How do the values in both those columns differ when they both say [empty string] on the query dump?
Is it a problem with column names? What kind of data are in those cells?
where ( cast('Permit No.' as varchar) <> '' )
The problem is the SQL, not the values. By enclosing the column name in quotes, you are actually comparing the literal string "P-e-r-m-i-t N-o-.", not the values inside that column. Since the string "Permit No." can never equal an empty string, the comparison always returns true. That is why the resulting query still includes all rows.
Unless it was fixed in ColdFusion 2016, QoQ's do not support column names containing invalid characters like spaces. One workaround is to use the "columnNames" attribute to specify valid column names when reading the spreadsheet. Failing that, another option is to take advantage of the fact that query columns are arrays and duplicate the data under a valid column name: queryAddColumn(yourQuery, "PermitNo", yourQuery["Permit No."]) (Though the latter option is less ideal because it may require copying the underlying data internally):

Replace null value with NA using Pentaho Kettle

I have an input csv file with one column field value as empty. I want to replace that field value as NA in my destination table. And in my destination table that column is specified as not null column.
I tried using if field value is null, value mapper step. but it doesnt work out.. can anyone suggest how to proceed.
The NULLS can not be replaced using If field value is null step if you enable Lazy Conversion in CSV input step.
So untick the lazy conversion? check box in CSV Input step. Then in If field value is null step check Select fields check box and select the field you want to check nulls and type NA in Replace by value column.
there is a specific step that does just that-
it replaces null values - In the Step you have a choice to a) pick either 1 or more field type(s) (STRING, INTEGER, etc.) or b) identify specific field(s) - then you provide a replacement string if you wish.