I am trying to create a table from a CSV using the schema autodetect option. It fails because some rows / columns have values that do not conform to the auto detected type. I would like to change the type for those columns to STRING.
Is there a way to export the autodetected schema so I can update it and use it to load. The CSV has 30+ columns and I would like to avoid having to manually generate a schema file for all the columns.
Update
This question is not a duplicate of this. The latter is a solution to the case where the table already exists. In this question there is no existing table whose schema can be exported.
Related
I have a requirement of validating CSV file before loading into staged-folder, and later have to load into sql table.
I need to validate metadata (the structure of the file must be same as target sql table)
No. of columns should be equal to the target sql table
order of columns should be same as target sql table
Data types of columns (no text values should exist in numeric field of csv file)
looking for some easy and efficient way achieve this.
Thanks for help
A Python program and module that does most of what you're looking for is chkcsv.py: https://pypi.org/project/chkcsv/. It can be used to verify that a CSV file contains a specified set of columns and that the data type of each column conforms to the specification. It does not, however, verify that the order of columns in the CSV file is the same as the order in the database table. Instead of loading the CSV file directly into the target table, you can load it into a staging table and then move it from there into the target table--this two-step process eliminates column order dependence.
Disclaimer: I wrote chkcsv.py
Edit 2020-01-26: I just added an option that allows you to specify that the column order should be checked also.
I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.
So for my previous homework, we were asked to import a csv file with no columns names to impala, where we explicitly give the name and type of each column while creating the table. However, now we have a csv file but with column names given, in this case, do we still need to write down the name and type of it even it is provided in the data?
Yes, you still have to create an external table and define the column names and types. But you have to pass the following option right at the end of the create table statement
tblproperties ("skip.header.line.count"="1");
-- Once the table property is set, queries skip the specified number of lines
-- at the beginning of each text data file. Therefore, all the files in the table
-- should follow the same convention for header lines.
I have a csv file with more than 100 columns and I want to create a table in oracle with the similar structure, then populate it.
Do you have any idea how to do this ? ( SQL*Loader, External tables, ..)
I don't want to use the classic " Create table " and specify for each column the name and type.
It's a workaround, but does it work:
If you don't want to specify column names, you can import you CSV in Access.
Then, after you have defined a DSN, you can export your table to Oracle.
When creating a table before the bulk insert, is there a way to NOT specify the column names and use whatever column names are on the csv file? I have some columns in my csv file that are quarters, like 2012Q2, 2012Q3, etc. In the future, these are going to change depending on the time and that's why I don't want to specify the column names. If this is possible, any help will be appreciated.
Thanks!
One way to this is:
Drop the table on BigQuery
Get column names from your CSV
Create table using the column names from your CSV.
PHP example can be found here: https://github.com/GoogleCloudPlatform/php-docs-samples/blob/master/bigquery/api/src/functions/create_table.php
(the fields = your column names)
Import the CSV
Done!