Bigquery Loading CSV File with 'null' text in the columns

Bigquery Loading CSV File with 'null' text in the columns - google-bigquery

I have try to upload a CSV file into Bigquery using Google Cloud Client Library. In that one of the CSV file has 'null' text in the columns, while uploading the file Bigquery returns an error message saying "Too Few Columns".
Sample File Data:
column1, column2, column3, column4
1, null, 3, null,
2, null, null, null
I have verified the Configuration json sent, it has four Table fields for 4 columns. And error message says 'Expected 4 column(s) but got 2 column(s)'.
Is there any specfic configuration required to handle this scenario?

If the columns are numeric, then you specify a null with an empty value.
For example, this works.
$ echo 2,,, > rows.csv
$ bq load lotsOdata.lfdhjv2 rows.csv c1:integer,c2:integer,c3:integer,c4:float
Waiting on bqjob_r4f71e9aebbf9cb57_00000144acfa7622_1 ... (23s) Current status: DONE
Note that in your example above, you would have an extra value in the 1,null,3,null, line because there is an extra comma at the end. And also note that if your .csv file has a header row, you should use the --skip_leading_rows=1 parameter so that the header doesn't get interpreted as data.

Related

Ignore Last row in CSV file as part of BigQuery External table command

I have about 40 odd csv files, comma delimited in GCS however the last line of all the files has quotes and dot
”.
So these are not exactly conformed csv schema and has data quality issue which i have to get around
My aim is to create an external table referencing to the gcs files and then be able to select the data.
example:
create or replace dataset.tableName
options (
uris = ['gs://bucket_path/allCSVFILES_*.csv'],
format = 'CSV',
skip_leading_rows = 1,
ignore_unknown_values = true
)
the external table gets created without any error. however, when I select the data, I ran to error
"error message: CSV table references column position 16, but line starting at position:18628631 contains only 1 columns"
This is due to quotes and dot ”. at the end of file.
My question is: is there any way in BigQuery to consume to data without the LAST LINE. as part of options we have skip_leading_rows to skip header but any way to skip to last row?
Currently my best placed option is to clean the files, using sed/tail command.
I have checked the create or replace external table options list below and have tried using ignore_unknown_values but other than this option i don't see any other option which will work.
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_external_table_statement

You can try below work around:
I tried with pandas and removed the last record from the csv file.
from google.cloud import bigquery
import pandas as pd
from google.cloud import storage
df=pd.read_csv('gs://samplecsv.csv')
client = bigquery.Client()
dataset_ref = client.dataset('dataset')
table_ref = dataset_ref.table('new_table')
df.drop(df.tail(1).index,inplace=True)
client.load_table_from_dataframe(df, table_ref).result()
For more information you can refer to this link which mentions the limitation for loading csv files to Bigquery.

"Error while reading data" error received when uploading CSV file into BigQuery via console UI

I need to upload a CSV file to BigQuery via the UI, after I select the file from my local drive I specify BigQuery to automatically detect the Schema and run the job. It fails with the following message:
"Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2; errors: 1. Please look into the
errors[] collection for more details."
I have tried removing the comma in the last column, and tried changing options in the advanced section but it always results in the same error.
The error log is not helping me understand where the problem is, this is example of the error log entry:
2
019-04-03 23:03:50.261 CLST Bigquery jobcompleted
bquxjob_6b9eae1_169e6166db0 frank#xxxxxxxxx.nn INVALID_ARGUMENT
and:
"Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2; errors: 1. Please look into the
errors[] collection for more details."
and:
"Error while reading data, error message: Error detected while parsing
row starting at position: 46. Error: Data between close double quote
(") and field separator."
The strange thing is that the sample CSV data has NO double quote field separator!?
2019-01-02 00:00:00,326,1,,292,0,,294,0,,-28,0,,262,0,,109,0,,372,0,,453,0,,536,0,,136,0,,2609,0,,1450,0,,352,0,,-123,0,,17852,0,,8528,0
2019-01-02 00:02:29,289,1,,402,0,,165,0,,-218,0,,150,0,,90,0,,263,0,,327,0,,275,0,,67,0,,4863,0,,2808,0,,124,0,,454,0,,21880,0,,6410,0
2019-01-02 00:07:29,622,1,,135,0,,228,0,,-147,0,,130,0,,51,0,,381,0,,428,0,,276,0,,67,0,,2672,0,,1623,0,,346,0,,-140,0,,23962,0,,10759,0
2019-01-02 00:12:29,206,1,,118,0,,431,0,,106,0,,133,0,,50,0,,380,0,,426,0,,272,0,,63,0,,1224,0,,740,0,,371,0,,-127,0,,27758,0,,12187,0
2019-01-02 00:17:29,174,1,,119,0,,363,0,,59,0,,157,0,,67,0,,381,0,,426,0,,344,0,,161,0,,923,0,,595,0,,372,0,,-128,0,,22249,0,,9278,0
2019-01-02 00:22:29,175,1,,119,0,,301,0,,7,0,,124,0,,46,0,,382,0,,425,0,,431,0,,339,0,,1622,0,,1344,0,,379,0,,-126,0,,23888,0,,8963,0
I shared an example of a few lines of CSV data. I expect BigQuery to be able to detect the schema and load the data into a new table.

Using BigQuery new WebUI and your input data I did the following:
Select a dataset
Clicked on create a table
Filled the create table form as follow:
The table was created and I was able to SELECT 6 rows as expected
SELECT * FROM projectId.datasetId.SO LIMIT 1000

BigQuery Could not parse 'null' as int for field

Tried to load csv files into bigquery table. There are columns where the types are INTEGER, but some missing values are NULL. So when I use the command bq load to load, got the following error:
Could not parse 'null' as int for field
So I am wondering what are the best solutions to deal with this, have to reprocess the data first for bq to load?

You'll need to transform the data in order to end up with the expected schema and data. Instead of INTEGER, specify the column as having type STRING. Load the CSV file into a table that you don't plan to use long-term, e.g. YourTempTable. In the BigQuery UI, click "Show Options", then select a destination table with the table name that you want. Now run the query:
#standardSQL
SELECT * REPLACE(SAFE_CAST(x AS INT64) AS x)
FROM YourTempTable;
This will convert the string values to integers where 'null' is treated as null.

Please try with job config setting.
job_config.null_marker = 'NULL'
configuration.load.nullMarker
string
[Optional] Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load

BigQuery Console has it's limitations and doesn't allow you to specify a null marker while loading data from a CSV. However, it can easily be done by using the BigQuery command-line tool's bq load command. We can use the --null_marker flag to specify the marker which is simply null in this case.
bq load --source_format=CSV \
--null_marker=null \
--skip_leading_rows=1 \
dataset.table_name \
./data.csv \
./schema.json
Setting the null_marker as null does the trick here. You can omit the schema.json part if the table is already present with a valid schema. --skip_leading_rows=1 is used because my first row was a header.
You can learn more about the bg load command in the BigQuery Documentation.
The load command however lets you create and load a table in a single go. The schema needs to be specified in a JSON file in the below format:
[
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
},
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
}
]

unable to specify db2 import parameters on bluemix?

I subscribed a free sqldb service from bluemix and tried to import data in CSV file to this database instance.
For certain columns I have pure "space" as data, and some columns to be filled by default value. I can import this data with the following command on my local DB2:
db2 'import from MY_DATA.csv of del modified by usedefaults keepblanks timestampformat="MM/DD/YYYY HH:MM:SS" skipcount 1 insert into MY_TABLE'
On bluemix, I can only assign date / time / timestamp format and skip 1st row. How can I add the "modified by usedefaults keepblanks" part on bluemix to complete the import?
Also, when the import fails, I only receive the following message:
BaseException message: [Routine "SYSPROC.ADMIN_CMD" execution has completed, but at least one error, "_0911", was encountered during the execution. More information is available.. _CODE=20397, _STATE=01H52, DRIVER=3.66.46]
Where can I get the detail error log that I can see on my local DB such as:
SQL3125W The character data in row "2" and column "32" was truncated because
the data is longer than the target database column.
SQL3148W A row from the input file was not inserted into the table. SQLCODE
"-181" was returned.
SQL0181N The string representation of a datetime value is out of range.
SQLSTATE=22007
SQL3185W The previous error occurred while processing data from row "2" of
the input file.
SQL3110N The utility has completed processing. "2" rows were read from the
input file.
SQL3221W ...Begin COMMIT WORK. Input Record Count = "2".
SQL3222W ...COMMIT of any database changes was successful.
SQL3149N "2" rows were processed from the input file. "0" rows were
successfully inserted into the table. "1" rows were rejected.
Number of rows read = 2
Number of rows skipped = 1
Number of rows inserted = 0
Number of rows updated = 0
Number of rows rejected = 1
Number of rows committed = 2

In the same quick load page (load complete in step 4), there should be a link to view the logs for this load. Hopefully it'll reveal more details about the error message.
Also note that keepblanks is applicable to DEL file formats (Delimited ASCII) only. It is not applicable to ASCII file formats (ASC/DEL) or ASC file formats (Non-delimited ASCII).
http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0023577.html?cp=SSEPGG_10.5.0%2F3-6-1-3-0-0-12&lang=en

Issue with bulk insert

I am trying to insert the data from this link to my SQL server
https://www.ian.com/affiliatecenter/include/V2/CityCoordinatesList.zip
I created the table
CREATE TABLE [dbo].[tblCityCoordinatesList](
[RegionID] [int] NOT NULL,
[RegionName] [nvarchar](255) NULL,
[Coordinates] [nvarchar](4000) NULL
) ON [PRIMARY]
And I am running the following script to do the bulk insert
BULK INSERT tblCityCoordinatesList
FROM 'C:\data\CityCoordinatesList.txt'
WITH
(
FIRSTROW = 2,
MAXERRORS = 0,
FIELDTERMINATOR = '|',
ROWTERMINATOR = '\n'
)
But the bulk insert fails with following error
Cannot obtain the required interface ("IID_IColumnsInfo") from OLE DB provider "BULK" for linked server "(null)".
When I google, I found several articles which says the issue may be with RowTerminator, but I tried everything like \n\r, \n etc, but nothing is working.
Could anyone please help me to insert this data into my database?

Try ROWTERMINATOR = '0x0a'.
it should work.

This can also happen if the number of columns mismatch between the table and the imported file

I got the same error message, and as you had mention, it was related to unexpected line ending.
In my case the line ending was specified in a fmt file as a Windows Line ending (CRLF), written as \r\n, and the data file to process has a Mac classic one (CR).
I solved it with an editor that can show the current line ending and change it. I used EditPad Lite wich shows the opened file line ending in the bottom bar and pressing it allow to replace with the expected one.

I had this on SQL2019 when the FORMAT='CSV' option was used, and there was a comma on the end of each line in the source file. So the table your BULK inserting into needed to have an extra dummy field to cater for the fact each record has essentially a blank field in the source file.
!

I get the same error, probably from the file encoding problem. I fixed it by opening the problem CSV file using Notepad++, select everything and copy to clipboard. Next, create a new text file (making sure it has the CSV file extension), open it using Notepad++, then paste the text to the new file. Save and close all files. You should be able to successfully load the new CSV file into the SQL server.

you need run BULK INSERT - command from windows login (not from SQL). Now I don't have any examples

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Bigquery Loading CSV File with 'null' text in the columns - google-bigquery

Related

Ignore Last row in CSV file as part of BigQuery External table command

"Error while reading data" error received when uploading CSV file into BigQuery via console UI

BigQuery Could not parse 'null' as int for field

unable to specify db2 import parameters on bluemix?

Issue with bulk insert

Categories

Resources