Google Big Query - loading a csv file - Error while reading table - google-bigquery

I'm trying to upload a report in CSV fotmat to Google Big Query.
The report contains the following column names:
Adjustment Type; Day; Country; Asset ID; Asset Title; Asset Labels; Asset
Channel ID; Asset Type; Custom ID; TMS; EIDR; UPC; Season; Episode
Title; Episode Number; Director; Studio; Owned Views; YouTube Revenue Split
: Auction; YouTube Revenue Split : Reserved; YouTube Revenue Split :
Partner Sold YouTube Served; YouTube Revenue Split : Partner Sold
Partner Served; YouTube Revenue Split; Partner Revenue : Auction; Partner
Revenue : Reserved; Partner Revenue : Partner Sold YouTube
Served; Partner Revenue : Partner Sold Partner Served; Partner Revenue
After creating the table for this report, the column names and types look as follows:
[
{
"name": "Adjustment_Type",
"type": "STRING"
},
{
"name": "Day",
"type": "STRING"
},
{
"name": "Country",
"type": "STRING"
},
{
"name": "Asset_ID",
"type": "STRING"
},
{
"name": "Asset_Title",
"type": "STRING"
},
{
"name": "Asset_Labels",
"type": "STRING"
},
{
"name": "Asset_Channel_ID",
"type": "STRING"
},
{
"name": "Asset_Type",
"type": "STRING"
},
{
"name": "Custom_ID",
"type": "STRING"
},
{
"name": "TMS",
"type": "STRING"
},
{
"name": "EIDR",
"type": "STRING"
},
{
"name": "UPC",
"type": "STRING"
},
{
"name": "Season",
"type": "STRING"
},
{
"name": "Episode_Title",
"type": "STRING"
},
{
"name": "Episode_Number",
"type": "STRING"
},
{
"name": "Director",
"type": "STRING"
},
{
"name": "Studio",
"type": "STRING"
},
{
"name": "Owned_Views",
"type": "STRING"
},
{
"name": "YouTube_Revenue_Split___Auction",
"type": "FLOAT"
},
{
"name": "YouTube_Revenue_Split___Reserved",
"type": "FLOAT"
},
{
"name": "YouTube_Revenue_Split___Partner_Sold_YouTube_Served",
"type": "FLOAT"
},
{
"name": "YouTube_Revenue_Split___Partner_Sold_Partner_Served",
"type": "FLOAT"
},
{
"name": "YouTube_Revenue_Split",
"type": "FLOAT"
},
{
"name": "Partner_Revenue___Auction",
"type": "FLOAT"
},
{
"name": "Partner_Revenue___Reserved",
"type": "FLOAT"
},
{
"name": "Partner_Revenue___Partner_Sold_YouTube_Served",
"type": "FLOAT"
},
{
"name": "Partner_Revenue___Partner_Sold_Partner_Served",
"type": "FLOAT"
},
{
"name": "Partner_Revenue",
"type": "FLOAT"
}
]
While trying to query the table, I'm getting the following error message:
Could not parse 'YouTube Revenue Split : Auction' as double for field
YouTube_Revenue_Split___Auction (position 18) starting at location 0
(error code: invalid)
Any idea, what could be the reason for this error?

I've been able to replicate the error. In my case it appears when trying to load the CSV to BigQuery. The CSV has the string YouTube Revenue Split : Auction where should be float.
What I suspect is happening is that your CSV file has the column headers in it and you are not skipping them when loading the file to BigQuery. This causes that, when the import process gets to the YouTube_Revenue_Split___Auction field (position 18), expects to insert float, but instead it tries to insert the column header, YouTube Revenue Split : Auction, which is a string that cannot be parsed correctly.
Try re-loading the CSV but remove the headers first (or skip them using the Header rows to skip option).
If my supossition is wrong and this doesn't apply, update your question by adding the query that produces the error.

Adding on to what Guillermo said, you can also have Big Query automatically identify the headers and field types when you upload your files in CSV

I had a similar error and solved it by replacing the semi-colons with commas.
You can use a regex for this or use this great online text replacement tool that I found - https://onlinetexttools.com/replace-text

Related

Load Avro file to GCS with nested record using customized column name

I was trying to load an Avro file with nested record. One of the record was having a union of schema. When loaded to BigQuery, it created a very long name like com_mycompany_data_nestedClassname_value on each union element. That name is long. Wondering if there is a way to specify name without having the full package name prefixed.
For example. The following Avro schema
{
"type": "record",
"name": "EventRecording",
"namespace": "com.something.event",
"fields": [
{
"name": "eventName",
"type": "string"
},
{
"name": "eventTime",
"type": "long"
},
{
"name": "userId",
"type": "string"
},
{
"name": "eventDetail",
"type": [
{
"type": "record",
"name": "Network",
"namespace": "com.something.event",
"fields": [
{
"name": "hostName",
"type": "string"
},
{
"name": "ipAddress",
"type": "string"
}
]
},
{
"type": "record",
"name": "DiskIO",
"namespace": "com.something.event",
"fields": [
{
"name": "path",
"type": "string"
},
{
"name": "bytesRead",
"type": "long"
}
]
}
]
}
]
}
Came up with
Is that possible to make the long field name like eventDetail.com_something_event_Network_value to be something like eventDetail.Network
Avro loading is not as flexible as it should be in BigQuery (basic example is that it does not support load a subset of the fields (reader schema). Also, renaming of the columns is not supported today in BigQuery refer here. Only options are recreate your table with the proper names (create a new table from your existing table) or recreate the table from your previous table

Error when extracting data from Azure Table Storage using Azure Data Factory

I want to copy data from Azure Table Storage to Azure SQL Server using Azure Data Factory, but I get a strange error.
In my Azure Table Storage I have a column which contains multiple data types (this is how Table Storage works) E.G. Date time and String.
In my Data Factory project I mentioned that the entire column is string, but for some reason the Data Factory assumes the data type based on the first cell that it encounters during the extraction process.
In my Azure SQL Server database all columns are string.
Example
I have this table in Azure Table Storage: Flights
RowKey PartitionKey ArrivalTime
--------------------------------------------------
1332-2 2213dcsa-213 04/11/2017 04:53:21.707 PM - this cell is DateTime
1332-2 2213dcsa-214 DateTime.Null - this cell is String
If my table is like the one below, the copy process will work, because the first row is string and it will convert the entire column to string.
RowKey PartitionKey ArrivalTime
--------------------------------------------------
1332-2 2213dcsa-214 DateTime.Null - this cell is String
1332-2 2213dcsa-213 04/11/2017 04:53:21.707 PM - this cell is DateTime
Note: I am not allowed to change the data type in Azure Table Storage, move the rows or to add new ones.
Below are the input and output data sets from Azure Data Factory:
"datasets": [
{
"name": "InputDataset",
"properties": {
"structure": [
{
"name": "PartitionKey",
"type": "String"
},
{
"name": "RowKey",
"type": "String"
},
{
"name": "ArrivalTime",
"type": "String"
}
],
"published": false,
"type": "AzureTable",
"linkedServiceName": "Source-AzureTable",
"typeProperties": {
"tableName": "flights"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
},
{
"name": "OutputDataset",
"properties": {
"structure": [
{
"name": "PartitionKey",
"type": "String"
},
{
"name": "RowKey",
"type": "String"
},
{
"name": "ArrivalTime",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Destination-SQLAzure",
"typeProperties": {
"tableName": "[dbo].[flights]"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
}
]
Does anyone knows a solution to this issue?
I've just been playing around with this. I think you have 2 options to deal with this.
Option 1
Simply remove the data type attribute from your input dataset. In the 'structure' block of the input JSON table dataset you don't have to specify the type attribute. Remove or comment it out.
For example:
{
"name": "InputDataset-ghm",
"properties": {
"structure": [
{
"name": "PartitionKey",
"type": "String"
},
{
"name": "RowKey",
"type": "String"
},
{
"name": "ArrivalTime"
/* "type": "String" --<<<<<< Optional! */
},
This should mean the data type is not validated on read.
Option 2
Use a custom activity upstream of the SQL DB table load to cleanse and transform the table data. This will mean breaking out the C# and require a lot more dev time. But you may want to reuse the cleaning code for other datasets.
Hope this helps.

MSON specifying a date-time

How can the following MSON output a schema with "format":"date-time", similar to the stated below?
Mson:
FORMAT: 1A
# Some API
## series [/api/v1/series]
Returns a list of series.
### View all series [GET]
+ Response 200 (application/json; charset=utf-8)
+ Attributes
+ CreatedAt (required, string)
Obtained Json Schema:
{
"type": "object",
"properties": {
"CreatedAt": {
"type": "string"
}
},
"required": [
"CreatedAt"
],
"$schema": "http://json-schema.org/draft-04/schema#"
}
Desired Schema (Note the "format" field):
{
"type": "object",
"properties": {
"CreatedAt": {
"type": "string",
"format": "date-time"
}
},
"required": [
"CreatedAt"
],
"$schema": "http://json-schema.org/draft-04/schema#"
}
AFAIK unfortunately there is no way to specify some format on string fields yet.

Schema to load JSON to Google BigQuery

Suppose I have the following JSON, which is the result of parsing urls parameters from a log file.
{
"title": "History of Alphabet",
"author": [
{
"name": "Larry"
},
]
}
{
"title": "History of ABC",
}
{
"number_pages": "321",
"year": "1999",
}
{
"title": "History of XYZ",
"author": [
{
"name": "Steve",
"age": "63"
},
{
"nickname": "Bill",
"dob": "1955-03-29"
}
]
}
All the fields in top-level, "title", "author", "number_pages", "year" are optional. And so are the fields in the second level, inside "author", for example.
How should I make a schema for this JSON when loading it to BQ?
A related question:
For example, suppose there is another similar table, but the data is from different date, so it's possible to have different schema. Is it possible to query across these 2 tables?
How should I make a schema for this JSON when loading it to BQ?
The following schema should work. You may want to change some of the types (e.g. maybe you want the dob field to be a TIMESTAMP instead of a STRING), but the general structure should be similar. Since types are NULLABLE by default, all of these fields should handle not being present for a given row.
[
{
"name": "title",
"type": "STRING"
},
{
"name": "author",
"type": "RECORD",
"fields": [
{
"name": "name",
"type": "STRING"
},
{
"name": "age",
"type": "STRING"
},
{
"name": "nickname",
"type": "STRING"
},
{
"name": "dob",
"type": "STRING"
}
]
},
{
"name": "number_pages",
"type": "INTEGER"
},
{
"name": "year",
"type": "INTEGER"
}
]
A related question: For example, suppose there is another similar table, but the data is from different date, so it's possible to have different schema. Is it possible to query across these 2 tables?
It should be possible to union two tables with differing schemas without too much difficulty.
Here's a quick example of how it works over public data (kind of a silly example, since the tables contain zero fields in common, but shows the concept):
SELECT * FROM
(SELECT * FROM publicdata:samples.natality),
(SELECT * FROM publicdata:samples.shakespeare)
LIMIT 100;
Note that you need the SELECT * around each table or the query will complain about the differing schemas.

How to create new table with nested schema entirely in BigQuery

I've got a nested table A in BigQuery with a schema as follows:
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
}
]
}
I would like to enrich table A with data from other table and save result as a new nested table. Let's say I would like to add "description" field to table A (creating table B), so my schema will be as follows:
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
How do I do this in BigQuery? It seems, that there are no functions for creating nested structures in BigQuery SQL (except NEST functions, which produces a list - but this function doesn't seem to work, failing with Unexpected error)
The only way of doing this I can think of, is to:
use string concatenation functions to produce table B with single field called "json" with content being enriched data from A, converted to json string
export B to GCS as set of files F
load F as table C
Is there an easier way to do it?
To enrich schema of existing table one can use tables patch API
https://cloud.google.com/bigquery/docs/reference/v2/tables/patch
Request will look like below
PATCH https://www.googleapis.com/bigquery/v2/projects/{project_id}/datasets/{dataset_id}/tables/{table_id}?key={YOUR_API_KEY}
{
"schema": {
"fields": [
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
]
}
}
Before Patch
After Patch