Bigquery add columns to table schema - google-bigquery

I am trying to add new column to BigQuery existing table. I have tried bq command tool and API approach. I get following error when making call to Tables.update().
I have tried with providing full schema with additional field and that also gives me same error as shown below.
With API I get following Error:
{
"schema": {
"fields": [{
"name": "added_column",
"type": "integer",
"mode": "nullable"
}]
}
}
{
"error": {
"errors": [{
"domain": "global",
"reason": "invalid",
"message": "Provided Schema does not match Table [blah]"
}],
"code": 400,
"message": "Provided Schema does not match Table [blah]"
}
}
With BQ tool I get following error:
./bq update -t blah added_column:integer
BigQuery error in update operation: Provided Schema does not match Table [blah]

Try this:
bq --format=prettyjson show yourdataset.yourtable > table.json
Edit table.json and remove everything except the inside of "fields" (e.g. keep the [ { "name": "x" ... }, ... ]). Then add your new field to the schema.
Or pipe through jq
bq --format=prettyjson show yourdataset.yourtable | jq .schema.fields > table.json
Then run:
bq update yourdataset.yourtable table.json
You can add --apilog=apilog.txt to the beginning of the command line which will show exactly what is sent / returned from the bigquery server.

In my case I was trying to add a REQUIRED field to a template table, and was running into this error. Changing the field to NULLABLE , let me update the table.
Also more recent version on updates for anybody stumbling from Google.
#To create table
bq mk --schema domain:string,pageType:string,source:string -t Project:Dataset.table
#Or using schema file
bq mk --schema SchemaFile.json -t Project:Dataset.table
#SchemaFile.json format
[{
"mode": "REQUIRED",
"name": "utcTime",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "domain",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "testBucket",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "isMobile",
"type": "BOOLEAN"
},
{
"mode": "REQUIRED",
"name": "Category",
"type": "RECORD",
"fields": [
{
"mode": "NULLABLE",
"name": "Type",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "Published",
"type": "BOOLEAN"
}
]
}]
# TO update
bq update --schema UpdatedSchema.json -t Project:Dataset.table
# Updated Schema contains old and any newly added columns
Some docs for template tables

Example using the BigQuery Node JS API:
const fieldDefinition = {
name: 'nestedColumn',
type: 'RECORD',
mode: 'REPEATED',
fields: [
{name: 'id', type: 'INTEGER', mode: 'NULLABLE'},
{name: 'amount', type: 'INTEGER', mode: 'NULLABLE'},
],
};
const table = bigQuery.dataset('dataset1').table('source_table_name');
const metaDataResult = await table.getMetadata();
const metaData = metaDataResult[0];
const fields = metaData.schema.fields;
fields.push(fieldDefinition);
await table.setMetadata({schema: {fields}});

I was stuck trying to add columns to an existing table in BigQuery using the Python client and found this post several times. I'll then let the piece of code that solved it for me, in case someone's having the same problem:
# update table schema
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
table = bigquery_client.get_table(table_ref)
new_schema = list(table.schema)
new_schema.append(bigquery.SchemaField('LOLWTFMAN','STRING'))
table.schema = new_schema
table = bigquery_client.update_table(table, ['schema']) # API request

You can add Schema to your table through GCP console Easier and Clear:-

Here's a quick snippet I wrote that will dynamically add schema columns if the data coming in (from a server, etc) doesn't match what exists currently in a BigQuery Table:
def verify_schema(client, table, data_dict):
schema = list(table.schema)
existing_schema_names = [schema.name for schema in schema]
validation_list = [True if schema_field in existing_schema_names else schema.append(
bigquery.SchemaField(name=schema_field, field_type='STRING', mode='NULLABLE')) for schema_field in data_dict.keys()]
if None in validation_list:
table.schema = schema
client.update_table(table, ['schema'])

Related

Executing a BigQuery using google workflow to get last modified of a table. getting wrong results in workflow but same works fine in BIGQUERY UI

This is in continuation with my other post where I am getting issues with my workflow. After further debugging I realized that the "last_modified_time" in big query workflow is not showing correct results but same work fine when I execute it in big query UI. Please see below details.
Google Cloud Workflow error as "Syntax error: Unclosed string literal
Below is my workflow code just to see what is the value for "last_modified_time"
main:
steps:
- getupdatedatetime:
call: googleapis.bigquery.v2.jobs.query
args:
projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
body:
useLegacySql: false
query: >
SELECT
TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time,
DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date,
FROM `project-dataset.__TABLES__` where table_id='table_name'
result: queryResult
- documentFound:
return: ${queryResult}
The output of the above query in workflow is in json format and is as below
{
"cacheHit": false,
"jobComplete": true,
"jobReference": {
"jobId": "job__EmSzEzXNUAKBTebWTieYIQVNKf7",
"location": "EU",
"projectId": "project_id"
},
"kind": "bigquery#queryResponse",
"rows": [
{
"f": [
{
**"v": "1.625481329263E9"**
},
{
"v": "2021-07-05"
}
]
}
],
"schema": {
"fields": [
{
"mode": "NULLABLE",
"name": "last_modified_time",
"type": "TIMESTAMP"
},
{
"mode": "NULLABLE",
"name": "creation_date",
"type": "DATE"
}
]
},
"totalBytesProcessed": "0",
"totalRows": "1"
}
The last_modified_time is not correct. creation_date is fine. Because the last_modified_time is not fine here. my other subworkflows in my workflow is not working fine.
When I execute the same query in big query, I get below results
SELECT
TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time,
DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date,
FROM `project.dataset.__TABLES__` where table_id='table_name'
Can anyone provide some help and guidance for what I am doing wrong.
The result in your workflow is right. A timestamp is an integer number. In BQ UI, the timestamp is converted to the DateTime format. You can convert the last_modified_time timestamp value to the DateTime format as you want.
I use https://www.epochconverter.com/ to convert your timestamp result to the DateTime format and here is the result: GMT: Monday, July 5, 2021 10:35:29.263.

Error loading multiple files to bigquery too many positional args

~edited: I'm running the bq command line using my VM instance in Google Compute Engine
Ive been trying to load multiple csv files to bigquery using bq command line, and i keep getting this error
Too many positional args, still have ['/home/username/csvschema.json']
All my files contain the same schema since I copied and paste it only and rename for testing purposes. So not sure why I keep getting this error. [testFiles_1.csv, testFiles_2.csv, testFiles_3.csv]
These are the steps I took:
1. Created my bigquery table and manually insert 1 file there so I dont need to manually add schema, but rather auto detect.
2. Then, I type this command:
bq load --skip_leading_rows=1 gcstransfer.testFile /home/username/testfile_*.csv /home/username/csvschema.json
My schema contains by running the bq show --format=prettyjson dataset.table
[
{
"mode": "NULLABLE",
"name": "Channel",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Date",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "ID",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Referral",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Browser",
"type": "STRING"
}
]
I tried omitting the JSON part, but I get this error instead:
BigQuery error in load operation: Error decoding JSON schema from file /home/username/testfile_2.csv: No JSON object could be decoded
To specify a one-column schema, use "name:string".
Looks like you cannot use wildcards when loading from local data source. For this you can upload the files to a GCS Bucket and load them from there. See the Limitations paragraph in the docs: https://cloud.google.com/bigquery/docs/loading-data-local
Wildcards and comma separated lists are not supported when you load
files from a local data source. Files must be loaded individually.

Apache Druid segment merge task submition failure

I am using Druid 0.9.1.1 and trying to merge all the segment of a datasource per day to a single segment. Whereas the merge task initiation fails with error :
{"error":"Instantiation of [simple type, class io.druid.timeline.DataSegment] value failed: null (through reference chain: java.util.ArrayList[0])"}
I have got the segment details from segment metadata query. There is no help from driud documents as only specify raw structure of the overall query, but not the required segment detail structure(Below is how druid document suggests).
{
"type": "merge",
"id": <task_id>,
"dataSource": <task_datasource>,
"aggregations": <list of aggregators>,
"segments": <JSON list of DataSegment objects to merge>
}
example queries :
{
"type": "merge",
"id": "envoy_merge_task",
"dataSource": "dcap.envoy.diskmounts.kafka",
"segments": [{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5460959,"numRows":41577,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_1","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5448881,"numRows":41577,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_2","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5454452,"numRows":41571,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_3","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5456267,"numRows":41569,"aggregators":null,"queryGranularity":null}] }
I have tried different forms of structure for "segments" key, results in same error.
example :
"segments": [{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_1"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_2"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_3"}]
What is right structure for segment-merge tasks.
the format i used for segments is
"segments":[
{
"dataSource": "wikiticker88",
"interval": "2015-09-12T02:00:00.000Z/2015-09-12T03:00:00.000Z",
"version": "2018-01-16T07:23:16.425Z",
"loadSpec": {
"type": "local",
"path": "/home/linux/druid-0.11.0/var/druid/segments/wikiticker88/2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z/2018-01-16T07:23:16.425Z/0/index.zip"
},
"dimensions": "channel,cityName,comment,countryIsoCode,countryName,isAnonymous,isMinor,isNew,isRobot,isUnpatrolled,metroCode,namespace,page,regionIsoCode,regionName,user",
"metrics": "count,added,deleted,delta,user_unique",
"shardSpec": {
"type": "none"
},
"binaryVersion": 9,
"size": 198267,
"identifier": "wikiticker88_2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z_2018-01-16T07:23:16.425Z"
},
]
use this to get your metadata of segments
/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full

Using $ref for jsonschema in Abao

Can someone help with schemas refs in abao? How to use --schemas option? Here is simple gist https://gist.github.com/SeanSilke/e5a2f7673ad4aa2aa43ba800c9aec31b
I try to run "abao api.raml --schemas fref.json" but got error " Missing/unresolved JSON schema $refs (fref.json) in schema".
By the way the server is mocked by osprey-mock-service.
You need add id field to your JSON schemas.
For run use: abao api.raml --server http://localhost:3000 --schemas=./*.json
Example files:
api.raml
#%RAML 0.8
title: simple API
baseUri: http://localhost:3000
/song:
get:
responses:
200:
body:
application/json:
schema: !include schema.json
example: |
{
"songId": "e29b",
"songTitle": "The song",
"albumId": "18310"
}
fref.json
{
"id": "fref.json",
"type": "string"
}
schema.json
{
"$schema": "http://json-schema.org/draft-03/schema",
"id": "schema.json",
"type": "object",
"properties":{
"songId": {"$ref": "fref.json"}
},
"required": ["songId", "albumId", "songTitle"]
}

AWS Data pipeline CSV data from S3 to DynamoDB

I am trying to transfer CSV data from S3 bucket to DynamoDB using AWS pipeline, following is my pipe line script, it is not working properly,
CSV file structure
Name, Designation,Company
A,TL,C1
B,Prog, C2
DynamoDb : N_Table, with Name as hash value
{
"objects": [
{
"id": "Default",
"scheduleType": "cron",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DynamoDBDataNodeId635",
"schedule": {
"ref": "ScheduleId639"
},
"tableName": "N_Table",
"name": "MyDynamoDBData",
"type": "DynamoDBDataNode"
},
{
"emrLogUri": "s3://onlycsv/error",
"id": "EmrClusterId636",
"schedule": {
"ref": "ScheduleId639"
},
"masterInstanceType": "m1.small",
"coreInstanceType": "m1.xlarge",
"enableDebugging": "true",
"installHive": "latest",
"name": "ImportCluster",
"coreInstanceCount": "1",
"logUri": "s3://onlycsv/error1",
"type": "EmrCluster"
},
{
"id": "S3DataNodeId643",
"schedule": {
"ref": "ScheduleId639"
},
"directoryPath": "s3://onlycsv/data.csv",
"name": "MyS3Data",
"dataFormat": {
"ref": "DataFormatId1"
},
"type": "S3DataNode"
},
{
"id": "ScheduleId639",
"startDateTime": "2013-08-03T00:00:00",
"name": "ImportSchedule",
"period": "1 Hours",
"type": "Schedule",
"endDateTime": "2013-08-04T00:00:00"
},
{
"id": "EmrActivityId637",
"input": {
"ref": "S3DataNodeId643"
},
"schedule": {
"ref": "ScheduleId639"
},
"name": "MyImportJob",
"runsOn": {
"ref": "EmrClusterId636"
},
"maximumRetries": "0",
"myDynamoDBWriteThroughputRatio": "0.25",
"attemptTimeout": "24 hours",
"type": "EmrActivity",
"output": {
"ref": "DynamoDBDataNodeId635"
},
"step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoDBTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#{output.tableName},-d,S3_INPUT_BUCKET=#{input.directoryPath},-d,DYNAMODB_WRITE_PERCENT=#{myDynamoDBWriteThroughputRatio},-d,DYNAMODB_ENDPOINT=dynamodb.us-east-1.amazonaws.com"
},
{
"id": "DataFormatId1",
"name": "DefaultDataFormat1",
"column": [
"Name",
"Designation",
"Company"
],
"columnSeparator": ",",
"recordSeparator": "\n",
"type": "Custom"
}
]
}
Out of four steps while executing the pipeline, two are getting finished, but it is not executing completely
Currently (2015-04) default import pipeline template does not support importing CSV files.
If your CSV file is not too big (under 1GB or so) you can create a ShellCommandActivity to convert CSV to DynamoDB JSON format first and the feed that to EmrActivity that imports the resulting JSON file into your table.
As a first step you can create sample DynamoDB table including all the field types you need, populate with dummy values and then export the records using pipeline (Export/Import button in DynamoDB console). This will give you the idea about the format that is expected by Import pipeline. The type names are not obvious, and the Import activity is very sensitive about the correct case (e.g. you should have bOOL for boolean field).
Afterwards it should be easy to create an awk script (or any other text converter, at least with awk you can use the default AMI image for your shell activity), which you can feed to your shellCommandActivity. Don't forget to enable "staging" flag, so your output is uploaded back to S3 for the Import activity to pick it up.
If you are using the template data pipeline for Importing data from S3 to DynamoDB, these dataformats won't work. Instead, use the format in the link below to store the input S3 data file http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-pipelinejson-verifydata2.html
This format of the output file generated by the template data pipeline that exports data from DynamoDB to S3.
Hope that helps.
I would recommend using the CSV data format provided by datapipeline instead of custom.
For debugging the errors on cluster, you can lookup the jobflow in EMR console and look at the log files for the tasks that failed.
See below link for a solution that works (in the question section), albeit EMR 3.x. Just change the delimiter to "columnSeparator": ",". Personally, I wouldn't do CSV unless you are certain the data is sanitized correctly.
How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?