How to generate CREATE TABLE script from an existing table - google-bigquery

I create a table with Big Query interface. A large table. And I would like to export the schema of this table in Standard SQL (or Legacy SQL) syntax.
Is it possible ?
Thanks !

You can get the DDL for a table with this query:
SELECT t.ddl
FROM `your_project.dataset.INFORMATION_SCHEMA.TABLES` t
WHERE t.table_name = 'your_table_name'
;

As can be read in this question it is not possible to do so and there is a feature request to obtain the output schema of a standard SQL query but seems like it was not finally implemented. Depending on what your use case is, apart from using bq, another workaround is to do a query with LIMIT 0. Results are returned immediately (tested with a 100B row table) with the schema field names and types.
Knowing this you could also automate the procedure in your favorite scripting language. As an example I used Cloud Shell as the CLI and API calls. It makes three successive calls where the first one executes the query and a jobId is obtained (unnecessary fields are not included in request URL), then we obtain the dataset and table IDs correspondent to that particular job and, finally, the schema is retrieved.
I used the jq tool to parse the responses (manual), which comes preinstalled in the Shell, and wrapped everything in a shell function:
result_schema()
{
QUERY=$1
authToken="$(gcloud auth print-access-token)"
projectId=$(gcloud config get-value project 2>\dev\null)
# get the jobId
jobId=$(curl -H"Authorization: Bearer $authToken" \
-H"Content-Type: application/json" \
https://www.googleapis.com/bigquery/v2/projects/$projectId/queries?fields=jobReference%2FjobId \
-d"$( echo "{
\"query\": "\""$QUERY" limit 0\"",
\"useLegacySql\": false
}")" 2>\dev\null|jq -j .jobReference.jobId)
# get destination table
read -r datasetId tableId <<< $(curl -H"Authorization: Bearer $authToken" \
"https://www.googleapis.com/bigquery/v2/projects/$projectId/jobs/$jobId?fields=configuration(query(destinationTable(datasetId%2CtableId)))" 2>\dev\null | jq -j '.configuration.query.destinationTable.datasetId, " " ,.configuration.query.destinationTable.tableId')
# get resulting schema
curl -H"Authorization: Bearer $authToken" https://www.googleapis.com/bigquery/v2/projects/$projectId/datasets/$datasetId/tables/$tableId?fields=schema 2>\dev\null | jq .schema.fields
}
then we can invoke the function by querying a 100B row public dataset (don't specify LIMIT 0 as the function automatically adds it):
result_schema 'SELECT year, month, CAST(wikimedia_project as bytes) AS project_bytes, language AS lang FROM `bigquery-samples.wikipedia_benchmark.Wiki100B` GROUP BY year, month, wikimedia_project, language'
with the following output as the schema (mind the selected fields using casts and aliases to modify the returned schema):
[
{
"name": "year",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "month",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "project_bytes",
"type": "BYTES",
"mode": "NULLABLE"
},
{
"name": "lang",
"type": "STRING",
"mode": "NULLABLE"
}
]
This field array can then be copy/pasted (or further automated) in the fields editor when creating a new table using the UI.

I am not sure how it is possible using StandardSQL or Legacy SQL syntax. But you can get the schema in json format using command line.
From this link the command to do it would be:
bq show --schema --format=prettyjson [PROJECT_ID]:[DATASET].[TABLE] > [PATH_TO_FILE]

Related

problem on changing a columns' data type in BigQuery

I try to change a columns' data type from string to DATETIME (for example '04/12/2016 02:47:30') with the format 'YY/MM/DD HH24:MI:SS' but it shoes an error like :
Failed to parse input timestamp string at 8 with format element ' '
The initial file was a csv which i uploaded from my drive. I tried to convert the column's data type from google sheets and then re'upload it, but the column type still remains as string.
I think when you load your CSV file to the BigQuery table, you use autodetect mode.
Unfortunately with this mode, BigQuery will consider your date as String even if you changed it from Google Sheet.
Instead of using autodetect, I propose you using a Json schema for your BigQuery table.
In your schema you will indicate that the column type for your date field is timestamp.
The format you indicated 04/12/2016 02:47:30 is compatible with a timestamp and BigQuery will convert it for you.
For the loading file to BigQuery, you can directly use the console or gcloud cli with bq command :
bq load \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata.csv \
./myschema.json
For the BigQuery Json schema, the timestamp type is :
{
{
"name": "yourDate",
"type": "TIMESTAMP",
"mode": "NULLABLE",
"description": "Your date"
}
}

Wikimedia pageview compression not working

I am trying to analyze monthly wikimedia pageview statistics. Their daily dumps are OK but monthly reports like the one from June 2021 (https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-06/pageviews-202106-user.bz2) seem broken:
[radim#sandbox2 pageviews]$ bzip2 -t pageviews-202106-user.bz2
bzip2: pageviews-202106-user.bz2: bad magic number (file not created by bzip2)
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
[radim#sandbox2 pageviews]$ file pageviews-202106-user.bz2
pageviews-202106-user.bz2: Par archive data
Any idea how to extract the data? What encoding is used here? Can it be Parquet file from their Hive analytics cluster?
These files are not bzip2 archives. They are Parquet files. Parquet-tools can be used to inspect them.
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main schema /tmp/pageviews-202106-user.bz2 2>/dev/null
{
"type" : "record",
"name" : "hive_schema",
"fields" : [ {
"name" : "line",
"type" : [ "null", "string" ],
"default" : null
} ]
}

BigQuery Could not parse 'null' as int for field

Tried to load csv files into bigquery table. There are columns where the types are INTEGER, but some missing values are NULL. So when I use the command bq load to load, got the following error:
Could not parse 'null' as int for field
So I am wondering what are the best solutions to deal with this, have to reprocess the data first for bq to load?
You'll need to transform the data in order to end up with the expected schema and data. Instead of INTEGER, specify the column as having type STRING. Load the CSV file into a table that you don't plan to use long-term, e.g. YourTempTable. In the BigQuery UI, click "Show Options", then select a destination table with the table name that you want. Now run the query:
#standardSQL
SELECT * REPLACE(SAFE_CAST(x AS INT64) AS x)
FROM YourTempTable;
This will convert the string values to integers where 'null' is treated as null.
Please try with job config setting.
job_config.null_marker = 'NULL'
configuration.load.nullMarker
string
[Optional] Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
BigQuery Console has it's limitations and doesn't allow you to specify a null marker while loading data from a CSV. However, it can easily be done by using the BigQuery command-line tool's bq load command. We can use the --null_marker flag to specify the marker which is simply null in this case.
bq load --source_format=CSV \
--null_marker=null \
--skip_leading_rows=1 \
dataset.table_name \
./data.csv \
./schema.json
Setting the null_marker as null does the trick here. You can omit the schema.json part if the table is already present with a valid schema. --skip_leading_rows=1 is used because my first row was a header.
You can learn more about the bg load command in the BigQuery Documentation.
The load command however lets you create and load a table in a single go. The schema needs to be specified in a JSON file in the below format:
[
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
},
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
}
]

wit.ai HTTP Post/entities API not show expressions in "Keyword" search strategy

I try to post Intent and entity via wit.ai HTTP API.
My JSON format:
{"entities"=>[{"id"=>"intent", "lookups"=>["trait"], "values"=> [{"value"=>"ask_info", "expressions"=>["How old are you ?"]}]}, {"id"=>"age", "values"=>[{"value"=>"old", "expressions"=>["How old are you ?"]}]}]}
Input sentence is "How old are you ?"
Intent is ask_info
Entity is age for value 'old'
I call post entitiles API twice for Intent and Entity
$ curl -XPOST 'https://api.wit.ai/entities?v=20160526' \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"id"=>"intent", "lookups"=>["trait"], "values"=> [{"value"=>"ask_info", "expressions"=>["How old are you ?"]}]}'
$ curl -XPOST 'https://api.wit.ai/entities?v=20160526' \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"id"=>"age", "values"=>[{"value"=>"old", "expressions"=>["How old are you ?"]}]}'
In wit.ai page, entity age not mapping to expression "How old are you ?".
Just display in Synonyms
Display Image
Downloaded Dataset only show intent no entity
{
"text" : "How old are you ?",
"entities" : [
{
"entity" : "intent",
"value" : "\"ask_info\""
}
]
}
wit.ai GUI work very nice
{
"text" : "How old are you ?",
"entities" : [
{
"entity" : "intent",
"value" : "\"ask_info\""
},
{
"entity" : "age",
"value" : "\"old\"",
"start" : 2,
"end" : 3
}
]
}
Do you have any method could solve this problem?
I've bypassed this bug by manually crafting an archive which I later used for deploy.
There are multiple issues with this approach:
Apps cannot be updated from an archive, the importer only supports creating new apps as far as I know. This means the history will be lost between deploys and the server token will need to be changed.
The importer seems to be a little flaky. Tampering with the archive is possible, but zipping back is complicated: all files should be in an directory, no directory entries should be present, and the file order also seems to be important.
The upside is multiple features that are not available in the API are exposed, such as setting the 'free-text' lookups option.
I've downloaded an sample application and proceeded from there.

How to edit a github issue using the API (curl)? (especially: close)

I'm planning to migrate a couple hundred bugs tracked in another (home-rolled) system into GitHub's issue system. Most of these bugs were closed in the past. I can use github's API to create an issue, e.g.
curl -u $GITHUB_TOKEN:x-oauth-basic https://api.github.com/repos/my_organization/my_repo/issues -d '{
"title": "test",
"body": "the body"
}'
... however, this will leave me with a bunch of open issues. How to close those? I've tried just closing at the time of creation, e.g.:
curl -u $GITHUB_TOKEN:x-oauth-basic https://api.github.com/repos/my_organization/my_repo/issues -d '{
"title": "test",
"body": "the body",
"state": "closed"
}'
... but the result is to create an open issue (i.e. the "state" is ignored).
It looks to me like I should be able to "edit" an issue to close it (https://developer.github.com/v3/issues/#edit-an-issue) ... but I'm unable to figure out what the corresponding curl command is supposed to look like. Any guidance?
Extra credit: I'd really like to be able to assign a "closed" date, to agree with the actual closed date captured in our current system. It's not clear that this is possible.
Thanks!
migrating a bunch of issues to github with the command line? are you crazy?
anyway, using php and hhb_curl from https://github.com/divinity76/hhb_.inc.php/blob/master/hhb_.inc.php ,
this worked for me, unfortunately couldn't set the "closed_at" date (it was ignored by the api), but i could emulate it using labels, then it looked like
, the code should give you something to work on when porting it to command line:
<?php
declare(strict_types = 1);
require_once ('hhb_.inc.php');
$hc=new hhb_curl();
define('BASE_URL','https://api.github.com');
$hc->_setComfortableOptions();
$data=array(
'state'=>'closed',
'closed_at'=> '2011-04-22T13:33:48Z',// << unfortunately, ignored
'labels'=>array(
'closed at 2011-04-22T13:33:48Z' // << we can fake it using labels...
)
);
$data=json_encode($data);
$hc->setopt_array(array(
CURLOPT_CUSTOMREQUEST=>'PATCH',
// /repos/:owner/:repo/issues/:number
// https://github.com/divinity76/GitHubCrashTest/issues/1
CURLOPT_URL=>BASE_URL.'/repos/divinity76/GitHubCrashTest/issues/1',
CURLOPT_USERAGENT=>'test',
CURLOPT_HTTPHEADER=>array(
'Accept: application/vnd.github.v3+json',
'Content-Type: application/json',
'Authorization: token <removed>'
),
CURLOPT_POSTFIELDS=>$data,
));
$hc->exec();
hhb_var_dump($hc->getStdErr(),$hc->getResponseBody());
(i modified the "Authorization: token" line before posting it on stackoverflow ofc)
As suggested by hanshenrik, the correct altered curl command is:
curl -u $GITHUB_TOKEN:x-oauth-basic https://api.github.com/repos/my_organization/my_repo/issues/5 -d '{
"state": "closed"
}'
I'd failed to understand the documentation referenced in his answer:
/repos/:owner/:repo/issues/:number
translates to
https://api.github.com/repos/my_organization/my_repo/issues/5
(I now understand that fields starting with ":" are variables)
For the record, I'm planning to script the calls to curl. :)