Copy tables from query in Bigquery - google-bigquery

I am attempting to fix the schema of a Bigquery table in which the type of a field is wrong (but contains no data). I would like to copy the data from the old schema to the new using the UI ( select * except(bad_column) from ... ).
The problem is that:
if I select into a table, then Bigquery is removing the required of the columns and therefore rejecting the insert.
Exporting via json loses information on dates.
Is there a better solution than creating a new table with all columns being nullable/repeated or manually transforming all of the data?

Update (2018-06-20): BigQuery now supports required fields on query output in standard SQL, and has done so since mid-2017.
Specifically, if you append your query results to a table with a schema that has required fields, that schema will be preserved, and BigQuery will check as results are written that it contains no null values. If you want to write your results to a brand-new table, you can create an empty table with the desired schema and append to that table.
Outdated:
You have several options:
Change your field types to nullable. Standard SQL returns only nullable fields, and this is intended behavior, so going forward it may be less useful to mark fields as required.
You can use legacy SQL, which will preserve required fields. You can't use except, but you can explicitly select all other fields.
You can export and re-import with the desired schema.
You mention that export via JSON loses date information. Can you clarify? If you're referring to the partition date, then unfortunately I think any of the above solutions will collapse all data into today's partition, unless you explicitly insert into a named partition using the table$yyyymmdd syntax. (Which will work, but may require lots of operations if you have data spread across many dates.)

BigQuery now supports table clone features. A table clone is a lightweight, writeable copy of another table
Copy tables from query in Bigquery

Related

How to copy data from one table to another with nested required fields in repeatable objects

I'm trying to copy data from one table to another. The schemas are identical except that the source table has fields as nullable when they were meant to be required. Big query is complaining that the fields are null. I'm 99% certain the issue is that in many entries the repeatable fields are absent, which causes no issues when inserting into the table using our normal process.
The table I'm copying from used to have the exact same schema, but accidentally lost the required fields when recreating the table with a different partitioning scheme.
From what I can tell, there is no way to change the fields from nullable to required in an existing table. It looks to me like you must create a new table then use a select query to copy data.
I tried enabling "Allow large results" and unchecking "flatten results" but I'm seeing the same issue. The write preference is "append to table"
(Note: see edit below as I am incorrect here - it is a data issue)
I tried building a query to better confirm my theory (and not that the records exist but are null) but I'm struggling to build a query. I can definitely see in the preview that having some of the repeated fields be null is a real use case, so I would presume that translates to the nested required fields also being null. We have a backup of the table before it was converted to the new partitioning, and it has the same required schema as the table I'm trying to copy into. A simple select count(*) where this.nested.required.field is null in legacy sql on the backup indicates that there are quite a few columns that fit this criteria.
SQL used to select for insert:
select * from my_table
Edit:
When making a partition change on the table was also setting certain fields to a null value. It appears that somehow the select query created objects with all fields null rather than simply a null object. I used a conditional to set a nested object to either null or pick its existing value. Still investigating, but at this point I think what I'm attempting to do is normally supported, based on playing with some toy tables/queries.
When trying to copy from one table to another, and using SELECT AS STRUCT, run a null check like this:
IF(foo.bar is null, null, (SELECT AS STRUCT foo.bar.* REPLACE(...))
This prevents null nested structures from turning into structures full of null values.
To repair it via select statement, use a conditional check against a value that is required like this:
IF (bar.req is null, null, bar)
Of course a real query is more complicated than that. The good news is that the repair query should look similar to the original query that messed up the format

Bigquery fails to return proper data from table when queried using wildcard query

we are using Looker (dashboard/reporting solution) to create persistent derived tables in BigQuery. These are normal tables as far as bigquery is concerned, but the naming is as per looker standard (it creates a hash based on DB + SQL etc.) and names the table accordingly. These tables are generated through view in scheduled time daily. The table names in BigQuery look like below.
table_id
LR_Z504ZN0UK2AQH8N2DOJDC_AGG__table1
LR_Z5321I8L284XXY1KII4TH_MART__table2
LR_Z53WLHYCZO32VK3FWRS2D_JND__table3
If I query the resulting table in BQ by explicit name then the result is returned as expected.
select * from `looker_scratch.LR_Z53WLHYCZO32VK3FWRS2D_JND__table3`
Looker changes the hash value in the table name when the table is regenerated after a query/job change. Hence I wanted to create a view with a wildcard table query to make the changes in the table name transparent to outside world.
But the below query always fails.
SELECT *
FROM \`looker_scratch.LR_*\`
where _table_suffix like '%JND__table3'
I either get a completely random schema with null values or errors such as:
Error: Cannot read field 'reportDate' of type DATE as TIMESTAMP_MICROS
There are no clashing table suffixes and I have used all sort of regular expression checks (lower , contains, etc)
Is this happening since the table names have hash values in them? I have run multiple tests on other datasets and there are absolutely no problem, we have been running wildcard table queries since a long time and have faced no issues whatsoever.
Please let me know your thoughts.
When you are using wildcard like below
`looker_scratch.LR_*`
you actually looking for ALL tables with this prefix and than - when you apply below clause
LIKE '%JND__table3'
you further filter in tables with such suffix
So the trick here is that very first (chronologically) table defines the schema of your output
To address your issue - verify if there are more tables that match your query and than look into very first one (the one that was created first)

Storeing query result as table in big query retains its original tables's schema structure?

My goal is to update all the rows of google BigQuery table. But to do so I have to recreate tables from older data with adding new column. So I run a select query with all the fields and some hashing and encoding/decoding function. and then storing output as new table and same name as older one with dropping old table. But my question is when I create a new table will it retain its original schema structure specially when original has some nested structures.
When you run the job make sure you do not flatten results and the nesting of the schema will be retained. You can compare the schemas of the original and new table within the web ui.

Export data with phpmyadmin, ordered on row-level

I'm working with a huge database with more than 800 tables and over 50,000 rows in total. All these tables have different structures, with the exception of a timestamp field which is present in all tables.
My challenge: export all data but be able to use the timestamp field in a meaningful way.
For statistical purposes I want to create an overview of all the entries into this database in which I can work with the timestamp field. The problem with a "normal" export is that the data is ordered by table, then ID. This means that all the timestamp fields are in a different columns (using excel here), and I can't effectively use it to sort the entries based on this field.
TL;DR version: Is it possible to export all data from a database managed with PHPMyAdmin ordered by a field that is present in all tables, while all the other fields are table-specific?
It seems to me that what you want to do is first get the information in the format you would like and then export it. First though, you need to figure out exactly what you are trying to accomplish. You might rather create the SQL to do the statistical (counting, summing, averaging, etc.) work and then just use Excel for the final product. Views and alternate indexes provide logical ways of looking at the data.
As I understand what you are attempting, you need to recreate your database with the timestamp field as the major key for each table. Without physically rewriting the database I don't think you can use phpmyadmin's export to export in the format you want.

create table from query while keep original schema

I'm using the following workflow to append data to an existing BigQuery table from an external source:
query the table for the most updated record: (select max(lastModifiedData) from test.table). Save this data as 'lastMigrationTime';
query the external source for ids for records that changed since after 'lastMigrationTime'
query big Query table for all records except the updated ones: save result to test.tempTable.
move tempTable to table (using delete table,copy tempTable to table,delete tempTable).
Query external source for updated records and load them to test.table
The problem I'm facing is that the original schema of the table contains nested elements. Any query I run will flatten the schema, forcing me to flatten the original schema as well. Another side effect I saw is that column names are turned to lower case.
Is there any way to keep the original schema (mainly the nesting, but also maintaining the case would be nice)?
The column name casing issue is a known bug and should be fixed in our next release (hopefully in the next few days).
Preserving column nesting is a high-priority feature request. We're very interested in supporting this, but I don't have any time frame for when it will get done, unfortunately.