I'm using the following workflow to append data to an existing BigQuery table from an external source:
query the table for the most updated record: (select max(lastModifiedData) from test.table). Save this data as 'lastMigrationTime';
query the external source for ids for records that changed since after 'lastMigrationTime'
query big Query table for all records except the updated ones: save result to test.tempTable.
move tempTable to table (using delete table,copy tempTable to table,delete tempTable).
Query external source for updated records and load them to test.table
The problem I'm facing is that the original schema of the table contains nested elements. Any query I run will flatten the schema, forcing me to flatten the original schema as well. Another side effect I saw is that column names are turned to lower case.
Is there any way to keep the original schema (mainly the nesting, but also maintaining the case would be nice)?
The column name casing issue is a known bug and should be fixed in our next release (hopefully in the next few days).
Preserving column nesting is a high-priority feature request. We're very interested in supporting this, but I don't have any time frame for when it will get done, unfortunately.
Related
I am attempting to fix the schema of a Bigquery table in which the type of a field is wrong (but contains no data). I would like to copy the data from the old schema to the new using the UI ( select * except(bad_column) from ... ).
The problem is that:
if I select into a table, then Bigquery is removing the required of the columns and therefore rejecting the insert.
Exporting via json loses information on dates.
Is there a better solution than creating a new table with all columns being nullable/repeated or manually transforming all of the data?
Update (2018-06-20): BigQuery now supports required fields on query output in standard SQL, and has done so since mid-2017.
Specifically, if you append your query results to a table with a schema that has required fields, that schema will be preserved, and BigQuery will check as results are written that it contains no null values. If you want to write your results to a brand-new table, you can create an empty table with the desired schema and append to that table.
Outdated:
You have several options:
Change your field types to nullable. Standard SQL returns only nullable fields, and this is intended behavior, so going forward it may be less useful to mark fields as required.
You can use legacy SQL, which will preserve required fields. You can't use except, but you can explicitly select all other fields.
You can export and re-import with the desired schema.
You mention that export via JSON loses date information. Can you clarify? If you're referring to the partition date, then unfortunately I think any of the above solutions will collapse all data into today's partition, unless you explicitly insert into a named partition using the table$yyyymmdd syntax. (Which will work, but may require lots of operations if you have data spread across many dates.)
BigQuery now supports table clone features. A table clone is a lightweight, writeable copy of another table
Copy tables from query in Bigquery
Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?
You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to
My goal is to update all the rows of google BigQuery table. But to do so I have to recreate tables from older data with adding new column. So I run a select query with all the fields and some hashing and encoding/decoding function. and then storing output as new table and same name as older one with dropping old table. But my question is when I create a new table will it retain its original schema structure specially when original has some nested structures.
When you run the job make sure you do not flatten results and the nesting of the schema will be retained. You can compare the schemas of the original and new table within the web ui.
We receive a data feed from our customers and we get roughly the same schema each time, though it can change on the customer end as they are using a 3rd party application. When we receive the data files we import the data into a staging database with a table for each data file (students, attendance, etc). We then want to compare that data to the data that we already have existing in the database for that customer and see what data has changed (either the column has changed or the whole row was possibly deleted) from the previous run. We then want to write the updated values or deleted rows to an audit table so we can then go back to see what data changed from the previous data import. We don't want to update the data itself, we only want to record what's different between the two datasets. We will then delete all the data from the customer database and import the data exactly as is from the new data files without changing it(this directive has been handed down and cannot change). The big problem is that I need to do this dynamically since I don't know exactly what schema I'm going to be getting from our customers since they can make customization to their tables. I need to be able to dynamically determine what tables there are in the destination, and their structure, and then look at the source and compare the values to see what has changed in the data.
Additional info:
There are no ID columns on source, though there are several columns that can be used as a surrogate key that would make up a distinct row.
I'd like to be able to do this generically for each table without having to hard-code values in, though I might have to do that for the surrogate keys for each table in a separate reference table.
I can use either SSIS, SPs, triggers, etc., whichever would make more sense. I've looked at all, including tablediff, and none seem to have everything I need or the logic starts to get extremely complex once I get into them.
Of course any specific examples anyone has of something like this they have already done would be greatly appreciated.
Let me know if there's any other information that would be helpful.
Thanks
I've worked on a similar problem and used a series of meta data tables to dynamically compare datasets. These meta data tables described which datasets need to be staged and which combination of columns (and their data types) serve as business key for each table.
This way you can dynamically construct a SQL query (e.g., with a SSIS script component) that performs a full outer join to find the differences between the two.
You can join your own meta data with SQL Server's meta data (using sys.* or INFORMATION_SCHEMA.*) to detect if the columns still exist in the source and the data types are as you anticipated.
Redirect unmatched meta data to an error flow for evaluation.
This way of working is very risky, but can be done if you maintain your meta data well.
If you want to compare two tables to see what is different the keyword is 'except'
select col1,col2,... from table1
except
select col1,col2,... from table2
this gives you everything in table1 that is not in table2.
select col1,col2,... from table2
except
select col1,col2,... from table1
this gives you everything in table2 that is not in table1.
Assuming you have some kind of useful durable primary key on the two tables, everything in both sets, is a change. Everything in the first set is an insert; Everything in the second set is a delete.
I would like to know if there's a way to add a column to an SQL Server table after it's created and in a specific position??
Thanks.
You can do that in Management-Studio. You can examine the way this is accomplished by generating the SQL-script BEFORE saving the change. Basically it's achieved by:
removing all foreign keys
creating a new table with the added column
copying all data from the old into the new table
dropping the old table
renaming the new table to the old name
recreating all the foreign keys
In addition to all the other responses, remember that you can reorder and rename columns in VIEWs. So, if you find it necessary to store the data in one format but present it in another, you can simply add the column on to the end of the table and create a single table view that reorders and renames the columns you want to show. In almost every circumstance, this view will behave exactly like the original table.
The safest way to do this is.
Create your new table with the correct column order
Copy the data from the old table.
Drop the Old Table.
The only safe way of doing that is creating a new table (with the column where you want it), migrating the data, dropping the original table, and renaming the new table to the original name.
This is what Management Studio does for you when you insert columns.
As others have pointed out you can do this by creating a temp table moving the data and droping the orginal table and then renaming the other table. This is stupid thing to do though. If your table is large, it could be very time-consuming to do this and users will be locked out during the process. This issomething you NEVER want to do to any table in production.
There is absolutely no reason to ever care what order the columns are in a table since you should not be relying on column order anyway (what if someone else did this same stupid thing?). No queries should use select * or ordinal positions to get columns. If you are doing this now, this is broken code and needs to be fixed immediately as the results are not always going to be as expected. For instance if you do insert a column where you want it and someone else is using select * for a report, suddenly the partnumber is showing up in the spot that used to hold the Price.
By doing what you want to do, you may break much more than you fix by putting the column where you personally want it. Column order in tables should always be irrelevant. You should not be doing this every time you want columns to appear in a differnt order.
With Sql Server Management Studio you can open the table in design and drag and drop the column wherever you want
As Kane says, it's not possible in a direct way. You can see how Management Studio does it by adding a column in the design mode and checking out the change script.
If the column is not in the last position, the script basically drops the table and recreates it, with the new column in the desired position.
In databases table columns don't have order.
Write proper select statement and create a view
No.
Basically, SSMS behind the scenes will copy the table, constraints, etc, drop the old table and rename the new.
The reason is simple - columns are not meant to be ordered (nor are rows), so you're always meant to list which columns you want in a result set (select * is a bit of a hack)