I accidentally added a wrong column to my BigQuery table schema.
Instead of reloading the complete table (million of rows), I would like to know if the following is possible:
remove bad rows (rows with values contains the wrong column) by running a "select *" query on the table with some kind of filter, and saving result to same table.
removing the (now) unused column.
Is this functionality (or similar) supported?
Possibly the "save result to table" functionality can have a "compact schema" option.
The smallest time-saving way to remove a column from Big Query according to the documentation.
ALTER TABLE [table_name] DROP COLUMN IF EXISTS [column_name]
If your table does not consist of record/repeated type fields - your simple option is:
Select valid columns while filtering out bad records into new temp table
SELECT < list of original columns >
FROM YourTable
WHERE < filter to remove bad entries here >
Write above to temp table - YourTable_Temp
Make a backup copy of "broken" table - YourTable_Backup
Delete YourTable
Copy YourTable_Temp to YourTable
Check if all looks as expected and if so - get rid of temp and backup tables
Please note: the cost of above #1 is exactly the same as action in first bullet in your question. The rest of actions (copy) are free
In case if you have repeated/record fields - you still can execute above plan, but in #1 you will need to use some BigQuery User-Defined Functions to have proper schema in output
You can see below for examples - of course this will require some extra dev - but if you are in critical situation - this should work for you
Create a table with Record type column
create a table with a column type RECORD
I hope, at some point Google BigQuery Team will add better support for cases like yours when you need to manipulate and output repeated/record data, but for now this is a best workaround I found - at least for myself
Below is the code to do it. Lets say c is the column that you wants to delete.
CREATE OR REPLACE TABLE transactions.test_table AS
SELECT * EXCEPT (c) FROM transactions.test_table;
Or second method and my favorite is by following below steps.
Write Select query with the columns you want to exclude.
Go to Query Settings
Query Settings
In Destination setting Set destination table for query results, enter project name, Dataset name and table name exactly same as you entered in Step 1.
In Destination table write preference select Overwrite table.
Destination table settings
Save the Query Setting and run the query.
Save results to table is your way to go. Try on the big table with the selected columns you are interested, and you can apply a limit to make it small.
Related
I have a database, which contains information that I can't share images of due to compliance reasons.
I have a table I need to copy data from, so I was using the following SQL:
INSERT INTO completedtrainingstestfinal (MALicenseNum)
SELECT MALicenseNum
FROM CompletedTrainings
WHERE (CompletedTrainings.MALicenseNum IS NOT NULL)
AND (CompletedTrainings.Employee = completedtrainingstestfinal.Employee);
It keeps popping up the Enter Parameter Value, centered on the new table (named completedtrainingstestfinal) at the Employee column.
Background: The original table is a mess, and this is to be the replacement table, I've had to pivot the table in order to clean it up, and am now trying to remove an ungodly amount of nulls. The goal is to clean up the query process for the end users of this who need to put in training and certification/recertification through the forms.
When you look in the old table, it has been designed to reference another table and display the actual names, but as seen in the image below it is storing the data as the integer number Employee.
The new table Employee column was a direct copy but only displays the integer, my instincts tell me that the problem is here, but I have been unable to find a solution. Anyone have any suggestions to throw me in the right direction?
Edited to add: It might be an issue where the tables have different numbers of rows?
This is the design view of the two relevant tables :
Table 1
Table 2
I currently have a transformation setup with 2 table inputs and one Merge Rows (Diff), the SQL select statement in both table inputs are constant, they are not changing except for the table name. So I have:
select * from THIS_WILL_CHANGE
I have around 100 tables and I don't want to manually enter the table names every iteration, especially because this is automation...
What is the best way to achieve this? Is there any way to read like a CSV file with all the table names and loop that way? Any help is appreciated..
This is something I've had to do before too!
You can do this with a variable and a job which executes once for each row of the previous step.
Create a parent job to host these steps
Create a transformation which gets the table names from 'somewhere' eg. CSV file, or database query - a select on all_tables for tables with the same column names might be a nice way to do this for all time...
In this same transformation, use copy rows to result step to push the data back to the job
Create a new 'sub job', which executes once for each row, and has a hop from the 'get data' step in the main job
In the sub job, create two transformations, one to set the variable from the results field, and one to do your select
In your select query, check the box 'substitute variables' and place your variable with the same name as your set variables step into your SQL as ${yourVariableHere}
I've put this in an image below, which hopefully helps you.
The dynamic sql row step is a good option for this, provided the tables all have the same layout/metadata.
I have a database. I created it with HeidiSQL. Its look like this.
I enter the value-1 and value-2.
Is there a way to enter a formula to Result column like " =Value-1 * Value-2 " ? I want my database to calculate the Result when I enter my values to other cells.
A trigger is one way to achieve automated column content.
A second one is a view, which you can create additionally to the table. That view could contain SQL which generates the result:
SELECT value1, value2, value1*value2 AS result
A third (more modern) alternative is adding a virtual column in your existing table. You can do that with HeidiSQL's table editor, like shown in the screenshot. Just add a new column with INT data type, and set its Virtuality to "VIRTUAL", and Expression to "value-1 * value-2". That's it.
I'm not familiar with HeidiSQL, but it appears to be a front end? What RDBMS are you using, for example SQL Server allows a computed column.
SQL
ALTER TABLE YourTable
ADD Result AS ([Value-1] * [Value-2])
Right click your database name in the folder structure, go to --> create new then -->Trigger
Then you can create a trigger that when entering data, will be activated on the entire column like this:
But you will need to know how to write the actual query and function. This requires basic knowledge that is generally generic and consistent of most all SQL languages.
I need to set up a new company for automated data import. The utility has provided the data in a spreadsheet. (Image 1)
Based on this data, I need to create a stored procedure that will identify the correct meter, if it exists, and perform either an insert or update to the monthly data table. For automated utility data import, I want to make sure I restrict everything to a particular utility company.
The steps are the following ( I am having a hard time converting this to SQL)
1- I just want a script that identify the correct meter to see if it exists, basically check the Meter# column in the excel with the MeterNumber column in the Meters table.
2- The next step is perform either an insert or update to the MonthlyData table. This is a screen shot of all its columns.
3- Then I just want to make sure that I am restricting everything to the particular company which in this case Site1 since 2 different companies might have the same meter#. The UtilityCompany table contains 3 columns: ID, Name, UtilityType
I honestly do not know from where to get started, would anybody help me with the script? Thank you
You will want to:
perform a Bulk Insert operation to take your data from the excel file into a staging table.
write a query to select ALL rows for the corresponding utility company (notice I didn't see iterate over each row...). This select could be an update where you update an additional column to mark the row as an INSERT, or an UPDATE.
Then the last step (2 parts), retrieve all of the rows that were marked as INSERT, and insert those into your table. Then grab all rows that were marked with an UPDATE, and update their corresponding values based on your matching criteria.
Hi I have a table which was designed by a lazy developer who did not created it in 3rd normal form. He saved the arrays in the table instead of using MM relation . And the application is running so I can not change the database schema.
I need to query the table like this:
SELECT * FROM myTable
WHERE usergroup = 20
where usergroup field contains data like this : 17,19,20 or it could be also only 20 or only 19.
I could search with like:
SELECT * FROM myTable
WHERE usergroup LIKE 20
but in this case it would match also for field which contain 200 e.g.
Anybody any idea?
thanx
Fix the bad database design.
A short-term fix is to add a related table for the correct structure. Add a trigger to parse the info in the old field to the related table on insert and update. Then write a script to [parse out existing data. Now you can porperly query but you haven't broken any of the old code. THen you can search for the old code and fix. Once you have done that then just change how code is inserted or udated inthe orginal table to add the new table and drop the old column.
Write a table-valued user-defined function (UDF in SQL Server, I am sure it will have a different name in other RDBMS) to parse the values of the column containing the list which is stored as a string. For each item in the comma-delimited list, your function should return a row in the table result. When you are using a query like this, query against the results returned from the UDF.
Write a function to convert a comma delimited list to a table. Should be pretty simple. Then you can use IN().