I have a database from which I am pulling rows from, manipulating the data a bit and then putting into another table. Every time I run the package, it doesnt remove any data from the destination and thus grows by X number of rows each time.
Is there any way that I can clear the destination before adding the new rows?
YOu can run a truncate statement to clear all records from the table.
TRUNCATE TABLE YourTableName
Related
Context
I have an ETL process that keeps overwritng all rows of a table in bq by deleting all first then inserting new ones. I'm looking for a data back up design that can be triggered regularly on that table.
Issue
I'm concerned about the cost implications of using snapshots for this kind of table.
What exactly am I worried about?
On each drop and recreation of the base table, the new data has many rows that are identical to previous row, some new rows and some updated rows. However, the data gets inserted in a different sort order each time.
So when bq is creating a snapshot, by looking for rows that have changed, will it know that that some previous rows are still in the base table and have only changed position in order to avoid increased storage costs on the snapshot?
Have you thought about using merge statements?
These can deal with inserts, updates and even deletes in one query.
An example here https://querystash.com/query/62cf51097d57d7579954c0d418afc063
This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active
Our app has a few very large tables in SQL Server. 500 million rows and 1 billion rows in two tables that we'd like to clean up to reclaim some disk space.
In our testing environment, I tried running chunked deletes in a loop but I don't think this is a feasible solution in prod.
So the other alternative is to select/insert the data we want to keep into a temp table, truncate/drop the old table, and then recreate
indexes
foreign key constraints
table permissions
rename the temp table back to the original table name
My question is, am I missing anything from my list? Are there any other objects / structures that we will lose which we need to re-create or restore? It would be a disastrous situation if something went wrong. So I am playing this extremely safe.
Resizing the db/adding more space is not a possible solution. Our SQL Server app is near end of life and is being decom'd, so we are just keeping the lights on until then.
While you are doing this operation will there be new records added to the original table? I mean is the app that writing to this table will be live? If it is the case, maybe it would be better to change the order of steps like:
First to rename original table's name to the temp
Create a new table with the original name so that new records can be added from the writing app.
In parallel, you can move the data you want to keep, from temp to the new original table.
I have a table that has around 13 billion records. Size of this table is around 800 GB. I want to add a column of type tinyint to the table but it takes a lot of time to run add column command. Another option would be to create another table with the additional column and copy data from source table to the new table using BCP (data export and import) or copy data directly to the new table.
Is there a better way to achieve this?
My preference for tables of this size is to create a new table and then batch the records into it (BCP, Bulk Insert, SSIS, whatever you like). This may take longer but it keeps your log from blowing out. You can also do the most relevant data (say last 30 days) first, swap out the table, then batch in the remaining history so that you can take advantage of the new row immediately...if your application lines up with that strategy.
I have a table with ~100 columns, about ~30M rows, on MSSQL server 2005.
I need to alter 2 columns - change their types from VARCHAR(1024) to VARCHAR(max). These columns does not have index on them.
I'm worried that doing so will fill up the log, and cause the operation to fail. How can I estimate the needed free disk space, both of the data and the log, needed for such operation to ensure it will not fail?
You are right, increasing the column size (including to MAX) will generate a huge log for a large table, because every row will be updated (behind the scenens the old column gets dropped and a new column gets added and data is copied).
Add a new column of type VARCHAR(MAX) NULL. As a nullable column, will be added as metadata only (no data update)
Copy the data from the old column to new column. This can be done in batches to alleviate the log pressure.
Drop the old column. This will be a metadata only operation.
Use sp_rename to rename the new column to the old column name.
Later, at your convenience, rebuild the clustered index (online if needed) to get rid of the space occupied by the old column
This way you get control over the log by controlling the batches at step 2). You also minimize the disruption on permissions, constraints and relations by not copying the entire table into a new one (as SSMS so poorly does...).
You can do this sequence for both columns at once.
I would recommend that you consider, instead:
Create a new table with the new schema
Copy data from old table to new table
Drop old table
Rename new table to name of old table
This might be a far less costly operation and could possibly be done with minimal logging using INSERT/SELECT (if this were SQL Server 2008 or higher).
Why would increasing the VARCHAR limit fill up the log?
Try to do some test in smaller pieces. I mean, you could create the same structure locally with few thousand rows, and see the difference before and after. I think the change will be linear. The real question is about redo log, if it will fit into it or not, since you can do it at once. Must you do it online, or you can stop production for a while? If you can stop, maybe there is a way to stop redo log in MSSQL like in Oracle. It could make it a lot faster. If you need to do it online, you could try to make a new column, copy the value into it by a cycle for example 100000 rows at once, commit, continue. After completing maybe to drop original column and rename new one is faster than altering.