I have three nodes running, which contain some data we need to purge. Since we can not identify the records, we would like to truncate the table and fill it from scratch.
Now I was wondering, when I issue a truncate statement, will this only truncate the current instance or will this also clear the other nodes?
So basically, do I have to issue the truncate statement on each node and also fill it there, or is it sufficient to do it on one node and it will propagte to the others. We would load the data from a CSV file via the COPY command.
The table will be truncated in the entire cluster.
Related
I am trying to write data to an Azure SQL DB with Azure Data Factory. I'm using a Copy Data Task within a For Each that looks through all the rows in an ETL table in the DB. In the pre-copy script, I have
TRUNCATE TABLE [master].[dbo].[#{item().DestinationObjectName}]
DestinationObjectName being the name of the table that is being loaded in the ETL table. The problem I'm having is that for some of the tables (not all, some work perfectly fine) I am getting the error 'Cannot find the object % because it does not exist or you do not have permissions'. The account I'm using has all the necessary privileges as well. I am able to see the script that is sent to ADF which I have copied into the DB and confirmed this script works sometimes but not every time. If I select top 1000 from the table in question and replace that object for the one in the truncate table script, it works. I'm really at a loss here. Like I said the truncate works for a majority of the tables but not all. I have also double checked that the object names are the exact same.
Any help is appreciated.
This issue has been solved. I had to drop the affected tables and remove the brackets surrounding each in the create table statements and recreate without the brackets. very strange issue.
I have a table that I need drop, delete transaction log and recreate, but while I am trying to drop I get following error.
I have ran repair table statement on this one and could be responsible for error but not sure.
IllegalStateException: The transaction log has failed integrity checks. We recommend you contact Databricks support for assistance. To disable this check, set spark.databricks.delta.state.corruptionIsFatal to false. Failed verification of:
Table size (bytes) - Expected: 0 Computed: 63233
Number of files - Expected: 0 Computed: 1
We think this may just be related to s3 eventual consistency. Please try waiting a few extra minutes after deleting the Delta directory before writing new data to it. Also, normal MSCK REPAIR TABLE doesn't do anything for Delta, as Delta doesn't use the Hive Metastore to store the partitions. There is an FSCK REPAIR TABLE, but that is for removing the file entries from the transaction log of a Databricks Delta table that can no longer be found in the underlying file system.
We don't recommend overwriting a Delta table in place, like you might with a normal Spark table. Delta is not like a normal table - it's a table, plus a transaction log, and many versions of your data (unless fully vacuumed). If you want to overwrite parts of the table, or even the whole table, you should use Delta's delete functionality. If you want to completely change the table, consider writing to an entirely new directory, such as /table/v2/... and separately deleting the other table.
To skip the issue from occurring can use below command (PySpark notebook):
spark.conf.set("spark.databricks.delta.state.corruptionIsFatal", False)
I have a instance of Crate 1.0.2 and I dropped a table from it. Then re-created table with same name and slightly modified schema. Then I imported data using copy from command. File argument to copy from command consists of 10,000 records and copy from command runs ok. When I check table tab in crate web console, it shows many partitions added and each partition having few records. If I add number of records column on this tab, it comes close to 10k but when I fire a command "select count(*) from mytable", it returns around 8000 records only. On further investigation found that there are certain partitions on which data cannot be queried at all. Has any one seen this problem? Does it have anything to do with table drop and creation with same name ? I also observed that when a table is dropped, not all files related to that table are deleted from path.data. Are these directories a reason for those partitions become non-query able? While importing, I saw "Document already exists" exception. I know my data does not have any duplicate value for primary column.
Some questions to clarify the issue:
Have you run refresh table mytable after your copy command has finished?
Are you sure that with the new schema of the table, there are no duplicate records?
Since 1.x versions are not supported anymore, could you try with CrateDB 2.1.6 which is the current stable version to see if the problem persists?
When I load a (csv)-file to a hive table I can load without overwriting, thus adding the new file to the table.
Internally the file is just copied to the correct folder in HDFS
(e.g. user/warehouse/dbname/tablName/datafile1.csv). And probably some metadata is updated.
After a few loads I want to remove the contents of a specific file from the table.
I am sure I cannot simply delete the file because of the metadata that needs to be adjusted as well. There must be some kind of build-in function for this.
How do I do that?
Why do you need that?I mean Hive was developed to serve as a warehouse where you put lots n lots n lots of data and not to delete data every now and then. Such a need seems to be a poorly thought out schema or a poor use of Hive, at least to me.
And if you really have these kind of needs why don't you create partitioned tables? If you need to delete some specific data just delete that particular partition using either TRUNCATE or ALTER.
TRUNCATE TABLE table_name [PARTITION partition_spec];
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...
if this feature is needed more than just once in a while you can use MapR's distribution while allows this kind of operations with no problem (even via NFS). otherwise, if you don't have partition I think you'll have to create and new table using CTAS filterring the data in the bad file or just copy the good files back to os with "hadoop fs -copyToLocal" and move them back to hdfs into new table
I'm using SQL loader to load my data into database.
Before I insert the data I need to remove existing data in the table:
options(skip=1,load=250000,errors=0,ROWS=30000,BINDSIZE=10485760)
load data
infile 'G:1.csv' "str '^_^'"
replace
into table IMPORT_ABC
fields terminated by "," OPTIONALLY ENCLOSED BY '"'
trailing nullcols(
.
.
.
.)
But I got error like:
SQL*LOADER-926: OCI error while executing delete/truncate for table IMPORT_ABC
ORA-30036: unable to extend segment by 8 in undo tablespace 'undo1'
How can I delete data for example by 10000 rows?
I know that I have some limit on my DB.
Deleting records in batches can be done in a PL/SQL loop, but is generally considered bad practice as the entire delete should normally be considered as a single transaction; and that can't be done from within the SQL*Loader control file. Your DBA should size the UNDO space to accommodate the work you need to do.
If you're deleting the entire table you'll almost certainly be better off truncating anyway, either in the control file:
options(skip=1,load=250000,errors=0,ROWS=30000,BINDSIZE=10485760)
load data
infile 'G:1.csv' "str '^_^'"
truncate
into table IMPORT_ABC
...
Or as a separate truncate statement in SQL*Plus/SQL Developer/some other client before you start the load:
truncate table import_abc;
The disadvantage is that your table will appear empty to other users while the new rows are being loaded, but if it's a dedicated import area (guessing from the name) that may not matter anyway.
If your UNDO is really that small then you may have to run multiple loads, in which case - probably obviously - you need to make sure you only have the truncate in the control file for the first one (or use the separate truncate statement), and have append instead in subsequent control files as you noted in comments.
You might also want to consider external tables if you're using this data as a base to populate something else, as there is no UNDO overhead on replacing the external data source. You'll probably need to talk to your DBA about setting that up and giving you the necessary directory permissions.
Your undo tablespace is to small to hold all the undo information and it seems it cannot be extended.
You can split the import into smaller batched and issue a commit after each batch or get your DBA to increase the tablespace for undo1
And use truncate in stead of replace before you start the immports