Is there a way to clear out the contents of the stl_load_errors table in Amazon's Redshift?
I am running batch processes to COPY into Redshift and it would be convenient if I could view the entire stl_load_errors in one go without having to filter by a time range.
When I attempt to DELETE FROM stl_load_errors, I get "ERROR: cannot delete from a system table"
When I attempt to TRUNCATE stl_load_errors, I get "ERROR: permission denied: "stl_load_errors" is a system catalog"
Nope, you can't delete from that table.
It's worth noting that Redshift will automatically clear down that table over time, i.e., it doesn't hold all load errors forever.
You can't delete from stl_load_errors but if you use COPY query from S3 you can filter the SELECT from stl_load_errors using filename.
for example:
select * from stl_load_errors where filename like 's3://BUCKET/PREFIX_OF_PATH%'
The stl_load_errors will delete old data (usually a week old), so you don't need to worry about disk space.
You can use below query to get all errors for your copy command.
SELECT err.userid,
err.process,
err.recordtime,
err.pid,
err.errcode,
err.file,
err.linenum,
err.context,
err.error
FROM stl_error err,
stv_recents rec
WHERE rec.pid=err.pid
AND rec.status='running'
AND rec.query LIKE 'COPY%';
Please edit copy% in above query part as per your command.
Related
I am trying to write data to an Azure SQL DB with Azure Data Factory. I'm using a Copy Data Task within a For Each that looks through all the rows in an ETL table in the DB. In the pre-copy script, I have
TRUNCATE TABLE [master].[dbo].[#{item().DestinationObjectName}]
DestinationObjectName being the name of the table that is being loaded in the ETL table. The problem I'm having is that for some of the tables (not all, some work perfectly fine) I am getting the error 'Cannot find the object % because it does not exist or you do not have permissions'. The account I'm using has all the necessary privileges as well. I am able to see the script that is sent to ADF which I have copied into the DB and confirmed this script works sometimes but not every time. If I select top 1000 from the table in question and replace that object for the one in the truncate table script, it works. I'm really at a loss here. Like I said the truncate works for a majority of the tables but not all. I have also double checked that the object names are the exact same.
Any help is appreciated.
This issue has been solved. I had to drop the affected tables and remove the brackets surrounding each in the create table statements and recreate without the brackets. very strange issue.
I have a table that I need drop, delete transaction log and recreate, but while I am trying to drop I get following error.
I have ran repair table statement on this one and could be responsible for error but not sure.
IllegalStateException: The transaction log has failed integrity checks. We recommend you contact Databricks support for assistance. To disable this check, set spark.databricks.delta.state.corruptionIsFatal to false. Failed verification of:
Table size (bytes) - Expected: 0 Computed: 63233
Number of files - Expected: 0 Computed: 1
We think this may just be related to s3 eventual consistency. Please try waiting a few extra minutes after deleting the Delta directory before writing new data to it. Also, normal MSCK REPAIR TABLE doesn't do anything for Delta, as Delta doesn't use the Hive Metastore to store the partitions. There is an FSCK REPAIR TABLE, but that is for removing the file entries from the transaction log of a Databricks Delta table that can no longer be found in the underlying file system.
We don't recommend overwriting a Delta table in place, like you might with a normal Spark table. Delta is not like a normal table - it's a table, plus a transaction log, and many versions of your data (unless fully vacuumed). If you want to overwrite parts of the table, or even the whole table, you should use Delta's delete functionality. If you want to completely change the table, consider writing to an entirely new directory, such as /table/v2/... and separately deleting the other table.
To skip the issue from occurring can use below command (PySpark notebook):
spark.conf.set("spark.databricks.delta.state.corruptionIsFatal", False)
I am moving data around within Impala, not my design, and I have lost some data. I need to copy the data from the parquet tables back to their original non-parquet tables. Originally, the developers had done this with a simple one liner in a script. Since I don't know anything about databases and especially about Impala I was hoping you could help me out. This is the one line that is used to translate to a parquet table that I need to be reversed.
impalaShell -i <ipaddr> use db INVALIDATE METADATA <text_table>;
CREATE TABLE <parquet_table> LIKE <text_table> STORED AS PARQUET TABLE;
INSERT OVERWRITE <parquet_table> SELECT * FROM <text_table>;
Thanks.
Have you tried simply doing
CREATE TABLE <text_table>
AS
SELECT *
FROM <parquet_table>
Per the Cloudera documentation, this should be possible.
NOTE: Ensure that your does not exist or use a table name that does not already exist so that you do not accidentally overwrite other data.
Is it possible in hive to create a table and have it saved locally at the same time?
When I get data for my analyses, I usually create temporary tables to track eventual
mistakes in the queries/scripts. Some of these are just temporary tables, while others contain the data that I actually need for my analyses.
What I do usually is using hive -e "select * from db.table" > filename.tsv to get the data locally; however when the tables are big this can take quite some time.
I was wondering if there is some way in my script to create the table and save it locally at the same time. Probably this is not possible, but I thought it is worth asking.
Honestly doing it the way you are is the best way out of the two possible ways but it is worth noting you can preform a similar task in an .hql file for automation.
Using syntax like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp' select * from table;
You can run a query and store it somewhere in the local directory (as long as there is enough space and correct privileges)
A disadvantage to this is that with a pipe you get the data stored nicely as '|' delimitation and new line separated, but this method will store the values in the hive default '^b' I think.
A work around is to do something like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
But this is only in Hive 0.11 or higher
When I load a (csv)-file to a hive table I can load without overwriting, thus adding the new file to the table.
Internally the file is just copied to the correct folder in HDFS
(e.g. user/warehouse/dbname/tablName/datafile1.csv). And probably some metadata is updated.
After a few loads I want to remove the contents of a specific file from the table.
I am sure I cannot simply delete the file because of the metadata that needs to be adjusted as well. There must be some kind of build-in function for this.
How do I do that?
Why do you need that?I mean Hive was developed to serve as a warehouse where you put lots n lots n lots of data and not to delete data every now and then. Such a need seems to be a poorly thought out schema or a poor use of Hive, at least to me.
And if you really have these kind of needs why don't you create partitioned tables? If you need to delete some specific data just delete that particular partition using either TRUNCATE or ALTER.
TRUNCATE TABLE table_name [PARTITION partition_spec];
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...
if this feature is needed more than just once in a while you can use MapR's distribution while allows this kind of operations with no problem (even via NFS). otherwise, if you don't have partition I think you'll have to create and new table using CTAS filterring the data in the bad file or just copy the good files back to os with "hadoop fs -copyToLocal" and move them back to hdfs into new table