databricks error IllegalStateException: The transaction log has failed integrity checks - apache-spark-sql

I have a table that I need drop, delete transaction log and recreate, but while I am trying to drop I get following error.
I have ran repair table statement on this one and could be responsible for error but not sure.
IllegalStateException: The transaction log has failed integrity checks. We recommend you contact Databricks support for assistance. To disable this check, set spark.databricks.delta.state.corruptionIsFatal to false. Failed verification of:
Table size (bytes) - Expected: 0 Computed: 63233
Number of files - Expected: 0 Computed: 1

We think this may just be related to s3 eventual consistency. Please try waiting a few extra minutes after deleting the Delta directory before writing new data to it. Also, normal MSCK REPAIR TABLE doesn't do anything for Delta, as Delta doesn't use the Hive Metastore to store the partitions. There is an FSCK REPAIR TABLE, but that is for removing the file entries from the transaction log of a Databricks Delta table that can no longer be found in the underlying file system.
We don't recommend overwriting a Delta table in place, like you might with a normal Spark table. Delta is not like a normal table - it's a table, plus a transaction log, and many versions of your data (unless fully vacuumed). If you want to overwrite parts of the table, or even the whole table, you should use Delta's delete functionality. If you want to completely change the table, consider writing to an entirely new directory, such as /table/v2/... and separately deleting the other table.

To skip the issue from occurring can use below command (PySpark notebook):
spark.conf.set("spark.databricks.delta.state.corruptionIsFatal", False)

Related

BigQuery Scheduled Data Transfer throws "Incompatible table partitioning specification." Error - but error message is truncated

I'm using the new BQ Data Transfer UI and upon scheduling a Data Transfer, the transfer fails.
The error message in Run History isn't terribly helpful as the error message seems truncated.
Incompatible table partitioning specification. Expects partitioning specification interval(type:hour), but input partitioning specification is ; JobID: xxxxxxxxxxxx
Note the part of the error that says..."but input partition specification is..." with nothing before the semicolon. Seems this error is truncated.
Some details about the run:
The run imports data from a CSV file located in a GCS Bucket on a nightly basis. Once successfully ingested the process will delete the file. The target table in BQ is a partitioned table using the default partition pseudo column (_PARTITIONTIME)
What I have done so far:
Reran the scheduled Data Transfer -- which failed and threw the same error
Deleted the target table in BQ and recreated it with different partition specifications (day, hour, month) -- then Reran Scheduled Transfer -- failed and threw same error.
Imported the data manually (I downloaded the file from GCS and uploaded it locally from my machine) using the BQ UI (Create Table, append the specific table) - Worked perfectly.
Checked to see if this was a known issue here on Stack Overflow and only found this (which is now closed) -- close, but not exactly the issue. (BigQuery Data Transfer Service with BigQuery partitioned table)
What I'm holding off doing since it would take a bit more work.
Change schema of the target BQ table to include a specified column specific for partitioning
Include a system-generated timestamp in the original file inside of GCS and ensure the process recognizes this as the partitioning field.
Am I doing something wrong? Or is this a known issue?
Alright, I believe I have solved this. It looks like you need to include runtime parameters into your target table if the destination table is being partitioned.
https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters
Specifically this section called "Runtime Parameter Examples" here: https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters#loading_a_snapshot_of_all_data_into_an_ingestion-time_partitioned_table
They also advise that minutes cannot be specified in these parameters.
You will need to append the parameters to your destination table details as shown below:

Crate db cannot query data in a shard

I have a instance of Crate 1.0.2 and I dropped a table from it. Then re-created table with same name and slightly modified schema. Then I imported data using copy from command. File argument to copy from command consists of 10,000 records and copy from command runs ok. When I check table tab in crate web console, it shows many partitions added and each partition having few records. If I add number of records column on this tab, it comes close to 10k but when I fire a command "select count(*) from mytable", it returns around 8000 records only. On further investigation found that there are certain partitions on which data cannot be queried at all. Has any one seen this problem? Does it have anything to do with table drop and creation with same name ? I also observed that when a table is dropped, not all files related to that table are deleted from path.data. Are these directories a reason for those partitions become non-query able? While importing, I saw "Document already exists" exception. I know my data does not have any duplicate value for primary column.
Some questions to clarify the issue:
Have you run refresh table mytable after your copy command has finished?
Are you sure that with the new schema of the table, there are no duplicate records?
Since 1.x versions are not supported anymore, could you try with CrateDB 2.1.6 which is the current stable version to see if the problem persists?

How stop a Pentaho ETL process if some row fails?

I have a transformation which have the following stream:
The error handling lines are set to max error = 0. So When it detects one error it stops.
The problem is if the first row is correct the ETL insert this row into the final table, and then stops the process.
Is it possible to check all the rows before still doing the process? This way if some row fails the data is not deleted in the final table (truncate option in enabled).
Use a temporary/staging table in this transformation
If your storage space allows it, a staging table provides for the most reliable solution with the minimal downtime for your final table.
The staging table should be identical to the final table in structure. You can then run the transformation inside of a job and have the job proceed only on success to a SQL job step that renames final to old, staging to final and then old to staging.
This way your final table is never empty and only unavailable for a fraction of a second during the rename operation.
You achieve this by making the transformation database transactional.

Difference between invalidate metadata and refresh commands in Impala?

I saw at this link which affects Impala version 1.1:
Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement.
Does this still hold true for later versions of Impala?
According to Cloudera's Impala guide (Cloudera Enterprise 5.8) but stayed the same for 5.9:
INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA
waits to reload the metadata when needed for a subsequent query, but
reloads all the metadata for the table, which can be an expensive
operation, especially for large tables with many partitions. REFRESH
reloads the metadata immediately, but only loads the block location
data for newly added data files, making it a less expensive operation
overall. If data was altered in some more extensive way, such as being
reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a
performance penalty from reduced local reads. If you used Impala
version 1.0, the INVALIDATE METADATA statement works just like the
Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is
optimized for the common use case of adding new data files to an
existing table, thus the table name argument is now required.
and related to working on existing tables:
The table name is a required parameter [for REFRESH]. To flush the metadata for all
tables, use the INVALIDATE METADATA command.
Because REFRESH table_name only works for tables that the current
Impala node is already aware of, when you create a new table in the
Hive shell, enter INVALIDATE METADATA new_table before you can see the
new table in impala-shell. Once the table is known by Impala, you can
issue REFRESH table_name after you add data files for that table.
So it seems like it indeed stayed the same. I believe CDH 5.9 comes with Impala 2.7.
As per Impala document Invalidate Metada and Refresh
INVALIDATE METADATA Statement
The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA.
INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:
Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad.
REFRESH Statement
The REFRESH statement reloads the metadata for the table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, namely Hive Metastore (HMS) and NameNodes.
Usage notes:
The table name is a required parameter, and the table must already exist and be known to Impala.
Only the metadata for the specified table is reloaded.
Use the REFRESH statement to load the latest metastore metadata for a particular table after one of the following scenarios happens outside of Impala:
Deleting, adding, or modifying files.
For example, after loading new data files into the HDFS data directory for the table, appending to an existing HDFS file, inserting data from Hive via INSERT or LOAD DATA.
Deleting, adding, or modifying partitions.
For example, after issuing ALTER TABLE or other table-modifying SQL statement in Hive
Invalidate metadata is used to refresh the metastore and the data (structure & data)==complete flush
Refresh is used to update only the data = lightweight flush

HSQLDB clear table data after restart

I want to save some temporary data in memory, which should be removed after server shuts down.
There is a temporary table in HSQLDB, but the data is removed immediately after transaction committed, which is too short for me. On the other side, the memory table keeps a script log file and resume the data when server new starts. It takes time and place to maintain such script logs, which are useless for my situation.
what I need is just a type of table, only the table structure is persistent in hard disk, the data and the data operations should only be performed in memory. Otherwise why do I need a in-memory DB instead of mysql?
Is there such type of table in HSQLDB?
thanks
Create the file: database, then create the tables. Perform SHUTDOWN. Edit the .properties file for the database, add the setting below and save.
files_readonly=true
When you perform your tests with this database, no data is written to disk.
Alternatively, with the latest versions of HSQLDB 2.2.x, you can can specify this property on the connection URL during the tests. For example
jdbc:hsqldb:file:myfilepath;files_readonly=true