Delta table not avaialbe from SQL enpoint, but it works well from notebook - azure-storage

I have mounted Azure Data Lake Storage in databricks with:
dbutils.fs.mount(
source = "abfss://{0}#{1}.dfs.core.windows.net/".format(container_name, storage_account_name),
mount_point = mount_path,
extra_configs = cfg
)
After this I created a new delta table:
create table {database}.{my_tablename} (t1 int, t2 int) location 'dbfs:/mnt/{my_path}'
insert into {my_database}.{my_tablename} values (99, 99)
The table works fine in the notebooks.
I can query the data. Also, I see that a proper folder with the delta table is created in the container.
The problem is when I try to query the data from SQL Endpoint -
I get these errors:
dbfs:/mnt/{my_path} doesn't exist
OR
{my_database}.{my_tablename} is not a Delta table.

It appears to be working correctly today.
I waited around 12 hours.

Related

Schema change in Delta table - How to remove a partition from the table schema without overwriting?

Given a Delta table:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b, part_c
LOCATION '/some/path/'
This table already has tons of data. However, the desired schema is:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b -- <<-- ONLY part_a and part_b for partitions, ie, removing part_c
LOCATION '/some/path/'
How this schema change can be achieved?
I eventually took the following approach:
Backup the original table to be on the safe-side
spark.read.table into memory
df.write.option(“overwriteSchema”, “true”) to the original location
I chose this approach so I don’t need to change the original data location.
In more details:
1. Backup the original table to be on the safe-side
Since this was in on Databricks I could use their proprietary deep clone feature:
create table mydb.mytable_backup_before_schema_migration_v1
deep clone mydb.mytable
location 'dbfs:/mnt/defaultDatalake/backups/zones/mydb/mytable_backup_before_schema_migration_v1'
If you are not in Databricks and don't have access to its deep clone, you still can backup the table by reading and writing a copy to another place.
2. read and 3. overwrite with new schema
val df = spark.read.format("delta").table("mydb.mytable")
df
.write
.format("delta")
.mode("overwrite")
.partitionBy("part_a", "part_b")
.option("overwriteSchema", "true")
.saveAsTable("mydb.mytable") // same table, same location, but different data physical organization because partition changes

File is not created when CREATE TABLE in open source delta lake

I am using AWS EMR and I am using an open-source delta lake.
In Python, dataframe.write.format('delta').save() works fine.
But I want to use it in SQL. I tried to create a delta table in SQL as below.
spark.sql('''
CREATE OR REPLACE TABLE test.foo
(name string)
USING delta
LOCATION 's3://<bucket_name>/test/foo'
''');
But when I try to INSERT, an error is raised.
spark.sql('INSERT INTO test.foo (name) VALUES ("bar")');
ERROR: org.apache.spark.sql.AnalysisException: Path does not exist
Tables were created in Glue Metastore, but nothing was created in s3://<bucket_name>/test/foo in S3.
Is there any way to create a table in SQL? :)
To me it looks like you use the wrong name: test.sql_delta. If u use sql and created a sql table it will reference to your physical metastore through the SQL table name you just created.
code should be:
spark.sql('INSERT INTO test.foo (name) VALUES ("bar")')
SQL version:
%SQL
INSERT INTO test.foo (name) VALUES ("bar")

ALTER TABLE table ADD IF NOT EXISTS PARTITION (state = '34' , city = '123') is not adding the partition in temp/local folder

I am working on hive table partitioning and using spark client to trigger the request.
I have created the table and inserted data with partition.When I execute select statement I can see the data, but when I add Partition
spark.sql("ALTER TABLE temp_table6 ADD IF NOT EXISTS PARTITION (state = '34' , city = '123')")
second time onwards I am not able to get the data.
Since spark client looking for partitioned folder in temp location temp/temp_table6, spark is throwing an exception like below
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/temp_table6/state=34/city=123
From second time onwards partitioned data not getting created under temp folder.

Select count(*) from Table , Select * from Table dosent yeild any output

I am trying to build a managed table (which orc formatted ,bucketed and table properties is set to true for transnational )on which i can run the update/Insert Statement In hive .
I am running this whole setup on AWS EMR and the Hive version is 2.4.3 the default directory store the data is S3.
I am able to populate the table from another external table .
However am getting select count(*) as zero and no output for select *
i dropped the table and recreated the table and repopulated the data .
The ANALYZE TABLE TABLE-NAME COMPUTE STATISTICS gives proper output .

How to delete data from table cache using data factory?

I have a data factory pipeline that is taking data from SQL Server and copying it to table storage:
Azure Sql Server View --> Table Storage Cache
This step is doing a REPLACE of the specific row based on a guid from the SQL Server view.
If a record is deleted from the source view, how do we delete that same record from table storage?
During the Copy Active, we couldn't do operation to the table data. Data Factory does not support this operation.
I have created a data pipeline to copy data from my SQL Server to the table storage for the test. Before your Copy Active begin, you can delete or insert your record as you want. And you can preview the data in Data Factory. Once the Copy Active is published and finished, it means that your data have been copied to the table storage.
If your really need to delete the same record in Table Storage, one of the ways is that you can login you Azure portal and using the "Storage Explorer(preview)" to delete the record:
Hope this helps.
Here is a hint -- some code I have run in production for years... the idea is to have the source use a timestamp (or simpler flag) on deleted records and then to have your MERGE respect it.
CREATE PROCEDURE [DataFactory].[MergeCompany]
(
#Company [DataFactory].[CompanyType] READONLY
)
AS
BEGIN
MERGE INTO [Company] AS TGT
USING #Company As SRC
ON TGT.pk = SRC.pk
--
WHEN MATCHED AND SRC.aud_delete_date IS NOT NULL
THEN DELETE
--
WHEN MATCHED AND SRC.aud_delete_date IS NULL
THEN UPDATE SET
TGT.comp_number = SRC.comp_number
,TGT.comp_name = SRC.comp_name
,TGT.aud_update_date = SRC.aud_update_date
,TGT.aud_create_date = SRC.aud_create_date
--
WHEN NOT MATCHED BY TARGET
THEN INSERT(
pk
,comp_number
,comp_name
,aud_update_date
,aud_create_date
)
VALUES(
SRC.pk
,SRC.comp_number
,SRC.comp_name
,SRC.aud_update_date
,SRC.aud_create_date
)
; -- Required semicolon
END
GO