How to used a cached table from a different notebook in databricks? - apache-spark-sql

I cached a table in Databricks (SQL Notebook) using
CACHE TABLE work_details AS SELECT (....)
The problem is that I can only access the cached table if I am in the same notebook. I want to use the table in a different notebook (same cluster) but it throws the error table or view not found
Is there any workaround for this?
EDIT:
Note, I cannot use views here because the cached table has a lot of rows and is further used to join different tables to create the final required table. If I use VIEWS instead of the cached table, the time taken to create the final table is increased which I do not want.
Ques: Why can't I cache the table again in the new notebook?
Ans: This is the solution I am using right now, but I need a workaround where I can use this table across multiple notebooks without having to cache it again and again and still have the same performance.

You typically create Temp view when creating cached table
• A Temp View is available across the context of a Notebook and is a common way of sharing data
• A Global Temp View is available to all Notebooks running on that Databricks Cluster
Workaround:
Create Global Temp View which will be accessible on all Notebooks running on that Cluster.
%sql
CREATE GLOBAL TEMP VIEW <global-view-name>
To access Global Temp View use below query
%sql
select * from global_temp.<global-view-name>;

Related

Bigquery - Updating GCS path of Bigquery external tables

I have a requirement where I might have to update the Bigquery External tables on a periodic basis.
The GCS location has timestamp for every incremental run, I would like to update to the latest timestamp folder as the path of External table.
One way i see is only dropping the table and creating again by pointing it to latest folder. But, is there any other way to update it without dropping the table
As suggested by #Samuel , you can use the SQL statement CREATE or REPLACE EXTERNAL TABLES for your requirement. Scheduled queries support DML and DDL statements which can be used to create the new tables. You can use the below mentioned query parameter to create the table according to your schedule :
My_database_name.my_table_name.my_results_{run_date}
For more information you can refer to this documentation.

How to use CETAS (Synapse Serverless Pool) in dbt?

In Synapse Serverless Pool, I can use CETAS to create external table and export the results to the Azure Data Lake Storage.
CREATE EXTERNAL TABLE external_table
WITH (
LOCATION = 'location/',
DATA_SOURCE = staging_zone,
FILE_FORMAT = SynapseParquetFormat
)
AS
SELECT * FROM table
It will create an external table name external_table in Synapse and write a parquet file to my staging zone in Azure Data Lake.
How can I do this in dbt?
I was trying to do something very similar and run my dbt project with Synapse Serverless Pool, but ran into several issues. Ultimately I was mislead by CETAS. When you create the external table it creates a folder hierarchy, in which it places the parquet file. If you were to run the same script like the one you have as an example it fails because you cannot overwrite with CETAS. So dbt would be able to run it like any other model, but it wouldn't be easy to overwrite. Maybe if you dynamically made a new parquet every time the script is run and deleted the old one, but that seems like putting a small bandage on the hemorrhaging wound that is the synapse and severless pool interaction. I had to switch up my architecture for this reason.
I was trying to export as a parquet to maintain the column datatypes and descriptions so I didn't have to re-schematize. Also so I could create tables based of incremental points in my pipeline. I ended up finding a way to pull from a database that already had the datatype schemas, using the dbt-synapse adapter. Then if I needed an incremental table, I could materialize it as a table via dbt and dbt-synapse and access it that way.
What is your goal with the exported parquet file?
Maybe we can find another solution?
Here's the dbt-synapse-serverless adapter github where it lists caveats for serverless pools.
I wrote a materialization for CETAS (Synapse Serverless Pool) here: https://github.com/intheroom/dbt-synapse-serverless
It's a forked from dbt-synapse-serverless here: https://github.com/dbt-msft/dbt-synapse-serverless
Also you can use hooks in dbt to use CETAS.

Unable to load managed table with maptype column (complex datatype) from external table in hive

I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]

Azure SQL: Is it possible to use the view from another database?

Being new to Azure SQL, I am already successfully JOINing my table with the table from the other database (by defining the CREATE EXTERNAL TABLE and the related things). However, some queries contain views from the other database, not the tables. How can I join my table with that external view? Is it possible at all? Or do I have to copy the code from the remote view and work directly with the external tables?
with create external table you can access a table or view :
take a look at
https://medium.com/#fbeltrao/access-another-database-in-azure-sql-1afc526b7ad4

Spark HiveContext does not retrieve newly inserted records from Hive Table

I am using Spark 1.4. HiveContext is used to connect Hive. I did the following
val hx = new HiveContext(sc)
import hx.implicits._
hx.sql("select * from tab").show
// it is fine, result was shown as expected
then, I inserted a few records into tab from beeline console
hx.refreshTable("tab")
hx.sql("select * from tab").show
// still old records, no newly inserted records
My question is: why the HiveContext didn't retrieve the newly inserted records?
hiveContext.refreshTable(tableName: String) - this will refresh only metadata of the table (not the actual data)
Notes from official documentaition : (credits: https://spark.apache.org)
refreshTable(tableName: String): Unit
Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache
To retrive newly inserted records:- uncache first and cache again using , uncacheTable(String tableName) and cacheTable(String tableName)
If the target table is partitioned, You need to insert with 'partition' option. If you miss out the partition, data will not be visible.
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) SELECT col1,col2,.... FROM tablename2
On a differently slight case, I have an RDD coming from a Spark SQL statement via HiveContext. The solution which worked for me after some experiments was to actually regenerate the RDD itself.
It does not matter whether you are using the DDL by Spark SQL or sending SQL statements directly via hiveContext.sql.
I have seen around people using a "count trick" in order to force the recomputation of a dataset but at least in my attempts I couldn't get to see the new data this way.
Anyway trying caching, refreshing and friends did not work for me, if somebody has some proper pattern here please share.