cannot create a view in redshift spectrum external schema - sql

I am facing an issue in creating a view in an external schema on a spectrum external table. Below is the script I am using to create the view
create or replace view external_schema.test_view as
select id, name from external_schema.external_table with no schema binding;
I'm getting below error
ERROR: Operations on local objects in external schema are not enabled.
Please help in creating view under spectrum external table

External tables are created in an external schema. An Amazon Redshift External Schema references a database in an external Data Catalog in AWS Glue or in Amazon Athena or a database in Hive metastore, such as Amazon EMR.
External schemas are not present in Redshift cluster, and are looked up from their sources. External tables are also only read only for the same reason.
As a result, you will not be able to bind a view that you are creating to a schema not is not stored in the cluster. You can create a view on top of external tables (WITH NO SCHEMA BINDING clause), but the view will reside in a schema local to Redshift.
TL;DR Redshift doesn’t support creating views in external schemas yet, so the view can only reside in a schema local to Redshift.
Replace external_schema with internal_schema as follows:
create or replace view internal_schema.test_view as
select id, name from external_schema.external_table with no schema binding;

Related

Azure SQL: Is it possible to use the view from another database?

Being new to Azure SQL, I am already successfully JOINing my table with the table from the other database (by defining the CREATE EXTERNAL TABLE and the related things). However, some queries contain views from the other database, not the tables. How can I join my table with that external view? Is it possible at all? Or do I have to copy the code from the remote view and work directly with the external tables?
with create external table you can access a table or view :
take a look at
https://medium.com/#fbeltrao/access-another-database-in-azure-sql-1afc526b7ad4

Are two tables (native, external) always required in Hive for querying a DynamoDB table from AWS EMR?

Are two hive tables (native, external) always required for querying a DynamoDB table from an AWS EMR?
I have created a native hive table (CTAS, create table as select) using an hive external table that was mapped to a DynamoDB table. My (read) query times against external tables are slow and it uses up the read throughput versus native table are fast and read throughput is not consumed.
My questions:
Is this a standard practice/best practice i.e., create an external table mapped to a dynamodb table and then create a CTAS and query against CTAS for all read query use cases?
Where or how GSI's on dynamodb come into picture on hive side of things? Toward this curiosity I have tried to map my external hive table column to dynamodb GSI and some what expectedly saw NULLs.
So, back to #2 question was wondering how are GSI's used with a native or external hive table?
Thanks,
Answer is no.
However, from my observation if a hive native table data is backed (CTAS) by hive external table that is referencing a DynamoDb table: Read data is not accounted if you are querying hive native table from EMR. If you to take into account the periodic update (refresh data) of hive native table.

unioning tables from ec2 with aws glue

I have two mysql databases each on their own ec2 instance. Each database has a table ‘report’ under a schema ‘product’. I use a crawler to get the table schemas into the aws glue data catalog in a database called db1. Then I’m using aws glue to copy the tables from the ec2 instances into an s3 bucket. Then I’m querying the tables with redshift. I get the external schema in to redshift from the aws crawler using the script below in query editor. I would like to union the two tables together in to one table and add a column ’source’ with a flag to indicate the original table each record came from. Does anyone know if it’s possible to do that with aws glue during the etl process? Or can you suggest another solution? I know I could just union them with sql in redshift but my end goal is to create an etl pipeline that does that before it gets to redshift.
script:
create external schema schema1 from data catalog
database ‘db1’
iam_role 'arn:aws:iam::228276743211:role/madeup’
region 'us-west-2';
You can create a view that unions the 2 tables using Athena, then that view will be available in Redshift Spectrum.
CREATE OR REPLACE VIEW db1.combined_view AS
SELECT col1,cole2,col3 from db1.mysql_table_1
union all
SELECT col1,cole2,col3 from db1.mysql_table_2
;
run the above using Athena (not Redshift)

Using Azure HDInsight and Hive

I have created an HDInsight cluster but wants to upload a database on portal and use hive on it. What are the steps i need to take?
I know how to use hive but don't know how to connect the data being uploaded in container blob and hive. Btw I am using Powershell
Need to link storage account of the container with hdinsight cluster.
To do that, add following property in core-site.xml
<property>
<name>fs.azure.account.key.[STORAGE ACCOUNT NAME].blob.core.windows.net</name>
<value>[STORAGE ACCOUNT KEY]</value>
</property>
Once its linked, you will be to access that storage account.
To Create hive table on data residing in blob, use external hive table with location pointing to blob directory of your data.
example : CREATE EXTERNAL TABLE (col1 datatype, ....)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Location 'wasb://#.blob.core.windows.net/PATH/OF/DATA/'

Where does hive stores its table?

I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.