Azure Data Lake - how to insert into external table in AzureSQL DB? - azure-data-lake

From Azure Data Lake, inserting records to an external table in AzureSQL DB produce the following error:
Error E_CSC_USER_CANNOTMODIFYEXTERNALTABLE Modifying external table 'credDB.dbo.BuildInfosClone' is not supported.
Modifying external table 'credDB.dbo.BuildInfosClone' is not supported.
External tables are read-only tables.
How to insert records to external database ? My credential has read-write access. I am using regular Azure SQL DB, not Data Warehouse.
Complete U-SQL code
CREATE DATA SOURCE myDataSource
FROM AZURESQLDB
WITH
(
PROVIDER_STRING = "Database=MedicusMT2",
CREDENTIAL = credDB.rnddref_admin,
REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime)
);
CREATE EXTERNAL TABLE IF NOT EXISTS dbo.BuildInfosClone
(
[Key] string,
[Value] string
)
FROM myDataSource LOCATION "dbo.BuildInfosClone";
INSERT INTO dbo.BuildInfosClone
( [Key], [Value] )
VALUES
("SampleKey","SampleValue");

You cannot currently write directly to Azure SQL Data Warehouse tables using U-SQL. You could either write your data out to flat file then import it using Polybase or use Data Factory to orchestrate the copy.
Alternately you can use Azure Databricks to write directly to SQL Data Warehouse as per this tutorial.

Related

Best way to transfer data from source table in one db to destination table in another db daily

What would be the best way to transfer certain number of records daily from source to destination and then remove from source?
DB - SQL server on cloud.
As the databases are in the same server, you can create a job that transfers the data do the other database.
Because the databases are in the same server you can easily access them, just by adding the database before the table in the query, look the test that i did:
CREATE DATABASE [_Source]
CREATE DATABASE [_Destination]
CREATE TABLE [_Source].dbo.FromTable
(
some_data varchar(10)
)
CREATE TABLE [_Destination].dbo.ToTable
(
some_data varchar(10)
)
INSERT INTO [_Source].dbo.FromTable VALUES ('PAULO')
--THE JOB WOULD BE SOMETHING LIKE THIS:
-- INSERT INTO DESTINATION GETTING THE DATA FROM THE SOURCE
INSERT INTO [_Destination].dbo.ToTable
SELECT some_data
FROM [_Source].dbo.FromTable
-- DELETE FROM SOURCE
DELETE [_Source].dbo.FromTable

Are Databricks SQL tables & views duplicates of the source data, or do you update the same data source?

Let's say you create a table in DBFS as follows.
%sql
DROP TABLE IF EXISTS silver_loan_stats;
-- Explicitly define our table, providing schema for schema enforcement.
CREATE TABLE silver_loan_stats (
loan_status STRING,
int_rate FLOAT,
revol_util FLOAT,
issue_d STRING,
earliest_cr_line STRING,
emp_length FLOAT,
verification_status STRING,
total_pymnt DOUBLE,
loan_amnt FLOAT,
grade STRING,
annual_inc FLOAT,
dti FLOAT,
addr_state STRING,
term STRING,
home_ownership STRING,
purpose STRING,
application_type STRING,
delinq_2yrs FLOAT,
total_acc FLOAT,
bad_loan STRING,
issue_year DOUBLE,
earliest_year DOUBLE,
credit_length_in_years DOUBLE)
USING DELTA
LOCATION "/tmp/${username}/silver_loan_stats";
Later, you save data (a dataframe named 'loan_stats) to this source LOCATION.
# Configure destination path
DELTALAKE_SILVER_PATH = f"/tmp/{username}/silver_loan_stats"
# Write out the table
loan_stats.write.format('delta').mode('overwrite').save(DELTALAKE_SILVER_PATH)
# Read the table
loan_stats = spark.read.format("delta").load(DELTALAKE_SILVER_PATH)
display(loan_stats)
My questions are:
Are the table and the source data linked? So e.g. removing or joining data on the table updates it on the source as well, and removing or joining data on the source updates it in the table as well?
Does the above hold when you create a view instead of a table as well ('createOrReplaceTempView' instead of CREATE TABLE)?
I am trying to see the point of using Spark SQL when the Spark dataframes already offer a lot of functionality.. I guess it makes sense for me if the two are effectively the same data, but if CREATE TABLE (or createOrReplaceTempView) means you create a duplicate then I find it difficult to understand why you would put so much effort (and compute resources) into doing so.
The table and source data are linked in that the metastore contains the table information (silver_loan_stats) and that table points to the location as defined in DELTALAKE_SILVER_PATH.
The CREATE TABLE is really a CREATE EXTERNAL TABLE as the table and its metadata is defined in the DELTALAKE_SILVER_PATH - specifically the ``DELTALAKE_SILVER_PATH/_delta_log`.
To clarify, you are not duplicating the data when you do this - it's just an intermixing of SQL vs. API. HTH!

data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

Hive insert overwrite and Insert into are very slow with S3 external table

I am using AWS EMR. I have created external tables pointing to S3 location.
The "INSERT INTO TABLE" and "INSERT OVERWRITE" statements are very slow when using destination table as external table pointing to S3. The main issue is that Hive first writes data to a staging directory and then moves the data to the original location.
Does anyone have a better solution for this? Using S3 is really slowing down our jobs.
Cloudera recommends to use the setting hive.mv.files.threads. But looks like the setting is not available in Hive provided in EMR or Apache Hive.
Ok am trying to provide more details.
Below is my source table structure
CREATE EXTERNAL TABLE ORDERS (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE DOUBLE,
O_ORDERDATE DATE,
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://raw-tpch/orders/';
Below is the structure of destination table.
CREATE EXTERNAL TABLE ORDERS (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE decimal(12,2),
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING)
partitioned by (O_ORDERDATE string)
STORED AS PARQUET
LOCATION 's3://parquet-tpch/orders/';
The source table contains orders data for 2400 days. Size of table is 100 GB.so destination table is expected to have 2400 partitions. I have executed below insert statement.
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=500000000;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=2000;
set hive.load.dynamic.partitions.thread=20;
set hive.mv.files.thread=25;
set hive.blobstore.optimizations.enabled=false;
set parquet.compression=snappy;
INSERT into TABLE orders_parq partition(O_ORDERDATE)
SELECT O_ORDERKEY, O_CUSTKEY,
O_ORDERSTATUS, O_TOTALPRICE,
O_ORDERPRIORITY, O_CLERK,
O_SHIPPRIORITY, O_COMMENT,
O_ORDERDATE from orders;
The query completes it map and reduce part in 10 min but takes lot of time to move data from /tmp/hive/hadoop/b0eac2bb-7151-4e29-9640-3e7c15115b60/hive_2018-02-15_15-02-32_051_5904274475440081364-1/-mr-10001 to destination s3 path.
If i enable the parameter "set hive.blobstore.optimizations.enabled=false;"
it takes time for moving data from hive staging directory to destination table directory.
Surprisingly i found one more issue even though i set my compression as snappy the output table size is 108GB more that raw input text file which is 100 GB.

Create External Hive Table Pointing to HBase Table

I have a table named "HISTORY" in HBase having column family "VDS" and the column names ROWKEY, ID, START_TIME, END_TIME, VALUE. I am using Cloudera Hadoop Distribution. I want to provide SQL interface to HBase table using Impala. In order to do this we have to create respective External Table in Hive? So how to create external hive table pointing to this HBase table?
Run the following code in Hive Query Editor:
CREATE EXTERNAL TABLE IF NOT EXISTS HISTORY
(
ROWKEY STRING,
ID STRING,
START_TIME STRING,
END_TIME STRING,
VALUE DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
(
"hbase.columns.mapping" = ":key,VDS:ID,VDS:START_TIME,VDS:END_TIME,VDS:VALUE"
)
TBLPROPERTIES("hbase.table.name" = "HISTORY");
Don't forget to Refresh Impala Metadata after External Table Creation with the following bash command:
echo "INVALIDATE METADATA" | impala-shell;