Are Databricks SQL tables & views duplicates of the source data, or do you update the same data source? - sql

Let's say you create a table in DBFS as follows.
%sql
DROP TABLE IF EXISTS silver_loan_stats;
-- Explicitly define our table, providing schema for schema enforcement.
CREATE TABLE silver_loan_stats (
loan_status STRING,
int_rate FLOAT,
revol_util FLOAT,
issue_d STRING,
earliest_cr_line STRING,
emp_length FLOAT,
verification_status STRING,
total_pymnt DOUBLE,
loan_amnt FLOAT,
grade STRING,
annual_inc FLOAT,
dti FLOAT,
addr_state STRING,
term STRING,
home_ownership STRING,
purpose STRING,
application_type STRING,
delinq_2yrs FLOAT,
total_acc FLOAT,
bad_loan STRING,
issue_year DOUBLE,
earliest_year DOUBLE,
credit_length_in_years DOUBLE)
USING DELTA
LOCATION "/tmp/${username}/silver_loan_stats";
Later, you save data (a dataframe named 'loan_stats) to this source LOCATION.
# Configure destination path
DELTALAKE_SILVER_PATH = f"/tmp/{username}/silver_loan_stats"
# Write out the table
loan_stats.write.format('delta').mode('overwrite').save(DELTALAKE_SILVER_PATH)
# Read the table
loan_stats = spark.read.format("delta").load(DELTALAKE_SILVER_PATH)
display(loan_stats)
My questions are:
Are the table and the source data linked? So e.g. removing or joining data on the table updates it on the source as well, and removing or joining data on the source updates it in the table as well?
Does the above hold when you create a view instead of a table as well ('createOrReplaceTempView' instead of CREATE TABLE)?
I am trying to see the point of using Spark SQL when the Spark dataframes already offer a lot of functionality.. I guess it makes sense for me if the two are effectively the same data, but if CREATE TABLE (or createOrReplaceTempView) means you create a duplicate then I find it difficult to understand why you would put so much effort (and compute resources) into doing so.

The table and source data are linked in that the metastore contains the table information (silver_loan_stats) and that table points to the location as defined in DELTALAKE_SILVER_PATH.
The CREATE TABLE is really a CREATE EXTERNAL TABLE as the table and its metadata is defined in the DELTALAKE_SILVER_PATH - specifically the ``DELTALAKE_SILVER_PATH/_delta_log`.
To clarify, you are not duplicating the data when you do this - it's just an intermixing of SQL vs. API. HTH!

Related

Hive insert overwrite and Insert into are very slow with S3 external table

I am using AWS EMR. I have created external tables pointing to S3 location.
The "INSERT INTO TABLE" and "INSERT OVERWRITE" statements are very slow when using destination table as external table pointing to S3. The main issue is that Hive first writes data to a staging directory and then moves the data to the original location.
Does anyone have a better solution for this? Using S3 is really slowing down our jobs.
Cloudera recommends to use the setting hive.mv.files.threads. But looks like the setting is not available in Hive provided in EMR or Apache Hive.
Ok am trying to provide more details.
Below is my source table structure
CREATE EXTERNAL TABLE ORDERS (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE DOUBLE,
O_ORDERDATE DATE,
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://raw-tpch/orders/';
Below is the structure of destination table.
CREATE EXTERNAL TABLE ORDERS (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE decimal(12,2),
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING)
partitioned by (O_ORDERDATE string)
STORED AS PARQUET
LOCATION 's3://parquet-tpch/orders/';
The source table contains orders data for 2400 days. Size of table is 100 GB.so destination table is expected to have 2400 partitions. I have executed below insert statement.
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=500000000;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=2000;
set hive.load.dynamic.partitions.thread=20;
set hive.mv.files.thread=25;
set hive.blobstore.optimizations.enabled=false;
set parquet.compression=snappy;
INSERT into TABLE orders_parq partition(O_ORDERDATE)
SELECT O_ORDERKEY, O_CUSTKEY,
O_ORDERSTATUS, O_TOTALPRICE,
O_ORDERPRIORITY, O_CLERK,
O_SHIPPRIORITY, O_COMMENT,
O_ORDERDATE from orders;
The query completes it map and reduce part in 10 min but takes lot of time to move data from /tmp/hive/hadoop/b0eac2bb-7151-4e29-9640-3e7c15115b60/hive_2018-02-15_15-02-32_051_5904274475440081364-1/-mr-10001 to destination s3 path.
If i enable the parameter "set hive.blobstore.optimizations.enabled=false;"
it takes time for moving data from hive staging directory to destination table directory.
Surprisingly i found one more issue even though i set my compression as snappy the output table size is 108GB more that raw input text file which is 100 GB.

Mapping hbase table with counter column to external hive table?

I am trying to map a table in hive to view an hbase table. I did this without a problem with several columns, but am unsure how to manage with a counter column. Is this possible?
When I scan the hbase table an example value of the counter column is \x00\x00\x00\x00\x00\x00\x00\x01.
I suspect I am setting the column type incorrectly in the hive table. I have tried int and string (both show only nulls in the hive view). Is there a better way at getting the number of increments from this value? The ideal world would be a column in hive that is the sum of all the increments, i assume.
It is entirely possible I am misunderstanding what is possible in viewing the counter (or how the counter was originally setup).
Ended up finding answer through this link on cloudera community.
Answer is to define counter column in hive table as bigint and define the SERDEPROPERTIES with '#b' added to the end to indicate the hbase column type is binary.
For example:
create external table md_extract_file_status ( table_key string, fl_counter bigint)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,colfam:FL_Counter#b )
TBLPROPERTIES('hbase.table.name' ='HBTABLE');

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

Hive: Creating an external table: Show partitions does nothing

I am creating an external table in my own database :
create external table test1 (
event_uid string,
event_type_id int,
event_category_id int,
event_date string,
log_date string,
server string,
server_type string,
process_id int,
device_id string,
account_id string,
ip_address string,
category_id string,
content_id string,
entitlement_id string,
product_id string,
sku string,
title_id string,
service_id string,
order_id bigint,
transaction_id bigint,
company_code string,
product_code string,
key_value_pairs map<string,string>,
process_run_id string)
partitioned by (A string, B string, C string)
location '/data/a1/pnt/lte/formatted/evt'
When I try SHOW PARTITIONS TEST, I just get OK as an output.
However, there is a table with the same same DDL and the same location in another database which is giving results when I do SHOW PARITITIONS TEST. I have also tried MSCK REPAIR TABLE TEST which displays partitions .
Please suggest
When using partitions no actual partitions are created when you execute your DDL. The partitions are created when you load data into your table. So you need to load data and then you would be able to see the partitions with the show partitions statement.
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE test1 ADD PARTITION (A=2016, B=07, C=19)
hive> LOCATION '/data/a1/pnt/lte/formatted/evt/somedatafor_20160719'
hive> ;
When we specify LOCATION '/data/a1/pnt/lte/formatted/evt' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create for next data set(say here for C=20), we can simply place the data file in that location like '/data/a1/pnt/lte/formatted/evt/A=2016/B=07/C=20/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE test1;
The above statement will sync up the partition to the hive meta store of the table "test1".

Hive Partitioning Ineffective while using Hive Functions in Queries

I've been working on Hive partitioning from the past few days. Here is an example I created :-
Table - Transactions (Non - Partitioned Managed Table):
CREATE TABLE TRANSACTIONS (
txnno INT,
txndate STRING,
custid INT,
amount DOUBLE,
category STRING,
product STRING,
city STRING,
state STRING,
spendby STRING)
row format delimited fields terminated by ',';
Loaded the data inside this table using the load command.
Created another table as follows :-
Table - Txnrecordsbycat (Partitioned Managed Table):
CREATE TABLE TXNRECORDSBYCAT(txnno INT, txndate STRING, custno INT, amount DOUBLE, product STRING, city STRING, state STRING, spendby STRING)
partitioned by (category STRING)
clustered by (state) INTO 10 buckets
row format delimited fields terminated by ',';
Used the following query to load the data from Transactions table to Txnrecordsbycat table.
FROM TRANSACTIONS txn INSERT OVERWRITE TABLE TXNRECORDSBYCAT PARTITION(category) select txn.txnno,txn.txndate,txn.custid, txn.amount, txn.product,txn.city, txn.state, txn.spendby,txn.category DISTRIBUTE BY CATEGORY;
Now as long as I'm firing simple queires like select * from Transactions and select * from trxrecordsbycat, I can see my queries being efficient (i.e take less execution time) on the partitioned table as compared to non-partitioned table.
However, as soon as my queries become a little complex, something like select count(*) from table, the query becomes less efficient (i.e takes more time) on partitioned table.
Any idea whey this might be happening ?
Many Thanks for help.