Azure Databricks INFORMATION_Schema - apache-spark-sql

I am using Azure Databricks and need to have a way to find out which columns are allowed to be NULL in several tables. For MySQL there is the well-known Information_Schema which does not exist in Databricks.
My idea was now to use the Spark SQL to create a schema from there. I am now wondering if this is an equivalent way to generate the Information Schema? My approach looks like this:
df = spark.sql("Select * from mytable")
df.schema
Any comment would be much appreciated!

By default, in Spark any column of the Dataframe could be null. If you need to enforce that some data should be not null, then you either use the code to check before writing the data, or you can use constraints supported by Delta tables, like NOT NULL, or CHECK (for arbitrary conditions). With these constraints, Spark will check data before writing, and will fail if data doesn't match the given constraint, like this:
P.S> You can get more information about table's schema & these constraints if you use SQL commands like describe table or describe table extended.

Related

How to create a bucketed ORC transactional table in Hive that is modeled after a non-transactional table

Suppose I have a non-transactional table in Hive named 'ccm'. It has hundreds of columns and one partition field.
I know how to create a copy with "create table abc like ccm' but I would like abc to be bucketed, ORC, and have transaction support set on via TBLPROPERTIES.
I do not want to mention all the columns in ccm when I compose the HQL.
Can I do this?
This answer may have the correct way to proceed in your case, and it also explains some limitation of the method used.
Create hive table using "as select" or "like" and also specify delimiter
So, from the example, you should add the missing parts:
CLUSTER BY
TBLPROPERTIES ("transactional"="true")
I have some doubts that you can achieve exactly your expected results but i would consider it as a step forward

getting Clustering/Bucketing columns programmatically

For reference, I am connecting to amazon-athena via sqlalchemy using essentially:
create_engine(
f'awsathena+rest://:#athena.{myRegion}.amazonaws.com:443/{athena_schema}?s3_staging_dir={myS3_staging_path}',
echo=True)
In most relational databases that adhere to the ANSI-SQL standard, I can programmatically get the partition columns of a table by running something like the following:
select *
from information_schema.columns
where table_name='myTable' and table_schema='mySchema'
and extra_info = 'partition key'
However the bucketing or clustering columns seem to not be similarly flagged. I know I can access this information via:
show create table mySchema.myTable
but I am interested in clean programmatical solution, if one exists. I am trying to not reinvent the wheel. Please show me how to do this or point me to the relevant documentation.
Thank you in advance.
PS: It would also be great if other information about the table, like location of files and storage format were also accessible programmatically.
Athena uses Glue Data Catalog to store metadata about databases and tables. I don't know how much of this is exposed in information_schema, and there is very little documentation about it.
However, you can get everything Athena knows by querying the Glue Data Catalog directly. In this case if you call GetTable (e.g. aws glue get-table …) you will find the bucketing information in Table.StorageDescriptor.BucketColumns.
The GetTable call will also give you the storage format and the location of the files (but for a partitioned table you need to make additional calls with GetPartitions to retrieve the location of each partition's data).

Optimize Temporary Table on Presto/Hive SQL

I would like to optimize my computation time for queries ran on PRESTO/HIVE SQL. One of the techniques I used to do on Redshift was to improve efficiency of temporary tables as in the following :
BEGIN;
CREATE TEMPORARY TABLE my_temp_table(
column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) -- Assuming you intend to join this table on column_a
sortkey (column_b) -- Assuming you are sorting or grouping by column_b
;
INSERT INTO my_temp_table SELECT column_a, column_b FROM my_table;
COMMIT;
I have tried that on Presto/Hive SQL but it is not supported. Do you please know the equivalent of this technique on Presto/Hive SQL?
Many thanks!
Redshift is relational database, Presto is a distributed SQL Query Engine. Presto currently doesn't support the creation of temporary tables and also not the creation of indexes. But you may create tables based on a SQL statement via CREATE TABLE AS - Presto Documentation
You optimize the performance of Presto in two ways:
Optimizing the query itself
Optimizing how the underlying data is stored
One of the best articles around is Top 10 Performance Tuning Tips for Amazon Athena - Athena is a AWS Service based on Presto 0.172 and therefore the tips should also work for Presto.
I am not a Redshift expert but it seems you want to precompute a data set, distributing it and sorting by selected columns, so that it is faster to query.
This corresponds to Presto Hive connector ability to:
partition data -- data with same value in partitioning column(s) will form a single partition, which is a folder on storage; do not use partitioning on high cardinality columns. This is defined using partitioned_by table property
bucket data -- data is grouped in files using hash of bucketing column(s); this is similar to partitioning to a certain extent. This is defined using bucketed_by and bucket_count table properties.
sort data -- within data file, data is sorted by given column(s). This is defined using sorted_by table property.
See examples in Trino (formerly Presto SQL) Hive connector documentation
Note: while i realize documentation is scarce at the moment, i filed an issue to improve it. In the meantime, you can get additional information on Trino (formerly Presto SQL) community slack.

Hive date functions within create table

My hive version is 1.2.0
I am doing hive hbase integration where my hbase table already present.
While creating hive table, I was checking if I can use few of hive's built-in date functions as a candidate for virtual columns/derived columns, which is something like this -
create external table `Hive_Test`(
*existing hbase columns*,
*new_column* AS to_date(from_unixtime(unix_timestamp(*existing_column*,'yyyy/MM/dd HH:mm:ss')...
)CLUSTERED BY (..) SORTED BY (new_colulmn) INTO n BUCKETS
..
WITH SERDEPROPERTIES(
hbase.columns.mappings=':key,cf:*,:timestamp',
..
)
If there is any other way where I can use built-in functions capability in create table, then please let me know.
Thanks.
With reference to - Hive Computed Column, i think you are defining a logic when creating a table which is not possible with hive.
You can refer this article for Apache Hive Derived Column Support and Alternative
A better way is to create a view on top of the non-native table created for Hive-HBase integration, with which you can do almost any kind of mapping that facilitates your business.

What is the default schema in Hive?

For Pig, the default schema is ByteArray. Is there a default schema for Hive if we don't mention a schema in Hive? I tried to look at some Hive documentation but couldn't find any.
Hive is schema on Read --- I am not sure this is the answer...If some one could give an insight on this that would be great
Hive does the best that it can to
read the data. You will get lots of null values if there aren’t enough fields in each record
to match the schema. If some fields are numbers and Hive encounters nonnumeric
strings, it will return nulls for those fields. Above all else, Hive tries to recover from all
errors as best it can.
There is not default schema in Hive, in order to query data in hive you have to first create a table explaining the content of your data (by using create external table ... location).
So you basically have to tell hive the "scheme" before querying the data.