Bloom Filter Index on Delta Table - hive

Is is possible to create a Bloom filter index in Databricks on a Delta table using the filepath and not on the hive table referencing to that file location?
I tried the following:
CREATE BLOOMFILTER INDEX
ON TABLE delta.'gs://GCS_Bucket/Delta_Folder_Path'
FOR COLUMNS(colname OPTIONS(fpp=0.1, numItems=100))
But it doesn't work. I get the following error:
ParseException:
no viable alternative at input 'CREATE BLOOMFILTER'(line 1, pos 7)
== SQL ==
CREATE BLOOMFILTER INDEX
-------^^^
ON TABLE delta.'gs://GCS_Bucket/Delta_Folder_Path'
FOR COLUMNS(LOT_W OPTIONS(fpp=0.1, numItems=100))
Replacing the delta.'gs://GCS_Bucket/Delta_Folder_Path' with a hive external table that references the file works as expected.
All the examples that I found first create a table of it, and then create the bloom filter index. But this is not what we want.
We only want to have tables that are in the gold layer and some in silver available in hive. The table that I want to add a bloom filter index on should not be in hive.
Edit: This is on Databricks runtime 10.4 LTS

Most probably error is arising because you're using ordinary quotes instead of backquotes for path (doc). Try:
CREATE BLOOMFILTER INDEX
ON TABLE delta.`gs://GCS_Bucket/Delta_Folder_Path`
FOR COLUMNS(colname OPTIONS(fpp=0.1, numItems=100))
P.S. Error message points to the incorrect place, I think that it's known issue

Related

How to rename a table in Athena?

probably a very trivial question but I'm not sure about this and also don't want to lose the table. How do I rename a table in Athena?
Database name - friends
table name - centralPark
desired table name -centralPerk
you can't!
see the list of unsupported DDL in Athena.
what you can do is to make a new table using select:
CREATE TABLE centralPark
AS SELECT * FROM centralPerk
WITH DATA
and drop the old table:
DROP TABLE IF EXISTS centralPerk
Using a CTAS query is effective, but I found it to be quite slow. It needs to copy all the files.
But you don't need to copy the files. You can create a new table directly in the Glue catalog and point it at the existing files. This works in seconds or less.
If you're using Python, I highly recommend the awswrangler library for this kind of work.
import awswrangler as wr
def wrangler_copy(db, old_name, new_name):
wr.catalog.create_parquet_table(
db,
new_name,
path=wr.catalog.get_table_location(db, old_name),
columns_types=wr.catalog.get_table_types(db, old_name),
# TODO: partitions, etc
)
And then drop the old table if you like.
DROP TABLE IF EXISTS <old_name>

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

How to find when record was last updated?

How to find when table rows were last updated/inserted? Presto is ANSI-SQL compliant so even if you don't know Presto, maybe there's a generic SQL way that would point me in the right direction.
I'm using Hadoop. Presto queries are quicker than Hive. "Describe" just gives column names.
https://prestosql.io/docs/current/
Presto 309 added a hidden $properties table in the Hive connector for each table that exposes the Hive table properties. You can use it to find the last update time (replace example with your table name):
SELECT transient_lastddltime FROM "example$properties"

How to add a column in the middle of a ORC partitioned hive table and still be able to query old partitioned files with new structure

Currently I have a Partitioned ORC "Managed" (Wrongly created as Internal first) Hive table in Prod with atleast 100 days worth of data partitioned by year,month,day(~16GB of data).
This table has roughly 160 columns.Now my requirement is to Add a column in the middle of this table and still be able to query the older data(partitioned files).Its is fine if the newly added column shows null for the old data.
What I did so far ?
1)First convert the table to External using below to preserve data files before dropping
alter table <table_name> SET TBLPROPERTIES('EXTERNAL'='TRUE');
2)Drop and Recreate the table with new column in the middle and then Altered the table to add the partition file
However I am unable to read the table after Recreation .I get this Error message
[Simba][HiveJDBCDriver](500312) Error in fetching data rows: *org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.io.IOException: ORC does not support type conversion from file type array<string> (87) to reader type int (87):33:32;
Any other way to accomplish this ?
No need to drop and recreate the table. Simply use the following statement.
ALTER TABLE default.test_table ADD columns (column1 string,column2 string) CASCADE;
ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata.
PS - This will add new columns to the end of the existing columns but before the partition columns. Unfortunately, ORC does not support adding columns in the middle as of now.
Hope that helps!

Google Bigquery create table with no code? syntax

I was hoping to use basic SQL Create Table syntax within Google BigQuery to create a table based on columns in 2 existing tables already in BQ. The Google SQL dialect reference does not show a CREATE. All of the documentation seems to imply that I need to know how to code.
Is there any syntax or way to do a
CREATE TABLE XYZ AS
SELECT ABC.123, DFG.234
from ABC, DFG
?
You cannot do it entirely through a SQL statement.
However, the UI does allow you to save results to a table (max result size is 64MB compressed. The API and command line clients have the same capabilities.