See managed tables in Databricks AWS - sql

I need to identify and list all managed tables in a Databricks AWS workspace. I can see that manually in the table details, but I need to this for several thousand tables on different databases, and I cannot find a way to automate it.
The only way I found to tell programmatically if a table is managed or external is with the DESCRIBE TABLE EXTENDED command, but that returns it as a value on a column, and cannot be used with SELECT or WHERE to filter, even if I try running it as a subquery.
What is the easiest way to filter the managed tables?

spark.sql('use my_database')
df = spark.sql('show tables in my_database')
for t in df.collect():
print('table {}'.format(t.tableName))
display(spark.sql('describe table extended {}'.format(t.tableName)).where("col_name='Type' and data_type='MANAGED'"))
#use if condition to filter out the Managed data_type and collect the database and table names
#loop over all databases using "show databases" in outer loop

Related

Checking of replicated data Pentaho

I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.

getting Clustering/Bucketing columns programmatically

For reference, I am connecting to amazon-athena via sqlalchemy using essentially:
create_engine(
f'awsathena+rest://:#athena.{myRegion}.amazonaws.com:443/{athena_schema}?s3_staging_dir={myS3_staging_path}',
echo=True)
In most relational databases that adhere to the ANSI-SQL standard, I can programmatically get the partition columns of a table by running something like the following:
select *
from information_schema.columns
where table_name='myTable' and table_schema='mySchema'
and extra_info = 'partition key'
However the bucketing or clustering columns seem to not be similarly flagged. I know I can access this information via:
show create table mySchema.myTable
but I am interested in clean programmatical solution, if one exists. I am trying to not reinvent the wheel. Please show me how to do this or point me to the relevant documentation.
Thank you in advance.
PS: It would also be great if other information about the table, like location of files and storage format were also accessible programmatically.
Athena uses Glue Data Catalog to store metadata about databases and tables. I don't know how much of this is exposed in information_schema, and there is very little documentation about it.
However, you can get everything Athena knows by querying the Glue Data Catalog directly. In this case if you call GetTable (e.g. aws glue get-table …) you will find the bucketing information in Table.StorageDescriptor.BucketColumns.
The GetTable call will also give you the storage format and the location of the files (but for a partitioned table you need to make additional calls with GetPartitions to retrieve the location of each partition's data).

Google BigQuery list tables

I need to list all tables in my BigQuery, but I don't know how to do it, I try search but I didn't find anything about it.
I need to know if the table exists, if it exists I search for record, if not I create table and insert record.
Depending where/how you want to do this, you can use CLI, API calls or client libraries. Here you have all the info about listing tables
As an example, if you want to list them using Command Line Interface, you can do it like:
bq ls <project>:<dataset>
If you want to use normal SQL queries, you can use the INFROMATION_SCHEMA Beta feature
SELECT table_name from `<project>.<dataset>.INFORMATION_SCHEMA.TABLES`
(project is optional)

Deleting records in a table with billion records using spark or scala

we have a table in Azure Data Warehouse with 17 billion records. Now we have a scenario where we have to delete records from this table based on some where condition. We are writing Spark in Scala language in Azure Databricks notebooks.
We searched for different options to do this in Spark, but all suggested to first read the entire table, delete records from this and then overwrite the entire table in Data Warehosue. However this approach will not work in our case due to huge number of records in our table.
Can you please suggest how we can achieve this functionality using spark/scala?
1) checked if we can call stored procedure through spark/scala code in azure databricks but Spark do not support stored procedures.
2) Tried reading the entire table first to delete the records but it goes into never ending loop.
Is possible to create view with select clause as per your requirement, then using of the view

Delete all rows containing a specific number across multiple tables in a postgresql database?

I'm fairly new to SQL and I have a large database that needs some cleanup. In most of the tables, there is a column called "ID number" and I want to go through all of these tables, check each ID number to see if it is on a list that I have of bad IDs, and if it is delete the entire table row containing the ID. Problem is, the list of bad IDs alone is over 3 million long and the total number of table entries is in the hundreds of millions. I don't really know where to start with this and was wondering if anyone could help me out?
You can do this with PL/PgSQL, using a query against the system catalogs to build DELETE queries with format(...) that you then run using EXECUTE.
There are lots of existing examples of such dynamic SQL on Stack Overflow, and of how to query the catalogs to build table lists. Use pg_catalog.pg_class and pg_catalog.pg_attribute or use the information_schema for schema info.
Remember to use format with the %I format-specifier for identifiers, don't just concatenate SQL text with ||.
See:
Dynamic SQL with PL/PgSQL EXECUTE
format function
information_schema
System catalogs
this Stack Exchange search or this one.
Remember: Make sure you have good backups before attempting dynamic DML! A mistake can easily destroy all data in the database. Of course, you should have good backups - preferably PITR WAL archiving with PgBarman plus nightly dumps - anyway...