Apache Iceberg table format to ADLS / azure data lake - amazon-s3

I am trying to find some integration to use iceberg table format on adls /azure data lake to perform crud operations. Is it possible to not use any other computation engine like spark to use it on azure. I think aws s3 supports this usecase. Any thoughts on it.

spark can use Iceberg with the abfs connector, hdfs, even local files. you just need the classpath and authentication right

A bit late to the party but Starburst Galaxy deploys Trino on any Azure region and has a Great Lakes connector that supports Hive (parquet, orc, csv,etc..), Delta Lake and Iceberg. https://blog.starburst.io/introducing-great-lakes-connectivity-for-starburst-galaxy

Related

Export data table from Databricks dbfs to azure sql database

I am quite new to databricks and looking for a smart way to export a data table from databricks gold scheme to an azure sql database.
I am using databricks as a part of azure resource group, however I do not find data from databricks in any of the storage accounts that are within the same resource group. Does it mean that is is physically stored at en implicit databricks storage account/data lake?
Thanks in advance :-)
The tables you see in Databricks could be have the data stored within that Databricks Workspace file system (DBFS) or somewhere external (e.g. Data Lake, which could be in a different Azure Resource Group) - see here: Databricks databases and tables
For writing data from Databricks to Azure SQL, I would suggest the Apache Spark connector for SQL.

Oracle Cloud to Azure Cloud storage

We have a requirement to move data from oracle Cloud storage to Azure Cloud storage.
The requirement is basically to move data from an Oracle ADW database (hosted on Oracle cloud) to Snowflake database (hosted on Azure).
Since the data volume in tables is huge (some with 60mil+ records) we do not wish to use any ETL tool and instead want to setup a pipeline as below.
Oracle ADW database -> Store data in Oracle storage --> Move data to Azure Cloud storage -> Load into Snowflake using snowpipe or similar snowflake utilities.
How should I go about this implementation?
Also share your views on whether we can use Oracle fastconnect and Azure ExpressRoute to directly pull data from Oracle Cloud onto snowflake (or into Azure storage)
I am looking for the same thing with the simplest method from Oracle (on prem but could be cloud), into Snowflake. Looks like data must be exporeted or dropped to external tables, shifted to Azure Blob storage (like AWS S3), then pushed into Snowflake using COPY INTO - basically copying on disk external tables. This is what Snowpipe does:
"Snowpipe copies the files into a queue, from which they are loaded into the target table in a continuous, serverless fashion based on parameters defined in a specified pipe object. The following table indicates the cloud storage service support for automated Snowpipe from Snowflake accounts hosted on each cloud platform:"
It's been a while since I have worked with this. The other option is GoldenGate, which was not expensive the last time I looked into it:
https://www.snowflake.com/blog/continuous-data-replication-into-snowflake-with-oracle-goldengate/
Easy, simple, fast. Anyone have any better ideas would be appreciated.

Migrate on Premise Oracle DB of size 8TB to AWS RDS Oracle

Customer has an Oracle 11g running on 2 nodes RAC on premise. The size the DB is 8TB. I need to migrate the Oracle DB from on premise to AWS RDS Oracle.
I shall use Data Pump and AWS DMS with CDC. Customer has a requirement of zero or near zero downtime during migration.
But how can I take the backup of 8TB to S3 from on premise and download it from S3 to AWS? S3 has a limitation on file size of 5TB.
Please help.
This can be achieved by using AWS snowball edge Devices
Larger data migrations can include many terabytes of information. This process can be cumbersome due to network bandwidth limits or just the sheer amount of data. AWS Database Migration Service (AWS DMS) can use Snowball Edge and Amazon S3 to migrate large databases more quickly than by other methods.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_LargeDBs.Process.html
AWS already has documentation on how to migrate from on premise oracle db to AWS RDS.
It has mentioned
Database size limit: 64 TB
https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-an-on-premises-oracle-database-to-amazon-rds-for-oracle.html
All the strategies to do the migration are mentioned in below pdf
https://d1.awsstatic.com/whitepapers/strategies-for-migrating-oracle-database-to-aws.pdf
For migrating large data to AWS, you can refer following docs
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_LargeDBs.html

Hive ORC ACID table on AZURE Blob Storage possible for MERGE

On HDFS Hive ORC ACID for Hive MERGE no issue.
On S3 not possible.
For Azure HD Insight I am not clear from docs if such a table on Azure Blob Storage is posible? Seeking confirmation or otherwise.
I am pretty sure no go. See the update I gave on the answer, however.
According to Azure HDInsight offical documents Azure HDInsight 4.0 overview as the figure below,
As I known, Hive MERGE requires MapReduce, but HDInsight does not support it for Hive, so it's also not possible.
UPDATE by question poster
HDInsight 4.0 doesn't support MapReduce for Apache Hive. Use Apache Tez instead. So, with Tez it will still work and from https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-version-release Spark with Hive 3 and Warehouse Connector are also options.

amazon athena and confluent schema registry

We are planning to offload events from Kafka to S3 (e.g via using kafka connect). The target is to spin up a service (e.g. like amazon Athena) and provide a query interface on top of the exported avro events. The obstacle is that amazon Athena avro SerDe (uses org.apache.hadoop.hive.serde2.avro.AvroSerDe) does not support the magic bytes that schema registry is utilising for storing the schema id. Do you know of any alternative that can play nice with confluent schema registry?
Thanks!
Using S3 Connect's AvroConverter does not put any schema ID in the file. In fact, after the message is written, you lose the schema ID entirely.
We have lots of Hive tables that are working fine with these files, and users are querying using Athena, Presto. SparkSQL, etc.
Note: If you wanted to use AWS Glue, S3 Connect doesn't (currently, as of 5.x) offer automatic Hive partition creation like the HDFS Connector, so you might want to look for alternatives if you wanted to use it that way.