Spark streaming writing as delta and checkpoint location - hive

I am trying to stream from a delta table as a source and then also writing as delta after performing some transformations. so, this all worked. I recently looked at some videos and posts about best practices and found that I needed to do an additional thing and a modification.
The addition was adding queryName
Changing the checkpoint location, so that it resides alongside the data and not in a separate directory , like I was doing.
So, I have one question and a problem
Question is- can I add the queryName now, after my stream has been running for sometime , without any consequences?
and the problem, is: Now, that I have put my checkpoint location as the same directory as my delta table would be , I can't seem to create an external hive table anymore , it seems. It fails with
pyspark.sql.utils.AnalysisException: Cannot create table ('`spark_catalog`.`schemaname`.`tablename`'). The associated location ('abfss://refined#datalake.dfs.core.windows.net/curated/schemaname/tablename') is not empty but it's not a Delta table
So, this was my original code, which worked
def upsert(microbatchdf, batchId):
.....some transformations on microbatchdf
..........................
..........................
# Create Delta table beforehand as otherwise generated columns can't be created
# after having written the data into the data lake with the usual partionBy
deltaTable = (
DeltaTable.createIfNotExists(spark)
.tableName(f"{target_schema_name}.{target_table_name}")
.addColumns(microbatchdf_deduplicated.schema)
.addColumn(
"trade_date_year",
"INT",
generatedAlwaysAs="Year(trade_date) ",
)
.addColumn(
"trade_date_month",
"INT",
generatedAlwaysAs="MONTH(trade_date)",
)
.addColumn("trade_date_day", "INT", generatedAlwaysAs="DAY(trade_date)")
.partitionedBy("trade_date_year", "trade_date_month", "trade_date_day")
.location(
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}"
)
.execute()
)
.....some transformations and writing to the delta table
#end
#this is how the stream is run
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/curated/checkpoints/",
)
.start()
)
streamjob.awaitTermination()
Now, to this working piece , I only tried adding the queryName and modifying the checkpoint location (see comment for the modification and addition)
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.queryName(f"{source_schema_name}.{source_table_name}") # this added
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}/_checkpoint", # this changed
)
.start()
)
streamjob.awaitTermination()
In my datalake the _checkpoint did get created and apparently for this folder, the external table creation complains of non empty folder, whereas the documentation here, mentions that
So, why is the external hive table creation fails then? Also, please note my question about the queryName addition to an already running stream.
Point to note is- I have tried dropping the external table and also removed the contents of that directory, so there is nothing in that directory except the _checkpoint folder Which got created when I ran the streaming job , just before it got to creating the table inside the upsert method.
Any questions and I can help clarify.

The problem is that checkpoint files are put before you call the ``DeltaTable.createIfNotExists` function that checks if you have any data in that location or not, and fails because additional files are there, but they don't belong to the Delta Lake table.
If you want to keep checkpoint with your data, you need to put DeltaTable.createIfNotExists(spark)... outside of the upsert function - in this case, table will be created before any checkpoint files are created.

Related

AsterixDB ERROR: Code: 1 "HYR0010: Node asterix_nc2 does not exist" M1 Mac

I'm trying to set up a sample cluster with asterixDB on my M1 mac. I have my environment up and running and I am able to successfully make SQL queries with the following code:
drop dataverse csv if exists;
create dataverse csv;
use csv;
create type csv_type as {
lat: int32,
long: int32
};
create dataset csv_set (csv_type)
primary key lat;
However, when I try to load the dataset with a CSV file it seems to brick my sample cluster and throws the error: Error Code: 1 "HYR0010: Node asterix_nc2 does not exist". The code which causes this is below.
use csv;
load dataset csv_set using localfs
(("path"="127.0.0.1:///Users/nicholassantini/Downloads/test.csv"),
("format"="delimited-text"));
Thus far I have tried both java's newest release of version 18 and 17.0.3 as well as a variety of ports for the queries. I'm not sure what else to try. Some logs that I think are relevant say that it is failing to connect to the node. Not sure if that's an issue with the port or the node itself. Here is a snippet of those logs.
image.png
Also in case it matters, my CSV is a simple 2 column 2 row file with all single-digit integer values.
I appreciate any and all help.
After consulting the developer help email thread, I was able to find that the issue stems from the release of asterixDB that I was using (0.9.7.1). Upgrading to the newest release(0.9.8) fixed this issue.
The link can be found here:
https://ci-builds.apache.org/job/AsterixDB/job/asterixdb-snapshot-integration/lastSuccessfulBuild/artifact/asterixdb/asterix-server/target/asterix-server-0.9.8-SNAPSHOT-binary-assembly.zip

AWS Glue - getSink() is throwing "No such file or directory" right after glue_context.purge_s3_path

I am trying to purge a partition of a glue catalog table and then recreate the partition using getSink option (similar to truncate/load partition in database)
For purging the partition , I am using glueContext.purge_s3_path option with retention period = 0 . The partition is getting purged successfully .
self._s3_path=s3://server1/main/transform/Account/local_segment/source_system=SAP/
self._glue_context.purge_s3_path(
self._s3_path,
{"retentionPeriod": 0, "excludeStorageClasses": ()}
)
Here Catalog database = Account , Table = local_segment , Partition_key = source_system
However when I am trying to recreate the partition right after the purge step , I am getting "An error occurred while calling o180.pyWriteDynamicFrame. No such file or directory" from getSink writeFrame .
If I remove the purge part then getSink is working fine and is able to create the partition and write the files .
I even tried "MSCK REPAIR TABLE" in between purge and getSink but no luck .
Shouldn't getSink create a partition if it does not exist i.e. purged from previous step ?
target = self._glue_context.getSink(
connection_type="s3",
path=self._s3_path_prefix,
enableUpdateCatalog=True,
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["source_system"]
)
target.setFormat("glueparquet")
target.setCatalogInfo(
catalogDatabase=f"{self._target_database}",
catalogTableName=f"{self._target_table_name}"
)
target.writeFrame(self._dyn_frame)
Where -
self._s3_path_prefix = s3://server1/main/transform/Account/local_segment/
self._target_database = Account
self._target_table_name = local_segment
Error Message :
An error occurred while calling o180.pyWriteDynamicFrame. No such file or directory 's3://server1/main/transform/Account/local_segment/source_system=SAP/run-1620405230597-part-block-0-0-r-00000-snappy.parquet'
Try to check if you have permission for this object on s3. I got the same error and once I configured the object to be public (just for test), it worked. So maybe it’s a new object and your process might not have access.

How should I get snowpipe auto-ingest working?

Following is my snowpipe definition
create or replace pipe protection_job_runs_dms_test auto_ingest = true as
copy into protection_job_runs_dms_test_events from (select t.$1, t.$2, t.$3, t.$4, t.$5, t.$6, t.$7, t.$8, t.$9, t.$10, t.$11, t.$12, t.$13, t.$14, t.$15, t.$16,
t.$17, t.$18, t.$19, t.$20, t.$21, t.$22, t.$23, t.$24, current_timestamp from #S3DMSTESTSTAGE t)
FILE_FORMAT = (
FIELD_OPTIONALLY_ENCLOSED_BY='"'
)
pattern='dmstest/(?!LOAD).*[.]csv';
When I am executing the copy command manually, it is working correctly.
Anyone knows what might be the issue ?
Regarding to the comments to your questions you tested your COPY-command by loading the same files before without Snowpipe. This means your files have been loaded once and thus you cannot load them afterwards with Snowpipe. Reason: Snowflake prevents loading files twice by default.
You can add the FORCE=true parameter to your COPY-command to prevent this behaviour and load all files - regardless of whether they have been loaded or not.
More infos about the FORCE-parameter here: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
create or replace pipe protection_job_runs_dms_test auto_ingest = true as
copy into protection_job_runs_dms_test_events from (select t.$1, t.$2, t.$3, t.$4, t.$5, t.$6, t.$7, t.$8, t.$9, t.$10, t.$11, t.$12, t.$13, t.$14, t.$15, t.$16,
t.$17, t.$18, t.$19, t.$20, t.$21, t.$22, t.$23, t.$24, current_timestamp from #S3DMSTESTSTAGE t)
FILE_FORMAT = (
FIELD_OPTIONALLY_ENCLOSED_BY='"'
)
pattern='dmstest/(?!LOAD).*[.]csv'
force=true;

Snowflake COPY INTO from JSON - ON_ERROR = CONTINUE - Weird Issue

I am trying to load JSON file from Staging area (S3) into Stage table using COPY INTO command.
Table:
create or replace TABLE stage_tableA (
RAW_JSON VARIANT NOT NULL
);
Copy Command:
copy into stage_tableA from #stgS3/filename_45.gz file_format = (format_name = 'file_json')
Got the below error when executing the above (sample provided)
SQL Error [100069] [22P02]: Error parsing JSON: document is too large, max size 16777216 bytes If you would like to continue loading
when an error is encountered, use other values such as 'SKIP_FILE' or
'CONTINUE' for the ON_ERROR option. For more information on loading
options, please run 'info loading_data' in a SQL client.
When I had put "ON_ERROR=CONTINUE" , records got partially loaded, i.e until the record with more than max size. But no records after the Error record was loaded.
Was "ON_ERROR=CONTINUE" supposed to skip only the record that has max size and load records before and after it ?
Yes, the ON_ERROR=CONTINUE skips the offending line and continues to load the rest of the file.
To help us provide more insight, can you answer the following:
How many records are in your file?
How many got loaded?
At what line was the error first encountered?
You can find this information using the COPY_HISTORY() table function
Try setting the option strip_outer_array = true for file format and attempt the loading again.
The considerations for loading large size semi-structured data are documented in the below article:
https://docs.snowflake.com/en/user-guide/semistructured-considerations.html
I partially agree with Chris. The ON_ERROR=CONTINUE option only helps if the there are in fact more than 1 JSON objects in the file. If it's 1 massive object then you would simply not get an error or the record loaded when using ON_ERROR=CONTINUE.
If you know your JSON payload is smaller than 16mb then definitely try the strip_outer_array = true. Also, if your JSON has a lot of nulls ("NULL") as values use the STRIP_NULL_VALUES = TRUE as this will slim your payload as well. Hope that helps.

Can I save a trace file/extended events file to another partition other than the C drive on the server? Or another server altogether?

I've recently set some traces and extended events up and running in SQL on our new virtual server to show the access that users have to each database and whether they have logged in recently, and have set the file to save as a physical file on the server rather than writing to a SQL table to save resource. I've set the traces as jobs running at 8am each morning with a 12-hour delay so we can record as much information as possible.
Our IT department ideally don't want anything other than the OS on the C drive of the virtual server, so I'd like to be able to write the trace from my SQL script either to a different partition or to another server altogether.
I have attempted to insert a direct path to a different server within my code and have entered a different partition to C, however unless I write the trace/extended event files to the C drive I get an error message.
CREATE EVENT SESSION [LoginTraceTest] ON SERVER
ADD EVENT sqlserver.existing_connection(SET collect_database_name=
(1),collect_options_text=(1)
ACTION(package0.event_sequence,sqlos.task_time,sqlserver.client_pid,
sqlserver.database_id,sqlserver.
database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.request_id,sqlserver.server_principal_sid,sqlserver.session_id,sqlserver.session_nt_username,
sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.login(SET collect_database_name=
(1),collect_options_text=(1)
ACTION(package0.event_sequence,sqlos.task_time,sqlserver.client_pid,sqlserver.database_id,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.request_id,sqlserver.server_principal_sid,sqlserver.session_id,sqlserver.
session_nt_username,sqlserver.sql_text,sqlserver.username) )
ADD TARGET package0.asynchronous_file_target (
SET FILENAME = N'\\SERVER1\testFolder\LoginTrace.xel',
METADATAFILE = N'\\SERVER1\testFolder\LoginTrace.xem' );
The error I receive is this:
Msg 25641, Level 16, State 0, Line 6
For target, "package0.asynchronous_file_target", the parameter "filename" passed is invalid. Target parameter at index 0 is invalid
If I change it to another partition rather than a different server:
SET FILENAME = N'D:\Traces\LoginTrace\LoginTrace.xel',
METADATAFILE = N'D:\Traces\LoginTrace\LoginTrace.xem' );
SQL server states that the command completed successfully, but the file isn't written to the partition.
Any ideas please as to what I can do to write the files to another partition or server?