AWS Glue - getSink() is throwing "No such file or directory" right after glue_context.purge_s3_path

AWS Glue - getSink() is throwing "No such file or directory" right after glue_context.purge_s3_path - amazon-s3

I am trying to purge a partition of a glue catalog table and then recreate the partition using getSink option (similar to truncate/load partition in database)
For purging the partition , I am using glueContext.purge_s3_path option with retention period = 0 . The partition is getting purged successfully .
self._s3_path=s3://server1/main/transform/Account/local_segment/source_system=SAP/
self._glue_context.purge_s3_path(
self._s3_path,
{"retentionPeriod": 0, "excludeStorageClasses": ()}
)
Here Catalog database = Account , Table = local_segment , Partition_key = source_system
However when I am trying to recreate the partition right after the purge step , I am getting "An error occurred while calling o180.pyWriteDynamicFrame. No such file or directory" from getSink writeFrame .
If I remove the purge part then getSink is working fine and is able to create the partition and write the files .
I even tried "MSCK REPAIR TABLE" in between purge and getSink but no luck .
Shouldn't getSink create a partition if it does not exist i.e. purged from previous step ?
target = self._glue_context.getSink(
connection_type="s3",
path=self._s3_path_prefix,
enableUpdateCatalog=True,
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["source_system"]
)
target.setFormat("glueparquet")
target.setCatalogInfo(
catalogDatabase=f"{self._target_database}",
catalogTableName=f"{self._target_table_name}"
)
target.writeFrame(self._dyn_frame)
Where -
self._s3_path_prefix = s3://server1/main/transform/Account/local_segment/
self._target_database = Account
self._target_table_name = local_segment
Error Message :
An error occurred while calling o180.pyWriteDynamicFrame. No such file or directory 's3://server1/main/transform/Account/local_segment/source_system=SAP/run-1620405230597-part-block-0-0-r-00000-snappy.parquet'

Try to check if you have permission for this object on s3. I got the same error and once I configured the object to be public (just for test), it worked. So maybe it’s a new object and your process might not have access.

Related

Spark streaming writing as delta and checkpoint location

I am trying to stream from a delta table as a source and then also writing as delta after performing some transformations. so, this all worked. I recently looked at some videos and posts about best practices and found that I needed to do an additional thing and a modification.
The addition was adding queryName
Changing the checkpoint location, so that it resides alongside the data and not in a separate directory , like I was doing.
So, I have one question and a problem
Question is- can I add the queryName now, after my stream has been running for sometime , without any consequences?
and the problem, is: Now, that I have put my checkpoint location as the same directory as my delta table would be , I can't seem to create an external hive table anymore , it seems. It fails with
pyspark.sql.utils.AnalysisException: Cannot create table ('`spark_catalog`.`schemaname`.`tablename`'). The associated location ('abfss://refined#datalake.dfs.core.windows.net/curated/schemaname/tablename') is not empty but it's not a Delta table
So, this was my original code, which worked
def upsert(microbatchdf, batchId):
.....some transformations on microbatchdf
..........................
..........................
# Create Delta table beforehand as otherwise generated columns can't be created
# after having written the data into the data lake with the usual partionBy
deltaTable = (
DeltaTable.createIfNotExists(spark)
.tableName(f"{target_schema_name}.{target_table_name}")
.addColumns(microbatchdf_deduplicated.schema)
.addColumn(
"trade_date_year",
"INT",
generatedAlwaysAs="Year(trade_date) ",
)
.addColumn(
"trade_date_month",
"INT",
generatedAlwaysAs="MONTH(trade_date)",
)
.addColumn("trade_date_day", "INT", generatedAlwaysAs="DAY(trade_date)")
.partitionedBy("trade_date_year", "trade_date_month", "trade_date_day")
.location(
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}"
)
.execute()
)
.....some transformations and writing to the delta table
#end
#this is how the stream is run
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/curated/checkpoints/",
)
.start()
)
streamjob.awaitTermination()
Now, to this working piece , I only tried adding the queryName and modifying the checkpoint location (see comment for the modification and addition)
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.queryName(f"{source_schema_name}.{source_table_name}") # this added
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}/_checkpoint", # this changed
)
.start()
)
streamjob.awaitTermination()
In my datalake the _checkpoint did get created and apparently for this folder, the external table creation complains of non empty folder, whereas the documentation here, mentions that
So, why is the external hive table creation fails then? Also, please note my question about the queryName addition to an already running stream.
Point to note is- I have tried dropping the external table and also removed the contents of that directory, so there is nothing in that directory except the _checkpoint folder Which got created when I ran the streaming job , just before it got to creating the table inside the upsert method.
Any questions and I can help clarify.

The problem is that checkpoint files are put before you call the ``DeltaTable.createIfNotExists` function that checks if you have any data in that location or not, and fails because additional files are there, but they don't belong to the Delta Lake table.
If you want to keep checkpoint with your data, you need to put DeltaTable.createIfNotExists(spark)... outside of the upsert function - in this case, table will be created before any checkpoint files are created.

I am unable to copy sample data to populate a table Coginity pro from Redshift

I have been trying to copy data to a table in my Coginity Pro but I get the error message below .
I have copied my ARN from redshift and pasted it in the relevant path but I still could not populate the sample data to the tables already created in coginity Pro
below is the error message
Status: ERROR
copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt'
credentials 'aws_iam_role='
delimiter '|' region 'us-west-2'
36ms 2022-11-28T02:23:51.059Z
(SQLSTATE: 08006, SQLCODE: 0): An I/O error occurred while sending to the backend.

#udemeribe . Please check STL_LOAD_ERRORS ( order by date_field(starttime)) table

How should I get snowpipe auto-ingest working?

Following is my snowpipe definition
create or replace pipe protection_job_runs_dms_test auto_ingest = true as
copy into protection_job_runs_dms_test_events from (select t.$1, t.$2, t.$3, t.$4, t.$5, t.$6, t.$7, t.$8, t.$9, t.$10, t.$11, t.$12, t.$13, t.$14, t.$15, t.$16,
t.$17, t.$18, t.$19, t.$20, t.$21, t.$22, t.$23, t.$24, current_timestamp from #S3DMSTESTSTAGE t)
FILE_FORMAT = (
FIELD_OPTIONALLY_ENCLOSED_BY='"'
)
pattern='dmstest/(?!LOAD).*[.]csv';
When I am executing the copy command manually, it is working correctly.
Anyone knows what might be the issue ?

Regarding to the comments to your questions you tested your COPY-command by loading the same files before without Snowpipe. This means your files have been loaded once and thus you cannot load them afterwards with Snowpipe. Reason: Snowflake prevents loading files twice by default.
You can add the FORCE=true parameter to your COPY-command to prevent this behaviour and load all files - regardless of whether they have been loaded or not.
More infos about the FORCE-parameter here: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
create or replace pipe protection_job_runs_dms_test auto_ingest = true as
copy into protection_job_runs_dms_test_events from (select t.$1, t.$2, t.$3, t.$4, t.$5, t.$6, t.$7, t.$8, t.$9, t.$10, t.$11, t.$12, t.$13, t.$14, t.$15, t.$16,
t.$17, t.$18, t.$19, t.$20, t.$21, t.$22, t.$23, t.$24, current_timestamp from #S3DMSTESTSTAGE t)
FILE_FORMAT = (
FIELD_OPTIONALLY_ENCLOSED_BY='"'
)
pattern='dmstest/(?!LOAD).*[.]csv'
force=true;

How to write tables into Panoply using RPostgreSQL?

I am trying to write a table into my data warehouse using the RPostgreSQL package
library(DBI)
library(RPostgreSQL)
pano = dbConnect(dbDriver("PostgreSQL"),
host = 'db.panoply.io',
port = '5439',
user = panoply_user,
password = panoply_pw,
dbname = mydb)
RPostgreSQL::dbWriteTable(pano, "mtcars", mtcars[1:5, ])
I am getting this error:
Error in postgresqlpqExec(new.con, sql4) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "STDIN"
LINE 1: ..."hp","drat","wt","qsec","vs","am","gear","carb" ) FROM STDIN
^
)
The above code writes into Panoply as a 0 row, 0 byte table. Columns seem to be properly entered into Panoply but nothing else appears.

Fiest and most important redshift <> postgresql.
Redshift does not use the Postgres bulk loader. (so stdin is NOT allowed).
There are many options available which you should choose depending on your need, especially consider the volume of data.
For high volume of data you should write to s3 first and then use redshift copy command.
There are many options take a look at
https://github.com/sicarul/redshiftTools
for low volume see
inserting multiple records at once into Redshift with R

Can I save a trace file/extended events file to another partition other than the C drive on the server? Or another server altogether?

I've recently set some traces and extended events up and running in SQL on our new virtual server to show the access that users have to each database and whether they have logged in recently, and have set the file to save as a physical file on the server rather than writing to a SQL table to save resource. I've set the traces as jobs running at 8am each morning with a 12-hour delay so we can record as much information as possible.
Our IT department ideally don't want anything other than the OS on the C drive of the virtual server, so I'd like to be able to write the trace from my SQL script either to a different partition or to another server altogether.
I have attempted to insert a direct path to a different server within my code and have entered a different partition to C, however unless I write the trace/extended event files to the C drive I get an error message.
CREATE EVENT SESSION [LoginTraceTest] ON SERVER
ADD EVENT sqlserver.existing_connection(SET collect_database_name=
(1),collect_options_text=(1)
ACTION(package0.event_sequence,sqlos.task_time,sqlserver.client_pid,
sqlserver.database_id,sqlserver.
database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.request_id,sqlserver.server_principal_sid,sqlserver.session_id,sqlserver.session_nt_username,
sqlserver.sql_text,sqlserver.username)),
ADD EVENT sqlserver.login(SET collect_database_name=
(1),collect_options_text=(1)
ACTION(package0.event_sequence,sqlos.task_time,sqlserver.client_pid,sqlserver.database_id,sqlserver.database_name,sqlserver.is_system,sqlserver.nt_username,sqlserver.request_id,sqlserver.server_principal_sid,sqlserver.session_id,sqlserver.
session_nt_username,sqlserver.sql_text,sqlserver.username) )
ADD TARGET package0.asynchronous_file_target (
SET FILENAME = N'\\SERVER1\testFolder\LoginTrace.xel',
METADATAFILE = N'\\SERVER1\testFolder\LoginTrace.xem' );
The error I receive is this:
Msg 25641, Level 16, State 0, Line 6
For target, "package0.asynchronous_file_target", the parameter "filename" passed is invalid. Target parameter at index 0 is invalid
If I change it to another partition rather than a different server:
SET FILENAME = N'D:\Traces\LoginTrace\LoginTrace.xel',
METADATAFILE = N'D:\Traces\LoginTrace\LoginTrace.xem' );
SQL server states that the command completed successfully, but the file isn't written to the partition.
Any ideas please as to what I can do to write the files to another partition or server?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

AWS Glue - getSink() is throwing "No such file or directory" right after glue_context.purge_s3_path - amazon-s3

Try to check if you have permission for this object on s3. I got the same error and once I configured the object to be public (just for test), it worked. So maybe it’s a new object and your process might not have access.

Related

Spark streaming writing as delta and checkpoint location

I am unable to copy sample data to populate a table Coginity pro from Redshift

How should I get snowpipe auto-ingest working?

How to write tables into Panoply using RPostgreSQL?

Can I save a trace file/extended events file to another partition other than the C drive on the server? Or another server altogether?

Categories

Resources