How should I get snowpipe auto-ingest working? - amazon-s3

Following is my snowpipe definition
create or replace pipe protection_job_runs_dms_test auto_ingest = true as
copy into protection_job_runs_dms_test_events from (select t.$1, t.$2, t.$3, t.$4, t.$5, t.$6, t.$7, t.$8, t.$9, t.$10, t.$11, t.$12, t.$13, t.$14, t.$15, t.$16,
t.$17, t.$18, t.$19, t.$20, t.$21, t.$22, t.$23, t.$24, current_timestamp from #S3DMSTESTSTAGE t)
FILE_FORMAT = (
FIELD_OPTIONALLY_ENCLOSED_BY='"'
)
pattern='dmstest/(?!LOAD).*[.]csv';
When I am executing the copy command manually, it is working correctly.
Anyone knows what might be the issue ?

Regarding to the comments to your questions you tested your COPY-command by loading the same files before without Snowpipe. This means your files have been loaded once and thus you cannot load them afterwards with Snowpipe. Reason: Snowflake prevents loading files twice by default.
You can add the FORCE=true parameter to your COPY-command to prevent this behaviour and load all files - regardless of whether they have been loaded or not.
More infos about the FORCE-parameter here: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
create or replace pipe protection_job_runs_dms_test auto_ingest = true as
copy into protection_job_runs_dms_test_events from (select t.$1, t.$2, t.$3, t.$4, t.$5, t.$6, t.$7, t.$8, t.$9, t.$10, t.$11, t.$12, t.$13, t.$14, t.$15, t.$16,
t.$17, t.$18, t.$19, t.$20, t.$21, t.$22, t.$23, t.$24, current_timestamp from #S3DMSTESTSTAGE t)
FILE_FORMAT = (
FIELD_OPTIONALLY_ENCLOSED_BY='"'
)
pattern='dmstest/(?!LOAD).*[.]csv'
force=true;

Related

Spark streaming writing as delta and checkpoint location

I am trying to stream from a delta table as a source and then also writing as delta after performing some transformations. so, this all worked. I recently looked at some videos and posts about best practices and found that I needed to do an additional thing and a modification.
The addition was adding queryName
Changing the checkpoint location, so that it resides alongside the data and not in a separate directory , like I was doing.
So, I have one question and a problem
Question is- can I add the queryName now, after my stream has been running for sometime , without any consequences?
and the problem, is: Now, that I have put my checkpoint location as the same directory as my delta table would be , I can't seem to create an external hive table anymore , it seems. It fails with
pyspark.sql.utils.AnalysisException: Cannot create table ('`spark_catalog`.`schemaname`.`tablename`'). The associated location ('abfss://refined#datalake.dfs.core.windows.net/curated/schemaname/tablename') is not empty but it's not a Delta table
So, this was my original code, which worked
def upsert(microbatchdf, batchId):
.....some transformations on microbatchdf
..........................
..........................
# Create Delta table beforehand as otherwise generated columns can't be created
# after having written the data into the data lake with the usual partionBy
deltaTable = (
DeltaTable.createIfNotExists(spark)
.tableName(f"{target_schema_name}.{target_table_name}")
.addColumns(microbatchdf_deduplicated.schema)
.addColumn(
"trade_date_year",
"INT",
generatedAlwaysAs="Year(trade_date) ",
)
.addColumn(
"trade_date_month",
"INT",
generatedAlwaysAs="MONTH(trade_date)",
)
.addColumn("trade_date_day", "INT", generatedAlwaysAs="DAY(trade_date)")
.partitionedBy("trade_date_year", "trade_date_month", "trade_date_day")
.location(
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}"
)
.execute()
)
.....some transformations and writing to the delta table
#end
#this is how the stream is run
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/curated/checkpoints/",
)
.start()
)
streamjob.awaitTermination()
Now, to this working piece , I only tried adding the queryName and modifying the checkpoint location (see comment for the modification and addition)
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.queryName(f"{source_schema_name}.{source_table_name}") # this added
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}/_checkpoint", # this changed
)
.start()
)
streamjob.awaitTermination()
In my datalake the _checkpoint did get created and apparently for this folder, the external table creation complains of non empty folder, whereas the documentation here, mentions that
So, why is the external hive table creation fails then? Also, please note my question about the queryName addition to an already running stream.
Point to note is- I have tried dropping the external table and also removed the contents of that directory, so there is nothing in that directory except the _checkpoint folder Which got created when I ran the streaming job , just before it got to creating the table inside the upsert method.
Any questions and I can help clarify.
The problem is that checkpoint files are put before you call the ``DeltaTable.createIfNotExists` function that checks if you have any data in that location or not, and fails because additional files are there, but they don't belong to the Delta Lake table.
If you want to keep checkpoint with your data, you need to put DeltaTable.createIfNotExists(spark)... outside of the upsert function - in this case, table will be created before any checkpoint files are created.

AWS Glue - getSink() is throwing "No such file or directory" right after glue_context.purge_s3_path

I am trying to purge a partition of a glue catalog table and then recreate the partition using getSink option (similar to truncate/load partition in database)
For purging the partition , I am using glueContext.purge_s3_path option with retention period = 0 . The partition is getting purged successfully .
self._s3_path=s3://server1/main/transform/Account/local_segment/source_system=SAP/
self._glue_context.purge_s3_path(
self._s3_path,
{"retentionPeriod": 0, "excludeStorageClasses": ()}
)
Here Catalog database = Account , Table = local_segment , Partition_key = source_system
However when I am trying to recreate the partition right after the purge step , I am getting "An error occurred while calling o180.pyWriteDynamicFrame. No such file or directory" from getSink writeFrame .
If I remove the purge part then getSink is working fine and is able to create the partition and write the files .
I even tried "MSCK REPAIR TABLE" in between purge and getSink but no luck .
Shouldn't getSink create a partition if it does not exist i.e. purged from previous step ?
target = self._glue_context.getSink(
connection_type="s3",
path=self._s3_path_prefix,
enableUpdateCatalog=True,
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["source_system"]
)
target.setFormat("glueparquet")
target.setCatalogInfo(
catalogDatabase=f"{self._target_database}",
catalogTableName=f"{self._target_table_name}"
)
target.writeFrame(self._dyn_frame)
Where -
self._s3_path_prefix = s3://server1/main/transform/Account/local_segment/
self._target_database = Account
self._target_table_name = local_segment
Error Message :
An error occurred while calling o180.pyWriteDynamicFrame. No such file or directory 's3://server1/main/transform/Account/local_segment/source_system=SAP/run-1620405230597-part-block-0-0-r-00000-snappy.parquet'
Try to check if you have permission for this object on s3. I got the same error and once I configured the object to be public (just for test), it worked. So maybe it’s a new object and your process might not have access.

Snowflake COPY INTO from JSON - ON_ERROR = CONTINUE - Weird Issue

I am trying to load JSON file from Staging area (S3) into Stage table using COPY INTO command.
Table:
create or replace TABLE stage_tableA (
RAW_JSON VARIANT NOT NULL
);
Copy Command:
copy into stage_tableA from #stgS3/filename_45.gz file_format = (format_name = 'file_json')
Got the below error when executing the above (sample provided)
SQL Error [100069] [22P02]: Error parsing JSON: document is too large, max size 16777216 bytes If you would like to continue loading
when an error is encountered, use other values such as 'SKIP_FILE' or
'CONTINUE' for the ON_ERROR option. For more information on loading
options, please run 'info loading_data' in a SQL client.
When I had put "ON_ERROR=CONTINUE" , records got partially loaded, i.e until the record with more than max size. But no records after the Error record was loaded.
Was "ON_ERROR=CONTINUE" supposed to skip only the record that has max size and load records before and after it ?
Yes, the ON_ERROR=CONTINUE skips the offending line and continues to load the rest of the file.
To help us provide more insight, can you answer the following:
How many records are in your file?
How many got loaded?
At what line was the error first encountered?
You can find this information using the COPY_HISTORY() table function
Try setting the option strip_outer_array = true for file format and attempt the loading again.
The considerations for loading large size semi-structured data are documented in the below article:
https://docs.snowflake.com/en/user-guide/semistructured-considerations.html
I partially agree with Chris. The ON_ERROR=CONTINUE option only helps if the there are in fact more than 1 JSON objects in the file. If it's 1 massive object then you would simply not get an error or the record loaded when using ON_ERROR=CONTINUE.
If you know your JSON payload is smaller than 16mb then definitely try the strip_outer_array = true. Also, if your JSON has a lot of nulls ("NULL") as values use the STRIP_NULL_VALUES = TRUE as this will slim your payload as well. Hope that helps.

Unable to query using file in Data Proc Hive Operator

I am unable to query with .sql file in DataProcHiveOperator.
Though the documentation tells that we can query using file. Link of the documentation Here
It is working fine when I give query directly
Here is my sample code which is working fine with writing query directly:
HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
gcp_conn_id='google_cloud_default',
query='CREATE TABLE TABLE_NAME(NAME STRING);',
cluster_name='cluster-name',
region='us-central1',
dag=dag)
Querying with file :
HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
gcp_conn_id='google_cloud_default',
query='gs://us-central1-bucket/data/sample_hql.sql',
query_uri="gs://us-central1-bucket/data/sample_hql.sql
cluster_name='cluster-name',
region='us-central1',
dag=dag)
There is no error on sample_hql.sql script.
It is reading file location as a query and throwing me the error as:
Query: 'gs://bucketpath/filename.q'
Error occuring - cannot recognize input near 'gs' ':' '/'
Similar issue has also been raised Here
The issue is because you have passed query='gs://us-central1-bucket/data/sample_hql.sql' as well.
You should pass exactly 1 of query or queri_uri.
The code in your question has both of them, so remove query or use the following code:
HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
gcp_conn_id='google_cloud_default',
query_uri="gs://us-central1-bucket/data/sample_hql.sql",
cluster_name='cluster-name',
region='us-central1',
dag=dag)

Drupal is looking for a field that no longer exists

Warning: Cannot modify header information - headers already sent by (output started at /home/sites/superallan.com/public_html/includes/common.inc:2561) in drupal_send_headers() (line 1040 of /home/sites/superallan.com/public_html/includes/bootstrap.inc).
PDOException: SQLSTATE[42S02]: Base table or view not found: 1146 Table 'web247-sa_admin.field_data_field_embedcode' doesn't exist: SELECT field_data_field_embedcode0.entity_type AS entity_type, field_data_field_embedcode0.entity_id AS entity_id, field_data_field_embedcode0.revision_id AS revision_id, field_data_field_embedcode0.bundle AS bundle FROM {field_data_field_embedcode} field_data_field_embedcode0 WHERE (field_data_field_embedcode0.deleted = :db_condition_placeholder_0) AND (field_data_field_embedcode0.bundle = :db_condition_placeholder_1) LIMIT 10 OFFSET 0; Array ( [:db_condition_placeholder_0] => 1 [:db_condition_placeholder_1] => blog ) in field_sql_storage_field_storage_query() (line 569 of /home/sites/superallan.com/public_html/modules/field/modules/field_sql_storage/field_sql_storage.module).
As I understand it Drupal is looking for a data field that I deleted. I thought maybe it had got corrupted and Drupal couldn't find it to delete it properly. In phpMyAdmin it doesn't exist so how can I get Drupal to recognize it's not longer there and stop it showing this error at the bottom of every page?
You can see it on this page: http://superallan.com/404
Have you tried clearing the site cache, or uninstalling the module that provides the field? It looks like a reference to the field's SQL data is sticking around in your database, which is of course causing the error you posted.
This worked for me:
DELETE FROM field_config WHERE deleted = 1;
DELETE FROM field_config_instance WHERE deleted = 1;
I received no adverse effect from removing all the things that were already deleted.
Source:
http://digcms.com/remove-field_deleted_data-drupal-database/