Glue Crawler Skips a Particular S3 Folder

Glue Crawler Skips a Particular S3 Folder - amazon-s3

My S3 bucket is organised with this hierarchy, storing parquet file: <folder-name>/year=<yyyy>/month=<mm>/day=<dd>/<filename>.parquet
Manual Fixation
For a particular date (i.e. a single parquet file), I do some manual fixation
Downloaded the parquet file and read it as pandas DataFrame
Updated some values, while the column remains unchanged
Saved the pandas DataFrame back to parquet file with the same filename
Uploaded it back to same S3 bucket sub-folder
PS: I seem to have deleted the parquet file on S3 once, leading to empty sub-folder.
Then, I re-run the Glue crawler, pointing <folder-name>/. Unfortunately, data of this particular date is missing in the Athena Table.
After the crawler is finished running, the notification is as follow
Crawler <my-table-name> completed and made the following changes: 0 tables created, 0 tables updated. See the tables created in database <my-databse-name>.
Is there anything I have mis-configured in my Glue crawler ? Thanks
Glue Crawler Config
Schema updates in the data store: Update the table definition in the data catalog.
Inherit schema from table: Update all new and existing partitions with metadata from the table.
Object deletion in the data store: Delete tables and partitions from the data catalog.
Crawler Log in CloudWatch
BENCHMARK : Running Start Crawl for Crawler <my-table-name>
BENCHMARK : Classification complete, writing results to database <my-database-name>
INFO : Crawler configured with Configuration
{
"Version": 1,
"CrawlerOutput": {
"Partitions": {
"AddOrUpdateBehavior": "InheritFromTable"
}
},
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas"
}
}
and SchemaChangePolicy
{
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
}
. Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets.
BENCHMARK : Finished writing to Catalog
BENCHMARK : Crawler has finished running and is in state READY

If you are reading from or writing to S3 buckets, the bucket name should have aws-glue* prefix for Glue to access the buckets. Assuming you are using the preconfigured “AWSGlueServiceRole” IAM role. You can try by adding prefix aws-glue to the name of the folders

I had the same problem. Check the inline policy of your IAM role. You should have something like that when you specify the bucket:
"Resource": [
"arn:aws:s3:::bucket/object*"
]
When the crawler didn't work, I instead had the following:
"Resource": [
"arn:aws:s3:::bucket/object"
]

Related

Overwrite s3 data using Glue job in AWS

I am trying to save my Glue job output to S3 using following code snippet
output_table = glueContext.write_dynamic_frame.from_options(
frame=table,
connection_type="s3",
format="json",
connection_options={"path": "s3://brand-code-mappings", "partitionKeys": []},
transformation_ctx="S3bucket_node3",
)
I want to overwrite all the objects that are already present in S3 bucket instead of appending to them.
I have tried making following changes, but nothing seems to work.
table.toDF()
.write
.mode("overwrite")
.format("parquet")
.partitionBy()
.save('s3://brand-code-mappings')
table.toDF()
.write
.mode("overwrite")
.parquet("s3://brand-code-mappings")
Please help on how i can overwrite already existing objects in the S3 bucket with the Glue output.
I am using Glue 3.0- which supports Spark 3.1, Scala 2 and Python 3.
Thanks,
Anamika

Azureml : error "The SSL connection could not be established, see inner exception." while creating Tabular Dataset from Azure Blob Storage file

I have a new error using Azure ML maybe due to the Ubuntu upgrade to 22.04 which I did yesterday.
I have a workspace azureml created through the portal and I can access it whitout any issue with python SDK
from azureml.core import Workspace
ws = Workspace.from_config("config/config.json")
ws.get_details()
output
{'id': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.MachineLearningServices/workspaces/azml_lk',
'name': 'azml_lk',
'identity': {'principal_id': 'XXXXX',
'tenant_id': 'XXXXX',
'type': 'SystemAssigned'},
'location': 'westeurope',
'type': 'Microsoft.MachineLearningServices/workspaces',
'tags': {},
'sku': 'Basic',
'workspaceid': 'XXXXX',
'sdkTelemetryAppInsightsKey': 'XXXXX',
'description': '',
'friendlyName': 'azml_lk',
'keyVault': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.Keyvault/vaults/azmllkXXXXX',
'applicationInsights': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.insights/components/azmllkXXXXX',
'storageAccount': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.Storage/storageAccounts/azmllkXXXXX',
'hbiWorkspace': False,
'provisioningState': 'Succeeded',
'discoveryUrl': 'https://westeurope.api.azureml.ms/discovery',
'notebookInfo': {'fqdn': 'ml-azmllk-westeurope-XXXXX.westeurope.notebooks.azure.net',
'resource_id': 'XXXXX'},
'v1LegacyMode': False}
I then use this workspace ws to upload a file (or a directory) to Azure Blob Storage like so
from azureml.core import Dataset
ds = ws.get_default_datastore()
Dataset.File.upload_directory(
src_dir="./data",
target=ds,
pattern="*dataset1.csv",
overwrite=True,
show_progress=True
)
which again works fine and outputs
Validating arguments.
Arguments validated.
Uploading file to /
Filtering files with pattern matching *dataset1.csv
Uploading an estimated of 1 files
Uploading ./data/dataset1.csv
Uploaded ./data/dataset1.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Creating new dataset
{
"source": [
"('workspaceblobstore', '//')"
],
"definition": [
"GetDatastoreFiles"
]
}
My file is indeed uploaded to Blob Storage and I can see it either on azure portal or on azure ml studio (ml.azure.com).
The error comes up when I try to create a Tabular dataset from the uploaded file. The following code doesn't work :
from azureml.core import Dataset
data1 = Dataset.Tabular.from_delimited_files(
path=[(ds, "dataset1.csv")]
)
and it gives me the error :
ExecutionError:
Error Code: ScriptExecution.DatastoreResolution.Unexpected
Failed Step: XXXXXX
Error Message: ScriptExecutionException was caused by DatastoreResolutionException.
DatastoreResolutionException was caused by UnexpectedException.
Unexpected failure making request to fetching info for Datastore 'workspaceblobstore' in subscription: 'XXXXXX', resource group: 'gr_louis', workspace: 'azml_lk'. Using base service url: https://westeurope.experiments.azureml.net. HResult: 0x80131501.
The SSL connection could not be established, see inner exception.
| session_id=XXXXXX
After some research, I assumed it might be due to openssl version (which now is 1.1.1) but I am not sure and I surely don't know how to fix it...any ideas ?

According to the document there is no direct procedure to convert the file dataset into tabular dataset. Instead, we can create a workspace and that creates two storage methods (blobstorage which is the default storage, file storage). The SSL will be taken care by workspace.
We can create a datastore in the workspace and connect that to the blob storage.
Follow the procedure to do the same.
Create a workspace
If we want, we can create a dataset.
We can create from local files of datastore.
To choose a datastore, first we need to have a file in the datastore
Goto Datastores and click on create dataset. Observe that the name is workspaceblobstorage(default).
Fill the details and see that the dataset type is Tabular.
In the path, we will be having the local files path and can check there, under the select or create a datastore, it is showing default storage as blob.
After uploading, we can wee the name in this section which is a datastore and tabular dataset.
In your workspace created, check whether the public access is Disabled or Enabled. If disabled, it will not allow to access due to lack of SSL. Checkout the image below. After enabling, use the same procedure which was implemented till now.

Confusion with AWS --resources-to-import and --template-url

I'm trying to import an existing S3 bucket into a newly created CloudFormation stack. As a reference, I'm using this site. I use a Github workflow runner to execute this, like so:
- name: Add existing S3 bucket and object to Stack
run: aws cloudformation create-change-set
--stack-name ${{ env.STACK_NAME }} --change-set-name ImportChangeSet
--change-set-type IMPORT
--resources-to-import file://ResourcesToImport.txt
--template-url https://cf-templates.s3.eu-central-1.amazonaws.com/ResourcesToImport.yaml
I'm a little confused to what exactly should the ResourcesToImport.txt and ResourcesToImport.yaml contain. I currently have:
ResourcesToImport.txt
[
{
"ResourceType":"AWS::S3::Bucket",
"LogicalResourceId":"myBucket",
"ResourceIdentifier": {
"resourceName":"myBucket",
"resourceType":"AWS::S3::Bucket"
}
}
]
NB: As a sidenote, I have just used the bucket name, but actually I just want a specific folder within that bucket.
ResourcesToImport.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: Import existing resources
Resources:
S3SourceBucket:
Type: AWS::S3::Bucket
DeletionPolicy: Retain
BucketName: myBucket
I'm quite sure the replication of information in both of these files is redundant and incorrect. The ResourcesToImport.yaml file is uploaded in advance to the bucket cf-templates/ResourcesToImport.yaml
What should these two files actually contain, if I am to import only an existing S3 bucket and folder?
EDIT
In addition to the template route, I also tried adding the S3 bucket via the console. However when the S3 url is added (s3://myBucket/folder1/folder2/), I get:
S3 error: Domain name specified in myBucket is not a valid S3 domain

Here's what the two file inputs to create-change-set should contain when importing:
--resources-to-import The resources to import into your stack. This identifies the to-be-imported resources. Not a template. Make sure the LogicalResourceId matches the resource id in the template below. In your case: "LogicalResourceId": "S3SourceBucket".
--template-url The [S3] location of the file that contains the revised template. This is a CloudFormation template that includes (a) the to-be-imported resources AND (b) the existing stack resources. This is what CloudFormation will deploy when you execute the change set. Note: alternatively, use --template-body with a local file template instead.
Regarding your EDIT:
Bucket names cannot contain slashes. Object Keys can. S3 does not have folders per se, although object keys with a / have some folder-like properties. The path/to/my.json together is the S3 object Key name:
Amazon S3 supports buckets and objects, and there is no hierarchy. However, by using prefixes and delimiters in an object key name, the Amazon S3 console and the AWS SDKs can infer hierarchy and introduce the concept of folders

Force Glue Crawler to create separate tables

I am continuously add parquet data sets to an S3 folder with a structure like this:
s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3
At the beginning I only have set1 and my crawler is configured to run on the whole bucket s3:::my-bucket. This leads to the creation of a partitioned tabled named my-bucket with partitions named public, data and set1. What I actually want is to have a table named set1 without any partitions.
I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?. But when a new data set is uploaded (e.g. set2) I don't want it to be another partition (because it is completely different data with a different schema).
How can I force the Glue crawler to NOT create partitions?
I know I could define the crawler path as s3:::my-bucket/public/data/ but unfortunately I don't know where the new data sets will be created (e.g. could also be s3:::my-bucket/other/folder/set2).
Any ideas how to solve this?

You can use the TableLevelConfiguration to specify in which folder level the crawler should look for tables.
More information on that here.

My solution was to manually add the specific paths to the Glue crawler. The big picture is that I am using a Glue job to transform data from one S3 bucket and write it to another one. I now ended up to initially configure the Glue crawler to crawl the whole bucket. But every time the Glue transformation job runs it also updates the Glue crawler: it removes the initial full bucket location (if it still exists) and then adds the new path to the S3 targets.
In Python it looks something like this:
def update_target_paths(crawler):
"""
Remove initial include path (whole bucket) from paths and
add folder for current files to include paths.
"""
def path_is(c, p):
return c["Path"] == p
# get S3 targets and remove initial bucket target
s3_targets = list(
filter(
lambda c: not path_is(c, f"s3://{bucket_name}"),
crawler["Targets"]["S3Targets"],
)
)
# add new target path if not in targets yet
if not any(filter(lambda c: path_is(c, output_loc), s3_targets)):
s3_targets.append({"Path": output_loc})
logging.info("Appending path '%s' to Glue crawler include path.", output_loc)
crawler["Targets"]["S3Targets"] = s3_targets
return crawler
def remove_excessive_keys(crawler):
"""Remove keys from Glue crawler dict that are not needed/allowed to update the crawler"""
for k in ["State", "CrawlElapsedTime", "CreationTime", "LastUpdated", "LastCrawl", "Version"]:
try:
del crawler[k]
except KeyError:
logging.warning(f"Key '{k}' not in crawler result dictionary.")
return crawler
if __name__ == "__main__":
logging.info(f"Transforming from {input_loc} to {output_loc}.")
if prefix_exists(curated_zone_bucket_name, curated_zone_key):
logging.info("Target object already exists, appending.")
else:
logging.info("Target object doesn't exist, writing to new one.")
transform() # do data transformation and write to output bucket
while True:
try:
crawler = get_crawler(CRAWLER_NAME)
crawler = update_target_paths(crawler)
crawler = remove_excessive_keys(crawler)
# Update Glue crawler with updated include paths
glue_client.update_crawler(**crawler)
glue_client.start_crawler(Name=CRAWLER_NAME)
logging.info("Started Glue crawler '%s'.", CRAWLER_NAME)
break
except (
glue_client.exceptions.CrawlerRunningException,
glue_client.exceptions.InvalidInputException,
):
logging.warning("Crawler still running...")
time.sleep(10)
Variables defined defined globally: input_loc, output_loc, CRAWLER_NAME, bucket_name.
For every new data set a new path is added to the Glue crawler. No partitions will be created.

How to configure kafka s3 sink connector for json using its fields AND time based partitioning?

I have a json coming in like this:
{
"app" : "hw",
"content" : "hello world",
"time" : "2018-05-06 12:53:04"
}
I wish to push to S3 in the following file format:
/upper-directory/$jsonfield1/$jsonfield2/$date/$HH
I know I can achieve:
/upper-directory/$date/$HH
with TimeBasedPartitioner and Topic.dir, but how do I put in the 2 json fields as well?

You need to write your own Partitioner to achieve a combination of TimeBased and Field Partitioners
That means make a new Java project, look at the source code for a reference point, build a JAR out of the project, and then copy the jar into kafka-connect-storage-common on all servers running Kafka Connect, which is picked up by the S3 connector. After you've copy the JAR, you will need to reboot the Connect process.
Note: there's already a PR that is trying to add this - https://github.com/confluentinc/kafka-connect-storage-common/pull/73/files

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Glue Crawler Skips a Particular S3 Folder - amazon-s3

If you are reading from or writing to S3 buckets, the bucket name should have aws-glue* prefix for Glue to access the buckets. Assuming you are using the preconfigured “AWSGlueServiceRole” IAM role. You can try by adding prefix aws-glue to the name of the folders

I had the same problem. Check the inline policy of your IAM role. You should have something like that when you specify the bucket: "Resource": [ "arn:aws:s3:::bucket/object*" ] When the crawler didn't work, I instead had the following: "Resource": [ "arn:aws:s3:::bucket/object" ]

Related

Overwrite s3 data using Glue job in AWS

Azureml : error "The SSL connection could not be established, see inner exception." while creating Tabular Dataset from Azure Blob Storage file

Confusion with AWS --resources-to-import and --template-url

Force Glue Crawler to create separate tables

How to configure kafka s3 sink connector for json using its fields AND time based partitioning?

Categories

Resources