I am continuously add parquet data sets to an S3 folder with a structure like this:
s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3
At the beginning I only have set1 and my crawler is configured to run on the whole bucket s3:::my-bucket. This leads to the creation of a partitioned tabled named my-bucket with partitions named public, data and set1. What I actually want is to have a table named set1 without any partitions.
I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?. But when a new data set is uploaded (e.g. set2) I don't want it to be another partition (because it is completely different data with a different schema).
How can I force the Glue crawler to NOT create partitions?
I know I could define the crawler path as s3:::my-bucket/public/data/ but unfortunately I don't know where the new data sets will be created (e.g. could also be s3:::my-bucket/other/folder/set2).
Any ideas how to solve this?
You can use the TableLevelConfiguration to specify in which folder level the crawler should look for tables.
More information on that here.
My solution was to manually add the specific paths to the Glue crawler. The big picture is that I am using a Glue job to transform data from one S3 bucket and write it to another one. I now ended up to initially configure the Glue crawler to crawl the whole bucket. But every time the Glue transformation job runs it also updates the Glue crawler: it removes the initial full bucket location (if it still exists) and then adds the new path to the S3 targets.
In Python it looks something like this:
def update_target_paths(crawler):
"""
Remove initial include path (whole bucket) from paths and
add folder for current files to include paths.
"""
def path_is(c, p):
return c["Path"] == p
# get S3 targets and remove initial bucket target
s3_targets = list(
filter(
lambda c: not path_is(c, f"s3://{bucket_name}"),
crawler["Targets"]["S3Targets"],
)
)
# add new target path if not in targets yet
if not any(filter(lambda c: path_is(c, output_loc), s3_targets)):
s3_targets.append({"Path": output_loc})
logging.info("Appending path '%s' to Glue crawler include path.", output_loc)
crawler["Targets"]["S3Targets"] = s3_targets
return crawler
def remove_excessive_keys(crawler):
"""Remove keys from Glue crawler dict that are not needed/allowed to update the crawler"""
for k in ["State", "CrawlElapsedTime", "CreationTime", "LastUpdated", "LastCrawl", "Version"]:
try:
del crawler[k]
except KeyError:
logging.warning(f"Key '{k}' not in crawler result dictionary.")
return crawler
if __name__ == "__main__":
logging.info(f"Transforming from {input_loc} to {output_loc}.")
if prefix_exists(curated_zone_bucket_name, curated_zone_key):
logging.info("Target object already exists, appending.")
else:
logging.info("Target object doesn't exist, writing to new one.")
transform() # do data transformation and write to output bucket
while True:
try:
crawler = get_crawler(CRAWLER_NAME)
crawler = update_target_paths(crawler)
crawler = remove_excessive_keys(crawler)
# Update Glue crawler with updated include paths
glue_client.update_crawler(**crawler)
glue_client.start_crawler(Name=CRAWLER_NAME)
logging.info("Started Glue crawler '%s'.", CRAWLER_NAME)
break
except (
glue_client.exceptions.CrawlerRunningException,
glue_client.exceptions.InvalidInputException,
):
logging.warning("Crawler still running...")
time.sleep(10)
Variables defined defined globally: input_loc, output_loc, CRAWLER_NAME, bucket_name.
For every new data set a new path is added to the Glue crawler. No partitions will be created.
Related
I've been experimenting with Scrapy for sometime now and recently have been trying to upload files (data and images) to an S3 bucket. If the directory is static, it is pretty straightforward and I didn't hit any roadblocks. But what I want to achieve is to dynamically create directories based on a certain field from the extract data and place the data & media in those directories. The template path, if you will, is below:
s3://<bucket-name>/crawl_data/<account_id>/<media_type>/<file_name>
For example if the account_id is 123, then the images should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/images/file_name.jpeg
and the data file should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/data/file_name.json
I have been able to achieve this for the media downloads (kind of a crude way to segregate media types, as of now), with the following custom File Pipeline:
class CustomFilepathPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
adapter = ItemAdapter(item)
account_id = adapter["account_id"]
file_name = os.path.basename(urlparse(request.url).path)
if ".mp4" in file_name:
media_type = "video"
else:
media_type = "image"
file_path = f"crawl_data/{account_id}/{media_type}/{file_name}"
return file_path
The following settings have been configured at a spider level with custom_settings:
custom_settings = {
'FILES_STORE': 's3://<my_s3_bucket_name>/',
'FILES_RESULT_FIELD': 's3_media_url',
'DOWNLOAD_WARNSIZE': 0,
'AWS_ACCESS_KEY_ID': <my_access_key>,
'AWS_SECRET_ACCESS_KEY': <my_secret_key>,
}
So, the media part works flawlessly and I have been able to download the images and videos in their separate directories based on the account_id, in the S3 bucket. My questions is:
Is there a way to achieve the same results with the data files as well? Maybe another custom pipeline?
I have tried to experiment with the 1st example on the Item Exporters page but couldn't make any headway. One thing that I thought might help is to use boto3 to establish connection and then upload files but that might possibly require me to segregate files locally and upload those files together, by using a combination of Pipelines (to split data) and Signals (once spider is closed to upload the files to S3).
Any thoughts and/or guidance on this or a better approach would be greatly appreciated.
My S3 bucket is organised with this hierarchy, storing parquet file: <folder-name>/year=<yyyy>/month=<mm>/day=<dd>/<filename>.parquet
Manual Fixation
For a particular date (i.e. a single parquet file), I do some manual fixation
Downloaded the parquet file and read it as pandas DataFrame
Updated some values, while the column remains unchanged
Saved the pandas DataFrame back to parquet file with the same filename
Uploaded it back to same S3 bucket sub-folder
PS: I seem to have deleted the parquet file on S3 once, leading to empty sub-folder.
Then, I re-run the Glue crawler, pointing <folder-name>/. Unfortunately, data of this particular date is missing in the Athena Table.
After the crawler is finished running, the notification is as follow
Crawler <my-table-name> completed and made the following changes: 0 tables created, 0 tables updated. See the tables created in database <my-databse-name>.
Is there anything I have mis-configured in my Glue crawler ? Thanks
Glue Crawler Config
Schema updates in the data store: Update the table definition in the data catalog.
Inherit schema from table: Update all new and existing partitions with metadata from the table.
Object deletion in the data store: Delete tables and partitions from the data catalog.
Crawler Log in CloudWatch
BENCHMARK : Running Start Crawl for Crawler <my-table-name>
BENCHMARK : Classification complete, writing results to database <my-database-name>
INFO : Crawler configured with Configuration
{
"Version": 1,
"CrawlerOutput": {
"Partitions": {
"AddOrUpdateBehavior": "InheritFromTable"
}
},
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas"
}
}
and SchemaChangePolicy
{
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
}
. Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets.
BENCHMARK : Finished writing to Catalog
BENCHMARK : Crawler has finished running and is in state READY
If you are reading from or writing to S3 buckets, the bucket name should have aws-glue* prefix for Glue to access the buckets. Assuming you are using the preconfigured “AWSGlueServiceRole” IAM role. You can try by adding prefix aws-glue to the name of the folders
I had the same problem. Check the inline policy of your IAM role. You should have something like that when you specify the bucket:
"Resource": [
"arn:aws:s3:::bucket/object*"
]
When the crawler didn't work, I instead had the following:
"Resource": [
"arn:aws:s3:::bucket/object"
]
I exported a dataset from Google BigQuery to Google Cloud Storage, given the size of the file BigQuery exported the file as 99 csv files.
However now I want to connect to my GCP Bucket and perform some analysis with Spark, yet I need to join all 99 files into a single large csv file to run my analysis.
How can this be achieved?
BigQuery splits the data exported into several files if it is larger than 1GB. But you can merge these files with the gsutil tool, check this official documentation to know how to perform object composition with gsutil.
As BigQuery export the files with the same prefix, you can use a wildcard * to merge them into one composite object:
gsutil compose gs://example-bucket/component-obj-* gs://example-bucket/composite-object
Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.
The downside of this option is that the header row of each .csv file will be added in the composite object. But you can avoid this by modifiyng the jobConfig to set the print_header parameter to False.
Here is a Python sample code, but you can use any other BigQuery Client library:
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'yourBucket'
project = 'bigquery-public-data'
dataset_id = 'libraries_io'
table_id = 'dependencies'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'file-*.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.job.ExtractJobConfig(print_header=False)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US',
job_config=job_config) # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
Finally, remember to compose an empty .csv with just the headers row.
I got tired kind tired of doing multiple recursive compose operations, stripping headers, etc... Especially when dealing with 3500 split gzipped csv files.
Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.
https://github.com/tcwicks/DataUtilities
Download latest release, unzip and use.
Also wrote an article with a use case and usage example for it:
https://medium.com/#TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope it is of use to someone.
p.s. Recommend tab delimited over CSV as it tends to have less data issues.
I have about 1000 objects in S3 which named after
abcyearmonthday1
abcyearmonthday2
abcyearmonthday3
...
want to rename them to
abc/year/month/day/1
abc/year/month/day/2
abc/year/month/day/3
how could I do it through boto3. Is there easier way of doing this ?
As explained in Boto3/S3: Renaming an object using copy_object
you can not rename an object in S3 you have to copy object with a new name and then delete the Old object
s3 = boto3.resource('s3')
s3.Object('my_bucket','my_file_new').copy_from(CopySource='my_bucket/my_file_old')
s3.Object('my_bucket','my_file_old').delete()
There is not direct way to rename S3 object.
Below two steps need to perform :
Copy the S3 object at same location with new name.
Then delete the older object.
I had the same problem (in my case I wanted to rename files generated in S3 using the Redshift UNLOAD command). I solved creating a boto3 session and then copy-deleting file by file.
Like
import boto3
session = boto3.session.Session(aws_access_key_id=my_access_key_id,aws_secret_access_key=my_secret_access_key).resource('s3')
# Save in a list the tuples of filenames (with prefix): [(old_s3_file_path, new_s3_file_path), ..., ()] e.g. of tuple ('prefix/old_filename.csv000', 'prefix/new_filename.csv')
s3_files_to_rename = []
s3_files_to_rename.append((old_file, new_file))
for pair in s3_files_to_rename:
old_file = pair[0]
new_file = pair[1]
s3_session.Object(s3_bucket_name, new_file).copy_from(CopySource=s3_bucket_name+'/'+old_file)
s3_session.Object(s3_bucket_name, old_file).delete()
I'm trying to organize a large number of CloudWatch alarms for maintainability, and the web console grays out the name field on an edit. Is there another method (preferably something scriptable) for updating the name of CloudWatch alarms? I would prefer a solution that does not require any programming beyond simple executable scripts.
Here's a script we use to do this for the time being:
import sys
import boto
def rename_alarm(alarm_name, new_alarm_name):
conn = boto.connect_cloudwatch()
def get_alarm():
alarms = conn.describe_alarms(alarm_names=[alarm_name])
if not alarms:
raise Exception("Alarm '%s' not found" % alarm_name)
return alarms[0]
alarm = get_alarm()
# work around boto comparison serialization issue
# https://github.com/boto/boto/issues/1311
alarm.comparison = alarm._cmp_map.get(alarm.comparison)
alarm.name = new_alarm_name
conn.update_alarm(alarm)
# update actually creates a new alarm because the name has changed, so
# we have to manually delete the old one
get_alarm().delete()
if __name__ == '__main__':
alarm_name, new_alarm_name = sys.argv[1:3]
rename_alarm(alarm_name, new_alarm_name)
It assumes you're either on an ec2 instance with a role that allows this, or you've got a ~/.boto file with your credentials. It's easy enough to manually add yours.
Unfortunately it looks like this is not currently possible.
I looked around for the same solution but it seems neither console nor cloudwatch API provides that feature.
Note:
But we can copy the existing alram with the same parameter and can save on new name
.