AWS : How do Athena GET requests on S3 work? - amazon-s3

How do Athena GET requests on S3 work? I had the impression that one S3 GET request = getting one single file from a bucket. But that doesn't seem to be the case since a single query that uses 4 files is costing me around 400 GET requests.
What's happening exactly?

If you run queries against files that are splittable and are large enough Athena will spin up workers that will read partial files. This improves performance because of parallelization. Splittable files are for example Parquet files.
A 100x amplification sounds very high though. I don't know what size Athena aims for when it comes to splits, and I don't know the sizes for your files. There could also be other explanations for the additional GET operations, both inside of Athena and from other sources – how sure are you that these requests are from Athena?
One way you could investigate further is to turn on object level logging in CloudTrail for the bucket. You should be able to see all the request parameters like what byte ranges are read. If you assume a role and pass a unique session name and make only a single query with the credentials you get you should be able to isolate all the S3 operations made by Athena for that query.

Related

Writing Pyspark DF to S3 faster

I am pulling data from mysql DB using pyspark and trying to upload the same data using Pyspark.
While doing so, it takes around 5-7 mins to upload a chunk of 100K records.
This process will take months for the data pull as there are around 3,108,700,000 recs in source.
Is there any better way by which the S3 upload process can be improved.
NOTE : Data pull for a single fetch of 100K recs take only 20-30 seconds, its just the S3 upload causing the issue.
Here is how I am writing the DF to S3.
df = spark.read.format("jdbc").
option('url', jdbcURL).
option('driver', driver).
option('user', user_name).
option('password', password).
option('query', data_query).load()
output_df = df.persist()
output_df.repartition(1).write.mode("overwrite").parquet(target_directory)
Reparation is a good move as writing large files to S3 is better than writing small files.
Persist will slow you down as your writing all the files to S3 with that. So you are writing the data to S3 twice.
S3 is made for large, slow, inexpensive storage. It's not made to move data quickly. If you want to migrate the database AWS has tools for that and it's worth looking into them. Even if its so you can then move the files into S3.
S3 writes to buckets and it determines the buckets by file path, It uses tail variation to assign & auto split buckets. (/heres/some/variation/at/the/tail1,/heres/some/variation/at/the/tail2) Buckets are your bottleneck here. To get multiple buckets, keep the vary the file at the head of the file path.(/head1/variation/isfaster/,/head2/variation/isfaster/)
Try and remove the persist. (At least consider cache() as a cheaper alternative.
Keep the repartition
vary the head of the file path to get assigned more buckets.
consider a redesign that pushes the data into S3 with rest api multi-part upload.

Millions of GET requests (Amazon S3 USE2-Requests-Tier2) every day?

I was looking at our bill and apparently we are charged more than $600 for Amazon Simple Storage Service USE2-Requests-Tier2, meaning that we have more than 1 billion GET requests a month, so about 3 million every day? We made sure that none of our S3 buckets are public so attacks should not be possible. I have no idea how we are getting so many requests as we only have about 20 active users of our app everyday. Assuming that each of them were to make about 10 GET requests to our API, which uses lambda and boto3 to download 10 files from S3 bucket to the lambda's tmp folders, then returns a value, it still wouldn't make sense for us to have about 3 millions GET requests a day.
We also have another EventBridge triggered lambda, which uses Athena to query our database (S3), and will run every 2 hours. I don't know if this is a potential cause? Can anyone shed some light on this? And how we can take a better look into where and why are we getting so many GET requests? Thank you.
When you execute a query in Athena, during the initial query planning phase it will list the location of the table, or the locations of all the partitions of the table involved in the query. In the next phase it will make a GET request for each and every one of the objects that it found during query planning.
If your tables consists of many small files it is not uncommon to see S3 charges that are comparable or higher than the Athena charge. If those small files are Parquet files, the problem can be bigger because Athena will also do GET requests for those during query planning to figure out splits.
One way to figure out if this is the case is to enable S3 access logging on the bucket, create a new IAM session and run a query. Wait a few minutes and then look for all S3 operations that were issued with that session, that's an estimate of the S3 operations per query.

Which is better, multiple requests for multiple files or single request for a single file S3?

I have a 10gb CSV file. I can put the file in S3 in 2 ways.
1) Upload the entire file into single csv object.
2) Divide the file into multiple chunks(say 200mb) and upload.
Now I need to get all the data in the object into a pandas data frame which is running on a EC2 instance.
1) One way is to make a single request and get the file, if it is to be a one big file and put the data in dataframe.
2) Other way is to make multiple requests for each object and keep appending data to dataframe.
Which is the better way of doing it?
With multiple files, you will have possibility to download them simultaneously in parallel threads. But this has 2 drawbacks:
These operations are IO heavy (network mostly), so depending on your instance type you might have worse performance overall
Multithreaded apps include some overhead in handling errors, aggregating results and such.
Depending on what you do, you might also want to look at AWS Athena, which can query data in S3 for you and produce results in seconds, so you don't have to download it at all.

AWS Glue - how to crawl a Kinesis Firehose output folder from S3

I have what I think should be a relatively simple use case for AWS Glue, yet I'm having a lot of trouble figuring out how to implement it.
I have a Kinesis Firehose job dumping streaming data into a S3 bucket. These files consist of a series of discrete web browsing events represented as JSON documents with varying structures (so, say, one document might have field 'date' but not field 'name', whereas another might have 'name' but not 'date').
I wish to run hourly ETL jobs on these files, the specifics of which are not relevant to the matter at hand.
I'm trying to run a S3 data catalog crawler and the problem I'm running into is that the Kinesis output format is not, itself, valid JSON, which is just baffling to me. Instead it's a bunch of JSON documents separated by a line break. The crawler can automatically identify and parse JSON files, but it cannot parse this.
I thought of writing a lambda function to 'fix' the Firehose file, triggered by its creation on the bucket, but it sounds like a cheap workaround for two pieces that should fit neatly together.
Another option would be just bypassing the data catalog altogether and doing the necessary transformations in the Glue script itself, but I have no idea how to get started on this.
Am I missing anything? Is there an easier way to parse Firehouse output files or, failing that, bypassing the need for a crawler?
cheers and thanks in advance
It sounds like you're describing the behaviour of Kinesis Firehose, which is to concatenate multiple incoming records according to some buffering (time and size) settings, and then write the records to S3 as a single object. Firehose Data Delivery
The batching of multiple records into a single file is important if the workload will contain a large number of records, as performance (and S3 costs) for processing many small files from S3 can be less than optimal.
AWS Glue Crawlers and ETL jobs do support processing of 'JSON line' (newline delimited JSON) format.
If the crawler is failing to run please include the logs or error details (and if possible the crawler run duration, and number of tables created and updated.
I have seen a crawler fail in an instance where differences in the files being crawled forced it into a table-per-file mode, and it hit a limit on the number of tables. AWS Glue Limits
I managed to fix this; basically the problem was that not every JSON document had the same underlying structure.
I wrote a lambda script as part of the Kinesis process that forced every document into the same structure, by adding NULL fields where necessary. The crawlers were then able to correctly parse the resulting files and map them to a single table.
Can you please paste few lines from the json file the firehose creating? I ran the crawler on json file generated by Kinesis Streams and it was able to parse it successfully.
Did you also try "convert record format" when you create the Firehose job? There you can specify the JSONSerDe or Glue catalog to parse your data.
What solved this for me was to add a newline field '/n' to end of each payload sent to firehose.
msg_pkg = (str(json_response) + '\n').encode('utf-8')
record = {'Data': msg_pkg}
put_firehose('agg2-na-firehose', record
Because apparently the Hive JSON SerDe is the default used to proces json data. After doing this I was able to crawl the json data and read it in Athena as well.

AWS Spectrum giving blank result for parquet files generated by AWS Glue

We are building a ETL with AWS Glue. And to optimise the query performance we are storing data in apache parquet. Once the data is saved on S3 in parquet format. We are using AWS Spectrum to query on that data.
We successfully tested the entire stack on our development AWS account. But when we moved to our production AWS account. We are stuck with a weird problem. When we query the rows are returned, but the data is blank.
Though the count query return a good number
On further investigation we came to know the apache parquet files in development AWS account is RLE encoded and files in production AWS account is BITPACKED encoded. To make this case stronger, I want to convert BITPACKED to RLE and see if I am able to query data.
I am pretty new to parquet files and couldn't find much help to convert the encodings. Can anybody get me the ways of doing it.
Currently our prime suspect is the different encoding. But if you can guess any other issue. I will be happy to explore the possibilities.
We found our configuration mistake. The column names of our external tables and specified in AWS Glue were inconsistent. We fixed that and now able to view data. I bit shortfall from AWS Spectrum part would be not giving appropriate error message.