AWS Spectrum giving blank result for parquet files generated by AWS Glue - amazon-s3

We are building a ETL with AWS Glue. And to optimise the query performance we are storing data in apache parquet. Once the data is saved on S3 in parquet format. We are using AWS Spectrum to query on that data.
We successfully tested the entire stack on our development AWS account. But when we moved to our production AWS account. We are stuck with a weird problem. When we query the rows are returned, but the data is blank.
Though the count query return a good number
On further investigation we came to know the apache parquet files in development AWS account is RLE encoded and files in production AWS account is BITPACKED encoded. To make this case stronger, I want to convert BITPACKED to RLE and see if I am able to query data.
I am pretty new to parquet files and couldn't find much help to convert the encodings. Can anybody get me the ways of doing it.
Currently our prime suspect is the different encoding. But if you can guess any other issue. I will be happy to explore the possibilities.

We found our configuration mistake. The column names of our external tables and specified in AWS Glue were inconsistent. We fixed that and now able to view data. I bit shortfall from AWS Spectrum part would be not giving appropriate error message.

Related

Snowflake - Loading data loading data from cloud storage

I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance
The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.

AWS : How do Athena GET requests on S3 work?

How do Athena GET requests on S3 work? I had the impression that one S3 GET request = getting one single file from a bucket. But that doesn't seem to be the case since a single query that uses 4 files is costing me around 400 GET requests.
What's happening exactly?
If you run queries against files that are splittable and are large enough Athena will spin up workers that will read partial files. This improves performance because of parallelization. Splittable files are for example Parquet files.
A 100x amplification sounds very high though. I don't know what size Athena aims for when it comes to splits, and I don't know the sizes for your files. There could also be other explanations for the additional GET operations, both inside of Athena and from other sources – how sure are you that these requests are from Athena?
One way you could investigate further is to turn on object level logging in CloudTrail for the bucket. You should be able to see all the request parameters like what byte ranges are read. If you assume a role and pass a unique session name and make only a single query with the credentials you get you should be able to isolate all the S3 operations made by Athena for that query.

Writing data to S3 through Spark Data Frame and in scale - s3 connectivity issue and caused by s3 503 slow down error

We are trying to read and write the data to S3 in spark using AWS EMR clusters. And during this process, while we were scaling the execution, we ended up with some issues. When we try to process the same job for one-quarter of data we are not noticing this issue, but when we scale it to run multiple quarters of data in parallel, randomly for one/more quarters of data, we started seeing the spark jobs failing while writing the data to S3. Then we went down further to understand the issue in deeper, that is when we realized that spark is throwing the issue while it is writing the data to S3 and that is caused by S3 503 Slow down the error.
The slow down error will come only when we exceeded the S3 TPS of a given path. And the suggestion from S3 is to add random hash values to s3 path while writing. We tried this using partition by, but we come to know that some hash values(xy, 2 digit hash values) only will perform better. So does anyone come across the similar issue, and if so may I know how you had overcome this issue?
Looking forward!
Krish
S3 client should be throttling back on the errors; surprised the EMR one wasn't as the AWS SDK everyone uses does recognise a 503 and will backoff and retry.
There'll probably be a config option to set it (for S3a://, its fs.s3a.attempts.maximum)
If you are using the EMR s3:// connector, you'll have to look for their option

AWS Glue - how to crawl a Kinesis Firehose output folder from S3

I have what I think should be a relatively simple use case for AWS Glue, yet I'm having a lot of trouble figuring out how to implement it.
I have a Kinesis Firehose job dumping streaming data into a S3 bucket. These files consist of a series of discrete web browsing events represented as JSON documents with varying structures (so, say, one document might have field 'date' but not field 'name', whereas another might have 'name' but not 'date').
I wish to run hourly ETL jobs on these files, the specifics of which are not relevant to the matter at hand.
I'm trying to run a S3 data catalog crawler and the problem I'm running into is that the Kinesis output format is not, itself, valid JSON, which is just baffling to me. Instead it's a bunch of JSON documents separated by a line break. The crawler can automatically identify and parse JSON files, but it cannot parse this.
I thought of writing a lambda function to 'fix' the Firehose file, triggered by its creation on the bucket, but it sounds like a cheap workaround for two pieces that should fit neatly together.
Another option would be just bypassing the data catalog altogether and doing the necessary transformations in the Glue script itself, but I have no idea how to get started on this.
Am I missing anything? Is there an easier way to parse Firehouse output files or, failing that, bypassing the need for a crawler?
cheers and thanks in advance
It sounds like you're describing the behaviour of Kinesis Firehose, which is to concatenate multiple incoming records according to some buffering (time and size) settings, and then write the records to S3 as a single object. Firehose Data Delivery
The batching of multiple records into a single file is important if the workload will contain a large number of records, as performance (and S3 costs) for processing many small files from S3 can be less than optimal.
AWS Glue Crawlers and ETL jobs do support processing of 'JSON line' (newline delimited JSON) format.
If the crawler is failing to run please include the logs or error details (and if possible the crawler run duration, and number of tables created and updated.
I have seen a crawler fail in an instance where differences in the files being crawled forced it into a table-per-file mode, and it hit a limit on the number of tables. AWS Glue Limits
I managed to fix this; basically the problem was that not every JSON document had the same underlying structure.
I wrote a lambda script as part of the Kinesis process that forced every document into the same structure, by adding NULL fields where necessary. The crawlers were then able to correctly parse the resulting files and map them to a single table.
Can you please paste few lines from the json file the firehose creating? I ran the crawler on json file generated by Kinesis Streams and it was able to parse it successfully.
Did you also try "convert record format" when you create the Firehose job? There you can specify the JSONSerDe or Glue catalog to parse your data.
What solved this for me was to add a newline field '/n' to end of each payload sent to firehose.
msg_pkg = (str(json_response) + '\n').encode('utf-8')
record = {'Data': msg_pkg}
put_firehose('agg2-na-firehose', record
Because apparently the Hive JSON SerDe is the default used to proces json data. After doing this I was able to crawl the json data and read it in Athena as well.

CAN we run ETL jobs on AWS EFS

I would like to know if we can run ETL jobs on EFS mount files..
if so how? is it using Hive or anyother service?
Our target is to reduce all the files in one mount point to one file...and store that one file in s3 for better processing
EFS in itself does not inherently have a particular data warehouse product included. For data warehousing and ETL you can choose what you want to use that operates in the AWS environment.
On to your problem:
You want to concatenate or in some way combine all of the files currently in your EFS mount into a single file and store that in S3, if I understand it correctly.
You do not mention what type of data you have or what type of files you want to combine. That makes a huge difference in how you would do this. So I will have to give general suggestions. If you have different types of data, SQL tables from different databases, documents, non-sql data; then you need to determine how to combine that data. For that you would be looking at a data integration solution that can accommodate raw data.
Amazon has a few different products that may assist the process such as Redshift, Athena, Snowflake and their ETL solution Glue. Adding products depends on your company's needs and budget.
So, a more flexible data integration approach would be to use ELT (extract, load, transform) instead of ETL. Basically you would create an appropriate file over on your S3 instance. Then you would extract each file on EFS one at a time and load them into your S3 file. Then when you query the data in your S3 file you would perform any transformations needed before seeing the query results. Here's an article that explains the differences in more detail: https://blog.panoply.io/etl-vs-elt-the-difference-is-in-the-how.
There are some vendors supporting the ELT process such as Talend, Hadoop/Hive/Spark, Terradata and Informatica should you want to investigate options.