AWS Glue - how to crawl a Kinesis Firehose output folder from S3 - amazon-s3

I have what I think should be a relatively simple use case for AWS Glue, yet I'm having a lot of trouble figuring out how to implement it.
I have a Kinesis Firehose job dumping streaming data into a S3 bucket. These files consist of a series of discrete web browsing events represented as JSON documents with varying structures (so, say, one document might have field 'date' but not field 'name', whereas another might have 'name' but not 'date').
I wish to run hourly ETL jobs on these files, the specifics of which are not relevant to the matter at hand.
I'm trying to run a S3 data catalog crawler and the problem I'm running into is that the Kinesis output format is not, itself, valid JSON, which is just baffling to me. Instead it's a bunch of JSON documents separated by a line break. The crawler can automatically identify and parse JSON files, but it cannot parse this.
I thought of writing a lambda function to 'fix' the Firehose file, triggered by its creation on the bucket, but it sounds like a cheap workaround for two pieces that should fit neatly together.
Another option would be just bypassing the data catalog altogether and doing the necessary transformations in the Glue script itself, but I have no idea how to get started on this.
Am I missing anything? Is there an easier way to parse Firehouse output files or, failing that, bypassing the need for a crawler?
cheers and thanks in advance

It sounds like you're describing the behaviour of Kinesis Firehose, which is to concatenate multiple incoming records according to some buffering (time and size) settings, and then write the records to S3 as a single object. Firehose Data Delivery
The batching of multiple records into a single file is important if the workload will contain a large number of records, as performance (and S3 costs) for processing many small files from S3 can be less than optimal.
AWS Glue Crawlers and ETL jobs do support processing of 'JSON line' (newline delimited JSON) format.
If the crawler is failing to run please include the logs or error details (and if possible the crawler run duration, and number of tables created and updated.
I have seen a crawler fail in an instance where differences in the files being crawled forced it into a table-per-file mode, and it hit a limit on the number of tables. AWS Glue Limits

I managed to fix this; basically the problem was that not every JSON document had the same underlying structure.
I wrote a lambda script as part of the Kinesis process that forced every document into the same structure, by adding NULL fields where necessary. The crawlers were then able to correctly parse the resulting files and map them to a single table.

Can you please paste few lines from the json file the firehose creating? I ran the crawler on json file generated by Kinesis Streams and it was able to parse it successfully.
Did you also try "convert record format" when you create the Firehose job? There you can specify the JSONSerDe or Glue catalog to parse your data.

What solved this for me was to add a newline field '/n' to end of each payload sent to firehose.
msg_pkg = (str(json_response) + '\n').encode('utf-8')
record = {'Data': msg_pkg}
put_firehose('agg2-na-firehose', record
Because apparently the Hive JSON SerDe is the default used to proces json data. After doing this I was able to crawl the json data and read it in Athena as well.

Related

AWS : How do Athena GET requests on S3 work?

How do Athena GET requests on S3 work? I had the impression that one S3 GET request = getting one single file from a bucket. But that doesn't seem to be the case since a single query that uses 4 files is costing me around 400 GET requests.
What's happening exactly?
If you run queries against files that are splittable and are large enough Athena will spin up workers that will read partial files. This improves performance because of parallelization. Splittable files are for example Parquet files.
A 100x amplification sounds very high though. I don't know what size Athena aims for when it comes to splits, and I don't know the sizes for your files. There could also be other explanations for the additional GET operations, both inside of Athena and from other sources – how sure are you that these requests are from Athena?
One way you could investigate further is to turn on object level logging in CloudTrail for the bucket. You should be able to see all the request parameters like what byte ranges are read. If you assume a role and pass a unique session name and make only a single query with the credentials you get you should be able to isolate all the S3 operations made by Athena for that query.

How to save millions of files in S3 so that arbitrary future searches on key/path values are fast

My company has millions of files in an S3 bucket, and every so often I have to search for files whose keys/paths contain some text. This is an extremely slow process because I have to iterate through all files.
I can't use prefix because the text of interest is not always at the beginning. I see other posts (here and here) that say this is a known limitation in S3's API. These posts are from over 3 years ago, so my first question is: does this limitation still exist?
Assuming the answer is yes, my next question is, given that I anticipate arbitrary regex-like searches over millions of S3 files, are there established best practices for workarounds? I've seen some people say that you can store the key names in a relational database, Elasticsearch, or a flat file. Are any of these approaches more common place than others?
Also, out of curiosity, why hasn't S3 supported such a basic use case in a service (S3) that is such an established core product of the overall AWS platform? I've noticed that GCS on Google Cloud has a similar limitation. Is it just really hard to do searches on key name strings well at scale?
S3 is an object store, conceptually similar to a file system. I'd never try to make a database-like environment based on file names in a file system nor would I in S3.
Nevertheless, if this is what you have then I would start by running code to get all of the current file names into a database of some sort. DynamoDB cannot query by regular expression but any of PostgreSQL, MySQL, Aurora, and ElasticSearch can. So start with listing every file and put the file name and S3 location into a database-like structure. Then, create a Lambda that is notified of any changes (see this link for more info) that will do the appropriate thing with your backing store when a file is added or deleted.
Depending on your needs ElasticSearch is super flexible with queries and possibly better suited for these types of queries. But traditional relational database can be made to work too.
Lastly, you'll need an interface to the backing store to query. That will likely require some sort of server. That could be a simple as API gateway to a Lambda or something far more complex.
You might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file containing a list of all objects in the bucket.
You could then load this file into a database, or even write a script to parse it. Or possibly even just play with it in Excel.

Send Bigquery Data to rest endpoint

I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?
As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.

Event Hub, Stream Analytics and Data Lake pipe questions

After reading this article I decided to take a shot on building a pipe of data ingestion. Everything works well. I was able to send data to Event Hub, that is ingested by Stream Analytics and sent to Data Lake. But, I have a few questions regarding some things that seem odd to me. I would appreciate if someone more experienced than me is able to answer.
Here is the SQL inside my Stream Analytics
SELECT
*
INTO
[my-data-lake]
FROM
[my-event-hub]
Now, for the questions:
Should I store 100% of my data in a single file, try to split it in multiple files, or try to achieve one-file-per-object? Stream Analytics is storing all the data inside a single file, as a huge JSON array. I tried setting {date} and {time} as variables, but it is still a huge single file everyday.
Is there a way to enforce Stream Analytics to write every entry from Event Hub on its own file? Or maybe limit the size of the file?
Is there a way to set the name of the file from Stream Analytics? If so, is there a way to override a file if a name already exists?
I also noticed the file is available as soon as it is created, and it is written in real time, in a way I can see data truncation inside it when I download/display the file. Also, before it finishes, it is not a valid JSON. What happens if I query a Data Lake file (through U-SQL) while it is being written? Is it smart enough to ignore the last entry, or understand it as an array of objects that is incomplete?
Is it better to store the JSON data as an array or each object in a new line?
Maybe I am taking a bad approach on my issue, but I have a huge dataset in Google Datastore (NoSQL solution from Google). I only have access to the Datastore, with an account with limited permissions. I need to store this data on a Data Lake. So I made an application that streams the data from Datastore to Event Hub, that is ingested by Stream Analytics, who writes down the files inside the Data Lake. It is my first time using the three technologies, but seems to be the best solution. It is my go-to alternative to ETL chaos.
I am sorry for making so much questions. I hope someone helps me out.
Thanks in advance.
I am only going to answer the file aspect:
It is normally better to produce larger files for later processing than many very small files. Given you are using JSON, I would suggest to limit the files to a size that your JSON extractor will be able to manage without running out of memory (if you decide to use a DOM based parser).
I will leave that to an ASA expert.
ditto.
The answer depends here on how ASA writes the JSON. Clients can append to files and U-SQL should only see the data in a file that has been added in sealed extents. So if ASA makes sure that extents align with the end of a JSON document, you should be only seeing a valid JSON document. If it does not, then you may fail.
That depends on how you plan on processing the data. Note that if you write it as part of an array, you will have to wait until the array is "closed", or your JSON parser will most likely fail. For parallelization and be more "flexible", I would probably get one JSON document per line.

AWS Spectrum giving blank result for parquet files generated by AWS Glue

We are building a ETL with AWS Glue. And to optimise the query performance we are storing data in apache parquet. Once the data is saved on S3 in parquet format. We are using AWS Spectrum to query on that data.
We successfully tested the entire stack on our development AWS account. But when we moved to our production AWS account. We are stuck with a weird problem. When we query the rows are returned, but the data is blank.
Though the count query return a good number
On further investigation we came to know the apache parquet files in development AWS account is RLE encoded and files in production AWS account is BITPACKED encoded. To make this case stronger, I want to convert BITPACKED to RLE and see if I am able to query data.
I am pretty new to parquet files and couldn't find much help to convert the encodings. Can anybody get me the ways of doing it.
Currently our prime suspect is the different encoding. But if you can guess any other issue. I will be happy to explore the possibilities.
We found our configuration mistake. The column names of our external tables and specified in AWS Glue were inconsistent. We fixed that and now able to view data. I bit shortfall from AWS Spectrum part would be not giving appropriate error message.