How can I search Json files stored in my s3 bucket from my local computer? - amazon-s3

I have thousands of json files stored in my s3 bucket and need to perform a grep search for the string "name".
I have configured AWS cli ok as i can print out all the files in bucket (via the ls command).
I have tried the 2 following commands:
1)
aws s3 ls s3://training | grep 'name'
This resulted in nothing
2)
aws s3 cp s3://training/*json - | grep 'name'
This gave the error:
download failed: s3://training/*json to - An error occurred (404)
when calling the HeadObject operation: Not Found
I know the string name exists 100% as it is a field name that is stated multiple times in each json
Any ideas what I'm doing wrong?

Your first example fails because you are listing the objects, rather than printing out the content of the objects.
Your second example fails because you cannot use wildcards with S3 requests.
One way to do this would be to sync the files locally, then grep the local files, then delete the local files (or just leave them in place to optimize future syncs). You can use aws s3 sync to do this.
Another option would be to use Athena to query the JSON content. You can use SQL queries, for example.
Another option would be to create a search index when documents are uploaded to S3. You could trigger a Lambda function which reads the object content and indexes that into another S3 object, or a DynamoDB table, or even Elasticsearch if this is a significant system.

Related

AWS GLUE Pyspark job delete S3 folder unexpectly

My glue workflow is DDB -> GLUE table (by using Crawler) -> S3 (by using GLUE job)
I create S3 folder manually before the workflow run.
For DDB table with size at 500~MB it always works fine (runs 7-10min to finish), the s3 path will have correct result: e.g. s3://glue_example/ddb_500MB/ (I know data is correct by checking them in athena after connecting to s3)
For DDB table with size 50GB the folder is deleted by the GLUE JOB (runs 2 hours to finish, no error), e.g. s3://glue_example/ddb_50GB this folder is deleted. (I enabled the log for s3, and in log, GlueJobRunnerSession used DeleteObject on this folder path)
This delete folder behavior is not consistent, it happened most of the time, but if I find the folder is deleted, and I created manually, next run will have correct data in that s3 folder.
The code of GLUE job (Glue 3.0 - Supports spark 3.1, Scala 2, Python 3) is super simple. the only line that write to s3 is: ApplyMapping_node2.toDF().write.mode("overwrite").format("parquet").save('s3://glue_example/ddb_50GB')
concurrency of workflow/job is 1, so it's not competing caused problem
I use overwrite to keep the folder to have only latest data. but I don't know why this keep deleting folder with large size DDB as data source. Any idea?
The issue was due to whole table being read into single partition as it is default behaviour. Increasing dynamodb.splits while reading from DDB table should help as it reads data in parallel into multiple partitions.Below is an example in pySpark.
dyf = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={"dynamodb.input.tableName": "test_source",
"dynamodb.throughput.read.percent": "1.0",
"dynamodb.splits": "100"
}
)
Refer to below link for more information:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

AWS athena giving error when trying to query files in S3 that have already been catalogued in Glue data catalog

Trying to build a data lake using S3 for files that are in .csv.gz format and then further cleansing/processing data in AWS environment itself.
First used AWS Glue to create a data catalog\ (crawler was able to identify all tables).
The tables from catalog are also available in AWS Athena but when i try to run a Select * from the table it gives me following error.
Error opening Hive split s3://BUCKET_NAME/HEADER FOLDER/FILENAME.csv.gz (offset=0, length=44354) using org.apache.hadoop.mapred.TextInputFormat: Permission denied on S3 path: 3://BUCKET_NAME/HEADER FOLDER/FILENAME.csv.gz.
Could it be that the file is in CSV.GZ format and that is why it cannot be accessed as is or do i need to give user or role a specific access for these files?
You need to fix your permissions. The error says the principal (user/role) that ran the query does not have permission to read an object on S3.

Copy and Merge files to another S3 bucket

I have a source bucket where small 5KB JSON files will be inserted every second.
I want to use AWS Athena to query the files by using an AWS Glue Datasource and crawler.
For better query performance AWS Athena recommends larger file sizes.
So I want to copy the files from the source bucket to bucket2 and merge them.
I am planning to use S3 events to put a message in AWS SQS for each file created, then a lambda will be invoked with a batch of x sqs messages, read the data in those files, combine and save them to the destination bucket. bucket2 then will be the source of the AWS Glue crawler.
Will this be the best approach or am I missing something?
Instead of receiving 5KB JSON file every second in Amazon S3, the best situation would be to receive this data via Amazon Kinesis Data Firehose, which can automatically combine data based on either size or time period. It would output fewer, larger files.
You could also achieve this with a slight change to your current setup:
When a file is uploaded to S3, trigger an AWS Lambda function
The Lambda function reads the file and send it to Amazon Kinesis Data Firehose
Kinesis Firehose then batches the data by size or time
Alternatively, you could use Amazon Athena to read data from multiple S3 objects and output them into a new table that uses Snappy-compressed Parquet files. This file format is very efficient for querying. However, your issue is that the files are arriving every second so it is difficult to query the incoming files in batches (so you know which files have been loaded and which ones have not been loaded). A kludge could be a script that does the following:
Create an external table in Athena that points to a batching directory (eg batch/)
Create an external table in Athena that points to the final data (eg final/)
Have incoming files come into incoming/
At regular intervals, trigger a Lambda function that will list the objects in incoming/, copy them to batch/ and delete those source objects from incoming/ (any objects that arrive during this copy process will be left for the next batch)
In Athena, run INSERT INTO final SELECT * FROM batch
Delete the contents of the batch/ directory
This will append the data into the final table in Athena, in a format that is good for querying.
However, the Kinesis Firehose option is simpler, even if you need to trigger Lambda to send the files to the Firehose.
You can probably achive that using glue itself. Have a look here https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md
This is what I think will be more simpler
Have input folder input/ let 5kb/ 1kb files land here; /data we will use this to have Json files with max size of 200MB.
Have a lambda that runs every 1minute which reads a set of files from input/ and appends to the last file in the folder /data using golang/ java.
The lambda (with max concurrency as 1) copies a set of 5kb files from input/ and the XMB files from data/ folder into its /tmp folder; and merge them and then upload the merged file to /data and also delte the files from input/ folder
When ever the file size crosses 200MB create a new file into data/ folder
The advantage here is at any instant if somebody wants data its the union of input/ and data/ folder or in other words
With little tweeks here and there you can expose a view on top of input and data folders which can expose final de-duplicated snapshot of the final data.

How do I read Athena-created Parquet tables into python

I created a table using Athena CTAS statements. Per Glue, I see that the table is stored on my s3 bucket. I further confirmed that there are files in the expected place in my s3 bucket.
These files, however, are not parquet files (they are extension-less). When I try to read them into python using pd.read_parquet, I get the Error "Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.". A similar error occurs when I try to query the table and read the csv output using pd.read_csv. There, the error is "'utf-8' codec can't decode byte 0xee in position 0: invalid continuation byte". I tried using awswrangler and got the same errors.
I'm pretty sure these errors are related to the SSE_S3 encryption I put on the bucket. However, I'm at a loss as to how I can actually interact with these files outside of Athena.
The resolution is that the default Athena workgroup had CSE_KMS encryption turned on. I couldn't quickly figure out how to pass these options via awswrangler, so I took the shortcut of recreating the table using another workgroup that didn't have encryption.

How to see output in Amazon EMR/S3?

I am new to Amazon Services and tried to run the application in Amazon EMR.
For that I have followed the steps as:
1) Created the Hive Scripts which contains --> create table, load data statement in Hive with some file and select * from command.
2) Created the S3 Bucket. And I load the object into it as: Hive Script, File to load into the table.
3) Then Created the Job Flow (Using Sample Hive Program). Given the input, ouput, and script path (like s3n://bucketname/script.q, s3n://bucketname/input.txt, s3n://bucketname/out/). Didn't create out directory. I think it will get created automatically.
4) Then Job Flow start to run and after some time I saw the states as STARTING, BOOTSTRAPING, RUNNING, and SHUT DOWN.
5) While running SHUT DOWN state, it get terminated automatically showing FAILES status for SHUT DOWN.
Then on the S3, I didn't see the out directory. How to see the output? I saw directory like daemons, nodes, etc......
And also how to see the data from HDFS in Amazon EMR?
The output path that you specified in step 3 should contain your results (From your description, it is s3n://bucketname/out/)
If it doesn't, something went wrong with your Hive script. If your Hive job failed, you will find information about the failure/exception in the jobtracker log. The jobtracker log exists under <s3 log location>/daemons/<master instance name>/hadoop-hadoop-jobtracker-<some Amazon internal IP>.log
Only one file in your logs directory would have it's S3 key in the above format. This file will contain any exceptions that may have happened. You probably want to concentrate on the bottom end of the file.