How to see output in Amazon EMR/S3? - amazon-s3

I am new to Amazon Services and tried to run the application in Amazon EMR.
For that I have followed the steps as:
1) Created the Hive Scripts which contains --> create table, load data statement in Hive with some file and select * from command.
2) Created the S3 Bucket. And I load the object into it as: Hive Script, File to load into the table.
3) Then Created the Job Flow (Using Sample Hive Program). Given the input, ouput, and script path (like s3n://bucketname/script.q, s3n://bucketname/input.txt, s3n://bucketname/out/). Didn't create out directory. I think it will get created automatically.
4) Then Job Flow start to run and after some time I saw the states as STARTING, BOOTSTRAPING, RUNNING, and SHUT DOWN.
5) While running SHUT DOWN state, it get terminated automatically showing FAILES status for SHUT DOWN.
Then on the S3, I didn't see the out directory. How to see the output? I saw directory like daemons, nodes, etc......
And also how to see the data from HDFS in Amazon EMR?

The output path that you specified in step 3 should contain your results (From your description, it is s3n://bucketname/out/)
If it doesn't, something went wrong with your Hive script. If your Hive job failed, you will find information about the failure/exception in the jobtracker log. The jobtracker log exists under <s3 log location>/daemons/<master instance name>/hadoop-hadoop-jobtracker-<some Amazon internal IP>.log
Only one file in your logs directory would have it's S3 key in the above format. This file will contain any exceptions that may have happened. You probably want to concentrate on the bottom end of the file.

Related

AWS GLUE Pyspark job delete S3 folder unexpectly

My glue workflow is DDB -> GLUE table (by using Crawler) -> S3 (by using GLUE job)
I create S3 folder manually before the workflow run.
For DDB table with size at 500~MB it always works fine (runs 7-10min to finish), the s3 path will have correct result: e.g. s3://glue_example/ddb_500MB/ (I know data is correct by checking them in athena after connecting to s3)
For DDB table with size 50GB the folder is deleted by the GLUE JOB (runs 2 hours to finish, no error), e.g. s3://glue_example/ddb_50GB this folder is deleted. (I enabled the log for s3, and in log, GlueJobRunnerSession used DeleteObject on this folder path)
This delete folder behavior is not consistent, it happened most of the time, but if I find the folder is deleted, and I created manually, next run will have correct data in that s3 folder.
The code of GLUE job (Glue 3.0 - Supports spark 3.1, Scala 2, Python 3) is super simple. the only line that write to s3 is: ApplyMapping_node2.toDF().write.mode("overwrite").format("parquet").save('s3://glue_example/ddb_50GB')
concurrency of workflow/job is 1, so it's not competing caused problem
I use overwrite to keep the folder to have only latest data. but I don't know why this keep deleting folder with large size DDB as data source. Any idea?
The issue was due to whole table being read into single partition as it is default behaviour. Increasing dynamodb.splits while reading from DDB table should help as it reads data in parallel into multiple partitions.Below is an example in pySpark.
dyf = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={"dynamodb.input.tableName": "test_source",
"dynamodb.throughput.read.percent": "1.0",
"dynamodb.splits": "100"
}
)
Refer to below link for more information:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

Reading data from GCS with BigQuery fails with "Not Found", but the date (files) exists

I have a service that is constantly updating files in GCS bucket with hive format:
bucket
device_id=aaaa
month=01
part-0.parquet
month=02
part-0.parquet
....
device_id=bbbb
month=01
part-0.parquet
month=02
part-0.parquet
....
If today we are at month=02 and I ran the following with BigQuery:
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '02';
I get the error: Not found: Files /bigstore/bucket_name/device_id=aaaa/month=02/part-0.parquet
I checked and the file is there when the query ran.
If I run
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '01';
I get results without any errors. I guess the error is related to the fact that I'm modifying the data while querying it. But as I understand this should not be the case with GCS, this is from their docs.
Because uploads are strongly consistent, you will never receive a 404 Not Found response or stale data for a read-after-write or read-after-metadata-update operation.
I saw some posts that this could be related to my bucket been Multi-region.
Any other insights?
It could be for some reason that you get this error.
When you load data from Cloud Storage into a BigQuery table, the
dataset that contains the table must be in the same regional or
multi- regional location as the Cloud Storage bucket.
Due to consistency, for buckets, while metadata updates are strongly
consistent for read-after-metadata-update operations, the process
could take time to finish the changes.
Using a Multi-region bucket is not recommended.
In this case, it could be due to consistency, because while you are updating the files GCS at the same time you are executing the query, so when you execute a query the parquet file was available to read and you didn’t get the error, but the next time the parquet file wasn’t available because the service was updating the file and you got the error.
Unfortunately, there is not a simple way, to solve this problem, but here are some options:
You can add a pub/sub routine to the bucket and/or file and quick off
your query after the service finished updating the files.
Make a workflow that blocks the updating of the files in their
buckets until their query finishes.
If the query fails with “not found” for file ABCD and you have
verified ABCD exists in GCS, then retry the query X times.
You need to backup your data into another location where you won't
update these files constantly, just once a day.
You could move the data into a managed storage where you won't have
this problem because you can do snapshotting.

How to fix corrupted delta lake table on AWS S3

I ended up manually deleting some delta lake entries(hosted on S3) .
Now my spark job is failing because the delta transaction logs point to files that do not exist in the file system.
I came across this https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-fsck.html
but I am not sure how should I run this utility in my case.
You could easily do that following the document that you have attached.
I have done that as below if you have hive table on top of your S3:
%sql
FSCK REPAIR TABLE schema.testtable DRY RUN
Using DRY RUN will list the files that needs to be deleted. You can first run the above command and verify the files that actually need to be deleted.
Once you have verified that you can run the actual above command without DRY RUN and it should do what you needed.
%sql
FSCK REPAIR TABLE schema.testtable
Now if you have not created a hive table and have a path(delta table) where you have files than you can do it as below:
%sql
FSCK REPAIR TABLE delta.`dbfs:/mnt/S3bucket/tables/testtable` DRY RUN
I am doing this from databricks and have mounted my S3 bucket path to databricks.
you need to make sure that you have that ` symbol after delta. and before the actual path otherwise it wont work.
here also in order to perform the actual repair operation you can remove the DRY RUN from the above command and it should do the stuff that you wat.

Snowflake COPY INTO Command return

I have a question about the snowflake COPY INTO, searched but did not get my answers.
Suppose I want to push data from snowflake to s3 bucket and using the snowflake COPY INTO command in my code, How will I know if the file is ready or command is completed? So that I can read the file from the s3 location.
You can do the following things to check whether your COPY INTO was successful or at least to retrieve some useful information about your command:
Set DETAILED_OUTPUT = TRUE and check the result (this means you get information about every single unloaded file as a output; if set to "false" you only receive information about the whole unload-process)
Query your stage by using the syntax that can be found here https://docs.snowflake.com/en/user-guide/querying-stage.html
Query the metadata of your staged data by using metadata$filename and metadata$file_row_number: https://docs.snowflake.com/en/user-guide/querying-metadata.html
Keep in mind that even a failed COPY-command can result in some unloaded files on your stage.
More information can also be found at https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html#validating-data-to-be-unloaded-from-a-query
depending on how you're actually running this.
any Snowflake interface will run synchronously so the query will just spin until it's complete.
any async call would need extra checks - the easiest one being the web interface (it will show the status of the query and when it completes the unload is complete)

Loading multiple files

The following is working as expected.
./bq --nosync load -F '^' --max_bad_record=30000 myvserv.xa one.txt ip:string,cb:string,country:string,telco_name:string, ...
1) But how to I send two csv files one.txt and two.txt in the same command?
2) I can not cat file and then pipe | to bg command ?
3) What does nosync mean?
Unfortunately, you can't (yet) upload two files with the same command; you'll have to run bq twice. (If you're loading data from Google Cloud Storage, though, you can specify multiple gs:// URLs separated by commas.)
Nope, bq doesn't (yet) support reading upload data from stdin, though that's a great idea for a future version.
If you just run "bq load", bq will create a load job on the server and then poll for completion. If you specify the --nosync flag, it will just create the load job and then exit without polling. (If desired, you can poll for completion separately using "bq wait".)
For 1), as Jeremy mentioned, you can't import two local files at once in the same command. However, you can start two parallel loads to the same table -- loads are atomic, and append by default, so this should do what you want and may be faster than importing both in a single job since the uploads will happen in parallel.