Hive on EMR not reading all files at S3 location - amazon-s3

I created a hive table using the following syntax, pointed to an S3 folder:
CREATE EXTERNAL TABLE IF NOT EXISTS daily_input_file (
log_day STRING,
resource STRING,
request_type STRING,
format STRING,
mode STRING,
count INT
) row format delimited fields terminated by '\t' LOCATION 's3://my-bucket/my-folder';
When I execute a query, such as:
SELECT * FROM daily_input_file WHERE log_day IN ('20160508', '20160507');
I expect that records will be returned.
I have verified that this data is contained in the files in that folder. In fact, if I copy the file that contains this particular data into a new folder, create a table for that new folder and run the query, I get the results. I also get results from other files (in fact from most files) within the original folder.
The contents of s3://my-bucket/my-folder are simple. There are no subdirectories within my folder. There are two varieties of file names (a and b), all are prefixed with the date they were created (YYYYMMDD_), all have an extension of .txt000.gz. Here are some examples:
20160508_a.txt000.gz
20160508_b.txt000.gz
20160509_a.txt000.gz
20160509_b.txt000.gz
So what might be going on? Is there a limit to the number of files within a single folder that can be processed from S3? Or is something else the culprit?
Here are the versions used:
Release label: emr-4.7.0
Hadoop distribution: Amazon 2.7.2
Applications: Hive 1.0.0, Pig 0.14.0, Hue 3.7.1

The behavior being experienced with the S3 files is an issue with EMR release 4.7.0 and not a limitation of EMR.
Use EMR release 4.7.1 or later.
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html

Related

AWS GLUE Pyspark job delete S3 folder unexpectly

My glue workflow is DDB -> GLUE table (by using Crawler) -> S3 (by using GLUE job)
I create S3 folder manually before the workflow run.
For DDB table with size at 500~MB it always works fine (runs 7-10min to finish), the s3 path will have correct result: e.g. s3://glue_example/ddb_500MB/ (I know data is correct by checking them in athena after connecting to s3)
For DDB table with size 50GB the folder is deleted by the GLUE JOB (runs 2 hours to finish, no error), e.g. s3://glue_example/ddb_50GB this folder is deleted. (I enabled the log for s3, and in log, GlueJobRunnerSession used DeleteObject on this folder path)
This delete folder behavior is not consistent, it happened most of the time, but if I find the folder is deleted, and I created manually, next run will have correct data in that s3 folder.
The code of GLUE job (Glue 3.0 - Supports spark 3.1, Scala 2, Python 3) is super simple. the only line that write to s3 is: ApplyMapping_node2.toDF().write.mode("overwrite").format("parquet").save('s3://glue_example/ddb_50GB')
concurrency of workflow/job is 1, so it's not competing caused problem
I use overwrite to keep the folder to have only latest data. but I don't know why this keep deleting folder with large size DDB as data source. Any idea?
The issue was due to whole table being read into single partition as it is default behaviour. Increasing dynamodb.splits while reading from DDB table should help as it reads data in parallel into multiple partitions.Below is an example in pySpark.
dyf = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={"dynamodb.input.tableName": "test_source",
"dynamodb.throughput.read.percent": "1.0",
"dynamodb.splits": "100"
}
)
Refer to below link for more information:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

Is it possible to merge two parquet directory on hdfs?

I have two parquet directory on my HDFS with the same schema. I want to merge these two directories into one parquet directory, to be able to create an external hive table from it.
I have googled my problem, but almost all result is about merging small parquet files into larger parquet files.
As long as the parquet files have the same schema, you can simply put them in the same directory. Hive will process all files that it finds in an external table's directory (except a few special files with specific names), so you can simply put your data there and Hive will find it. (In older Hive versions this was true for non-external tables as well. In newer Hive versions, however, it is only true for external tables thus you should not tamper with the contents of so-called managed tables.)

Get most recent bucket data of out of 10 buckets in Apache Hive,

In Apache Hive, I have 10 buckets. Out of 10 buckets I would like to get the recent bucket data. Is there any way to identify which bucket created recently?
Hive table bucket is a file. You can get create time with file names using hadoop fs -ls command. And hive has INPUT__FILE__NAME virtual column. So, you can get file name in shell, then use it as parameter passed to the Hive script for filtering by it. But bear in mind that files are created in parallel, and which is later or earlier may be not connected with data nor with command start time.

Google Cloud Dataprep - Scan for multiple input csv and create corresponding bigquery tables

I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.

HIVE script - Specify file name as S3 Location

I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.