I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.
Related
I'm trying to bulk load 28 parquet files into Snowflake from an S3 bucket using the COPY command and regex pattern matching. But each time I run the command in my worksheet, I'm getting the following bad response:
Copy executed with 0 files processed.
Inside a folder in my S3 bucket, the files I need to load into Snowflake are named as follows:
S3://bucket/foldername/filename0000_part_00.parquet
S3://bucket/foldername/filename0001_part_00.parquet
S3://bucket/foldername/filename0002_part_00.parquet
...
S3://bucket/foldername/filename0026_part_00.parquet
S3://bucket/foldername/filename0027_part_00.parquet
Using the Snowflake worksheet, I'm trying to load data into a pre-existing table, using the following commands:
CREATE or REPLACE file format myparquetformat type = 'parquet';
COPY INTO [Database].[Schema].[Table] FROM (
SELECT $1:field1::VARCHAR(512), $1:field2::INTEGER, $1:field3::VARCHAR(512),
$1:field4::DOUBLE, $1:field5::VARCHAR(512), $1:field6::DOUBLE
FROM #AWS_Snowflake_Stage/foldername/
(FILE_FORMAT => 'myparquetformat', PATTERN =>
'filename00[0-9]+_part_00.parquet')
)
on_error = 'continue';
I'm not sure why these commands fail to run.
In every example I've seen in the Snowflake documentation, "PATTERN" is only used within the COPY command outside of a SELECT query. I'm not sure if it's possible to use PATTERN inside a SELECT query.
In this case, I think it's necessary to use the SELECT query within the COPY command, since I'm loading in parquet data that would first need to be cast from a single column ($1) into multiple columns with appropriate data types for the table (varchar, integer, double). The SELECT query is what enables the importing of the parquet file into the existing table -- is it possible to find a way around this using a separate staging table?
It's a huge pain to load the parquet files one at a time. Is there any way to bulk load these 28 parquet files using the Snowflake worksheet? Or is it better to try to do this using a Python script and the Snowflake API?
For me the below worked, I agree my pattern is quite simple to select all parquet file in the location, but you can probably verify if the regex pattern is valid.
COPY INTO <TABLE_NAME> FROM (
SELECT
$1:col_name_1,
$1:col_name_2
FROM #STAGE_NAME/<PATH_TO_FILES>
)
PATTERN = '.*.parquet'
FORCE = TRUE
FILE_FORMAT = (
TYPE = 'parquet'
);
Side note, Keep in mind that Snowflake has a safety check to skip
files if it has already been Staged and loaded once successfully.
My question is somewhat similar to the below post. I want to download some data from a hive table using select query. But because the data is large, I want to write it as an external table in a given path. so that I can create a csv file. Uses the below code
create external table output(col1 STRING, col2STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '{outdir}/output'
INSERT OVERWRITE TABLE output
Select col1, col2 from atable limit 1000
This works fine, and create a file in 0000_ format, which can be copied as a csv file.
But my question is how to ensure that the output will always have a single file? If there is no partition defined, will it always be single file? What is the rule it uses to split files?
Saw few similar questions like below. But it discuss hdfs file access.
How to point to a single file with external table
I know the below alternative, but I use a hive connection object to execute queries from a remote node.
hive -e ' selectsql; ' | sed 's/[\t]/,/g' > outpathwithfilename
You can set the below property before doing the overwrite
set mapreduce.job.reduces=1;
Note: If the hive engine doesn't allow to be modified at runtime, then whitelist the parameter by setting below property in hive-site.xml
hive.security.authorization.sqlstd.confwhitelist.append=|mapreduce.job.|mapreduce.map.|mapreduce.reduce.*
I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.
You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344
I created a hive table using the following syntax, pointed to an S3 folder:
CREATE EXTERNAL TABLE IF NOT EXISTS daily_input_file (
log_day STRING,
resource STRING,
request_type STRING,
format STRING,
mode STRING,
count INT
) row format delimited fields terminated by '\t' LOCATION 's3://my-bucket/my-folder';
When I execute a query, such as:
SELECT * FROM daily_input_file WHERE log_day IN ('20160508', '20160507');
I expect that records will be returned.
I have verified that this data is contained in the files in that folder. In fact, if I copy the file that contains this particular data into a new folder, create a table for that new folder and run the query, I get the results. I also get results from other files (in fact from most files) within the original folder.
The contents of s3://my-bucket/my-folder are simple. There are no subdirectories within my folder. There are two varieties of file names (a and b), all are prefixed with the date they were created (YYYYMMDD_), all have an extension of .txt000.gz. Here are some examples:
20160508_a.txt000.gz
20160508_b.txt000.gz
20160509_a.txt000.gz
20160509_b.txt000.gz
So what might be going on? Is there a limit to the number of files within a single folder that can be processed from S3? Or is something else the culprit?
Here are the versions used:
Release label: emr-4.7.0
Hadoop distribution: Amazon 2.7.2
Applications: Hive 1.0.0, Pig 0.14.0, Hue 3.7.1
The behavior being experienced with the S3 files is an issue with EMR release 4.7.0 and not a limitation of EMR.
Use EMR release 4.7.1 or later.
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html
I have a set of ~100 files each with 50k IDs in them. I want to be able to make a query against Hive that has a Where In clause using the IDs from these files. I could also do this directly from Groovy, but I'm thinking the code would be cleaner if I did all of the processing from Hive instead of referencing an external Set. Is this possible?
Create an external table describing the format of your files, and set the location to the HDFS path of a directory containing the files.. i.e for tab delimited files
create external table my_ids(
id bigint,
other_col string
)
row format delimited fields terminated by "\t"
stored as textfile
location 'hdfs://mydfs/data/myids'
Now you can use Hive to access this data.