Spark-sql Insert OVERWRITE append data instead of overwriting - apache-spark-sql

Using external table
Process doesn't have write permisions to /home/user/.Trash
calling "insert OVERWRITE" will generate the following warnning
2018-08-29 13:52:00 WARN TrashPolicyDefault:141 - Can't create trash directory: hdfs://nameservice1/user/XXXXX/.Trash/Current/data/table_1/key1=2
org.apache.hadoop.security.AccessControlException: Permission denied: user=XXXXX, access=EXECUTE, inode="/user/XXXXX/.Trash/Current/data/table_1/key1=2":hdfs:hdfs:drwx
Questions:
Could we avoid the move to .Trash? using TBLPROPERTIES ('auto.purge'='true') on External tables doesn't work.
"insert OVERWRITE" should rewrite the partition data , instead the new data is appended to the partition
Code Sample
creating the table
spark.sql("CREATE EXTERNAL TABLE table_1 (id string, name string) PARTITIONED BY (key1 int) stored as parquet location 'hdfs://nameservice1/data/table_1'")
spark.sql("insert into table_1 values('a','a1', 1)").collect()
spark.sql("insert into table_1 values ('b','b2', 2)").collect()
spark.sql("select * from table_1").collect()
overwriting partition:
spark.sql("insert OVERWRITE table table_1 values ('b','b3', 2)").collect()
result in
[Row(id=u'a', name=u'a1', key1=1),
Row(id=u'b', name=u'b2', key1=2),
Row(id=u'b', name=u'b3', key1=2)]

Add PARTITION(column) in your insert overwrite.
val spark = SparkSession.builder.appName("test").config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").enableHiveSupport().getOrCreate
spark.sql("drop table table_1")
spark.sql("CREATE EXTERNAL TABLE table_1 (id string, name string) PARTITIONED BY (key1 int) stored as parquet location '/directory/your location/'")
spark.sql("insert into table_1 values('a','a1', 1)")
spark.sql("insert into table_1 values ('b','b2', 2)")
spark.sql("select * from table_1").show()
spark.sql("insert OVERWRITE table table_1 PARTITION(key1) values ('b','b3', 2)")
spark.sql("select * from table_1").show()
CODE IMAGE

Related

Copying the structure of temp hive table to new table and adding additional table properties

I want to copy the structure of full load temp table and add the addition table properties like partitioned by (partition_col), Format='ORC'
Temp table :
Create table if not exists tmp.temp_table( id int,
name string,
datestr string )
temp table got created.
Final table :
CREATE TABLE IF NOT EXISTS tmp.{final_table_name} (
LIKE tmp.temp_table
)
WITH (
FORMAT = 'ORC'
partitioned by('datestr')
)
But I am getting the error as "Error: Error while compiling statement: FAILED: ParseException line 1:63 missing EOF at 'WITH' near 'temp_table' (state=42000,code=40000)"
Any solution to achieve this functionality.
You should not use like and instead use create table as (CTAS) select * from mytab where 1=2.
CREATE TABLE IF NOT EXISTS tmp.{final_table_name}
As select * from tmp.temp_table where 1=2
WITH (
FORMAT = 'ORC'
partitioned by('datestr')
)
Like will create an empty table with exact same definition. CTAS will use same column sequence, data type/length, the sql, and your definition to create new empty table because we are using 1=2.

How to create partitioned table from other tables in Amazon Athena?

I am looking to create a table from an existing table in Amazon Athena. The existing table is partitioned on partition_0, partition_1, and partition_2 (all strings) and I would like this partition to carry over. Here is my code:
CREATE TABLE IF NOT EXISTS newTable
AS
Select x, partition_0, partition_1, partition_2
FROM existingTable T
PARTITIONED BY (partition_0 string, partition_1 string, partition_2 string)
Trying to run this gives me an error at the FROM line, saying "mismatched input 'by'. expecting: '(', ',',".... Status code: 400; error code:invalidrequestexception
Not sure what syntax I am missing here.
This is the syntax for creating new tables:
CREATE TABLE new_table
WITH (
format = 'parquet',
external_location = 's3://example-bucket/output/',
partitioned_by = ARRAY['partition_0', 'partition_1', 'partition_2'])
AS SELECT * FROM existing_table
See the documentation for more examples: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html#ctas-example-partitioned

How to output a table as a parquet file in spark-sql, not spark-shell?

It is easy to read a table from CSV file using spark-sql:
CREATE TABLE MyTable (
X STRING,
Y STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'input.csv' INTO TABLE MyTable;
But how can I output this result as Parquet file?
PS: I know how to do that in spark-shell, but it is not what I'm looking for.
You have to create one table with the schema of your results in hive stored as parquet. After getting the results you can export them into the parquet file format table like this.
set hive.insert.into.external.tables = true
create external table mytable_parq ( use your source table DDL) stored as parquet location '/hadoop/mytable';
insert into mytable_parq select * from mytable ;
or
insert overwrite directory '/hadoop/mytable' STORED AS PARQUET select * from MyTable ;

How do you add Data to an Existing Hive Metastore?

I have multiple subdirectories in S3 that contain .orc files. I'm trying to create a hive metastore so I can query the data with Presto / Hive, etc. The data is poorlly structured (no consistent delimiter, ugly characters, etc). Here's a scrubbed sample:
1488736466 199.199.199.199 0_b.www.sphericalcow.com.f9b1.qk-g6m6z24tdr.v4.url.name.com TXT IN: NXDOMAIN/0/143
1488736466 6.6.5.4 0.3399.186472.4306.6668.638.cb5a.names-things.update.url.name.com TXT IN: NOERROR/3/306 0\009253\009http://az.blargi.ng/%D3%AB%EF%BF%BD%EF%BF%BD/\009 0\009253\009http://casinoroyal.online/\009 0\009253\009http://d2njbfxlilvpsq.cloudfront.net/b_zq_ym_bangvideo/bangvideo0826.apk\009
I was able to create a table pointing to one of the subdirectories using a serde regex and the fields are parsing properly, but as far as I can tell I can only load one subfolder at a time.
How does one add more data to an existing hive metastore?
Here's an example of my hive metastore create statement with the regex serde bit:
DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
COMMENT 'fill all the tables with the datas.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS ORC
LOCATION 's3://path/to/one/of/10/folders/'
tblproperties ("orc.compress" = "SNAPPY", "skip.header.line.count"="2");
select * from test limit 10;
I realize there is probably a very simple solution, but I tried INSERT INTO in place of CREATE EXTERNAL TABLE, but it understandably complains about the input, and I looked in both the hive and serde documentation for help but was unable to find a reference to adding to an existing store.
Possible solution using partitions.
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
partitioned by (mypartcol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)"
)
LOCATION 's3://whatever/as/long/as/it/is/empty'
tblproperties ("skip.header.line.count"="2");
alter table test add partition (mypartcol='folder 1') location 's3://path/to/1st/of/10/folders/';
alter table test add partition (mypartcol='folder 2') location 's3://path/to/2nd/of/10/folders/';
.
.
.
alter table test add partition (mypartcol='folder 10') location 's3://path/to/10th/of/10/folders/';
For #TheProletariat (the OP)
It seems there is no need for RegexSerDe since the columns are delimited by space (' ').
Note the use of tblproperties ("serialization.last.column.takes.rest"="true")
create external table test
(
field1 bigint
,field2 string
,field3 string
,field4 string
)
row format delimited
fields terminated by ' '
tblproperties ("serialization.last.column.takes.rest"="true")
;

Hive Json SerDE for ORC or RC Format

IS It possible to use a JSON serde with RC or ORC file formats? I am trying to insert into a Hive table with file format ORC and store on azure blob in serialized JSON.
Apparently not
insert overwrite local directory '/home/cloudera/local/mytable'
stored as orc
select '{"mycol":123,"mystring","Hello"}'
;
create external table verify_data (rec string)
stored as orc
location 'file:////home/cloudera/local/mytable'
;
select * from verify_data
;
rec
{"mycol":123,"mystring","Hello"}
create external table mytable (myint int,mystring string)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as orc
location 'file:///home/cloudera/local/mytable'
;
myint mystring
Failed with exception java.io.IOException:java.lang.ClassCastException:
org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text
JsonSerDe.java:
...
import org.apache.hadoop.io.Text;
...
#Override
public Object deserialize(Writable blob) throws SerDeException {
Text t = (Text) blob;
...
You can do so using some sort of a conversion step, like a bucketing step which will produce ORC files in a target directory and mounting a hive table with same schema after bucketing. Like below.
CREATE EXTERNAL TABLE my_fact_orc
(
mycol STRING,
mystring INT
)
PARTITIONED BY (dt string)
CLUSTERED BY (some_id) INTO 64 BUCKETS
STORED AS ORC
LOCATION 's3://dev/my_fact_orc'
TBLPROPERTIES ('orc.compress'='SNAPPY');
ALTER TABLE my_fact_orc ADD IF NOT EXISTS PARTITION (dt='2017-09-07') LOCATION 's3://dev/my_fact_orc/dt=2017-09-07';
ALTER TABLE my_fact_orc PARTITION (dt='2017-09-07') SET FILEFORMAT ORC;
SELECT * FROM my_fact_orc WHERE dt='2017-09-07' LIMIT 5;