Pyspark data frame to Hive Table - hive

How to store a Pyspark DataFrame object to a hive table , "primary12345" is a hive table ?
am using the below code masterDataDf is a data frame object
masterDataDf.write.saveAsTable("default.primary12345")
getting below error
: java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

You can create one temporary table.
masterDataDf.createOrReplaceTempView("mytempTable")
Then you can use simple hive statement to create table and dump the data from your temp table.
sqlContext.sql("create table primary12345 as select * from mytempTable");
OR
if you want to used HiveContext
you need to have/create a HiveContext
import org.apache.spark.sql.hive.HiveContext;
HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
Then directly save dataframe or select the columns to store as hive table
masterDataDf.write().mode("overwrite").saveAsTable("default.primary12345 ");

Related

Import JSON Data to table with schema

I store JSON objects in a column (string). I want to convert it to a table with schema.
JSON_DATA
{"id":"ksah2132","connections":{"structure":["123","456","789"]},"options":[{"id":"AA123","type":"optionA"},{"id":"BB123","type":"optionB"},{"id":"CC123","type":"optionC"}]}
{"id":"ksah3321","connections":{"structure":["567","332","435"]},"options":[{"id":"AA133","type":"optionA"},{"id":"BB156","type":"optionB"},{"id":"CC445","type":"optionC"}]}
Table with Schema:
CREATE TABLE `sandboxabc.raw_data`(`options` array<struct<id:string,type:string>>, `connections` struct<structure:array<string>>, `id` string)
How can I use Spark SQL to insert overwrite into the new table?
My code:
INSERT OVERWRITE TABLE sandboxabc.structured_data
SELECT
from_json (JSON_DATA,'$.options') AS options
,from_json (JSON_DATA,'$.connections') AS connections
,from_json (JSON_DATA,'$.id') AS id
FROM
sandboxabc.raw_data
Sample of output:
id
connection
option
ksah2132
{"structure":["123","456","789"]}
[{"id":"AA123","type":"optionA"},{"id":"BB123","type":"optionB"},{"id":"CC123","type":"optionC"}
Below spark-sql code should work for you. Please note that hive support should be enabled and hive related jars should be present in classpath.
INSERT OVERWRITE TABLE sandboxabc.structured_data
SELECT
id,
from_json(connection, "struct<structure:array<string>>") as connection,
from_json(options, "array<struct<id:string,type:string>>") as options
FROM (
select
get_json_object(JSON_DATA,'$.id') as id,
get_json_object(JSON_DATA,'$.connection') as connection,
get_json_object(JSON_DATA,'$.options') as options
FROM sandboxabc.raw_data)

Reading a hive table in pyspark after altering the schema

I added a column to a hive table:
ALTER TABLE table_name ADD COLUMNS (new_col string);
But when I read the table using pyspark (2.1), I see the old schema. How do I download the updated table?

Load into Hive table imported entire data into first column only

I am trying to copy the Hive data from one server to another server. By this, I am exporting into hive data into CSV from server1 and trying to import that CSV file into Hive in server2.
My table contains following datatypes:
bigint
string
array
Here is my commands:
export:
hive -e 'select * from sample' > /home/hadoop/sample.csv
import:
load data local inpath '/home/hadoop/sample.csv' into table sample;
After importing into Hive table, entire row data into inserted into first column only.
How can I overcome this, or else is there a better way to copy data from one server to another server?
While creating table add below line at the end of create statment
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Like Below:
hive>CREATE TABLE sample(id int,
name String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Then Load Data:
hive>load data local inpath '/home/hadoop/sample.csv' into table sample;
For Your Example
sample.csv
123,Raju,Hello|How Are You
154,Nishant,Hi|How Are You
So In above sample data first column is bigint, second is String and third is Array separated by |
hive> CREATE TABLE sample(id BIGINT,
name STRING,
messages ARRAY<String>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|';
hive> LOAD DATA LOCAL INPATH '/home/hadoop/sample.csv' INTO TABLE sample;
Most important point :
Define delimiter for collection items and don't impose the array
structure you do in normal programming.
Also, try to make the field
delimiters different from collection items delimiters to avoid
confusion and unexpected results.
You really should not be using CSV as your data transfer format
DistCp copies data between Hadoop clusters as-is
Hive supports Export, Import
Circus Train allows Hive table replication
why not use hadoop command to transfer data from one cluster to another such as
bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
then load the data to your new table
load data inpath '/bar/foo/*' into table wyp;
your problem may caused by the delimiter
,The default delimiter '\001' if you havn't set when create a hivetable ..
if you use hive -e 'select * from sample' > /home/hadoop/sample.csv will make all cloumn to one cloumn

Inserting a pyspark dataframe to an existing partitioned hive table

I have a hive table which is partitioned by column inserttime.
I have a pyspark dataframe which has the same columns as the table except for the partitioned column.
The following works well when the table is not partitioned:
df.insertInto('tablename',overwrite=True)
But I am not able to figure out how to insert to a particular partition from pyspark
Tried below:
df.insertInto('tablename',overwrite=True,partition(inserttime='20170818-0831'))
but it did not work and failed with
SyntaxError: non-keyword arg after keyword arg
and I am using pyspark 1.6
One option is:
df.registerTempTable('tab_name')
hiveContext.sql("insert overwrite table target_tab partition(insert_time=value) select * from tab_name ")
Another option is to add this static value as the last column of dataframe and try to use insertInto() as dynamic partition mode.
you can use df.write.mode("overwrite").partitionBy("inserttime").saveAsTable("TableName")
or you can overwrite the value in the partition itself.
df.write.mode(SaveMode.Overwrite).save("location/inserttime='20170818-0831'")
Hope this helps.

Is there a way to alter column in a hive table that is stored as ORC?

There is already a question on Hive in general (
Is there a way to alter column type in hive table?). The answer to this question states that it is possible to change the schema with the alter table change command
However, is this also possible if the file is stored as ORC?
You can load the orc file into pyspark:
Load data into a dataframe:
df = spark.read.format("orc").load("<path-of-file-in-hdfs")
Create a view over the dataframe:
df2 = df.createOrReplaceTempView('Table')
Create a new data frame with manipulated columns:
df3 = spark.sql("select *, cast(third_column as float) as third_column, from Table")
Save the dataframe to hdfs:
df3.write.format("orc").save("<hdfs-path-where-file-needs-to-be-saved")
I ran tests on a ORC-table. It is possible to convert a string to a float column.
ALTER TABLE test_orc CHANGE third_column third_column float;
would convert a column called third_column that is marked as a string column to a float column. It is also possible to change the name of a column.
Sidenote: I was curious if other alterations on ORC might create problems. I ran into an exception when I tried to reorder columns.
ALTER TABLE test_orc CHANGE third_column third_column float AFTER first_column;
The exception is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Reordering columns is not supported for table default.test_orc. SerDe may be incompatible.