File is not created when CREATE TABLE in open source delta lake - sql

I am using AWS EMR and I am using an open-source delta lake.
In Python, dataframe.write.format('delta').save() works fine.
But I want to use it in SQL. I tried to create a delta table in SQL as below.
spark.sql('''
CREATE OR REPLACE TABLE test.foo
(name string)
USING delta
LOCATION 's3://<bucket_name>/test/foo'
''');
But when I try to INSERT, an error is raised.
spark.sql('INSERT INTO test.foo (name) VALUES ("bar")');
ERROR: org.apache.spark.sql.AnalysisException: Path does not exist
Tables were created in Glue Metastore, but nothing was created in s3://<bucket_name>/test/foo in S3.
Is there any way to create a table in SQL? :)

To me it looks like you use the wrong name: test.sql_delta. If u use sql and created a sql table it will reference to your physical metastore through the SQL table name you just created.
code should be:
spark.sql('INSERT INTO test.foo (name) VALUES ("bar")')
SQL version:
%SQL
INSERT INTO test.foo (name) VALUES ("bar")

Related

Drop and overwrite external table in hive

I need to create an external table in hiveql with the output from a SELECT clause. Every time when the HiveQL is ran the table should be dropped and recreated . When we drop an external table only the table structure is getting dropped but not the data files from HDFS location. How to achieve this?
Create Table As Select (CTAS) has restrictions. One of them is that target table cannot be External.
You have these options:
Create external table once, then INSERT OVERWRITE
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;
Use managed table, then you can DROP TABLE, then CREATE TABLE ... as SELECT
See also answer about skipTrash and auto.purge property.

Inserting BLOB data in DB2 using SQL Query

I am stuck into a situation where I need to insert data into a blob column by reading a file from the Filesystem in DB2 (DB2 Express C on Windows 7).
Somewhere on the internet I found this INSERT INTO ... VALUES ( ..., readfile('filename'), ...); but here readfile is not an inbuilt function but I need to create it using UDF (c language libraries), but that might not be a useful solution.
Can somebody update us how to insert BLOB values using Insert command.
Also, you can insert blob values casting characters to its corresponding hex values:
CREATE TABLE BLOB_TEST (COL1 BLOB(50));
INSERT INTO BLOB_TEST VALUES (CAST('test' AS BLOB));
SELECT COL1 FROM BLOB_TEST;
DROP TABLE BLOB_TEST;
And this gives this result:
COL1
-------------------------------------------------------------------------------------------------------
x'74657374'
1 record(s) selected.
1) You could use LOAD or IMPORT via ADMIN_CMD. In this way, you can use SQL to call the administrative stored procedure that will call the tool. Import or Load can read files and then put that in a row.
You can also wrap this process by using a temporary table that will read the binary data from the file, insert it in the temporary table, and then return it to be from the table.
2) You can create an external stored procedure or UDF implemented in Java or C, that will read the data, and then insert that in the row.
I have not tried, but you can also use the Built-in modules that handle LOBs http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.apdv.sqlpl.doc/doc/r0055115.html
This is only available in DB2 LUW since version 9.7
I've succeeded to do this by using IBM Data Studio with following query:
INSERT INTO MY_TABLE (BLOB_COLUMN) values (?);
And selecting a file from pop-up dialog.
Somehow, same method from RAD 8 doesn't show an option to load blob column type the same way.
First and foremost, per IBM docs all LOB data in DB2 must have the following corresponding items in addition to a LOB column defined in a table. See docs for example CREATE statements.
LOB tablespace (one for every LOB column in each partition)
Auxiliary table on above table space that points to the blob column in base table (also, one for every LOB column in each partition)
A unique index in auxiliary table
Once this schema is prepared you can then run a LOAD command that can import with other data fields where blob content is referenced by file paths. Below is a demo with an Employees table:
DB Table (example table)
CREATE TABLE EMPLOYEES (
ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY,
EMPLOYEE_NUMBER INTEGER,
EMPLOYEE_NAME VARCHAR(255),
EMPLOYEE_PIC BLOB(500K)
);
CSV FILE (comma being default delimiter in LOAD with no headers)
1234, "John Doe", johndoe.jpg
5678, "Jane Doe", janedoe.jpg
...
DB2 LOAD (simple version using defaults for many other LOAD parameters)
LOAD FROM "/path/to/file.csv"
OF DEL
LOBS FROM /path/to/picture/folder/ --PATH OF BLOB FILES WITH BASENAME IN CSV
--MUST END IN FORWARD SLASH
MODIFIED BY LOBSINFILE CHARDEL""
DUMPFILE="/path/to/dump.txt" --FOR FAILED IMPORTS
METHOD P (1,2,3) --NUMBER REFERENCE OF COLS, OR USE N FOR FIELD NAMES
MESSAGES "/path/to/messages.txt" --FOR LOAD COMMAND MESSAGES
REPLACE INTO "EMPLOYEES" --REMOVES EXISTING FOR IMPORT, OR USE INSERT TO ADD
(EMPLOYEE_NUMBER
EMPLOYEE_NAME,
EMPLOYEE_PIC);
Command lines
> db2 -tvf "/path/to/load_command.sql"
> db2 "SELECT LENGTH(EMPLOYEE_PIC) FROM EMPLOYEES"
DB2 SQL query to insert JPG file into table
create table table_name(column_name BLOB) /* BLOB is data type
insert into table_name(column_name)values(blob('c:\data\winter.jpg'))
c:\data\ its a path and winter.jpg its a image name

External table does not return the data in its folder

I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".

Alter the schema of Hive table

I want to alter the table created in Hive which is mapped to HBase fields. Recently i have included few more column into HBase and thus would lik to add those fields into Hive as well.
for creation i used:
CREATE EXTERNAL TABLE test1(rowKey STRING,a STRING,b STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
('hbase.columns.mapping' = ':key,cf:address,cf:name')
TBLPROPERTIES ('hbase.table.name' = 'test');
now i want to add one more column in hive tables test1 which should be mapped to hbase but i don't find any way to do this. Pleas help Thanks.
Because you use external table, the easiest way is drop and create it again.
drop table test1;
and
create external table test1 {...};

Inserting Data into Hive Table

I am new to hive. I have successfully setup a single node hadoop cluster for development purpose and on top of it, I have installed hive and pig.
I created a dummy table in hive:
create table foo (id int, name string);
Now, I want to insert data into this table. Can I add data just like sql one record at a time? kindly help me with an analogous command to:
insert into foo (id, name) VALUES (12,"xyz);
Also, I have a csv file which contains data in the format:
1,name1
2,name2
..
..
..
1000,name1000
How can I load this data into the dummy table?
I think the best way is:
a) Copy data into HDFS (if it is not already there)
b) Create external table over your CSV like this
CREATE EXTERNAL TABLE TableName (id int, name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'place in HDFS';
c) You can start using TableName already by issuing queries to it.
d) if you want to insert data into other Hive table:
insert overwrite table finalTable select * from table name;
There's no direct way to insert 1 record at a time from the terminal, however, here's an easy straight forward workaround which I usually use when I want to test something:
Assuming that t is a table with at least 1 record. It doesn't matter what is the type or number of columns.
INSERT INTO TABLE foo
SELECT '12', 'xyz'
FROM t
LIMIT 1;
Hive apparently supports INSERT...VALUES starting in Hive 0.14.
Please see the section 'Inserting into tables from SQL' at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
What ever data you have inserted into one text file or log file that can put on one path in hdfs and then write a query as follows in hive
hive>load data inpath<<specify inputpath>> into table <<tablename>>;
EXAMPLE:
hive>create table foo (id int, name string)
row format delimited
fields terminated by '\t' or '|'or ','
stored as text file;
table created..
DATA INSERTION::
hive>load data inpath '/home/hive/foodata.log' into table foo;
to insert ad-hoc value like (12,"xyz), do this:
insert into table foo select * from (select 12,"xyz")a;
this is supported from version hive 0.14
INSERT INTO TABLE pd_temp(dept,make,cost,id,asmb_city,asmb_ct,retail) VALUES('production','thailand',10,99202,'northcarolina','usa',20)
It's a limitation of hive.
1.You cannot update data after it is inserted
2.There is no "insert into table values ... " statement
3.You can only load data using bulk load
4.There is not "delete from " command
5.You can only do bulk delete
But you still want to insert record from hive console than you can do select from statck. refer this
You may try this, I have developed a tool to generate hive scripts from a csv file. Following are few examples on how files are generated.
Tool -- https://sourceforge.net/projects/csvtohive/?source=directory
Select a CSV file using Browse and set hadoop root directory ex: /user/bigdataproject/
Tool Generates Hadoop script with all csv files and following is a sample of
generated Hadoop script to insert csv into Hadoop
#!/bin/bash -v
hadoop fs -put ./AllstarFull.csv /user/bigdataproject/AllstarFull.csv
hive -f ./AllstarFull.hive
hadoop fs -put ./Appearances.csv /user/bigdataproject/Appearances.csv
hive -f ./Appearances.hive
hadoop fs -put ./AwardsManagers.csv /user/bigdataproject/AwardsManagers.csv
hive -f ./AwardsManagers.hive
Sample of generated Hive scripts
CREATE DATABASE IF NOT EXISTS lahman;
USE lahman;
CREATE TABLE AllstarFull (playerID string,yearID string,gameNum string,gameID string,teamID string,lgID string,GP string,startingPos string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/bigdataproject/AllstarFull.csv' OVERWRITE INTO TABLE AllstarFull;
SELECT * FROM AllstarFull;
Thanks
Vijay
You can use following lines of code to insert values into an already existing table. Here the table is db_name.table_name having two columns, and I am inserting 'All','done' as a row in the table.
insert into table db_name.table_name
select 'ALL','Done';
Hope this was helpful.
Hadoop file system does not support appending data to the existing files. Although, you can load your CSV file into HDFS and tell Hive to treat it as an external table.
Use this -
create table dummy_table_name as select * from source_table_name;
This will create the new table with existing data available on source_table_name.
LOAD DATA [LOCAL] INPATH '' [OVERWRITE] INTO TABLE <table_name>;
use this command it will load the data at once just specify the file path
if file is in local fs then use LOCAL if file is in hdfs then no need to use local