how to copy a existing glue table to a iceberg format table with athena? - amazon-s3

i have a a lot of json files in s3 which are updated frequently. Basically i am doing CRUD operations in a datalake. Because apache iceberg can handle item-level manipulations, i would like to migrate my data to use apache iceberg as table format.
my data is in json files, but i have crawled the data and created a glue-table.
The glue table was basically created automatically after crawling and the schema with all the data types was automatically detected.
i want to migrate this table to a table with iceberg format. Therefore i created a iceberg table with the same schema i read from the existing crawled glue table:
CREATE TABLE
icebergtest
(
`name` string,
`address` string,
`customer` boolean,
`features` array<struct<featurename:string,featureId:string,featureAge:double,featureVersion:string>>,
)
LOCATION 's3://<s3-bucket-name>/<blablapath>/'
TBLPROPERTIES ( 'table_type' ='ICEBERG' );
as you can see, i have some attributes in my json files an features is an array with json objects. I just copy pasted the data types from my exising glue table.
creating the table was successful, but filling the iceberg table with the data from the glue table fails:
INSERT INTO "icebergtest"
SELECT * FROM "customer_json_table";
ERRORs : SYNTAX_ERROR: Insert query has mismatched column types:
Table: [varchar, varchar, boolean, array(row(featurename varchar,
featureId varchar, featureAge double, featureVersion varchar)), ...
for me it seems like it i am trying to insert varchar to a string datafield. But my glue tables has also a string as data type configure.. i dont understand where suddenly varchar is coming from and how i can fix that problem.

Related

Redshift Spectrum query returns 0 row from S3 file

I tried Redshift Spectrum. Both of query below ended success without any error message, but I can't get the right count of the uploaded file in S3, it's just returned 0 row count, even though that file has over 3 million records.
-- Create External Schema
CREATE EXTERNAL SCHEMA spectrum_schema FROM data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
create external database if not exists;
-- Create External Table
create EXTERNAL TABLE spectrum_schema.principals(
tconst VARCHAR (20),
ordering BIGINT,
nconst VARCHAR (20),
category VARCHAR (500),
job VARCHAR (500),
characters VARCHAR(5000)
)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://xxxxx/xxxxx/'
I also tried the option, 'stored as parquet', the result was same.
My iam role has "s3:","athena:", "glue:*" permissions, and Glue table created successfully.
And just in case, I confirmed the same S3 file could be copied into table in Redshift Cluster successfully. So, I concluded the file/data has no issue by itself.
If there is something wrong with my procedure or query. Any advice would be appreciated.
As your DDL is not scanning any data it looks like the issue seems to be with it not understanding actual data in s3. To figure this out you can simply generate table using AWS Glue crawler.
Once the table is created you can compare this table properties with another table created using DDL in Glue data catalog. That will give you the difference and what is missing in your table that is created using DDL manually.

Support for creating table out of limited number of column in Presto

I was playing around with Presto. I uploaded parquet file with 10 columns.I want to created table (external location s3) in meta store with 5 column using presto-cli. Looks like presto doesn't support this ?
Is there any other way to get this working.
That should be easily possible if you are using Parquet or ORC file formats. This is another advantage of keeping metadata separate than actual data. As mentioned in the comments, you should use column names to access the fields instead of index.
One of the example:
CREATE TABLE hive.web.request_logs (
request_time timestamp,
url varchar,
ip varchar,
user_agent varchar
)
WITH (
format = 'parquet',
external_location = 's3://my-bucket/data/logs/'
)
Reference:
https://prestodb.github.io/docs/current/connector/hive.html#examples

Migrating data from Hive PARQUET table to BigQuery, Hive String data type is getting converted in BQ - BYTES datatype

I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.
That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.

Is it possible to load only selected columns from Avro file to Hive?

I have a requirement to load Avro file to hive. Using the following to create the table
create external table tblName stored as avro location 'hdfs://host/pathToData' tblproperties ('avro.schema.url'='/hdfsPathTo/schema.avsc');
I am getting an error FOUND NULL, EXPECTED STRING while doing a select on the table. Is it possible to load few columns and find which column data is causing this error?
Actually you need first to create an Hive External table pointing to the location of your AVRO files, and using the AvroSerDe format.
At this stage, nothing is loaded. The external table is just a mask on files.
Then you can create an internal HIVE table and load data (the expected columns) from the external one.
If you are already having AVRO file, then load the file to HDFS in a directory of your choice. Next create an external table on top of the directory.
CREATE EXTERNAL TABLE external_table_name(col1 string, col2 string, col3 string ) STORED AS AVRO LOCATION '<HDFS location>';
Next create an internal hive table on top of the external table to load the data
CREATE TABLE internal_table_name(col2 string, col3 string) AS SELECT col2, col3 FROM external_table_name
You can schedule the internal table load using a batch script in any scripting language or tools.
Hope this helps :)

Copy tables in HIVE, from one database to another database

In a database, I have 50+ tables, I was wondering is there any way to copy these tables into second database at one shot?
I have used this, but running this 50+ times isn't efficient.
create table database2.table1 as select * from database1.table1;
Thanks!
Copying data from one database table to another database table in Hive is like copying data file from existing location in HDFS to new location in HDFS.
The best way of copying data from one Database table to another Database table would be to create external Hive table in new Database and put the location value as for e.g. LOCATION '/user/hive/external/ and copy the file of older table data using distcp to from the old HDFS location to new one.
Example: Existing table in older Database:
CREATE TABLE stations( number STRING, latitude INT, longitude INT, elevation INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH "/home/cloudera/Desktop/Stations.csv"
Now you create external table in new Database:
CREATE EXTERNAL TABLE external_stations( number STRING, latitude INT, longitude INT, elevation INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/external/';
Now you just copy the data file from /user/hive/warehouse/training.db/stations/ to /user/hive/external/ using distcp command. These two paths are specific to my hive locations. You can have similarly in yours.
In this way you can copy table data of any number.
One approach would be to create your table structures in the new database and use distcp to copy data from the old HDFS location to new one.