i'm using Trino, i have a bucket in s3 where i store multiple parquet files with the same schema, which look like this:
MySchema = new ParquetSchema({
users: {
repeated: true,
fields: {
userName: {type: 'UTF8'}
password: {type: 'UTF8'}
}}})
I've created schema over hive connector, and then i created a table, looking just like this:
CREATE TABLE s3.default.myschema(
users array(row(userName varchar(100), password varchar(100))))
with external_location='s3a://users/',format='PARQUET')
When i'm creating tables from diffrent parquet files that contain columns of primitive types/objects - it works. but when i'm using array like shows here - the table created but the users column shows 'null' and not the array from the parquet.
i'll add - when i'm reading the file with parquetjs for example, everything works fine, it just trino that act this way and shows the null value.
Related
Trino is not able to create a table from JSON in S3.
I use
create table trino_test.json_test (id VARCHAR) with (external_location = 's3a://trino_test/jsons/', format='JSON');
but I get Query 20230203_154914_00018_d3erw failed: java.lang.IllegalArgumentException: bucket is null/empty even though I really do have file in trino_test/jsons/file.json
Different "folder" containing PARQUET works.
i have a a lot of json files in s3 which are updated frequently. Basically i am doing CRUD operations in a datalake. Because apache iceberg can handle item-level manipulations, i would like to migrate my data to use apache iceberg as table format.
my data is in json files, but i have crawled the data and created a glue-table.
The glue table was basically created automatically after crawling and the schema with all the data types was automatically detected.
i want to migrate this table to a table with iceberg format. Therefore i created a iceberg table with the same schema i read from the existing crawled glue table:
CREATE TABLE
icebergtest
(
`name` string,
`address` string,
`customer` boolean,
`features` array<struct<featurename:string,featureId:string,featureAge:double,featureVersion:string>>,
)
LOCATION 's3://<s3-bucket-name>/<blablapath>/'
TBLPROPERTIES ( 'table_type' ='ICEBERG' );
as you can see, i have some attributes in my json files an features is an array with json objects. I just copy pasted the data types from my exising glue table.
creating the table was successful, but filling the iceberg table with the data from the glue table fails:
INSERT INTO "icebergtest"
SELECT * FROM "customer_json_table";
ERRORs : SYNTAX_ERROR: Insert query has mismatched column types:
Table: [varchar, varchar, boolean, array(row(featurename varchar,
featureId varchar, featureAge double, featureVersion varchar)), ...
for me it seems like it i am trying to insert varchar to a string datafield. But my glue tables has also a string as data type configure.. i dont understand where suddenly varchar is coming from and how i can fix that problem.
Hi I am working in AWS Athena Editor. I am trying to create external table for a pool of kafka events saved in S3. The kafka events have different payloads, for example as shown below
{
"type":"job_started",
"jobId":"someId",
}
{
"type":"equipment_taken",
"equipmentId":"equipmentId"
}
and I was wondering if there is a way to do something like
Create External table _table()... where type='job_started'..
LOCATION 'some-s3-bucket'
TBLPROPERTIES ('has_encrypted_data' = 'false')
I know that I can create a table which has a schema with all the attributes of the event types (job_started and equipment_taken), however there will be a lot of nulls; and each time I have a new "type" of event then I have to expand the schema and it keeps growing. So instead I want to have two tables (table_for_job_started and table_for_equipment_taken) mapping to the data in the s3 bucket for each type, and so only relevant data is populated. Can you help with this?
I have successfully loaded a great deal of externally hive partitioned data (in parquets) into bigquery without issue.
All my tables are sharing the same schema and are in the same dataset. However, some tables don't work when calling the bq mk command using the external table definition files I've defined.
The full output of the error after bq mk is as follows
BigQuery error in mk operation: Error while reading table: my_example,
error message: Failed to add partition key my_common_key (type:
TYPE_INT64) to schema, because another column with the same name was
already present. This is not allowed. Full partition schema:
[my_common_key:TYPE_INT64, my_secondary_common_key:TYPE_INT64].
External table definition files look like this
{ "hivePartitioningOptions": {
"mode": "AUTO",
"sourceUriPrefix": "gs://common_bucket/my_example" },
"parquetOptions": {
"enableListInference": false,
"enumAsString": false }, "sourceFormat": "PARQUET", "sourceUris": [
"gs://common_bucket/my_example/*" ]
You will see that I am relying on auto inferencing for schema with a source URI pattern, as there are numerous parquet files nested within two sub directories which hive uses as partition key. The full path is gs://common_bucket/my_example/my_common_key=2022/my_secondary_common_key=1 <- within here there are several parquets
Please check your data files under the bucket, does your Hive table has evolved and older files had that partition column you are using as part of data, but later the Hive tables are partitioned using "my_common_key" . I recently migrated large set of hive tables and had similar issue. Current hive table structure and underlying data has evolved over time.
One way to solve this issue will be to read that data using dataproc hive and export it back to a GCS location and then try to load to BigQuery.You can also try to use Spark SQL to do this.
I am using databricks community edition.
I am using a hive query to create an external table , the query is running without any error but the table is not getting populated with the specified file that has been specified in the hive query.
Any help would be appreciated .
from official docs ... make sure your s3/storage location path and schema (with respects to the file format [TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM]) are correct
DROP TABLE IF EXISTS <example-table> // deletes the metadata
dbutils.fs.rm("<your-s3-path>", true) // deletes the data
CREATE TABLE <example-table>
USING org.apache.spark.sql.parquet
OPTIONS (PATH "<your-s3-path>")
AS SELECT <your-sql-query-here>
// alternative
CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"