Query JSON file in Presto in S3 - amazon-s3

I have a file in S3, and Presto running on EMR. I see I can use Json_extract to read the json.
I am running the following query, however, I keep seeing null instead of the correct value.
select json_extract('s3a://random-s3-bucket/analytics/20210221/myjsonfile.json', '$.dateAvailability')
I see this output
Not sure if my syntax is wrong? Thoughts?

json_extract() operates on JSON scalar values kept in memory. It does not load data from an external location. See documentation page for usage examples.
In order to query a JSON file using Trino (formerly known as Presto SQL), you need to map it as a table with JSON format like this:
CREATE TABLE my_table ( .... )
WITH (
format = 'JSON',
external_location = 's3a://random-s3-bucket/analytics/20210221'
);
See more information in Hive connector documentation.

If you need a tool to help you create the table statement, try this one: https://www.hivetablegenerator.com
From the page:
Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log
sample file to an Apache HiveQL DDL create table statement.

Related

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

Using elemMatch in Hive with json field

I am using Hive for json storage. Then, I have created a table with only one string column containing all the json. I have tested the get_json_object function that Hive offers but I am not able to create a query that iterates all the subdocuments in a list and finds a value in a specific field.
In MongoDB, this problem can be solved by using $elemMatch as the documentation says.
Is there any way to do something like this in Hive?

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.

How to query data from gz file of Amazon S3 using Qubole Hive query?

I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.
You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344

Pig Script Create Table

I've been trying to store csv data into a table in a database using a pig script.
But instead of inserting the data into a table in a database I created a new file in the metastore.
Can someone please let me know if it is possible to insert data into a table in a database with a pig script, and if so what that script might look like?
You can take a look at DBStorage, but be sure to include the JDBC jar in your pig script and declaring the UDF.
The documentation for the storage UDF is here:
http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/DBStorage.html
you can use:
STORE into tablename USING org.apache.hcatalog.pig.HCatStorer()