I am using Hive for json storage. Then, I have created a table with only one string column containing all the json. I have tested the get_json_object function that Hive offers but I am not able to create a query that iterates all the subdocuments in a list and finds a value in a specific field.
In MongoDB, this problem can be solved by using $elemMatch as the documentation says.
Is there any way to do something like this in Hive?
Related
I am trying to create a table from JSON files in BigQuery and want just one column which will represent the first key 'id' only.
Creating a schema with only one column causes errors because all of the JSON keys in the input files are considered.
Is there a way to create a table that corresponds to only specific JSON keys?
Unfortunately, you can’t create a table from a JSON file in BigQuery with just one column from the JSON file. You can create a feature request in this link.
You have these options:
Option 1
Don't import as JSON, but as CSV instead (define null character as
separator)
Each line has only one column - the full JSON string
Parse inside BigQuery with maximum flexibility (JSON parsing
functions and even JS)
Option 2
Do a 2-step import:
Import as a new table with all the columns.
Append "SELECT column1 FROM [newtable]" into the existing table.
My actual (not properly working) setup has two pipelines:
Get API data to lake: for each row in metadata table in SQL calling the REST API and copy the reply (json-files) to the Blob datalake.
Copy data from the lake to SQL: For Each file auto create table in SQL.
The result is the correct number of tables in SQL. Only the content of the tables is not what I hoped for. They all contain 1 column named odata.metadata and 1 entry, the link to the metadata.
If I manually remove the metadata from the JSON in the datalake and then run the second pipeline, the SQL table is what I want to have.
Have:
{ "odata.metadata":"https://test.com",
"value":[
{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]}
Want:
[{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]
I tried to add $.['value'] in the API call. The result then was no odata.metadata line, but the array started with {value: which resulted in an error copying to SQL
I also tried to use mapping (in sink) to SQL. That gives the wanted result for the dataset I manually specified the mapping for, but only goes well for the dataset with the same number of column in the array. I don't want to manually do the mapping for 170 calls...
Does anyone know how handle this in ADF? For now I feel like the only solution is to add a Python step in the pipeline, but I hope for a somewhat standard ADF way to do this!
You can add another pipeline with dataflow to remove the content from JSON file before copying data to SQL, using flatten formatters.
Before flattening the JSON file:
This is what I see when JSON data copied to SQL database without flattening:
After flattening the JSON file:
Added a pipeline with dataflow to flatten the JSON file to remove 'odata.metadata' content from the array.
Source preview:
Flatten formatter:
Select the required object from the Input array
After selecting value object from input array, you can see only the values under value in Flatten formatter preview.
Sink preview:
File generated after flattening.
Copy the generated file as Input to SQL.
Note: If your Input file schema is not constant, you can enable Allow schema drift to allow schema changes
Reference: Schema drift in mapping data flow
I have a file in S3, and Presto running on EMR. I see I can use Json_extract to read the json.
I am running the following query, however, I keep seeing null instead of the correct value.
select json_extract('s3a://random-s3-bucket/analytics/20210221/myjsonfile.json', '$.dateAvailability')
I see this output
Not sure if my syntax is wrong? Thoughts?
json_extract() operates on JSON scalar values kept in memory. It does not load data from an external location. See documentation page for usage examples.
In order to query a JSON file using Trino (formerly known as Presto SQL), you need to map it as a table with JSON format like this:
CREATE TABLE my_table ( .... )
WITH (
format = 'JSON',
external_location = 's3a://random-s3-bucket/analytics/20210221'
);
See more information in Hive connector documentation.
If you need a tool to help you create the table statement, try this one: https://www.hivetablegenerator.com
From the page:
Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log
sample file to an Apache HiveQL DDL create table statement.
Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.
I want to add a unique value to my hive table whenever i enter any record, that value should not be repeated in the entire hive table. I am not able to find any solutions or any function for this. In my case i want to enter the record in hive using pig latin. Please help.
HIVE does not provide RDBMS database like constraints.
The suggested approch using PIG Script is as below.
1. Load data
2. Apply DISTINCT to data
3. Store data at a location
4. Create external hive table at the same location.
Step 3 and 4 can be combined if you can use HCATALOG which allows you to directly store data in Hive table.
Official documentation :Link 1 link 2
did you take a look to this? https://github.com/manojkumarvohra/hive-hilo it seems to provide a way to generate sequence numbers in hive using hi/lo algorithm