how to use trino/presto to query redis - redis

I have a simple string and hash stored in redis
get test
"1"
hget htest first
"first hash"
I'm able to see the "table" test, but there are no columns
trino> show columns from redis.default.test;
Column | Type | Extra | Comment
--------+------+-------+---------
(0 rows)
and obviously I can't get result from select
trino> select * from redis.default.test;
Query 20210918_174414_00006_dmp3x failed: line 1:8: SELECT * not allowed from relation
that has no columns
I see in the documentation that I might need to create a table definition file, but I wasn't able to create one that will work.
I had few variations of this, but this is the one for example:
{
"tableName": "test",
"schemaName": "default",
"value": {
"dataFormat": "json",
"fields": [
{
"name": "number",
"mapping": 0,
"type": "INT"
}
]
}
}
any idea what am I doing wrong?
I focused on the string since it's simpler, but I also need to query the hash

Related

SQL for json array in column

I have a SQL table with one of the column as jsonb datatype. Below is a json entry:
{
"size": -1,
"regions": [
{
"shape_attributes": {
"name": "polygon",
"X": [
2703,
2801,
2884
]
},
"region_attributes": {
"Material Type": "wood",
"Color": "red"
}
},
{
"shape_attributes": {
"name": "polygon",
"X": [
2397,
2504,
2767
]
},
"region_attributes": {
"Material Type": "metal",
"Color": "blue"
}
}
],
"filename": "filenam_1"
}
I am using PostgresSQL.
Given a search_string, how can I use SQL to select rows for the two cases-
Key is known
Key is not known, i.e. string anywhere in json
I have tried this
select *
from TABLE_Name
WHERE ‘wood’ IN ( SELECT value FROM OPENJSON(COL_NAME,'$.Material Type'))
---
Error occurred during SQL query execution
Reason:
SQL Error [42883]: ERROR: function openjson(jsonb, unknown) does not exist
Hint: No function matches the given name and argument types. You might need to add explicit type casts.
SELECT *
FROM TABLE_Name
CROSS APPLY OPENJSON(COL_NAME,'$.Material Type')
WHERE value ='wood'
---
Error occurred during SQL query execution
Reason:
SQL Error [42601]: ERROR: syntax error at or near "APPLY"
To find a key/value pair with a known key, you can use several different methods, using the contains operator is one of them:
select *
from table_name
where the_jsonb_column #> '{"regions": [{"region_attributes": {"Material Type": "wood"}}]}'
The equivalent of the mentioned openjson function (from SQL Server) would be jsonb_each() but that (just like openjson) will only expand the top-level key/value pairs. It doesn't do this recursively.
If you at least know the key is somewhere in the regions array, you can use a JSON/Path expression that iterates over all elements (recursively):
select *
from table_name
where (t.the_jsonb_column -> 'regions') ## '$[*].** == "wood"'
I think what you are doing isn't even possible at all, unless I don't know it. You could rather use a programming language, like Python or C# and execute the SQL Queries in the program. It is much more easier.

How do I INSERT columns with nested name syntax (ie. "item.description")?

I'm trying to merge two databases with the same schema on Google BigQuery.
I'm following the merge samples here: https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement
However, my tables have nested columns, ie "service.id" or "service.description"
My code so far is:
MERGE combined_table
USING table1
ON table1.id = combined_table.id
WHEN NOT MATCHED THEN
INSERT(id, service.id, service.description)
VALUES(id, service.id, service.description)
However, I get the error message: Syntax error: Expected ")" or "," but got ".", and a red squiggly underline under .id on the INSERT(...) line.
Here is a view of part of my table's schema:
[
{
"name": "id",
"type": "STRING"
},
{
"name": "service",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
},
{
"name": "cost",
"type": "FLOAT"
}
...
]
How do I properly structure this INSERT(...) statement so that I can include the nested columns?
Syntax error: Expected ")" or "," but got "."
Looks like you are on the right direction, Note in the documentation how you need to insert value to a REPEATED column,
You need to define the structure to guide BigQuery what to expect, For example:
STRUCT<created DATE, comment STRING>
This is the full example from the documentation
MERGE dataset.DetailedInventory T
USING dataset.Inventory S
ON T.product = S.product
WHEN NOT MATCHED AND quantity < 20 THEN
INSERT(product, quantity, supply_constrained, comments)
-- insert values like this
VALUES(product, quantity, true, ARRAY<STRUCT<created DATE, comment STRING>>[(DATE('2016-01-01'), 'comment1')])
WHEN NOT MATCHED THEN
INSERT(product, quantity, supply_constrained)
VALUES(product, quantity, false)
I've found the answer.
It turns out when referencing the top level of a STRUCT, BigQuery references all of the nested columns as well. So if I wanted to INSERT service and all of it's sub-columns (service.id and service.description), I only have to include service in the INSERT(...) statement.
The following code worked:
...
WHEN NOT MATCHED THEN
INSERT(id, service)
VALUES(id, service)
This would merge all sub columns, including service.id and service.description.

How to generate JSON array from multiple rows, then return with values of another table

I am trying to build a query which combines rows of one table into a JSON array, I then want that array to be part of the return.
I know how to do a simple query like
SELECT *
FROM public.template
WHERE id=1
And I have worked out how to produce the JSON array that I want
SELECT array_to_json(array_agg(to_json(fields)))
FROM (
SELECT id, name, format, data
FROM public.field
WHERE template_id = 1
) fields
However, I cannot work out how to combine the two, so that the result is a number of fields from public.template with the output of the second query being one of the returned fields.
I am using PostGreSQL 9.6.6
Edit, as requested more information, a definition of field and template tables and a sample of each queries output.
Currently, I have a JSONB row on the template table which I am using to store an array of fields, but I want to move fields to their own table so that I can more easily enforce a schema on them.
Template table contains:
id
name
data
organisation_id
But I would like to remove data and replace it with the field table which contains:
id
name
format
data
template_id
At the moment the output of the first query is:
{
"id": 1,
"name": "Test Template",
"data": [
{
"id": "1",
"data": null,
"name": "Assigned User",
"format": "String"
},
{
"id": "2",
"data": null,
"name": "Office",
"format": "String"
},
{
"id": "3",
"data": null,
"name": "Department",
"format": "String"
}
],
"id_organisation": 1
}
This output is what I would like to recreate using one query and both tables. The second query outputs this, but I do not know how to merge it into a single query:
[{
"id": 1,
"name": "Assigned User",
"format": "String",
"data": null
},{
"id": 2,
"name": "Office",
"format": "String",
"data": null
},{
"id": 3,
"name": "Department",
"format": "String",
"data": null
}]
The feature you're looking for is json concatenation. You can do that by using the operator ||. It's available since PostgreSQL 9.5
SELECT to_jsonb(template.*) || jsonb_build_object('data', (SELECT to_jsonb(field) WHERE template_id = templates.id)) FROM template
Sorry for poorly phrasing what I was trying to achieve, after hours of Googling I have worked it out and it was a lot more simple than I thought in my ignorance.
SELECT id, name, data
FROM public.template, (
SELECT array_to_json(array_agg(to_json(fields)))
FROM (
SELECT id, name, format, data
FROM public.field
WHERE template_id = 1
) fields
) as data
WHERE id = 1
I wanted the result of the subquery to be a column in the ouput rather than compiling the entire output table as a JSON.

AWS: Other function than COPY by transferring data from S3 to Redshift with amazon-data-pipeline

I'm trying to transfer data from the Amazon S3-Cloud to Amazon-Redshift with the Amazon-Data-Pipeline tool.
Is it possible while transferring the Data to change the Data with e.G. an SQL Statement so that just the results of the SQL-Statement will be the input into Redshift?
I only found the Copy Command like:
{
"id": "S3Input",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://example-bucket/source/inputfile.csv"
},
Source: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-get-started-copy-data-cli.html
Yes, it is possible. There are two approaches to it:
Use transformSQL of RedShiftCopyActivity
transformSQL is useful if the transformations are performed within the scope of the record that are getting loaded on a timely basis, e.g. every day or hour. That way changes are only applied to the batch and not to the whole table.
Here is an excerpt from the documentation:
transformSql: The SQL SELECT expression used to transform the input data. When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called staging and initially loads it in there. Data from this table is used to update the target table. If the transformSql option is specified, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table. So transformSql must be run on the table named staging and the output schema of transformSql must match the final target table's schema.
Please, find an example of usage of transformSql below. Notice that select is from staging table. It will effectively run CREATE TEMPORARY TABLE staging2 AS SELECT <...> FROM staging;. Also, all fields must be included and match the existing table in RedShift DB.
{
"id": "LoadUsersRedshiftCopyActivity",
"name": "Load Users",
"insertMode": "OVERWRITE_EXISTING",
"transformSql": "SELECT u.id, u.email, u.first_name, u.last_name, u.admin, u.guest, CONVERT_TIMEZONE('US/Pacific', cs.created_at_pst) AS created_at_pst, CONVERT_TIMEZONE('US/Pacific', cs.updated_at_pst) AS updated_at_pst FROM staging u;",
"type": "RedshiftCopyActivity",
"runsOn": {
"ref": "OregonEc2Resource"
},
"schedule": {
"ref": "HourlySchedule"
},
"input": {
"ref": "OregonUsersS3DataNode"
},
"output": {
"ref": "OregonUsersDashboardRedshiftDatabase"
},
"onSuccess": {
"ref": "LoadUsersSuccessSnsAlarm"
},
"onFail": {
"ref": "LoadUsersFailureSnsAlarm"
},
"dependsOn": {
"ref": "BewteenRegionsCopyActivity"
}
}
Use script of SqlActivity
SqlActivity allows operations on the whole dataset, and can be scheduled to run after particular events through dependsOn mechanism
{
"name": "Add location ID",
"id": "AddCardpoolLocationSqlActivity",
"type": "SqlActivity",
"script": "INSERT INTO locations (id) SELECT 100000 WHERE NOT EXISTS (SELECT * FROM locations WHERE id = 100000);",
"database": {
"ref": "DashboardRedshiftDatabase"
},
"schedule": {
"ref": "HourlySchedule"
},
"output": {
"ref": "LocationsDashboardRedshiftDatabase"
},
"runsOn": {
"ref": "OregonEc2Resource"
},
"dependsOn": {
"ref": "LoadLocationsRedshiftCopyActivity"
}
}
There is an optional field in RedshiftCopyActivity called 'transformSql'.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html
I have not personally used this, but from the looks of it, it seems like - you will treat your s3 data being in a temp table and this sql stmt will return transformed data for redshift to insert.
So, you will need to list all fields in the select whether or not you are transforming that field.
AWS Datapipeline SqlActivity
{
"id" : "MySqlActivity",
"type" : "SqlActivity",
"database" : { "ref": "MyDatabase" },
"script" : "insert into AnalyticsTable (select (cast(requestEndTime as bigint) - cast(requestBeginTime as bigint)) as requestTime, hostname from StructuredLogs where hostname LIKE '%.domain.sfx');",
"schedule" : { "ref": "Hour" },
"queue" : "priority"
}
So basically in
"script" any sql script/transformations/commands Amazon Redshift SQL Commands
transformSql is fine but support only The SQL SELECT expression used to transform the input data. ref : RedshiftCopyActivity

elasticsearch splits by space in facets

I am trying to do a simple facet request over a field containing more than one word (Eg: 'Name1 Name2', sometimes with dots and commas inside) but what I get is...
"terms" : [{
"term" : "Name1",
"count" : 15
},
{
"term" : "Name2",
"count" : 15
}]
so my field value is split by spaces and then runs the facet request...
Query example:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": [
"dataset"
],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"field": [
"speciesName"
],
"size": 50000
}
}
}
}'
Your field shouldn't be analyzed, or at least not tokenized. You need to update your mapping and then reindex if you want to index the field without tokenizing it.
First of all, javanna provided a very good answer from a practical perspective. However, for the sake of completeness, I want to mention that in some cases there is a way to do it without reindexing the data.
If the speciesName field is stored and your queries produce relatively small number of results, you can use script_field to retrieve stored field values:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": ["dataset"],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"script_field": "_fields['\''speciesName'\''].value",
"size": 50000
}
}
}
}
'
As a result of this query, elasticsearch will retrieve the speciesName field for every record in your result set and it will construct facets from these values. Needless to say, if your result set contains millions of records, performance of this query might be sluggish.
Similarly, if the field is not stored, but record source is stored, you can use script_field facet to retrieve field values from the source:
......
"script_field": "_source['\''speciesName'\'']",
......
Again, source for each record in the result list will be retrieved and parsed, so you might need some patience to run this query on a large set of records.