Tweak schema of CTAS parquet table in apache drill: make element required instead of optional - schema

I'd like to produce a parquet file with a very specific schema using apache drill. I join two tables with CTAS like:
CREATE TABLE synthetic1 AS (
SELECT e1.returneddocids AS returneddocids, e1.pathinfo AS pathinfo, c1.counters AS counters
FROM dfs.`/tmp/tier1.parquet` e1 LEFT JOIN dfs.tmp.shadow3 c1 ON TRUE LIMIT 100
);
The resulting file schema looks like this:
message root {
optional group returneddocids {
repeated group list {
optional binary element (UTF8); // need this one as required, not optional
}
}
optional binary pathinfo (UTF8);
optional group counters {
repeated group list {
optional group element { // need this as required
optional binary name (UTF8); // need this as required
optional int32 value; // need this as required
}
}
}
}
I wonder how can I tweak the CTAS query so optional elements above changed to required?

It's quite convoluted, you can use CREATE OR REPLACE SCHEMA to apply constraints. In my case this kind of works (not exactly, though could be helpful for others struggling with a similar problem):
ALTER SESSION SET `store.table.use_schema_file` = true;
ALTER SESSION SET `exec.storage.enable_v3_text_reader` = true;
CREATE OR REPLACE SCHEMA (
returneddocids STRUCT<`list` STRUCT<`element` ARRAY<VARCHAR>>> NOT NULL,
pathinfo VARCHAR NOT NULL,
counters STRUCT<`list` STRUCT<`element` ARRAY<VARCHAR>>>
) FOR TABLE synthetic1;

Related

SQL OpenJSON obtain label in select output

I'm using the OpenJSON function in SQL to import various JSON format files into SQL and can usually handle the variations in formats from various sources, however I've got an example where I can't reach a certain value.
Example JSON format file:
{
"bob": {
"user_type": "standard",
"user_enabled": "true",
"last_login": "2021-07-25"
},
"claire": {
"user_type": "administrator",
"user_enabled": "true",
"last_login": "2021-09-17"
}
}
One of the values I want to return as one of my columns is the user's name;
I believe it's called the key but not entirely sure, because when I execute the following having loaded the json string into the #json variable:
select *
from openjson(#json)
I get two columns, one labelled key containing the username, the other containing my nested json string within {} braces.
Usually, to run my select statement, I would do something like
select username,user_type,user_enabled,last_login
from openjson(#thisjson)
with (
username nvarchar(100),
user_type nvarchar(100),
user_enabled nvarchar(100),
last_login nvarchar(100)
)
I get that sometimes I have to put the base in the brackets after openjson, and sometimes I have to follow the input column definitions with something like '$.last_login' to help traverse the structure, but can't work out how to identify or select the placeholder for the username.

Does Spark SQL provide an API to parse a SQL statement and corresponding DDL and infer data types of the select list?

I'm reviewing Spark SQL for a project and I see all the pieces of the API I need (SQL Parser, Dataset, Encoder, LogicalPlan, etc) however I'm having difficulty tying them together the way I'd like.
Essentially I want the following functionality:
var ddl = parseDDL(RAW_DDL);
var query = parseQuery("SELECT c1, c2 FROM t1 WHERE c2='value'", ddl);
var selectFields = query.getSelectFields();
for(var field: selectFields) {
var name = field.getName();
var type = field.getType(); // <~~~ want this in terms of `t1` from ddl
...
}
The type information for the select list in terms of the DDL is what I'm after.
Ideally I'd like a soup-to-nuts example with Spark SQL, if possible.
UPDATE
To clarify let's say I have an SQL schema file with several CREATE TABLE statements:
File: com/abc/MovieDb.sql
CREATE TABLE Movie (
Title varchar(255),
Year integer,
etc.
);
CREATE TABLE Actor (
FirstName varchar(255),
LastName varchar(255),
etc.
);
etc.
I want to use Spark SQL to parse a number of arbitrary SQL SELECT statements against this schema. Importantly, I want to get type information about the select list of each query in terms of the Movie, Actor, etc. tables and columns in the schema. For example:
SELECT Title, Year FROM Movie WHERE Year > 1990
I want to parse this query against the schema and get the type information for the select list. Again the queries are arbitrary, however the schema is stable so something like:
var parser = createParser(schema);
var query = parser.parseQuery(arbitraryQuery);
var selectedFields = query.getSelectedFields();
for (var field: selectedFields) {
var name = field.getName();
var type = field.getType();
}
Most important is the field.getType() call.
I assumed this would be an easy Yes or No type question, but perhaps my use-case is off the beaten path. Time to dive into it myself...
to get columns information here is what can be done
suppose you have dataframe with columns A,B,C,D in it
val inputDf= Seq(("foo","Bar",0,0.0)).toDf("a","b","c","d")
val newDf = inputDf.select("a","c")
val columnInfo= newDf.dtypes // should give you something like (("a","StringType"),("c","IntegarType"))
again this is not tested code but generally this is how you can get the column names and their types.

Postgresql insert into name and value using a variable

I have 2 variables
var anycolumn = 'fruit';
var anyvalue = 'apple';
I was able to insert the value using variable 'anyvalue' like...
"INSERT INTO mytable (fruit) VALUES ($1)", [anyvalue], function(err, result){}
Now how do I insert the column name using variable 'anycolumn'?
I'm using node server
"INSERT INTO mytable (anycolumn) VALUES ($1)", [anyvalue], function(err, result){}
"INSERT INTO mytable ("+anycolumn+") VALUES($1)", [anyvalue]...
But make sure you know for certain what anycolumn contains. Don't use something received directly or indirectly from the outside (e.g. an HTTP request, in the form of parameters, body, cookies, headers...) without validating it, otherwise you'll get a terrible SQL injection.
Also note that this probably means your database schema is incorrect, and you should have a table with columns name and value.
The following example is based on pg-promise:
db.none('INSERT INTO mytable($1~) VALUES($2)', [anycolumn, anyvalue])
.then(()=> {
// success;
})
.catch(error=> {
// error;
});
Using the same SQL Names syntax you can inject the table name as well.

I need to format a column value in my return but that column may not exist in which case I need to grab the rest of the columns data

I have an app that uses a local sqlite database which has a particular table called 'Theme' for each project. Sometimes that table has a column 'startDate' and sometimes it does not. I need to have that 'startDate' returned in a particular format if it does exist. My problem is, when I query this table, specifying the neccessary format, if the column does not exist, the query returns an error "NO SUCH COLUMN".
HOW DO I CHECK FOR COLUMN EXISTENCE, IF IT DOES EXIST, RETURN THE 'startDate' PROPERLY FORMATTED ALONG WITH THE REST OF THE DATA, IF IT DOES NOT EXIST, RETURN THE REST OF THE DATA WITHOUT THE 'startDate'???
This must be done in 1 query!
Something like this...
SELECT * (if exists STRFTIME('%Y/%m/%d %H:%M:%S', startDate) AS sDate FROM Theme
Only one query:
Cursor cursor = database.query(TABLE_NAME, null, null, null, null, null, null);
if(cursor.moveToFirst()) {
do {
for(String columnName : cursor.getColumnNames()) {
// do something
}
} while(cursor.moveToNext());
} else {
// your actions for empty table
}
cursor.close();

How can I cause Grails/GORM to use default sequence values in postgres?

When I define a domain object like:
class MusicPlayed {
String user
Date date = new Date()
String mood
static mapping = {
id name: 'played_id'
version false
}
}
I get a postgres sequence automatically defined like:
CREATE SEQUENCE seq_music_played
INCREMENT 1
MINVALUE 1
MAXVALUE 9223372036854775807
START 1
CACHE 1;
That's great -- but I'd love to have this become the default value for my id field. In other words, I'd like to have the table defined with:
played_id bigint DEFAULT nextval('seq_music_played'::regclass) NOT NULL,
... but this doesn't happen. So when my client code requires manual SQL invocation, I'm stuck pulling new values form the sequence instead of just relying on auto-population.
Is there any way to cause this table to be created "the way I want," or do I need to forgo gorm's table-creation magic and just create the tables myself with a db-creation script that runs at install-time?
Note My question is similar to How to set up an insert to a grails created file with next sequence number?, but I'm specifically looking for a solution that doesn't pollute my client code.
This works for me :-)
static mapping = {
id generator: 'native', params: [sequence: 'my_seq'], defaultValue: "nextval('my_seq')"
}
Generating something like:
create table author (
id int8 default nextval('nsl_global_seq') not null,...
for postgresql.
I would use:
static mapping = {
id generator: 'native', params:[sequence:'your_seq']
}
Additionally, i would update the DEFAULT-Value of the id-column via
ALTER TABLE your_table ALTER COLUMN id SET DEFAULT nextval('your_seq');
This is extremely useful for manual INSERTs
UPDATE - use liquibase for the default-column-problem:
changeSet(author:'Bosh', id:'your_table_seq_defaults', failOnError: true) {
sql ("ALTER TABLE your_table ALTER COLUMN id SET DEFAULT nextval('your_seq')")
}
I tend to create my tables directly in PostgreSQL and then map them in grails.
I took the best idea to the sequences-generated-IDs from here:
http://blog.wolfman.com/articles/2009/11/11/using-postgresql-with-grails
Give it a try and then smile at your former problems :)
rawi
You can define it in Config.groovy
grails.gorm.default.mapping = {
id generator: 'sequence'
}