What is an efficient way to parse query result into a struct that has two fileds: a string and an array of structs using pkg sqlx? - sql

I have written the following query and found myself trying to find an efficient way to parse the result of it:
SELECT modes.mode_name, confs.config_name, confs.field1, confs.field2, confs.field3
FROM modes
JOIN confs ON modes.config_name = confs.name
WHERE modes.mode_name = $1 ORDER BY confs.config_name ASC;
Foe each mode there are multiple corresponding configs - table modes has two columns forming a primary key - mode_name and config_name.
Here are the structs I have to use:
type Mode struct {
Name string `db:"mode_name"`
Configs []Config
}
type Config struct {
Name string `db:"name" json:"-"`
Mode string `db:"mode_name" json:"mode_name,omitempty"`
Field1 float32 `db:"field1" json:"field1,omitempty"`
Field2 float32 `db:"field2" json:"field2,omitempty"`
Field3 float32 `db:"field3" json:"field3,omitempty"`
}
I expect to find a way to populate Mode struct with the data from the query above:
Name from mode_name
Then parse each corresponding config into Config struct and add them to []Configs into the Configs field
I have studied the docs for pkg sqlx, picked and tried several options that looked promising:
sqlx.QueryxContext alongside attempting to iterate over Rows with StructScan
sqlx.NamedExec - parsing directly into the struct (which fails yet again as mine has an embedded struct inside).
Both of them failed and I am beginning to think there might be no elegant way to solve the task in these circumstances with the aforementioned tools.

Related

Efficient Way In SQL Server To Remove All "InvalidXMLCharacters" From a NVARCHAR

As part of this answer, I determined that one of the things that can break a OLAP cube is feeding in values to it (in the dimension names/values/etc) that contain characters that are considered "InvalidXMLCharacters". Now I would like to filter out these values so that they never end up in the OLAP cubes I've building in SQL. Often I find myself importing this input data from one table into another. Something like the following:
INSERT INTO [dbo].[DestinationTableThatWillBeReferencedInMyOLAPCube]
SELECT TextDataColumn1, TextDataColumn2, etc...
FROM [dbo].[SourceTableContainingColumnsWithValuesWithInvalidXMLCharacters]
WHERE XYZ...
Is there an efficient way to remove all "InvalidXMLCharacters" from my Columns in this query?
The obvious solution that comes to mind would be some sort of Regex, though from the previously stated linked posts, that might be quite complex, and I'm not sure of the performance implications around this.
Another idea I've had is to Convert the Columns to "XML" data type, but that will error if they contain invalid characters, not very helpful for removing them...
I've looked around and don't see many other cases where developers are trying to do exactly this, has thing been tackled any other way in another post that I haven't found?
.NET CLR integration in SQL Server could be helpful.
Here is a small c# example for you. You can use it as a starting point for your needs. Its most important line is using XmlConvert.IsXmlChar(ch) call to remove invalid XML characters.
c#
void Main()
{
// https://www.w3.org/TR/xml/#charsets
// ===================================
// From xml spec valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
// any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
string content = "fafa\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content).Dump("Clean string");
Console.WriteLine(IsValidXmlString(content)); // True
}
// Define other methods and classes here
static string RemoveInvalidXmlChars(string text)
{
return new string(text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray());
}
static bool IsValidXmlString(string text)
{
bool rc = true;
try
{
XmlConvert.VerifyXmlChars(text);
}
catch
{
rc = false;
}
return rc;
}

Write null value to Parquet file

I'm using Parquet CPP library to write data from MySQL database to a parquet file. I have two questions:
1) What does the REPETITION in schema mean? Is it related to table constraints when we define a column as NULL or NOT NULL?
2) How to insert NULL value into a column? Do I just pass a null pointer to the value parameter?
WriteBatch(int64_t num_levels, const int16_t* def_levels,
const int16_t* rep_levels,
const typename ParquetType::c_type* values)
Thanks in advance!
#Ivy.W I have been using parquet CPP recently at work and this is what I understood
Parquet schema needs to know about each column of the table that you are going to read from and write to. If the column is nullable then it means that the repetitionType is optional, if it is not nullable it means the repetitionType is required else it will be repeated (for nested structures like map, list etc). Let me give a quick intro to definition and repetition levels:
The definition level in parquet is to identify if the value to be written is nullable or not I.e we should tell the level for which the particular field is NULL. So basically, if you want to reconstruct the schema back, we can use the definition and repetition levels.
A field can be Optional/required/repeated. If the field is required, it means it can't be null so the definition level is not required. If it is optional, it will be 0 for null and 1 for non-null. If the schema is nested, we use additional values accordingly.
e.g
message ExampleDefinitionLevel {
optional group a {
optional group b {
optional string c;
}
}
}
definition level for a would be 0, for b would be 1 for c would be 2.
enter image description here
Repetition level:
Repetition level is only applicable for nested structures such as lists, map etc.
for e.g when a user can have multiple phone numbers the field will be "repeated".
e.g
message list{
repeated string list;
}
The data would be like: ["a","b","c"] and would look like:
{
list:"a",
list:"b",
list:"c"
}
To write null, make sure the schema knows that the column is nullable and just pass the definition level as 0 and parquet writebatch should take care of the rest.
Please refer to https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html

Checking existence of value inside JSON array

I've found the solution for more complex than this but I'm not able to make this work...
I've a column called data of type json. The JSON structure is as follows:
{"actions": ["action1","action2","action3", ..., "actionN"]}
So, let's say I've 3 rows with the following data:
{"actions": ["work","run"]}
{"actions": ["run","eat","sleep", 'walk']}
{"actions": ["eat","run","work"]}
I want to retrieve the rows where work is included in actions array.
I've tried something similar to what is posted here:
Query for element of array in JSON column, but since each element inside the array is just a json string, I got stuck there.
Then, I tried something like:
SELECT * from table t WHERE 'work' in ...
but this also failed to get the values as an string array to put it there.
Using PostgreSql 9.3.
Since you are using the json type, which is just a string internally, the simplest solution is:
SELECT * FROM table WHERE strpos(data::text, 'work') > 0;
It defeats the purpose of using json in the first place, but it is (very likely) faster than the json parser in this simple case.
A json solution would be:
SELECT * FROM (
SELECT *, unnest(data->"actions")::text AS action FROM table) AS foo
WHERE action = 'work';

Find a suitable vocabulary database to build a C structure

Let's begin with the question final purpose: my aim is to build a word-based neural network which should take a basic sentence and select for each individual word the meaning it is supposed to yield in the sentence itself. It is then going to learn something about the language (for example the possible correlation between two given words, what is the probability to find both in a single sentence and so on) and at the final stage (after the learning phase) try to build some very simple sentences of its own according to some input.
In order to do this I need some kind of database representing a vocabulary of a given language from which I could extract some information such as word list, definitions, synonyms et cetera. The database should be structured in a way such that I can build C data structures containing the needed information such as
typedef struct _dictEntry DictionaryEntry;
typedef struct _dict Dictionary;
struct _dictEntry {
const char *word; // Word string
const char **definitions; // Array of definition strings
DictionaryEntry **synonyms; // Array of pointers to synonym words
Dictionary *dictionary; // Pointer to parent dictionary
};
struct _dict {
const char *language; // Language identification string
int count; // Number of elements in the dictionary
float **correlations; // Correlation matrix between i-th and j-th entries
DictionaryEntry *entries; // Array of dictionary entries
};
or equivalent Obj-C objects.
I know (from Searching the Mac OSX system dictionaries?) that apple provided dictionaries are licensed so I cannot use them to create my data structures.
Basically what I want to do is the following: given an arbitrary word A I want to fetch all the dictionary entries which have a definition containing A and select such definition only. I will then implement some kind of intersection procedure to select the most appropriate definition and synonyms based on the rest of the sentence and build a correlation matrix.
Let me give a little example: let us suppose I type a sentence containing "play"; I want to fetch all the entries (such as "game", "instrument", "actor", etc.) the word "play" can be correlated to and for each of them select the corresponding definition (I don't want for example to extract the "instrument" definition which corresponds to the "tool" meaning since you cannot "play a tool"). I will then select the most appropriate of these definitions looking at the rest of the sentence: if it contains also the word "actor" then I will assign to "play" the meaning "drama" or another suitable definition.
The most basic way to do this is scanning every definition in the dictionary searching for the word "play" so I will need to access all definitions without restrictions and as I understand this cannot be done using the dictionaries located under /Library/Dictionaries. Sadly this work MUST be done offline.
Is there any available resource I can download which allows me to get my hands on all the definitions and fetch my info? Currently I'm not interested in any particular file format (could be a database or an xml or anything else) but it must be something I can decompose and put in a data structure. I tried to google it but, whatever the keywords I use, if I include the word "vocabulary" or "dictionary" I (pretty obviously) only get pages about the other words definitions on some online dictionary site! I guess this is not the best thing to search for...
I hope the question is clear... If it is not I'll try to explain it in a different way! Anyway, thanks in advance to all of you for any helpful information.
Probably an ontology which is free, like http://www.eat.rl.ac.uk would help you. In the university sector there are severals available.

How to read data from data bag within a PIG script

I have a databag which is the following format
{([ChannelName#{ (bigXML,[])} ])}
DataBag consists of only one item which is a Tuple.
Tuple consists of only item that is Map.
Map is of type which is a map between channel names and values.
Here is value is of type DataBag, which consists of only one tuple.
The tuple consists of two items one is a charrarray (very big string) and other is a map
I have a UDF that emits the above bag.
Now i need to invoke another UDF by passing the only tuple within the DataBag against a given Channel from the Map.
Assuming there was not data bag and a tuple as
([ChannelName#{ (bigXML,[])} ])
I can access the data using $0.$0#'StdOutChannel'
Now with the tuple inside a bag
{([ChannelName#{ (bigXML,[])} ])}
If i do $0.$0.$0#'StdOutChannel' (Prepend $0), i get the following error
ERROR 1052: Cannot cast bag with schema bag({bytearray}) to map
How can I access data within a data bag?
Try to break this problem down a little.
Let's say you get your inner bag:
MYBAG = $0.$0#'StdOutChannel';
First, can you ILLUSTRATE or DUMP this?
What can you do with this bag? Usually FOREACH over the tuples inside.
A = FOREACH MYBAG {
GENERATE $0 AS MyCharArray, $1 AS MyMap
};
ILLUSTRATE A; -- or if this doesn't work
DUMP A;
Can you try this interactively and maybe edit your question a little more with some details as a result of you trying these things.
Some editing hints for StackOverflow:
put backticks around your code (`ILLUSTRATE`)
indent code blocks by 4 spaces on each line