How to get the base type of an array type in portable JDBC - sql

If you have a table with a column whose type is SQL ARRAY, how do you find the base type of the array type, aka the type of the individual elements of the array type?
How do you do this in vendor-agnostic pure JDBC?
How do you do this without fetching and inspecting actual row data? Equivalently: what if the table is empty?
Similar questions were asked here:
How to get array base type in postgres via jdbc
JDBC : get the type of an array from the metadata
However, I am asking for a vendor-agnostic way through the JDBC API itself. I'm asking: How is one supposed to solve this problem with vendor-agnostic pure JDBC? This use case seems like a core use case of JDBC, and I'm really surprised that I can't find a solution in JDBC.
I've spent hours reading and re-reading the JDBC API javadocs, and several more hours scouring the internet, and I'm greatly surprised that there doesn't seem be a proper way of doing this via the JDBC API. It should be right there via DatabaseMetaData or ResultSetMetaData, but it's apparently not.
Here are the insufficient workarounds and alternatives that I've found.
Fetch some rows until you get a row with an actual value for that column, get the column value, cast to java.sql.Array, and call getBaseType.
For postgres, assume that SQL ARRAY type names are encoded as ("_" + baseTypeName).
For Oracle, use Oracle specific extensions that allow getting the answer.
Some databases have a special "element_types" view which contains one row for each SQL ARRAY type that is used by current tables et al, and the row contains the base type and base type name.
My context is that I would like to use vendor-supplied JDBC connectors in spark in cloud in my company product, and metadata discovery becomes an important thing. I'm also investigating the feasibility of writing JDBC connectors myself for other data sources that don't have a JDBC driver nor spark connector yet. Metadata discovery is important so that one can define the Spark InternalRow and Spark-JDBC data getters correctly. Currently, Spark-JDBC has very limited support for SQL ARRAY and SQL STRUCT, but I managed to provide the missing bits with a day or two of coding, but during that process I hit this problem which is blocking me. If I have control over the JDBC Driver implementation, then I could use a kludge (i.e. encode the type information in the type name, and in the Spark JdbcDialect, take the type name and decode it to create the Catalyst type). However, I want to do it in the proper JDBC way, and I ideally I want to do it in a way that some other vendor-supplied JDBC drivers will support.
PS: It took me a surprising amount of time to locate DatabaseMetaData.getAttributes(). If I'm reading this right, this can give me the names and types of the fields/attributes of a SQL STRUCT. Again, I'm very surprised that I can get the names and types of the fields/attributes of a SQL STRUCT in vendor-agnostic pure JDBC but not get the base-type of a SQL ARRAY in vendor-agnostic pure JDBC.

Related

Is the column of type JSON deprecated?

In the bigquery console, when creating a table, there used to be type JSON as an option for the column types but weirdly enought it was never present in their docs We used this column type in our production tables, and discovered later on that you can't select it in queries otherwise bigquery throws an error, and the json functions also didn't work with it. So we simply stopped using this column in the queries but they still exist in our tables.
However, in the past couple of days, all queries against this table are failing with this error 400 Json is not enabled for current project. and this column type is not present in the bigquery console anymore. It seems it was removed or deprecated? I checked the release notes, but the latest release was way before the error occured. This broke our production environment, and we couldnt even export the data because exporting gave the same error. Instead we had to use a new table without this column which meant we lost all our history.
Did anyone face the same problem with any other column types before, is it normal that a type is deprecated without users being notified beforehand. This is making me question the reliability of bigquery.
Please reach out to Google Cloud support and we will help you fix your issue with that problematic table. You may also want to try fixing it yourself using the ALTER TABLE DROP COLUMN statement that is currently in public preview [1]. This will drop the erroneous column (the data in that column only will be lost). The rest of the data will remain usable.
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_table_drop_column_statement
I ran into the same error message few days ago and was surprised to read about this policy change that's not backed up by a mitigation process. My attempt to use Vlad Grachev suggestion to drop this column did not prevail, as the console does not allow to query this table (same "Json is not enabled for current project." error).
My only remediation at this point is:
build a new table where the json column is switched to type string
create a pipeline that transforms the objects to strings
migrate the data through the pipeline to the new table
In BigQuery Json data can be stored in a column type "Record.Are you referring the same by JSON column type?
BigQuery uses the RECORD (or STRUCT) type to represent nested structure. A column of RECORD type is in fact a large column containing multiple child columns. For more information Refer the link below,
Json Data in BigQuery
if you are not refering to the Record Data type, The Json Column type might be a test feature that might not dependent on deprecation scheme

RediSql (for redis): Get column names as well as data type?

I am using the excellent RediSql, a module for Redis, to get a powerful caching solution.
When sending a command to Redis, that interacts with the SqLite db in the background, like this:
REDISQL.EXEC db "SELECT * FROM jobcache"
I get a result like this:
I get a type for the integer column, but not for the string, and no column names are provided.
Is there a way to get column name and defined data type always? I would need this, as I need to convert the results back to a more standard sql result format.
unfortunately, at the moment this is not possible with the EXEC command.
You can use the QUERY.INTO command reference
QUERY.INTO add the result of your query into a stream, it adds the column and the values for each row. Then you can consume the stream in whichever way you prefer.
When doing query (reads) against RediSQL is a good practice to use the .QUERY family of commands, this avoids useless replication of data, in the case you are in a cluster setup.
Moreover, it is possible to use the .QUERY commands also against replica of the main redis instance, while the .EXEC commands can be used only against the primary instance.

SQL UDF - Struct Diff

We have a table with 2 top level columns of type 'struct' - one is a 'before', and an 'after' image. The struct schemas are non trivial - nested, with arrays to a variable depth. The are sent to us from replication, so the schemas are always the same (but the schemas of course can be updated at some point, but always together)
Objective is for the two input structs, to return 2 struct 'diffs' of the before and after with only fields that have changed - essentially the 'delta' diff of the changes produce by the replication source. We know something has changed, but not 'what' since we get the full before and after image. this raw data lands in BQ and is then processed from there but need to determine the more granular change for high order BQ processing.
The table schema is very wide (1000's of leaf fields), and the data populated fairly spare (so alot of nulls will be present on both sides of the snapshot) - so would need to be performant as best as possible when executing over 10s of millions of rows.
All things are nullable for maximum flexibility.
So change could look like:
null -> value
value -> null
valueA -> valueB
Arrays:
recursive use of above for arrays of structs, ordering could be relaxed if that makes it easier?
It might not be possible.
Ive not attempted this yet as it seems really difficult so am looking to the community boffins for some support for this. I feel the arrays could be difficult part. There is probably an easy way perhaps in Python I dont or even doing some JSON conversion and comparison using JOSN tools? It feels like it would be a super cool feature built in to BQ as well, so if can get this to work, will add a feature request for it.
Id like to have a SQL UDF for reuse (we have SQL skills not python, although if easier in python then thats ok), and now with the new feature of persistent SQL UDFs, this seems the right time to ask and test the feature out!
sql
def struct_diff(before Struct, after Struct)
(beforeChange, afterChange) - type of signature but open to suggestions?
It appears to be really difficult to get a piece of reusable code. Since currently there is no support for recursive functions for SQL UDF, you cannot use a recursive approach for the nested structs.
Although, you might be able to get some specific SQL UDF functions depending on your array and structs structures. You can use an approach like this one to compare the structs.
CREATE TEMP FUNCTION final_compare(s1 ANY TYPE, s2 ANY TYPE) AS (
STRUCT(s1 as prev, s2 as cur)
);
CREATE TEMP FUNCTION compare(s1 ANY TYPE, s2 ANY TYPE) AS (
STRUCT(final_compare(s1.structA, s2.structA))
);
You can use UNNEST to work with arrays, and the final SQL UDF would really depend on your data.
As #rtenha suggested, Python could be a lot easier to handle this problem.
Finally, I did some tests using JavaScript UDF, and it was basically the same result, if not worst than SQL UDF.
The console allows a recursive definition of the function, however it will fail during execution. Also, javascript doesn't allow the ANY TYPE data type on the signature, so you would have to define the whole STRUCT definition or use a workaround like applying TO_JSON_STRING to your struct in order to pass it as a string.

Does Tableau support complex data type of hive columns and flatten it?

I am trying to create a dashboard from the data present in the Hive. The catch is the column which I want to visualize is a nested JSON type. So will tableau able to parse and flatten the JSON column and list out all possible attributes? thanks!
Unfortunately Tableau will not automatically flatten the JSON structure of the field for you, but you can manually do so.
Here is an article that explains the use of Regex within Tableau to extract pertinent information from your JSON field.
I realize this may not be the answer you were looking for, but hopefully it gets you started down the right path.
(In case it helps, Tableau does have a JSON connector in the event you are able to connect directly to your JSON as a datasource instead of embedded in your Hive connection as a complex field type.)

Difference between JSON and SQL

I'm a newbie at web development, so here's a simple question. I've been doing a few tutorials in Django, setting up an SQL database, which is all good. I have now come across the JSON format, which I am not fully understanding. The definition on Wikipedia is: It is used primarily to transmit data between a server and web application, as an alternative to XML. Does this mean that JSON is a database like SQL? If not, what is the difference between SQL and JSON?
Thank you!
JSON is data markup format. You use it to define what the data is and means. eg: This car is blue, it has 4 seats.
{
"colour": "blue",
"seats": 4
}
SQL is a data manipulation language. You use it to define the operations you want to perform on the data. eg: Find me all the green cars. Change all the red cars to blue cars.
select * from cars where colour = 'green'
update cars set colour='blue' where colour='red'
A SQL database is a database that uses SQL to query the data stored within, in whatever format that might be. Other types of databases are available.
They are 2 completely different things.
SQL is used to communicate with databases, usually to Create, Update and Delete data entries.
JSON provides a standardized object notation/structure to talk to web services.
Why standardized?
Because JSON is relatively easy to process both on the front end (with javascript) and the backend. With no-SQL databases becoming the norm, JSON/JSON-like documents/objects are being used in the database as well.
Absolutely not. JSON is the data format in order to pass the data from the sender to the receiver. SQL is the language used by relational databases in order to define data structures and query the information from them. JSON is not associated with any way to store or retrieve the data.
JSON isn't a database, but there isn't anything stopping you from using JSON in a database. Mongo DB is a database that uses JSON (it's actually BSON behind closed doors) to communicate with the database. If you enjoy using JSON and you understand it, I recommend looking into Mongo!