Is there any plan for Google BigQuery to implement native JSON support?
I am considering migrating hive data (~20T) to Google BigQuery,
but the table definitions in Hive contains map type which is not supported in BigQuery.
for example, the HiveQL below:
select gid, payload['src'] from data_repository;
although, it can be worked around by using regular expression.
As of 1 Oct 2012, BigQuery supports newline separated JSON for import and export.
Blog post: http://googledevelopers.blogspot.com/2012/10/got-big-json-bigquery-expands-data.html
Documentation on data formats: https://developers.google.com/bigquery/docs/import#dataformats
Your best bet is to coerce all of your types into csv before importing, and if you have complex fields, decompose them via a regular expression in the query (as you suggested).
That said, we are actively investigating support for new input formats, and are interested in feedback as to what formats would be the most useful. There is support in the underlying query engine (Dremel) for types similar to the hive map type, but BigQuery, however, does not currently expose a mechanism for ingesting nested records.
Related
My goal is to write a dbt macro that will allow me to flatten a table column with arbitrarily nested JSON content.
I have already found a wonderful tutorial for this for Snowflake, however I would like to implement this for Databricks (Delta Lake) - using SQL.
Ultimately, I am looking for the Databricks equivalent of the LATERAL FLATTEN function in Snowflake.
The following is an example of a source...
Source
If ultimately using SQL to transform to the following target state:
Target
I have already looked at several projects, for example json-denormalize. However, I would like to implement this completely in SQL.
Also I have seen the Databricks functions json_object_keys, lateral view, explode, but can't make sense of how exactly I should ideally approach the problem.
Can someone steer me in the right direction?
In this great post, we see a powerful technique to utilize the ELT paradigm of data transformation on newline delimited json files.
However, this post utilizes a hack in the crucial step that creates a
'schemaless' federated table: tell bigquery it's ingesting CSV data with an exotic character as the delimiter and hope the data never contains that delimiter.
I'd like to utilize this approach in a production system without the terrible hack that could lead to bugs (what if our data contains the delimiter?). Really, all we want to do here is tell bigquery to create a single-column federated table where the single column is a json-formatted string. Is there some better way of doing this?
I think the external table technique is a great way to separate your compute and storage.
It's already handling the decompression, so I don't see any great advantage in not asking the BigQuery engine to process the newline delimited JSON format at the same time.
So I would go for something like this (in Bash) and let it autodetect the fields:
bq mkdef --autodetect --source_format=NEWLINE_DELIMITED_JSON "gs://your-bucket/your-folder/someprefix*.jsonl" > /tmp/schem.json
bq mk --external_table_definition /tmp/schem.json some_dataset.ext_tab
You end up with a table named ext_tab, with fieldnames taken from the JSON attributes, that you can query using SQL .. continuing the ELT paradigm.
If you have a table with a column whose type is SQL ARRAY, how do you find the base type of the array type, aka the type of the individual elements of the array type?
How do you do this in vendor-agnostic pure JDBC?
How do you do this without fetching and inspecting actual row data? Equivalently: what if the table is empty?
Similar questions were asked here:
How to get array base type in postgres via jdbc
JDBC : get the type of an array from the metadata
However, I am asking for a vendor-agnostic way through the JDBC API itself. I'm asking: How is one supposed to solve this problem with vendor-agnostic pure JDBC? This use case seems like a core use case of JDBC, and I'm really surprised that I can't find a solution in JDBC.
I've spent hours reading and re-reading the JDBC API javadocs, and several more hours scouring the internet, and I'm greatly surprised that there doesn't seem be a proper way of doing this via the JDBC API. It should be right there via DatabaseMetaData or ResultSetMetaData, but it's apparently not.
Here are the insufficient workarounds and alternatives that I've found.
Fetch some rows until you get a row with an actual value for that column, get the column value, cast to java.sql.Array, and call getBaseType.
For postgres, assume that SQL ARRAY type names are encoded as ("_" + baseTypeName).
For Oracle, use Oracle specific extensions that allow getting the answer.
Some databases have a special "element_types" view which contains one row for each SQL ARRAY type that is used by current tables et al, and the row contains the base type and base type name.
My context is that I would like to use vendor-supplied JDBC connectors in spark in cloud in my company product, and metadata discovery becomes an important thing. I'm also investigating the feasibility of writing JDBC connectors myself for other data sources that don't have a JDBC driver nor spark connector yet. Metadata discovery is important so that one can define the Spark InternalRow and Spark-JDBC data getters correctly. Currently, Spark-JDBC has very limited support for SQL ARRAY and SQL STRUCT, but I managed to provide the missing bits with a day or two of coding, but during that process I hit this problem which is blocking me. If I have control over the JDBC Driver implementation, then I could use a kludge (i.e. encode the type information in the type name, and in the Spark JdbcDialect, take the type name and decode it to create the Catalyst type). However, I want to do it in the proper JDBC way, and I ideally I want to do it in a way that some other vendor-supplied JDBC drivers will support.
PS: It took me a surprising amount of time to locate DatabaseMetaData.getAttributes(). If I'm reading this right, this can give me the names and types of the fields/attributes of a SQL STRUCT. Again, I'm very surprised that I can get the names and types of the fields/attributes of a SQL STRUCT in vendor-agnostic pure JDBC but not get the base-type of a SQL ARRAY in vendor-agnostic pure JDBC.
I am looking for a way to visualize the stats of a table in Snowflake.
The long step is to pull a meaningful sample of the data with python and apply Pandas, but it is somewhat inefficient and unsafe to pull the data out of snowflake.
Snowflake's new interface shows these stats graphically and I would like to know if there is a way to obtain this data with query or by consulting metadata.
I need something like Pandas-profiling but without a external server. maybe snowflake store metadata/statistic about its colums. numeric, categoric
https://github.com/pandas-profiling/pandas-profiling
thank you for your advices.
You can find a lot meta information in the INFORMATION_SCHEMA.
All the views and table functions in the Snowflake INFORMATION_SCHEMA can be found here: https://docs.snowflake.com/en/sql-reference/info-schema.html
not sure if you're talking about viewing the information schema as mentioned, but if you need documentation on this whole new interface, it's called SnowSight
you can learn more there:
https://docs.snowflake.com/en/user-guide/ui-snowsight.html
cheers!
The highlight in your screenshot isn't statistics about the data in the table, but merely about the query result (which looks like a DESCRIBE TABLE query). For example, if you look at type, it simply tells you that this table has 6 VARCHAR columns, 2 timestamps, and 1 number.
What you're looking for is something that is provided by most BI tools or data catalogs. I suggest you take a look at those instead.
You could also use an independent tool, like Soda, which is open source.
I am trying to create a dashboard from the data present in the Hive. The catch is the column which I want to visualize is a nested JSON type. So will tableau able to parse and flatten the JSON column and list out all possible attributes? thanks!
Unfortunately Tableau will not automatically flatten the JSON structure of the field for you, but you can manually do so.
Here is an article that explains the use of Regex within Tableau to extract pertinent information from your JSON field.
I realize this may not be the answer you were looking for, but hopefully it gets you started down the right path.
(In case it helps, Tableau does have a JSON connector in the event you are able to connect directly to your JSON as a datasource instead of embedded in your Hive connection as a complex field type.)