In this great post, we see a powerful technique to utilize the ELT paradigm of data transformation on newline delimited json files.
However, this post utilizes a hack in the crucial step that creates a
'schemaless' federated table: tell bigquery it's ingesting CSV data with an exotic character as the delimiter and hope the data never contains that delimiter.
I'd like to utilize this approach in a production system without the terrible hack that could lead to bugs (what if our data contains the delimiter?). Really, all we want to do here is tell bigquery to create a single-column federated table where the single column is a json-formatted string. Is there some better way of doing this?
I think the external table technique is a great way to separate your compute and storage.
It's already handling the decompression, so I don't see any great advantage in not asking the BigQuery engine to process the newline delimited JSON format at the same time.
So I would go for something like this (in Bash) and let it autodetect the fields:
bq mkdef --autodetect --source_format=NEWLINE_DELIMITED_JSON "gs://your-bucket/your-folder/someprefix*.jsonl" > /tmp/schem.json
bq mk --external_table_definition /tmp/schem.json some_dataset.ext_tab
You end up with a table named ext_tab, with fieldnames taken from the JSON attributes, that you can query using SQL .. continuing the ELT paradigm.
Related
I am looking for a way to visualize the stats of a table in Snowflake.
The long step is to pull a meaningful sample of the data with python and apply Pandas, but it is somewhat inefficient and unsafe to pull the data out of snowflake.
Snowflake's new interface shows these stats graphically and I would like to know if there is a way to obtain this data with query or by consulting metadata.
I need something like Pandas-profiling but without a external server. maybe snowflake store metadata/statistic about its colums. numeric, categoric
https://github.com/pandas-profiling/pandas-profiling
thank you for your advices.
You can find a lot meta information in the INFORMATION_SCHEMA.
All the views and table functions in the Snowflake INFORMATION_SCHEMA can be found here: https://docs.snowflake.com/en/sql-reference/info-schema.html
not sure if you're talking about viewing the information schema as mentioned, but if you need documentation on this whole new interface, it's called SnowSight
you can learn more there:
https://docs.snowflake.com/en/user-guide/ui-snowsight.html
cheers!
The highlight in your screenshot isn't statistics about the data in the table, but merely about the query result (which looks like a DESCRIBE TABLE query). For example, if you look at type, it simply tells you that this table has 6 VARCHAR columns, 2 timestamps, and 1 number.
What you're looking for is something that is provided by most BI tools or data catalogs. I suggest you take a look at those instead.
You could also use an independent tool, like Soda, which is open source.
I only found answers about how to import csv files into the database, for example as blob or as 1:1 representation of the table you are importing it into.
What I need is a little different: My team and I are tracking everything we do in a database. A lot of these tasks produce logfiles, benchmark results, etc., which are stored in CSV format. The number of columns are far from consistent and also the data could be completely different from file to file, e.g. it could be a log from fraps with frametimes in it or a log of CPU temparatures over an amount of time, or even something completely different.
Long story short, I came up with an idea, but - being far from a sql pro - I am not sure if it makes sense or if there is a more elegant solution.
Does this make sense to you:
We also need to deal with a lot of data that is produced, so please give me also your opinion if that is feasible with like 200 files per day which can easyly have a couple of thousands rows.
The purpose of all this will be, that we can generate reports form the stored data and perform analysis of the data. E.g. view it on a webpage in a graph or do calculations with it.
I'm limited to MS-SQL in this case, because that's what the current (quite complex) database is and I'm just adding a new schema with that functionality to it.
Currently we just archive the files on a raid and store a link to it in the database. So everyone who wants to do magic with the data needs to download every file he needs and then use R or Excel to create a visualization of the data.
Have you considered a column of XML data type for the file data as an alternative of ColumnId -> Data structure? SQL server provides is a special dedicated XML index (over the entire XML structure) so your data can be fully indexed no matter what CSV columns you have. You will have much less records in the database to handle (as an entire CSV file will be a single XML field value). There are good XML query options to search by values & attributes of the XML type.
For that you will need to translate CSV to XML, but you will have to parse it either way ...
Not that your plan won't work, I am just giving an idea :)
=========================================================
Update with some online info:
An article from Simple talk: The XML Methods in SQL Server
Microsoft documentation for nodes() with various use case samples: nodes() Method (xml Data Type)
Microsoft document for value() with various use case samples: value() Method (xml Data Type)
Is there any plan for Google BigQuery to implement native JSON support?
I am considering migrating hive data (~20T) to Google BigQuery,
but the table definitions in Hive contains map type which is not supported in BigQuery.
for example, the HiveQL below:
select gid, payload['src'] from data_repository;
although, it can be worked around by using regular expression.
As of 1 Oct 2012, BigQuery supports newline separated JSON for import and export.
Blog post: http://googledevelopers.blogspot.com/2012/10/got-big-json-bigquery-expands-data.html
Documentation on data formats: https://developers.google.com/bigquery/docs/import#dataformats
Your best bet is to coerce all of your types into csv before importing, and if you have complex fields, decompose them via a regular expression in the query (as you suggested).
That said, we are actively investigating support for new input formats, and are interested in feedback as to what formats would be the most useful. There is support in the underlying query engine (Dremel) for types similar to the hive map type, but BigQuery, however, does not currently expose a mechanism for ingesting nested records.
As the title says, does Hive support cursors, or something like them? I have a large query that I'd like to fetch in chunks, but I can't seem to find a Hive friendly solution.
Thanks,
Kyle
As far as I know, there is no such thing in Hive. You can not really have pagination using Hive. Only thing you can do is to execute your hive query and get the ResultSet, then iterate through it.
The interesting part is how you want to handle the large results. You don't normally want to load all the results in memory, instead, you can stream back your query results. For example, if you are write the results into csv, instead of having a big object containing all the query results before you start writing to csv which potentially can use up your memory, you can handle them iteratively on the wire and write in chunk to your csv file.
I have an XML feed of a resume. Each part of the resume is broken down into its constituent parts. For example <employment_history>, <education>, <skills>.
I am aware that I could save each section of the XML file into a database. For example columnID = employment_history | education | skills & then conduct a free text search just on those individual columns. However I would prefer not to do this because it would create duplication of data that is already contained within the XML file and may put extra strain on indexing.
Therefore I wondered if it is possible to conduct a free text search of an XML file within the <employment_history></employment_history> using SQL Server.
If so an example would be appreciated.
Are you aware that SQL Server supports columns with the data type of "XML"? These can contain an entire XML document.
You can also index these columns and you can use XQuery to perform query and data manipulation tasks on those columns.
See Designing and Implementing Semistructured Storage (Database Engine)
Querying xml by doing string searching using sql is probably going to run into a lot of trouble.
Instead, I would parse it into whatever language you're using to interact with your database and use xpath (most languages/environments have some kind of built in or popular 3rd party library) to query it.
I think you can create a function (UDF) that takes the xml text as a parameter then it fetches the data inside tag then you make the filter you want