Cursor in Hive? - sql

As the title says, does Hive support cursors, or something like them? I have a large query that I'd like to fetch in chunks, but I can't seem to find a Hive friendly solution.
Thanks,
Kyle

As far as I know, there is no such thing in Hive. You can not really have pagination using Hive. Only thing you can do is to execute your hive query and get the ResultSet, then iterate through it.
The interesting part is how you want to handle the large results. You don't normally want to load all the results in memory, instead, you can stream back your query results. For example, if you are write the results into csv, instead of having a big object containing all the query results before you start writing to csv which potentially can use up your memory, you can handle them iteratively on the wire and write in chunk to your csv file.

Related

BigQuery ELT: Reading json lines into federated table as single string column

In this great post, we see a powerful technique to utilize the ELT paradigm of data transformation on newline delimited json files.
However, this post utilizes a hack in the crucial step that creates a
'schemaless' federated table: tell bigquery it's ingesting CSV data with an exotic character as the delimiter and hope the data never contains that delimiter.
I'd like to utilize this approach in a production system without the terrible hack that could lead to bugs (what if our data contains the delimiter?). Really, all we want to do here is tell bigquery to create a single-column federated table where the single column is a json-formatted string. Is there some better way of doing this?
I think the external table technique is a great way to separate your compute and storage.
It's already handling the decompression, so I don't see any great advantage in not asking the BigQuery engine to process the newline delimited JSON format at the same time.
So I would go for something like this (in Bash) and let it autodetect the fields:
bq mkdef --autodetect --source_format=NEWLINE_DELIMITED_JSON "gs://your-bucket/your-folder/someprefix*.jsonl" > /tmp/schem.json
bq mk --external_table_definition /tmp/schem.json some_dataset.ext_tab
You end up with a table named ext_tab, with fieldnames taken from the JSON attributes, that you can query using SQL .. continuing the ELT paradigm.

how to view stats on snowflake?

I am looking for a way to visualize the stats of a table in Snowflake.
The long step is to pull a meaningful sample of the data with python and apply Pandas, but it is somewhat inefficient and unsafe to pull the data out of snowflake.
Snowflake's new interface shows these stats graphically and I would like to know if there is a way to obtain this data with query or by consulting metadata.
I need something like Pandas-profiling but without a external server. maybe snowflake store metadata/statistic about its colums. numeric, categoric
https://github.com/pandas-profiling/pandas-profiling
thank you for your advices.
You can find a lot meta information in the INFORMATION_SCHEMA.
All the views and table functions in the Snowflake INFORMATION_SCHEMA can be found here: https://docs.snowflake.com/en/sql-reference/info-schema.html
not sure if you're talking about viewing the information schema as mentioned, but if you need documentation on this whole new interface, it's called SnowSight
you can learn more there:
https://docs.snowflake.com/en/user-guide/ui-snowsight.html
cheers!
The highlight in your screenshot isn't statistics about the data in the table, but merely about the query result (which looks like a DESCRIBE TABLE query). For example, if you look at type, it simply tells you that this table has 6 VARCHAR columns, 2 timestamps, and 1 number.
What you're looking for is something that is provided by most BI tools or data catalogs. I suggest you take a look at those instead.
You could also use an independent tool, like Soda, which is open source.

Does redshift store failed queries?

I wish to analyze the queries executed on certain redshift warehouse (not mine).
In order to do so I'm using a query with a join on stl_querytext and stl_query.
My question is how come I'm also getting illegal queries (I.E queries with wrong sql syntax)?
When I've tried it in my local redshift I haven't seen those. Also, couldn't find relevant documentation.
Is this a configuration issue? And in case I'm supposed to those queries is there a way to know those are illegal ones?
Thanks,
Nir.
So stl_querytext breaks long queries into parts identified by sequence number. I hope you are reconstructing the parts into the original query as a first step. This can be done with listagg() function as long as the resulting query doesn't over the max tex field (about 320 parts).
Now this is not enough to get valid SQL back in all cases because you need to treat combining the parts differently depending if the section of the query is inside or outside a text string in the query. (Is white space needed between parts or not)
I've done this exact process a bunch so this is doable. I don't have a perfect process on the whitespace question, I get close. Maybe others know the expression to get an exact recreation of the query from stl_querytext.

.Net Parsing Fixed Width Data... From a Concatenated, Single, Fixed-Width Column

I was bored and looking at old code that runs like molasses on a cold day. I found that a group of tables in our accounting system - each with 500,000 records of ~20 datapoints - that use a single column of concatenated, fixed-width values instead of separate columns. (Fixing the tables isn't an option.) An old .net ETL project is grabbing all records, doing a bunch of substrings on each record to set an object's corresponding attributes, then sending the object to merge with production data via a stored proc.
The way it is working is fine. It works. And, to be perfectly honest, I doubt I'll be given the go-ahead to fix it even if I come up with a better solution, but I was curious to see if anyone knew of a better way of doing this, because it's not entirely unlikely that I'll face a situation like this in the future.
I was thinking that if there was a way to use the TextFieldParser to parse a static string instead of a file/stream that might be a valid idea. Or, instead, I could write the entire table to a text file and then use the TextFieldParser to send data to the SProc. http://www.dotnetperls.com/textfieldparser does show that TextFieldParser is quite a bit faster than split, which I would assume is tantamount to the string manipulation our project is currently doing with substring. So there may be something to that idea.
Or perhaps the whole, old project should be dumped for a shiny new SSIS project. Would it also have to write the records to a flat file before importing into SQL? Or can it import directly from the table?
Thank you in advance!

Is there a way to parser a SQL query to pull out the column names and table names?

I have 150+ SQL queries in separate text files that I need to analyze (just the actual SQL code, not the data results) in order to identify all column names and table names used. Preferably with the number of times each column and table makes an appearance. Writing a brand new SQL parsing program is trickier than is seems, with nested SELECT statements and the like.
There has to be a program, or code out there that does this (or something close to this), but I have not found it.
I actually ended up using a tool called
SQL Pretty Printer. You can purchase a desktop version, but I just used the free online application. Just copy the query into the text box, set the Output to "List DB Object" and click the Format SQL button.
It work great using around 150 different (and complex) SQL queries.
How about using the Execution Plan report in MS SQLServer? You can save this to an xml file which can then be parsed.
You may want to looking to something like this:
JSqlParser
which uses JavaCC to parse and return the query string as an object graph. I've never used it, so I can't vouch for its quality.
If you're application needs to do it, and has access to a database that has the tables etc, you could run something like:
SELECT TOP 0 * FROM MY_TABLE
Using ADO.NET. This would give you a DataTable instance for which you could query the columns and their attributes.
Please go with antlr... Write a grammar n follow the steps..which is given in antlr site..eventually you will get AST(abstract syntax tree). For the given query... we can traverse through this and bring all table ,column which is present in the query..
In DB2 you can append your query with something such as the following, but 1 is the minimum you can specify; it will throw an error if you try to specify 0:
FETCH FIRST 1 ROW ONLY