Dealing with enormous datasets with Impala

Dealing with enormous datasets with Impala - sql

I have a general question about Impala vs some traditional SQL database systems. I've heard that Impala can take certain SQL statements quite literally and spit out tables with billions of rows (such as what might happen with a join statement with duplicate rows). As a narrower example suppose I run something like "SELECT * FROM database" . As far as immediate console output is concerned, I understand that most traditional SQL databases will stop running when a limit of say, 1000 entries is reached. Is the same true of Impala? In other words, if I run "SELECT * FROM database" in Impala, is it in theory doing more work, even though it will ultimately spit out a limited number of rows?

I think it depends what you use to do the query. If you just run from the command line in Bash or the Impala shell it will fetch all the results, however if you use Hue, it will page through the results like you describe. Actually the same is true for any database, if you're using a GUI to access it, you can run something like an export to csv command to get the full result set, or if you're accessing programatically, you would use fetchall().

Related

How I can find sql query for execution plan?

Some programm generate and send queries to sql server(on high load production). I want take plan of concrete query of concrete table. I start profiler with "Showplan XML" and set filter on TextData(like %MyTable%) and DatabaseName. It show rows with xml in TextData that describe execution plans(for all queries of my table). But I know that exist 5 different sql queries for this table.
How I can match some concrete query with correspond plan without use statistic?

Is there a reason this has to be done on the production environment? Most really bad execution plans (missing indexes causing table scans etc.) will be obvious enough on a dev environment where you can use all the diagnostics you want.
Otherwise running the SQL on the query cache (as in the linked question someone else mentioned) will probably have the lowest impact as it just queries a system table rather than adding diagnostics to every query.

How does SSMS show partial results of a large resultset whilst the query is still running, and can equivalent behaviour be achieved in .NET?

In SQL Server Management Studio, when running a query that produces a very large resultset, it appears to sometimes display the results of the resultset as it's loading them, rather than them all appearing at once.
My normal assumption would be that it's simply it populating the grid(s) in SSMS with the results of the finished query, and that the SQL query itself is finished.
However, if I run the following query:
SELECT 1
SELECT * FROM EnormousTable
INSERT INTO SomeOtherTable([Column1]) SELECT 'Test3'
That last INSERT does not occur until after the results from the larger resultset have been fully returned.
I have two main questions:
1. What is happening here? Is SSMS breaking down the query into separate batches even without GO statements? Please note that I'm not a DBA, so if there's some fundamental reason for this behaviour that 'any DBA would know', there's a good chance I don't know it.
2. Is there a way to attain similar functionality in .NET? What I mean by this is, when running a set of queries that will produce multiple resultsets, whether or not it's possible to have a DataSet get populated with the results of each successive query as it finishes (without waiting for all the queries to finish), without me having to manually break down the query (unless that's what SSMS is actually doing under the hood).

"Select data source" popping-up when executing a SQL writed query

I try to create a pretty complex database on ms Access 2013, so I wanted to type it directly in SQL. It has no errors, as other DBMS can fully build the database from the script I wrote (for example, phpmyadmin imports it with no difficulty).
On this tutorial, it is showed how to write a SQL query in order to build tables. I thought this way matched well with my goal as I could copy-paste my script in the query and run it to create the whole thing.
But when I tried to open/double-click on the query a pop-up appears saying "Select data source", waiting for me to select an ODBC, either from a file or a host, before continuing and executing the query.
I tried other types of queries (creating only one table at time, trying on a blank file, or even SELECT * FROM *), bt this message keeps showing up and I really don't know how to deal with it as I don't want to connect to anything but the infile database.
Does anyone got a hint about what to do in this case?
Or, even better, how could Access import my SQL script in order to create the database?

You should configure the database connection in the ODBC and check whether the connection is established or not. Once the connection is established, you can run the query to fetch the data or create tables as per your requirement.

Azure Get Live Queries

I'm looking for a query to get the current running queries in Azure SQL. All of the T-SQL I've found do not show the running queries when I test them (for instance, run a query in one window, then look in another window at the running queries). Also, I'm not looking for anything related to the time, CPU, etc, but only the actual running query text.
When I run ...
SELECT * FROM Table --(takes 2 minutes to load)
... and run a standard information query (like from Pinal Dave or this), I don't see the above query (I assume there's another way).

select * from sys.dm_exec_requests should give you what other sessions are doing.You can join this with sys.dm_exec_sql_text to get the text if needed. sys.dm_tran_locks gives the locks hold / waiting. If this is V12 server you can also use dbcc inutbuffer. Make sure that the connection you are running is dbo / server admin

System.OutOfMemoryException when querying large SQL table

I've written a SQL query that looks like this:
SELECT * FROM MY_TABLE WHERE ID=123456789;
When I run it in the Query Analyzer in SQL Server Management Studio, the query never returns; instead, after about ten minutes, I get the following error: System.OutOfMemoryException
My server is Microsoft SQL Server (not sure what version).
SELECT * FROM MY_TABLE; -- return 44258086
SELECT * FROM MY_TABLE WHERE ID=123456789; -- return 5
The table has over forty million rows! However, I need to fetch five specific rows!
How can I work around this frustrating error?
Edit: The server suddenly started working fine for no discernable reason, but I'll leave this question open for anyone who wants to suggest troubleshooting steps for anyone else with this problem.

According to http://support.microsoft.com/kb/2874903:
This issue occurs because SSMS has insufficient memory to allocate for
large results.
Note SSMS is a 32-bit process. Therefore, it is limited to 2 GB of
memory.
The article suggests trying one of the following:
Output the results as text
Output the results to a file
Use sqlcmd
You may also want to check the server to see if it's in need of a service restart--perhaps it has gobbled up all the available memory?
Another suggestion would be to select a smaller subset of columns (if the table has many columns or includes large blob columns).

If you need specific data use an appropriate WHERE clause. Add more details if you are stuck with this.
Alternatively write a small application which operates using a cursor and does not try to load it completely into memory.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Dealing with enormous datasets with Impala - sql

Related

How I can find sql query for execution plan?

How does SSMS show partial results of a large resultset whilst the query is still running, and can equivalent behaviour be achieved in .NET?

"Select data source" popping-up when executing a SQL writed query

Azure Get Live Queries

System.OutOfMemoryException when querying large SQL table

Categories

Resources