Which one is the best either Cached nor Non-Cached Google BigQuery in C# application - google-bigquery

I have developed C# application that reads data from Google Big-query using .Net Client Library.
Query:
Select SUM(Salary), Count(Employee_ID) From Employee_Details
If i am using Non-Cached Query (JobConfig.UseCacheQuery=false) in Job Configuration then able to get the result in ~6 Seconds.
If i am using Cached Query (JobConfig.UseCacheQuery=true) in Job Configuration then able to get the same result in ~2 Seconds.
Which is the best way to use Google BigQuery whether Cached nor Non-Cached. (Cached Query Execution time is faster than Non-Cached once).
If there is any drop-backs are present in Cached Queries? Kindly Clarify this.

If you run a BigQuery query twice in a row, the query cache will allow the second query invocation to simply return the same results that the first query already computed, without actually running the query again. You get your results faster, and you don't get charged for it.
The query cache is a simple way to prevent customers from overspending by repeating the same query, which sometimes happens in automated environments.
Query caching is on by default, and I would recommend leaving it enabled unless you have a specific reason to disable it. One reason you might disable caching is if you are doing performance testing and want to actually run the query to see how long it takes. But those scenarios are rare.
Read more here:
https://cloud.google.com/bigquery/querying-data#querycaching

Related

Measuring the averaged elapsed time for SQL code running in google BigQuery

As BigQuery is a shared resource, it is possible that one gets different values running the same code on BigQuery. OK one option that I always use is to turn off caching in Query Settings, Cache preference. This way queries will not be cached. The problem with this setting is that if you refresh the browser or leave it idle, that Cache Preference box will be ticked again.
Anyhow I had a discussion with some developers that are optimizing the code. In a nutshell, they take slow running code, run it 5 times and get an average, then following optimization then run the code again 5 times to get an average value for optimized SQL. Details are not clear to me. However, my preference would be (all in BQ console)
create a user session
turn off sql caching
On BQ console paste the slow running code;
On the same session paste the optimized code
Run the codes (separated by ";")
This will ensure that any systematics like BQ busy/overloaded, slow connection etc will affect "BOTH" SQL piece equally and the systematics will be cancelled out. In my option one only need to run it once as caching is turned off as well. Running 5 times to get an average looks excessive and superfluous?
Appreciate any suggestions/feedback
Thanks
Measuring the time is one way, the other way to see if the query has been optimized is the understanding of the query plan and how slots are used effectively.
I've been with BigQuery more than 6 years, and what you describe was never used by me. In BigQuery actually what matters is reducing the costs, and that can be done iteratively rewriting the query, and using partitioning/clustering/materialized views, caching/temporary tables.

Get ALL sql queries executed in the last several hours in Oracle

I need a way to collect all queries executed in Oracle DB (Oracle Database 11g) in the last several hours regardless of how fast the queries are (this will be used to calculate sql coverage after running tests against my application that has several points where queries are executed).
I cannot use tables like V$SQL since there's no guarantee that a query will remain there long enough. It seems I could use DBA_HIST_SQLTEXT but I didn't find a way to filter out queries executed before current test run.
So my question is: what table could I use to get absolutely all queries executed in the given period of time (up to 2 hours) and what DB configuration should I adjust to reach my goal?
"I need the query itself since I need to learn which queries out of all queries coded in the application are executed during the test"
The simplest way of capturing all the queries executed would be to enable Fine Grained Audit on all your database tables. You could use the data dictionary to generate policies on every application table.
Note that even when writing to an OS file such a number of policies would generate a high impact on the database, and will increase the length of time it takes to run the tests. Therefore you should only use these policies to assess your test coverage, and disable them for other test runs.
Find out more.
In the end I went with what Justin Cave suggested and instrumented Oracle JDBC driver to collect every executed query in a Set and them dump them all into a file after running the tests.

Why does my SELECT query take so much longer to run on the web server than on the database itself?

I'm running the following setup:
Physical Server
Windows 2003 Standard Edition R2 SP2
IIS 6
ColdFusion 8
JDBC connection to iSeries AS400 using JT400 driver
I am running a simple SQL query against a file in the database:
SELECT
column1,
column2,
column3,
....
FROM LIB/MYFILE
No conditions.
The file has 81 columns - aplhanumeric and numeric - and about 16,000 records.
When I run the query in the emulator using the STRSQL command, the query comes back immediately.
When I run the query on my Web Server, it takes about 30 seconds.
Why is this happening, and is there any way to reduce this time?
While I cannot address whatever overhead might be involved in your web server, I can say there are several other factors to consider:
This may likely have to do primarily in the differences between the way the two system interfaces work.
Your interactive STRSQL session will start displaying results as quickly as it receives the first few pages of data. You are able to page down through that initial data, but generally at some point you will see a status message at the bottom of the screen indicating that it is now getting more data.
I assume your web server is waiting until it receives the entire result set. It wants to get all the data as it is building the HTML page, before it sends the page. Thus you will naturally wait longer.
If this is not how your web server application works, then it is likely to be a JT400 JDBC Properties issue.
If you have overridden any default settings, make sure that those are appropriate.
In some situations the OPTIMIZATION_GOAL settings might be a factor. But if you are reading the table (aka physical file or PF) directly, in its physical sequence, without any index or key, then that might not apply here.
Your interactive STRSQL session will default to a setting of *FIRSTIO, meaning that the query is optimized for returning the first pages of data quickly, which corresponds to the way it works.
Your JDBC connection will default to a "query optimize goal" of "0", which will translate to an OPTIMIZATION_GOAL setting of *ALLIO, unless you are using extended dynamic packages. *ALLIO means the optimizer will try to minimize the time needed to return the entire result set, not just the first pages.
Or, perhaps first try simply adding FOR READ ONLY onto the end of your SELECT statement.
Update: a more advanced solution
You may be able to bypass the delay caused by waiting for the entire result set as part of constructing the web page to be sent.
Send a web page out to the browser without any records, or limited records, but use AJAX code to load the remainder of the data behind the scenes.
Use large block fetches whenever feasible, to grab plenty of rows in one clip.
One thing you need to remember, the i saves the access paths it creates in the job in case they are needed again. Which means if you log out and log back in then run your query, it should take longer to run, then the second time you run the query it'll be faster. When running queries in a web application, you may or may not be reusing a job meaning the access paths have to be rebuilt.
If speed is important. I would:
Look into optimizing the query. I know there are better sources, but I can't find them right now.
Create a stored procedure. A stored procedure saves the access paths created.
With only 16000 rows and no WHERE or ORDER BY this thing should scream. Break the problem down to help diagnose where the bottleneck is. Go back to the IBM i, run your query in the SQL command line and then use the B, BOT or BOTTOM command to tell the database to show the last row. THAT will force the database to cough up the entire 16k result set, and give you a better idea of the raw performance on the IBM side. If that's poor, have the IBM administrators run Navigator and monitor the performance for you. It might be something unexpected, like the 'table' is really a view and the columns you are selecting might be user defined functions.
If the performance on the IBM side is OK, then look to what Cold Fusion is doing with the result set. Not being a CF programmer, I'm no help there. But generally, when I am tasked with solving multi-platform performance issues, the client side tends to consume the entire result set and then use program logic to choose what rows to display/work with. The server is MUCH faster than the client, and given the right hints, the database optimiser can make some very good decisions about how to get at those rows.

Django PostgreSQL Database [Data Race?]

I have a Django application hosted on Heroku using the PostgreSQL database addon. Upon performing a GET request for the front page, my applications performs a SQL query to extract some necessary display information. I also create a subprocess with Popen on each GET request.
However, when I notice that the number of GET requests increasing to around once every second, I being erroring at the statement model.objects.get(id="----"). I get an OperationalError ; I'm assuming that either my free plan on heroku isn't keeping up or my database isn't keeping up.
In this case, I don't want to leave Heroku's free plan, but I was wondering if I did, would I need to create more workers? Upgrade my database? What are ways to diagnose the issue? And why would a simple SQL query cause issues as the number of requests increase to an interval of around once every second? Does this seem reasonable?
My hack solution was just to sleep the view whenever I catch an OperationalError. Any other approaches recommended?

Postgres Paginating a FTS Query

What is the best way to paginate a FTS Query ? LIMIT and OFFSET spring to mind. However, I am concerned that by using limit and offset I'd be running the same query over and over (i.e., once for page 1, another time for page 2.... etc).
Will PostgreSQL be smart enough to transparently cache the query result ? Thus subsequently satisfying the pagination queries from a cache ? If not, how do I paginate efficiently ?
edit
The database is for single user desktop analytics. But, I still want to know what the best way is, if this were a live OLTP application. I have addressed the problem in the past with SQL Server by creating a ordered set of document id's and cache the query parameters against the IDs in a seperate table. Clearing the cache every few hours (so as to allow new documents to enter the result set).
Perhaps this approach is viable for postgres. But still I wanna know the mechanics present in the database and how best to leverage them. If I were a DB developer I'd enable the query-response cache to work with the FTS system.
A server-side SQL cursor can be effectively used for this if a client session can be tied to a specific db connection that stays open during the entire session. This is because cursors cannot be shared between different connections. But if it's a desktop app with a unique connection per running instance, that's fine.
The doc for DECLARE CURSOR explains how the resultset is going to be materialized when the cursor is declared WITH HOLD in a committed transaction.
Locking shouldn't be a concern at all. Should the data be modified while the cursor is already materialized, it wouldn't affect the reader nor block the writer.
Other than that, there is no implicit query cache in PostgreSQL. The LIMIT/OFFSET technique implies a new execution of the query for each page, which may be as slow as the initial query depending on the complexity of the execution plan and the effectiveness of the buffer cache and disk cache.
Well, to be honest, what you may want is for your query to return a live Cursor, that you can then reuse to fetch certain portions of the results that it (the Cursor) represents. Now, I don't know if PostGre supports this, Mongo DB does, and I've tried going down that road but it's not cool. For example: do you know how much time it will pass between when a query is done and a second page of results from that query are demanded? Can the cursor stay on for that amount if time? And if it can, what does it mean exactly, will it block resources, such that if you have many lazy users, who start queries but take a long time to navigate through pages, your server might be bogged down by locked cursors?
Honestly, I think redoing a paginated query each time someone asks for a certain page is ok. First of all, you'll be returning a small number of entries (no need to display more than 10-20 entries at a time) and that's gonna be pretty fast, and second, you should more likely tune up your server so that it executes frequent request fast (add indexes, put it behind a Solr server if necessary, etc.) rather than have those queries run slow, but caching them.
Finally, if you really want to speed up full text searches, and have fancy indexes like case insensitive, prefix and suffix enabled, etc, you should take a look at Lucene or better yet Solr (which is Lucene on steroids) as an in-between search and indexing solution between your users and your persistence tier.