Django PostgreSQL Database [Data Race?]

Django PostgreSQL Database [Data Race?] - sql

I have a Django application hosted on Heroku using the PostgreSQL database addon. Upon performing a GET request for the front page, my applications performs a SQL query to extract some necessary display information. I also create a subprocess with Popen on each GET request.
However, when I notice that the number of GET requests increasing to around once every second, I being erroring at the statement model.objects.get(id="----"). I get an OperationalError ; I'm assuming that either my free plan on heroku isn't keeping up or my database isn't keeping up.
In this case, I don't want to leave Heroku's free plan, but I was wondering if I did, would I need to create more workers? Upgrade my database? What are ways to diagnose the issue? And why would a simple SQL query cause issues as the number of requests increase to an interval of around once every second? Does this seem reasonable?
My hack solution was just to sleep the view whenever I catch an OperationalError. Any other approaches recommended?

Related

Bigquery Cache Not Working

I noticed that BigQuery no longer cache the same query even I have chosen to use cache in the GUI (both Alpha and Classic). I didn't edit the query at all, just keep clicking run query button and every time GUI executed the query without using cache results.
It happens to my PHP script as well. Before, it was enable to use cache and came back with results very quick and now it executes the query every time even the same query has been executed minutes ago. I can confirm the behaviour in the logs.
I am wondering if there is anything changed in the last few weeks? Or some kind of account level settings control this? Because it was working fine for me.

As per official docs here cache is disable when:
...any of the tables referenced by the query have recently received
streaming inserts...
Even if you are streaming to one partition, and then querying to another, this will invalidate caching functionality for the whole table. There is this feature request opened where it is requested to be able to hit cache when doing streaming inserts to one partition but querying a different partition.
EDIT***:
After some investigation I've found out that some months ago there was an issue going on which was allowing to hit the cache even streaming inserts were being made. This was not expected behavior, and therefore it got solved in May. I guess this is the change you have experienced and what you are talking about.
Docs have not changed related to this, and they aren't/weren't incorrect. Just the previous behavior was the incorrect one.

Which one is the best either Cached nor Non-Cached Google BigQuery in C# application

I have developed C# application that reads data from Google Big-query using .Net Client Library.
Query:
Select SUM(Salary), Count(Employee_ID) From Employee_Details
If i am using Non-Cached Query (JobConfig.UseCacheQuery=false) in Job Configuration then able to get the result in ~6 Seconds.
If i am using Cached Query (JobConfig.UseCacheQuery=true) in Job Configuration then able to get the same result in ~2 Seconds.
Which is the best way to use Google BigQuery whether Cached nor Non-Cached. (Cached Query Execution time is faster than Non-Cached once).
If there is any drop-backs are present in Cached Queries? Kindly Clarify this.

If you run a BigQuery query twice in a row, the query cache will allow the second query invocation to simply return the same results that the first query already computed, without actually running the query again. You get your results faster, and you don't get charged for it.
The query cache is a simple way to prevent customers from overspending by repeating the same query, which sometimes happens in automated environments.
Query caching is on by default, and I would recommend leaving it enabled unless you have a specific reason to disable it. One reason you might disable caching is if you are doing performance testing and want to actually run the query to see how long it takes. But those scenarios are rare.
Read more here:
https://cloud.google.com/bigquery/querying-data#querycaching

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?

Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!

I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

Why does my SELECT query take so much longer to run on the web server than on the database itself?

I'm running the following setup:
Physical Server
Windows 2003 Standard Edition R2 SP2
IIS 6
ColdFusion 8
JDBC connection to iSeries AS400 using JT400 driver
I am running a simple SQL query against a file in the database:
SELECT
column1,
column2,
column3,
....
FROM LIB/MYFILE
No conditions.
The file has 81 columns - aplhanumeric and numeric - and about 16,000 records.
When I run the query in the emulator using the STRSQL command, the query comes back immediately.
When I run the query on my Web Server, it takes about 30 seconds.
Why is this happening, and is there any way to reduce this time?

While I cannot address whatever overhead might be involved in your web server, I can say there are several other factors to consider:
This may likely have to do primarily in the differences between the way the two system interfaces work.
Your interactive STRSQL session will start displaying results as quickly as it receives the first few pages of data. You are able to page down through that initial data, but generally at some point you will see a status message at the bottom of the screen indicating that it is now getting more data.
I assume your web server is waiting until it receives the entire result set. It wants to get all the data as it is building the HTML page, before it sends the page. Thus you will naturally wait longer.
If this is not how your web server application works, then it is likely to be a JT400 JDBC Properties issue.
If you have overridden any default settings, make sure that those are appropriate.
In some situations the OPTIMIZATION_GOAL settings might be a factor. But if you are reading the table (aka physical file or PF) directly, in its physical sequence, without any index or key, then that might not apply here.
Your interactive STRSQL session will default to a setting of *FIRSTIO, meaning that the query is optimized for returning the first pages of data quickly, which corresponds to the way it works.
Your JDBC connection will default to a "query optimize goal" of "0", which will translate to an OPTIMIZATION_GOAL setting of *ALLIO, unless you are using extended dynamic packages. *ALLIO means the optimizer will try to minimize the time needed to return the entire result set, not just the first pages.
Or, perhaps first try simply adding FOR READ ONLY onto the end of your SELECT statement.
Update: a more advanced solution
You may be able to bypass the delay caused by waiting for the entire result set as part of constructing the web page to be sent.
Send a web page out to the browser without any records, or limited records, but use AJAX code to load the remainder of the data behind the scenes.
Use large block fetches whenever feasible, to grab plenty of rows in one clip.

One thing you need to remember, the i saves the access paths it creates in the job in case they are needed again. Which means if you log out and log back in then run your query, it should take longer to run, then the second time you run the query it'll be faster. When running queries in a web application, you may or may not be reusing a job meaning the access paths have to be rebuilt.
If speed is important. I would:
Look into optimizing the query. I know there are better sources, but I can't find them right now.
Create a stored procedure. A stored procedure saves the access paths created.

With only 16000 rows and no WHERE or ORDER BY this thing should scream. Break the problem down to help diagnose where the bottleneck is. Go back to the IBM i, run your query in the SQL command line and then use the B, BOT or BOTTOM command to tell the database to show the last row. THAT will force the database to cough up the entire 16k result set, and give you a better idea of the raw performance on the IBM side. If that's poor, have the IBM administrators run Navigator and monitor the performance for you. It might be something unexpected, like the 'table' is really a view and the columns you are selecting might be user defined functions.
If the performance on the IBM side is OK, then look to what Cold Fusion is doing with the result set. Not being a CF programmer, I'm no help there. But generally, when I am tasked with solving multi-platform performance issues, the client side tends to consume the entire result set and then use program logic to choose what rows to display/work with. The server is MUCH faster than the client, and given the right hints, the database optimiser can make some very good decisions about how to get at those rows.

Postgres Paginating a FTS Query

What is the best way to paginate a FTS Query ? LIMIT and OFFSET spring to mind. However, I am concerned that by using limit and offset I'd be running the same query over and over (i.e., once for page 1, another time for page 2.... etc).
Will PostgreSQL be smart enough to transparently cache the query result ? Thus subsequently satisfying the pagination queries from a cache ? If not, how do I paginate efficiently ?
edit
The database is for single user desktop analytics. But, I still want to know what the best way is, if this were a live OLTP application. I have addressed the problem in the past with SQL Server by creating a ordered set of document id's and cache the query parameters against the IDs in a seperate table. Clearing the cache every few hours (so as to allow new documents to enter the result set).
Perhaps this approach is viable for postgres. But still I wanna know the mechanics present in the database and how best to leverage them. If I were a DB developer I'd enable the query-response cache to work with the FTS system.

A server-side SQL cursor can be effectively used for this if a client session can be tied to a specific db connection that stays open during the entire session. This is because cursors cannot be shared between different connections. But if it's a desktop app with a unique connection per running instance, that's fine.
The doc for DECLARE CURSOR explains how the resultset is going to be materialized when the cursor is declared WITH HOLD in a committed transaction.
Locking shouldn't be a concern at all. Should the data be modified while the cursor is already materialized, it wouldn't affect the reader nor block the writer.
Other than that, there is no implicit query cache in PostgreSQL. The LIMIT/OFFSET technique implies a new execution of the query for each page, which may be as slow as the initial query depending on the complexity of the execution plan and the effectiveness of the buffer cache and disk cache.

Well, to be honest, what you may want is for your query to return a live Cursor, that you can then reuse to fetch certain portions of the results that it (the Cursor) represents. Now, I don't know if PostGre supports this, Mongo DB does, and I've tried going down that road but it's not cool. For example: do you know how much time it will pass between when a query is done and a second page of results from that query are demanded? Can the cursor stay on for that amount if time? And if it can, what does it mean exactly, will it block resources, such that if you have many lazy users, who start queries but take a long time to navigate through pages, your server might be bogged down by locked cursors?
Honestly, I think redoing a paginated query each time someone asks for a certain page is ok. First of all, you'll be returning a small number of entries (no need to display more than 10-20 entries at a time) and that's gonna be pretty fast, and second, you should more likely tune up your server so that it executes frequent request fast (add indexes, put it behind a Solr server if necessary, etc.) rather than have those queries run slow, but caching them.
Finally, if you really want to speed up full text searches, and have fancy indexes like case insensitive, prefix and suffix enabled, etc, you should take a look at Lucene or better yet Solr (which is Lucene on steroids) as an in-between search and indexing solution between your users and your persistence tier.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas