Who is responsible for modifying timestamp fields in the app: backend or db? - sql

For example: we have 2 fields in certain entity: created_at and updated_at. We can update those fields manually on backend after create or update operations, or create a trigger on the DB side that will fill/update these fields for us automatically.
There are some cases to consider:
Usually, on the backend after create or update we return the json of the object. In this case it'd be nice to see those timestamp fields set up on return, however if a trigger makes the modifying for us, to see these updated timestamps, backend would have to make another select just to set up these timestamps to nicely return it to the client.
Sometimes backend engineers can forget to update these fields manually leading to null records.
Not a DBA specialist myself but what do you think of the cost of the triggers? Especially in high RPS. Should I not worry about the performance that triggers have for such simple updates in the high-load systems?

Who is responsible for modifying timestamp fields in the app: backend or db?
It depends, it can be either.
There is no one "right" answer for which times to use or exclude. Depending on your system, which actors perform time-based actions (users, devices, servers, triggers), any (or all) of the list below might make sense to incorporate.
Depending on your system, you might have one (or more) of the following:
time A – when a user performs an action
this is most likely local device time (whatever the phone or computer thinks is current time)
but: anything is possible, a client could get a time from who-knows-where and report that to you
could be when a user did something (tap a button) and not when the message was sent to the backend
could be 10-20 seconds (or more) after a user did something (tap a button), and gets assigned by the device when it sends out batched data
time B – when the backend gets involved
this is server time, and could be when the server receives the data, or after the server has received and processed the data, and is about to hand it off to the next player (database, another server, etc)
note: this is probably different from "time A" due to transit time between user and backend
also, there's no guarantee that different servers in the mix all agree on time.. they can and should, but should not be relied upon as truth
time C – when a value is stored in the database
this is different from server time (B)
a server might receive inbound data at B, then do some processing which takes time, then finally submits an insert to the database (which then assigns time C)
Another highly relevant consideration in capturing time is the accuracy (or rather, the likely inaccuracy) of client-reported time. For example, a mobile device can claim to have sent a message at time X, when in fact the clock is just set incorrectly and actual time is minutes, hours - even days - away from reported time X (in the future or in the past). I've seen this kind of thing occur where data arrives in a system, claiming to be from months ago, but we can prove from other telemetry that it did in fact arrive recently (today or yesterday). Never trust device-reported times. This applies to a device – mobile, tablet, laptop, desktop – all of them often have internal clocks that are not accurate.
Remote servers and your database are probably closer to real, though they can be wrong in various ways. However, even if wrong, when the database auto-assigns datetimes to two different rows, you can trust that one of them really did arrive after or before the other – the time might be inaccurate relative to actual time, but they're accurate relative to each other.
All of this becomes further complicated if you intend to piece together history by using timestamps from multiple origins (A, B and C). It's tempting to do, and sometimes it works out fine, but it can easily be nonsense data. For example, it might seem safe to piece together history using a user time A, then a server time B, and database time C. Surely they're all in order – A happened first, then B, then C; so clearly all of the times should be ascending in value. But these are often out of order. So if you need to piece together history for something important, it's a good idea to look for secondary confirmations of order of events, and don't rely on timestamps.
Also on the subject of timestamps: store everything in UTC – database values, server times, client/device times were possible. Timezones are the worst.

Related

How can I deal with the webserver UI of one machine being out of sync with backend/API of another?

The system my company sells is software for a multi-machine solution. In some cases, there is a UI on one of the machines and a backend/API on another. These systems communicate and both use their own clocks for various operations and storage values.
When the UI's system clock gets ahead of the backend by 30 seconds or more, the queries start to misbehave due to the UI's timestamp being sent over as key information to the REST request. There is a "what has been updated by me" query that happens every 30 seconds and the desync will cause the updated data to be missed since they are outside the timing window.
Since I do not have any control over the systems that my software is installed on, I need a solution on my code's side. I can't force customers to keep their clocks in sync.
Possible solutions I have considered:
The UI can query the backend for it's system time and cache that.
The backend/API can reach back further in time when looking for updates. This will give the clocks some room to slip around, but will cause a much heavier query load on systems with large sets of data.
Any ideas?
Your best bet is to restructure your API somewhat.
First, even though NTP is a good idea, you can't actually guarantee it's in use. Additionally, even when it is enabled, OSs (Windows at least) may reject packets that are too far out of sync, to prevent certain attacks (on the order of minutes, though).
When dealing with distributed services like this, the mantra is "do not trust the client". This applies even when you actually control the client, too, and doesn't necessarily mean the client is attempting anything malicious - it just means that the client isn't the authoritative source.
This should include timestamps.
Consider; the timestamps are a problem here because you're trying to use the client's time to query the server - except, we shouldn't trust the client. Instead, what we should do is have the server return a timestamp of when the request was processed, or the update stamp for the latest entry of the database, that can be used in subsequent queries to retrieve new updates (how far back you go on initial query is up to you).
Dealing with concurrent updates safely is a little harder, and depends on what is supposed to happen on collision. There's nothing really different here from most of the questions and answers dealing with database-centric versions of the problem, I'm just mentioning it to note you may need to add extra fields to your API to correctly handle or detect the situation, if you haven't already.

Shared Data Storage Strategy for 'Live' Dashboards in Excel VBA

I'm doing an UI in excel which the goal is to have "live" information on Orders and Order Status between three users, I'll name them DataUser, DashboardOne, and DashboardTwo for examples sake.
The process is that the DataUser will fill in the Orders data, that data is going to be used to populate information on two dashboards. The dashboards are going to be updated live with changes from the DataUser(Orders Increases/Decreases), and changes on order status between DashboardOne and DashboardTwo. For the live updates I'm thinking on using Application.OnTime event call to refresh the View/Dashboards. The two dashboards will be active about 8 hours a day.
Where I'm struggling in on how/where to store the Data, I've though about a couple of options but I don't know the implications of one over the other, especially considering that I intend that the dashboards will run/refresh every 30 sec. with Application.OnTime which could prove expensive.
The options I thought about where:
A Master Workbook that would create separate Workbooks for DashboardOne and DashboardTwo and act database and main UI for DataUser.
Three separate workbooks that would all refer to the one DataWorkbook or another flat data file (perhaps and XML or JSON).
Using an actual database for the data, although this would bring other implications (don't currently have one).
I'm not considering a shared workbook as I've tried something similar in the past (and this time ^^, early steps) and it went rather poorly, nightmare to sync and poor data integrity.
In short:
Which would be the best Data storage strategy for Excel that wouldn't jeopardise the integrity of the data nor be so expensive as to interfere with the uptime rest of the code? Are there better options that I should be considering?
There are quite a number of alternatives, depending on the time you want to invest and the tools at hand. I'll give you a couple of options here.
But first, the basic assumptions:
The amount of data items that you need to share (being a dashboard) is of few tens (let's say, less than 100),
You have at least basic programming skills,
From your description, you have one client with READ-WRITE capabilities while there are two clients with READ-ONLY capability.
OPTION 1:
You can have the Excel saving the data in CSV format (very small amount of data and hence it would take a small fraction of a second to save it and to read it).
The two clients would then open the file in read-only mode, load the data and update the display. You would need to include exception handling at both types of client:
At the one writing, handle the condition of error when it attempts to write at the same time one of the clients attempts to read,
At the two reading, handle the condition of error when attempting to open the file (for read only) while the other process is writing.
Since the write and read operations are going to take a very, VERY short time (as stated, a small fraction of a second), these conditions will be very rare. Additional, since both dashboard clients would be open the file for read-only, they will not disturb each other if they make their attempt at the same moment.
If you wish to drastically reduce the chances of collision, you may set the timers (of the update process on one hand and of the reading processes on the other) to be a primary number of seconds. For instance, the timer of the updating process would be every 11 seconds while that of the reading process would be every 7 seconds.
OPTION 2:
Establish a TCP/IP channel between the processes, where the main process (meaning the one that would have WRITE privilege) would send a triggering message to the other two requesting to start an update whenever a new version of the data had been saved. Upon reception of the trigger, both READ-ONLY processes would approach the file and fetch the data.
In this case, the chances of collision would become near to null.

Refactoring my database schema

I'm refactoring my current schema and it's too abstract for me.
I monitor my servers with a homemade monitoring software. This software sends HTTP requests to a Rails web server with about ten different fields worth of information so I can get a quick overview of everything.
My current implementation:
server [id, name, created_date, edited_date, ..., etc ]
status_update [id, server_id, field1, field2, field3, created_date, edited_date, ..., etc]
I treat the servers as Users and status updates as Tweets. I delete any status_update on a server_id older than the tenth one just to keep from growing to infinity.
Though I'm starting to run into a few complications. I need to display information from the most recent status_update on the index page, I need to sort the servers based on status_update info, I need to store info from certain status_updates that may be way older than 10 status_updates old. It also seems like I'm going to start needing to store information from status_updates in both the server and status_update, which would cause hitting the DB multiple times on an insert. Thus, I am looking to refactor.
My requirements:
I only need to display information from the most recent update.
Having the next 9 status_updates helps debug if the system goes offline.
I need to be able to sort based on some info from most recent status_update.
I need the database to remain small (Heroku free).
Ideal performance, IE not hitting database more than once unless necessary.
Non-Complicated DB structure so I can pass it along.
Edit: Additional Info => I am looking to ultimately monitor about 150-200 servers (a lot for a hobby dev, but I'm cheap). Each monitoring service posts every five minutes or so unless something goes wrong. So, worst case scenario has me reaching max capacity every four hours.
I was thinking it would be nice to track when the last time X event happened, and what the result was. Thus, tracking that information would have to be moved to the server model itself since I'm wiping out old records and would lose the information after an hour or so. Though in retrospect, I could just save that info in memory using the monitoring service and send it up every five minutes or only once each time it changes. I could also simply edit that information only when it changes, so as to process less information on each request. Hm!
Efficiency
All ORMs, including ActiveRecord, are designed and built around certain tradeoffs. It's commonplace for ORMs to use several simple SELECT statements to do what a SQL developer would do with a single SELECT statement. You're probably not overwhelming Heroku with your queries.
There's no reasonable structural solution to this problem.
Size
Your "status_update" table should be able to hold an enormous number of rows. Heroku's hobby-dev plan allows 10,000 rows. How many servers do you seriously expect to monitor on a free plan? If I were you, I would delete old rows from it no more than once a day, or when I got a permission error. (On Heroku, certain permission errors mean you're over the row limit.)
It also seems like I'm going to start needing to store information
from status_updates in both the server and status_update, which would
cause hitting the DB multiple times on an insert.
This really makes little sense. Tweets don't require updates to the user account; status updates don't require updates about the server. This might suggest refactoring is in order, but I'd want to see either your models or your CREATE TABLE statements to be sure. (You can paste those into your question, and leave a comment here.)
Alternatives
I'd seriously consider running this Rails app on a local machine, writing data to a database on the local machine, especially if you intend to target 200 web servers. This would eliminate all Heroku row limits, and you don't really need to run it 24 hours a day if this is just a hobby. If you're doing this professionally, your income from it should easily cover the cost of a hobby-basic plan on Heroku. (Currently $9.00/month.) But even then I'd think hard about hosting this locally.

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

Disadvantages of Sql Cursor

I was studying cursor and I read somewhere that. Each time you fetch a row from the cursor, it results in a network round trip whereas normal select query makes only one round trip however large the resultset is.
Can anyone explain what does that means? And what does network round trip and one round trip means in detail with some example. And when we use cursor and when we use while loop?
Unfortunately, that reference is incorrect.
A "normal SELECT" creates a cursor that the client fetch from. The mechanics are exactly the same as if you open and return a SYS_REFCURSOR (or any other mechanism for opening a cursor). In both cases, the client will fetch a number of rows over the network every time it requests data from the database. The client can control the number of rows that are fetched each time-- it would be exceptionally rare for the client to fetch 1 row or to fetch all the rows from a cursor in a single network round-trip.
What actually happens when a client application fetches from a cursor (no matter how the cursor is opened), the client application sends a request over the network for N rows (again, the client controls the value of N). The database sends the next N rows back to the client (generally, Oracle has to continue executing the query in order to determine the next N rows because Oracle does not generally materialize an entire result set). The client application does something with those N rows and then sends another request over the network for the next N rows and the pattern repeats.
In virtually all database systems, the application that uses the data; and the DBMS that is responsible for storing and searching the data; live on separate machines. They talk to each other over a network. Even when they are on the same machine, there is still effectively a network connection.
This matters because there is some time between when an application decides that it's ready to read data, when that request arrives over the network at the database server, when the database server actually gets the response for that, and when the response finally arrives over the network on the application side.
When you do a query for a whole set of data, you only pay this cost once; Although it may seem wasteful; in fact it's much more efficient to put the burden of holding on to all of the data on the application, because it's usually easier to give more resources to an application than to do the same on a database server.
When your application only fetches data one row at a time, then the cost of the round trip between application and database is paid once per row; If you want to show the titles of 100 blog posts, then you're paying the cost of 100 round trips to the database for that one report. Whats worse is that the database server has to some how keep track of the partially completed result set. That usually means that the resources that could be used for querying data for another request are instead being retained by an application that hasn't happened to ask for all of the data it originally asked for.
The basic rule is to talk to the database only when you absolutely have to, and to make the interaction as short as possible; This means you only ask for the data you really need (have the database do as much filtering as possible, instead of doing it in the application) and accept all of the data as quickly as possible, so that the database can move on to another task.