I need to synchronize between two data sources:
I have a web service running on then net. It continuously gathers data from the net and stores it on the database. It also provides the data to the client based on the client's request. I want to keep a repository of data as object for faster service.
On the client side, there is a windows service that calls the web service mentioned previously and synchronize its local database to the server.
Few of my restrictions:
The web service has very small buffer limit and it can only transfer less then 200 records per call which is not enough for data collected in a day.
I also can't copy the database files since the database structure is very different (sql and other is access)
The data is being updated on a hourly basis and there will be large amount of data that will be needed to be transfer.
Sync by date or other group is not possible with the size limitation. Paging can be done but the remote repository keeps changing (and I don't know how to take chunk of data from the middle of table of SQL database)
How do I use the repository for recent data update/or full database in sync with this limitation?
A better approach for the problem or an improvement of the current approach will be taken as the right answer
You mentioned that syncing by date or by group wouldn't work because the number of records would be too big, but what about syncing by date (or group or whatever) and then paging by that? The benefit is that you will have a defined batch of records and you can now page over that because that group won't change.
For example, if you need to pull data off hourly, as each hour elapses (so, when it goes from 8:59am to 9:00 am), you begin pulling down the data that was added between 8am and 9am in chunks of 200 or whatever size the service can handle.
Related
this is more of a general discussion rather than a code question.
I have a website monitoring platform whereby users of the system can input their website URL and we'll check it every X minutes based on the customer's interval, at each interval, an entry is stored as a UptimeCheck model in the Laravel 8 project with the status being down or up.
If a customer has 20 monitors, and each checks every minute, then over a 30 day period for the one customer they'd accumulate over 1 million rows.
My query, is really do I need to keep this number of rows?
The reason this number of rows is kept is so that we can present a graph showing the average website uptime.
My thinking is that if I created some kind of SVG programatically for each day and store this in the table then I wouldn't need to store as many entries, but my concern here is how would I merge SVG models into one to present a daily graph?
What kind of libraries could I use and how else might I approach this?
Unlike performance, the trick for storing uptime data is simple. You don't store it. ;)
You need to store DOWNTIME data instead. Register only unavailability events and extrapolate uptime when displaying reports.
I'm looking for a cloud service that can do advanced statistics calculations on a large amount of votes submitted by users, in "real time".
In our app, users can submit different kind of votes like picking a favorite, rating 1-5, say yes/no etc. on various topics.
We also want to show "live" statistics to the user, showing the popularity of a person etc. This will be generated by a rather complex SQL where we are calculating the average number of times a person was picked as favorite, divided by total number of votes and the number of games in which the person has been participating etc. And the score for the latest X games should count higher than the overall score for all games. This is just an example, there are several other SQL queries with similar complexity.
All our presentable data (including calculated statistics) is served from Firestore documents, and the votes will be saved as Firestore documents.
Ideally, the Firebase-backend (functions, firestore etc) should not need to know about the query logic.
What I wish for is a pay as you go cloud service that does the following:
I define some schemas and set up the queries we need for the statistics we have (15-20 different SQLs). Like setting up views in MySQL
On every vote, we push the vote data to this service, which will store it in a row.
The service should then, based on its knowledge about the defined queries, and the content of the pushed vote data, determine which statistics that are affected by the newly added row, and recalculate these. A specific vote type can affect one or more statistics.
Every time a statistic is recalculated, the result should be automatically pushed back to our Firebase backend (for instance by calling an HTTPS endpoint that hits a cloud function) - so we can update the relevant Firestore documents.
The service should be able to throttle the calculations, like only regenerating new statistics every 1 minute despite having several votes per second on the same topic.
Is there any product like this in the market? Or can it be built by combining available cloud services? And what is the official term for such a product, if I should search for it myself?
I know that I can probably build a solution like this myself, and run it on a cloud hosted database server, which can scale as our need grows - but I believe that I'm not the first developer with a need of this, so I hope that someone has solved it before me :)
You can leverage the existing cloud services available on the Google Cloud Platform.
Google BigQuery, Google Cloud Firestore, Google App Engine (CRON Jobs), Google Cloud Tasks
The services can be used to solve the problems mentioned above:
1) Google BigQuery : Here you can define schema for the data on which you're going to run the SQL queries. BigQuery supports Standard and legacy SQL queries.
2) Every vote can be pushed to the defined BigQuery tables using its streaming insert service.
3) Every vote pushed can trigger the recalculation service which calculates the statistics by executing the defined SQL queries and the query results can be stored as documents in collections in Google Cloud Firestore.
4) Google Cloud Firestore: Here you can store the live statistics of the user. This is a real time database, so you'll be able to configure listeners for the modifications to the statistics and show the modifications as soon as the statistics are recalculated.
5) In the same service which inserts every vote, create a new record with a "syncId" in an another table. The idea is to group a number of votes cast in a particular interval to a its corresponding syncId. The syncId can be suffixed with a timestamp. According to your requirement a particular time interval can be set so that the recalculation can be triggered using CRON jobs service which invokes the recalculation service within the interval. Once the recalculation related to a particular syncId is completed the record corresponding to the syncId should be marked as completed.
We are leveraging the above technologies to build a web application on Google Cloud Platform, where the inputs are recorded on Google Firestore and then stream-inserted to Google BigQuery. The data stored in BigQuery is queried after 30 sec of each update using SQL queries and the query results are stored in Google Cloud Firestore to serve dashboards which are automatically updated using listeners configured for the collection in which the dashboard information is stored.
I need to take a snapshot of a table at a given time, and determine the difference between the snapshot and the current data. What is the most effective way to do that? Can it be done in pure SQL (MS SQL), or do my app server do that in Delphi code?
I'm using an app server that keeps track of these changes, and transmit them over a Telnet protocol to any number of clients/ on the same machine or not.
Because of the txt protocol, I have to use the difference of the tables because it is impractical to send all the data (~10k records) every time something changes.
The apps involved are, Swordfish (an Automatic Trading System/ ATS), not written by me. The app server (Chef), and the client (Diner), both written by me. The ATS uses MS SQL as a layer for its API, so Chef, sends and receives data to the MS SQL server, essentially controlling the ATS. The client communicates what it wants done to Chef, and then Chef talks to Swordfish through the DBMS, and the the other Diners, through Telnet.
Code. Is the most efficient way to do this. According to all the info that I could find on the web
It may be possible to know with pure SQL what rows were added, but I could find nothing (in SQL) to detect changes to already existent rows or row deletes, both of which I need knowledge of to keep my app server (that is aware and synced with the SQL table) and my app clients synchronized.
Keeping an in-memory table of 10-15k records isn't that serious, a different error in my code (to do with TFDQuery) made me think that my "offline" or "in memory" snapshot op the tables needed A LOT of memory (every sql add command created it's own instance of TFDQuery, requiring 30mb per record that leaked when destroying the TFDQuery, now I create the instance of TFDQuery once, and reuse the instance for every record added, and my memory usage total stays ~50mb, which I have no problem with)
So, every time Service Broker detects a change in the dataset of the sql table, I save the old dataset to a in-memory table, and do 3 compares between dataset and dataset (dataset saved/old and dataset current/the newest version of the SQL table). 1. Scan for addition. 2. Scan for changes. 3 Scan for deletion, DONE :-)
Then its' a simple task of encoding the text for the Telnet protocol, and all my clients and my SQL server and my app server are happily synced!
I have an SQL database which has a main Orders table taking 2-5 new rows per day.
Other table which has daily records is Log table. It receives new data every time a user accesses the login page of the web site including time and the IP address of the user. It gets 10-15 new rows per day for now.
As I monitor the daily backup of SQL, I realized that it is growing like 2-3MB per day. I have enough storage but it makes me worried. Is it the Log table causing this growth? I deleted like 150 rows but it didn't cause the .bak file size reduce. It increased! I didn't shrink database and I don't want to do it.
I'm not sure what to do about it. Is there any other decent way of Logging user accesses?
I typically export the rows from the production server, and import into a database on a non-production server (like my local machine), then delete the existing rows from the production server. Also run an optimize on the production server table so the size is recalculated. This is somewhat manual but it keeps the production server table size down, and the export/import process is rather quick.
We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.