database design, question about implementation - sql

Question regarding my sql database design for a project i am working on.
I will be receiving data every few seconds and i am going to need to store that data into a database. I am using mySQL for my DBMS. The data needs to be stored in the database with a userid attached to each piece of data. I will only be handling one user per application. So, each instance of the application will only be handling one users data. The remote database will be storing all users data though. So, that is why i need userid's to know whose data is whose.
My idea was to wait until i receive like 50 data packets and create a delimited string of all 50 data packets. (Maybe separated by commas) Then push that string to the database along with the userid. And store the data like that. My question is, is that a good way to do it? Is there a better way? Is this bad practice? TIPS PLEASE! =)
I will be receiving a lot of this data. One data packet like every second, sometimes faster. Just let me know what you think.
The DBMS will be running on a remote machine. The application will be running on an android phone.
Thanks in advance!

I would not suggest concatenating a bunch of values together to send a delimited string to the database. That just creates additional work on the database to parse the string.
Any reasonable framework for interacting with the database will let you create and send batches of SQL statements with different values for the bind variables to the database. That keeps the nice, friendly syntax of the stored procedure or INSERT statement, it keeps the database properly normalized, and it accomplishes the performance goal of minimizing the number of round-trips.

If the dbms is running on a good server, and all you do with the data is a simple insert to a reasonably simple table, 1 insert per second should not a strain at all. I'd expect it to be hardly measurable.

The question you really have to answer is the tolerance you have for losing data. A request per second transferring under 1k of data isn't much, especially using json vs. xml. Then again, battery life is something to keep in mind on mobile devices, so making a request every 5-60 seconds is also doable.
There's no reason you cannot batch your updates to the server.
If you have no tolerance for data loss, you could collect your batch of 50 updates on local storage, and upload them. If a failure occurs in transmission you can resend. In this case, however, I would want to have some record ID that's reasonably guaranteed to be unique, such as a UUID. This way the server can see which records it's already processed and exclude them from reprocessing.

I'm going to address the issue of storing it as a delimited string. HOw do you intend to query this data after it is stored? If you will need to find the data for one or aeven a small group of values but not the entire string, donot consider storing the data this way as it will give you horrible performance in querying and will be very painful to write queries for. In general, storing more than one piece of dat ina field is a bad thing, it means you need a related table.
Also, for what you are doing, if you don't need to to analytical querying of the data, perhaps a nosql database would be a better choice than a relational database.

Related

What is the best way to structure this database?

So I am in the process of building a database from my clients data. Each month they create roughly 25 csv's, which are unique by their topic and attributes, but they all have 1 thing in common; a registration number.
The registration number is the only common variable across all of these csv's.
My task is to move all of this into a database, for which I am leaning towards postgres (If anyone believes nosql would be best for this then please shout out!).
The big problem; structuring this within a database. Should I create 1 table per month that houses all the data, with column 1 being registration and column 2-200 being the attributes? Or should put all the csv's into postgres as they are, and then join them later?
I'm struggling to get my head around the method to structure this when there will be monthly updates to every registration, and we dont want to destroy historical data - we want to keep it for future benchmarks.
I hope this makes sense - I welcome all suggestions!
Thank you.
There are some ways where your question is too broad and asking for an opinion (SQL vs NoSQL).
However, the gist of the question is whether you should load your data one month at a time or into a well-developed data model. Definitely the latter.
My recommendation is the following.
First, design the data model around how the data needs to be stored in the database, rather than how it is being provided. There may be one table per CSV file. I would be a bit surprised, though. Data often wants to be restructured.
Second, design the archive framework for the CSV files.
You should archive all the incoming files in a nice directory structure with files from each month. This structure should be able to accommodate multiple uploads per month, either for all the files or some of them. Mistakes happen and you want to be sure the input data is available.
Third, copy (this is the Postgres command) the data into staging tables. This is the beginning of the monthly process.
Fourth, process the data -- including doing validation checks to load it into your data model.
There may be tweaks to this process, based on questions such as:
Does the data need to be available 24/7 even during the upload process?
Does a validation failure in one part of the data prevent uploading any data?
Are SQL checks (referential integrity and check) sufficient for validating the data?
Do you need to be able to "rollback" the system to any particular update?
These are just questions that can guide your implementation. They are not intended to be answered here.

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

Updating database with multiple records

We use SQL Server and have Winforms application. In our product, sometimes the records exceeds more than 50000 in single transaction and we face Performance issue there.
When we have huge amount of data, we generally do that in multiple database calls. So in one of our Import functionality we are updating servers in a bunch of 1000 rows. So if we have 5000 records, then while processing them (in a for loop) we update the first 1000 rows and then continue processing until we get new 1000 rows to update. This performs better but honestly not the best I feel in terms of performance.
But we have seen in other Import/Export functionality that updating database with every 5000 rows is giving good results when compared to 1000. So there is a lot of confusion we are facing and also code does not look to be same across our applications.
Can anyone give me an idea what makes this happen. You don't have sample data, database schema etc. and yes I do agree. But are there any scenarios which should be taken care/considered while working with database? And why different number of records are giving us the good results, is there something we are ignoring? I am not a champ in database and more of a programming guy in .Net. Will be happy to hear your suggestions.
Not sure if this is helpful, our data generally contains employee details like payroll information, personal details, Accrual Benefits, Compensation etc. Data is fed from an excel and also we generate lot of data in our internal process. Let me know if you need more information. Thanks!!
The more database callouts you have, the more connection management you will need (open connection, use connection, cleanup & close, are we using connection pooling etc.etc.). You're sending the same amount of data over the wire, but you are opening and closing the taps more often, which brings overhead.
The downside of this is that the amount of data held in a transaction is greater.
However, if I may make a suggestion, you might want to consider achieving this in a different way, by loading all data into the database as fast as possible (into interim tables where the contraints are deativated and with transactional management turned off, if possible) and then allowing the database to carry out the task of checking and validating the data.
Since you are using SQL Server, you can just turn on SQL Profiler, define an appropriate event filter, and watch what happens under different loads.

Postgres SQL: Best way to check for new data in a database I don't control

For an application I am writing, I need to be able to identify when new data is inserted into several tables of a database.
The problem is two fold, this data will be been inserted many times per minute into sometimes very large databases (and I need to be sensitive to demand / database polling issues) and I have no control of the application creating this data (so as far as I know, I can't use the notify / listen functionality available within postgres for exactly this kind of task*).
Any suggestion regarding a good strategy would be much appreciated.
*I believe the application controlling this data is using the notify / listen functionality itself, but I haven't a clue how (if at all possible) to know what the "channel" it uses externally and if it is ever able to latch on to that.
Generally, you need something in the table that you can use to determine newness, and there are a few approaches.
A timestamp column would let you use the date but you'd still have the application issue of storing a date outside of your database, and data that isn't in the database means another realm of data to manage. Yuck.
A tracking table that stored last update/insert timestamps on a per-table basis could give you what you want. You'd want to use a trigger to maintain the last-DML timestamp.
A solution you don't want to use is a serial (integer) id that comes from nextval, for any purpose than uniqueness. The standard/common mistake is to presume serial keys will be contiguous (they're not) or monotonic (they're not).

Is having a copy of SQL data in an application a good idea to save SQL SELECTS?

I am working on a multithreading .NET 4 application which acquires data continuously and writes them into a SQL database (MySQL or SQL Server - not yet sure).
Everytime when a INSERT is executed, at leat one prior SELECT is necessary in order so synchronize with the databaes. This means the applications gets a block which contains new and old data and then has to check which data sets are new and which are already in the database.
This means a lot of SELECTS which result everytime in more or less the same data.
Would it be a good idea to have a copy of the last x entries per table within the application?
This way the synchronization could be done on the copy instead of the database.
Pro:
Faster
Contra:
Uses a lot of memory
Risk of becomming unsynchronized with the database
What do you think? What is the best practice for such a use case?
Any other pros and cons?
Unless you have an external program writing to your database at the same time, you can use buffering.
But instead of buffering SELECT results, just add to the insert method a buffer of the last X (a reasonable number) insertion requests, and only insert the new one if it isn't on that list.
You might also want to lock the insertion method, to make sure the inclusion check is always correct.
If you have multiple processes writing to the database, it is non-trivial to maintain perfect synchronization between in-memory data and the database. In fact the only way to confirm you are synchronized is by making a SELECT query on database. So you have a trade-off between perfect synchronization and synchronization with some tolerance which is very efficient.
My suggestions, which may help in both cases, would be:
Tune your SELECT queries. Add indexes if necessary.
Create meta-data, like version numbers. So that you have to only check something very trivial to determine if you need synchronization.
Write a stored proc which implements your SELECT and INSERT logic. Then you do not have to worry about making multiple calls to the database.