How many rows in a sql server should we retrieve? - sql

Right now, I need a query to send mail to all the registered users for some task. For this I need their userdetails and some data about which I will mail them.
My question is do I use join and retrieve all the data from sql-server and structure them in NodeJS and send mail, or do i retrieve first the data that needs to be mailed today and then for each post I retrieve the relevant people to mail too and then send mail?
PS: The server is across network

Joining the tables is clearly more efficient and easier, since you need only one query.
If you first get the data, how can you find the corresponding people later? Querying each person one by one will create a lot of queries. You can try to optimize the query with the use of an IN-clause looking like WHERE person_id IN (45,77,12,23,124, ...) but creating this list is tedious and the length of IN-clauses is limited, so you would have to split the list and combine them with OR, or create several queries. Why do this manually, if the JOIN does it for you?
Each roundtrip to the server is time consuming. On the other hand, the db-server can handle joins in a very optimized way.

It depends on data.
If you are using same post/content for all users then 1st approach is good.
But scenarios can turns complex as your system grows. Think about tomorrow, you might need scheduler which will read user details, their preferences may be mail digest for a week / daily mail etc, categories for post etc. In such case, 2nd option is good.
Possibly you are not going to send 1 million mails in a one shot. You can use scheduler which will runs after per-determined time and sends mail. May be another scheduler will invoke failed jobs only etc.

Related

Extending a set of existing tables into a dynamic client defined structure?

We have an old repair database that has alot of relational tables and it works as it should but i need to update it to be able to handle different clients ( areas ) - currenty this is done as a single client only.
So i need to extend the tables and the sql statements so ex i can login as user A and he will see his own system only and user B will have his own system too.
Is it correctly understood that you wouldnt create new tables for each client but just add a clientID to every record in every ( base ) table and then just filter with a clientid in all sql statements to be able to achieve multiple clients ?
Is this also something that would work ( how is it done ) on hosted solutions ? Am worried about performance if thats an issue lets say i had 500 clients ( i wont but from a theoretic viewpoint ) ?
The normal situation is to add a client key to each table where appropriate. Many tables don't need them -- such as reference tables.
This is preferred for many reasons:
You have the data for all clients in one place, so you can readily answers a question such as "what is the average X for each client".
If you change the data structure, then it affects all clients at the same time.
Your backup and restore strategy is only implemented once.
Your optimization is only implemented once.
This is not always the best solution. You might have requirements that specify that data must be separated -- in which case, each client should be in a separate database. However, indexes on the additional keys are probably a minor consideration and you shouldn't worry about it.
This question has been asked before. The problem with adding the key to every table is that you say you have a working system, and this means every query needs to be updated.
Probably the easiest is to create a new database for each client, so that the only thing you need to change is the connection string. This also means you can get automated query tools for example to work without worrying about cross-client data leakage.
And it also allows you to backup, transfer, delete a single client easily as well.
There are of course pros and cons to this approach, but it will simplify the development effort. Also remember that if you plan to spin it up in a cloud environment then spinning up databases like this is also very easy.

Google BigQuery: Stop running query

I have run a query on Google BigQuery several hours ago, and the query is still running. I clicked "abandon", but it appears there is no way to stop a query. What can I do? Can I contact Google somehow, so they stop the query?
I've been working on a project for a company which analyzes Google Analytics data with BigQuery, so I don't want to run them a big bill or something.
(Maybe StackOverflow is not the right place to ask this question, but I've tried to find another place, and I couldn't. On the BigQuery support page, it is said that questions should be asked here, with the google-bigquery tag, so I'm doing that).
I've written a query (which I don't want to paste or describe here, as someone might abuse it to block BigQuery or something, I don't know). Let's just say it includes inner joins. After I've written it, and before running it, the console message was something like "This will analyze 674KB of data", which looked OK, given the fact that the table only has 10,000 rows. I've got the same message after clicking on "abandon" query, something like "You can abandon this, but you will still be billed for 674KB of data".
I try very hard to make sure what I do doesn't cause problems to someone, so I've actually run that query on a local PostgreSQL database (with the exact same data - 10,000 rows) as in BigQuery, and the query there finishes in a second or two.
How can I cancel this query, and can I (the company I've worked for) be billed for something more than 674KB of data?
At the time being, there is no way to stop a BigQuery job once it's started, neither via web interface or API calls.
According to this, this feature may be added in the future.
As BigQuery will shard the query to multiple machines, even a large query (TeraByte level) will not have a large impact on an individual machine, let alone a query of 674KB. However, according to this, this is the amount that you will be charged.
Here are some tips to save money in BigQuery.
First thing to know is that, unlike traditional RDBMS, BigQuery is column based, and you will be charged by the amount of data in the columns rather than in the rows.
That means, don't include columns that you do not need in the query. This may sound trivial, but sometimes people coming from RDBMS may write queries like this:
SELECT
COUNT(*), user_id
FROM
[Dataset.Table]
The query is absolutely correct, but instead of being charged only the size of user_id column, Google would actually bill the whole table for this query. Therefore it's a good idea to explicitly specify the column names.
Break the tables into smaller chunks. Instead of having a single table that contains all the data, it's a good idea to split the table according to date, and use table wildcard functions to stitch the tables together during query. In this case, you won't be billed by rows that you don't need.
BigQuery supports canceling query jobs.
You can do this via the bq command line utility:
bq cancel <job_id>
or from the API via the jobs.cancel method (documented here)

Calculate inbox based on dynamic groups

I am facing this problem when calculating the inbox for a user:
On one hand I have a bunch of documents that can potentially have
many readers (DOCS table).
Each reader belongs to one or more defined groups of users.
I have a table DOC_ACCES_BY_GROUP with (DOC_ID, GROUP_ID)
I need to know if a user has read a document or not. So, I have a table DOC_UNREAD with (DOC_ID, USER_ID) so that if a document is in that table, the user has not read the document yet.
Then each group can change in participants at any time, so I need to calculate my "inbox" for a certain user in real time.
The first guess is: Calculate all the groups in which a user is involved, then make a join between all the DOCS and the DOC_ACCESS_BY_GROUP table to get all the documents for that user (with the data asociated), and then another join to see if that document is read or not for the user.
The problem is, when my DOCS table grows considerably and I have many users, and many groups... the performance is really poor.
I'm trying to abstract the problem, which is actually a bit more complex. The possibility of storing document permissions per user is discarded. I also imagine it's not a problem that can be solved by optimizing the SQL query but should be done by software. We also support many data bases such as Mysql, Posgre or MSSQL so it can not be linked to a specific vendor solution (I guess).
So, the question is: Does anyone know any mechanism or framework or algorithm to do things differently and solve this problem, in an optimal and performant way?
Memcached? Infinispan? Hadoop?
You probably want to "materialize" the inbox and update it every time the user reads something, the membership of a group changes etc. The materialized inbox could be stored either in a DB table or in a separate system like Infinispan/memcached.

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

database design, question about implementation

Question regarding my sql database design for a project i am working on.
I will be receiving data every few seconds and i am going to need to store that data into a database. I am using mySQL for my DBMS. The data needs to be stored in the database with a userid attached to each piece of data. I will only be handling one user per application. So, each instance of the application will only be handling one users data. The remote database will be storing all users data though. So, that is why i need userid's to know whose data is whose.
My idea was to wait until i receive like 50 data packets and create a delimited string of all 50 data packets. (Maybe separated by commas) Then push that string to the database along with the userid. And store the data like that. My question is, is that a good way to do it? Is there a better way? Is this bad practice? TIPS PLEASE! =)
I will be receiving a lot of this data. One data packet like every second, sometimes faster. Just let me know what you think.
The DBMS will be running on a remote machine. The application will be running on an android phone.
Thanks in advance!
I would not suggest concatenating a bunch of values together to send a delimited string to the database. That just creates additional work on the database to parse the string.
Any reasonable framework for interacting with the database will let you create and send batches of SQL statements with different values for the bind variables to the database. That keeps the nice, friendly syntax of the stored procedure or INSERT statement, it keeps the database properly normalized, and it accomplishes the performance goal of minimizing the number of round-trips.
If the dbms is running on a good server, and all you do with the data is a simple insert to a reasonably simple table, 1 insert per second should not a strain at all. I'd expect it to be hardly measurable.
The question you really have to answer is the tolerance you have for losing data. A request per second transferring under 1k of data isn't much, especially using json vs. xml. Then again, battery life is something to keep in mind on mobile devices, so making a request every 5-60 seconds is also doable.
There's no reason you cannot batch your updates to the server.
If you have no tolerance for data loss, you could collect your batch of 50 updates on local storage, and upload them. If a failure occurs in transmission you can resend. In this case, however, I would want to have some record ID that's reasonably guaranteed to be unique, such as a UUID. This way the server can see which records it's already processed and exclude them from reprocessing.
I'm going to address the issue of storing it as a delimited string. HOw do you intend to query this data after it is stored? If you will need to find the data for one or aeven a small group of values but not the entire string, donot consider storing the data this way as it will give you horrible performance in querying and will be very painful to write queries for. In general, storing more than one piece of dat ina field is a bad thing, it means you need a related table.
Also, for what you are doing, if you don't need to to analytical querying of the data, perhaps a nosql database would be a better choice than a relational database.