As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I need to test number-analytics-intensive product, so ideally I would use a copy of production database and tested left and right for numbers not "adding up".
The problem is, I don't have one (production database) available. And I can't just populate tables with random numbers and strings, because of the inherent business logic of data itself.
Any ideas? Any tools that could be at least partially useful? Some secret plugins for Excel combined with clear head and functioning brain?
Note that I'm talking about millions of records. I would settle for thousands to be frank, but I don't think I can realistically test the thing with less than that.
Testing with millions of rows doesn't work. If a test fails, a human brain needs to be able to see why it fails. You should settle for dozens or hundreds of records - don't just feed random data to test cases.
When you want to test a method that sums numbers, try to feed it with 1-5 numbers. Feeding it with millions will probably not give you any useful information. Same with calculating averages: 2+2+0+0+0+0+0+0+...+0 gives you exactly the same average as 2+2.
So you need to look at your code, determine each feature you want to test and then write a test for each individual feature. Avoid "I'll just dumps tons of data through the code; maybe something will happen."
Along the same lines, you should have user stories that explain in detail what you should test, how it can be tested (i.e. the full input and the expected results). If someone tells you "this can't be tested", they are omitting a vital piece of information: "... without a lot of effort" (well, unless your product behaves perfectly random).
You could try something like Datanamic's Data Generator. It sells itself on being able to 'Generate meaningful test data'. It's not cheap but the trial period may help you achieve what you want.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Generally speaking, does the faster code uses less system resources?
If yes, should I suppose that reading a file from file system in 0.02 sec is lighter (better?) than querying the database that took 0.03 sec?
Please note:
The speed is not a concern here, I'm just talking about the system resources like memory, etc.
Also it's a general question, I'm not comparing file system vs. database. That's just an example.
I'm aware that I need to do different benchmarks or profiling in my code to find the precise answer, but as I said above, I'm curious to see if it's generally true or not.
I used to do speed benchmarks in my project to define the better solution, however I never thought that I might need to do benchmarks on memory usage for example. I did it a few times, but it wasn't serious. So, that's why I'm asking this question. I hope it make sense at all.
That depends on why the code is faster.
One common way to optimise for speed, is to use a lot of some other resource.
With your example, both the database and the file system uses RAM memory to cache data. It's more likely that the database would actually be faster, because it uses a lot more RAM to cache data.
So, often faster code uses more resources.
It very broad topic for discussion
faster code mean what? If a code is all static compile time bind than naturally you get faster one.That is what structural programming like C is base line.But,when you enter into object oriented programming just a static binding does not provides object oriented programming figure .So,you need class,objects which natually uses more system resources like more cpu cycle and memory for run time binding . If compared to C and java .yes C is definitely faster than java in some extend .If you run a single hello world example program from C and Java .You can see C takes less resource than java.It mean less CUP cycle and less memory.But in cost we may miss
reusability,maintainbaility,extendability.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm creating a forum (for fun) in php and I want too show the amount of posts and topics per forum. Is it better to add a column to my forums in the database showing the number of thread/topics and whenever someone creates/deletes a topic it will get updated OR counting the amount of topics/threads everytime the forum page is loaded? What is common practice in this case?
Counting records is the only reliable way to do this; if you store counts then you will have a concurrency issue to address; give the database a chance and only fix it if it becomes a real problem. My experience is that COUNT(*) can be surprisingly quick.
I've got a table with 1.2M records (and it has the right indexes); just tried select count(*) from table_name where field=11; takes 0.02 seconds and returns 104, to count 500k records takes 0.15 seconds. This is using mysql on a fairly low spec VPS.
The key thing is to do some performance tests and to only optimize away from the easiest most reliable solution when there is a genuine performance issue.
Database normalisation rules say that you shouldn't have any values in your database that you can programatically construct based on other values in your database. So if you're all about "proper engineering", you should count topics every time they need to be shown.
Under normal circumstances, keeping a counter value will not increase performance by much. It should be possible to write your queries in such a way that the performance hit of recalculating those values whenever they need to be displayed would be negligible, if at all noticeable. As indicated in other posts, COUNT, even with a condition specified, can be very fast. Remember to use indices where necessary.
In the end, you'll have to decide. Depending on the required usage, having your database normalised may impede performance enough to warrant paying the price, making your application just that little bit less performant, and adding a counter somewhere. However, denormalising your database should always be a last resort.
I dont think these two steps are optional.
If you dont update the count whenever someone creates/deletes a thread,
which count will you display on your forum?
Whenever there is any update i.e create/delete you have to increment/decrement the counter.
Then while displaying you can always display the count.
I would suggest updating the count as new posts are added or deleted. This update should be accurate, so it will need to do some locking. Locking will create some contention on the parent records (threads updating forums, posts updating threads, etc.) Make sure you have indexes defined so this update is fast.
When users browse the forum, don't lock the records at all. I wouldn't care about dirty reads, because being 100% accurate is not as critical as in, say, accounting software. The forums is a living entity, so it is OK if it looks slightly different between page loads.
Also, you might want to run some queries occasionally to verify the counts are correct.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
The problem is the following. We gather some data in real time, let say 100 entries per second. We want to have real-time reports. The reports should present data by hours. All we want to do is to create some sums of incoming data and have some smart indexing so that we can easily serve queries like "give me value2 for featureA = x, and featureB = y, for 2012-01-01 09:00 - 10:00"
To avoid too many I/O operations we aggregate data in memory (which means we sum them up), then flush them to database. Let us say it happens every 10 seconds or so, which is an acceptable latency for our real-time reports.
So basically, in SQL terms, we end-up with 20 (or more) tables like this (ok, we could have little less of them by combining sum, but it does not make a lot of difference):
Time, FeatureA, FeatureB, FeatureC, value1, value2, valu3
Time, FeatureA, FeatureD, value4, value5
Time, FeatureC, FeatureE, value6, value7
etc.
(I do not say the solution has to be SQL, I only present this to explain the issue at hand.) The Time column is timestamp (with hour precision), Feature columns are some ids of system entities, and values are integer values (counts).
So now the problem arises. Because of the very nature of the data, even if we aggregate them, there are still (too) many inserts to these aggregating tables. This is because some of the data are sparse, which means that for every 100 entries, we have, say, 50 entries to some of aggregating tables. I understand that we can go forward by upgrading the hardware, but what I feel is that we could do better by having smarter storing mechanism. For example, we could use SQL database, but we do not need most of its features (transactions, joins etc.).
So given this scenario my question is the following. How do you guys deal with real-time reporting of high volume traffic? Google somehow does this for web analytics, so it is possible after all. Any secret weapon here? We are open to any solutions - be it Hadoop & Co, NoSQL, clustering or whatever else.
Aside from splitting the storage requirements for collection and reporting/analysis, one of the things we used to do, is look at how often significant changes to a value occurred, and how the data would be used.
No idea what your data looks like, but reporting and analysis is usually looking for significant patterns. In tolerance to out, and vice versa and particularly oscillation.
Now while it's might be laudable to collect an "inifinite" amount of data just in case you want to analyse it, when you bump into the finite limits of implementing, choices have to be made.
I did this sort of thing in manufacturing environment. We had two levels of analysis. One for control where the granularity was as high as we could afford. Then as the data got further in the past we summarised it for reporting.
I ran into the issues you appear to be more than a few times, and while the loss of data was bemoaned, the complaints about how much it would cost were much louder.
So I wouldn't look at this issue from simply a technical point of view, but from a practical business one. Start from how much the business believes it can afford, and see how much you can give them for it.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Apart from graphical features, online games should have a simple Relational Database structure. I am curious what database do online games like Farmville and MafiaWars use?
Is it practical to use SQL based databases for such programs with such frequent writes ?
If not, how could one store the relational dependence of users in these games?
EDIT: As pointed, they use NOSQL databases like Couchbase. NOSQL is fast with good cuncurrency (which is really needed here); but the sotrage size is much larger (due to key/value structure).
1. Does't it slow down the system (as we need to read large database files from the disk)?
2. We will be very limited as we do not have SQL's JOIN to connected different sets of data.
These databases scale to about 500,000 operations per second, and they're massively distributed. Zynga still uses SQL for logs, but for game data, they presently use code that is substantially the same as Couchbase.
“Zynga’s objective was simple: we needed a database that could keep up with the challenging demands of our games while minimizing our average, fully-loaded cost per database operation – including capital equipment, management costs and developer productivity. We evaluated many NoSQL database technologies but all fell short of our stringent requirements. Our membase development efforts dovetailed with work being done at NorthScale and NHN and we’re delighted to contribute our code to the open source community and to sponsor continuing efforts to maintain and enhance the software.” - Cadir Lee, Chief Technology Officer, Zynga
To Answer your edit:
You can decrease storage size by using a non key=>value storage like MongoDB. This does still have some overhead but less than trying to maintain a key=>value store.
It does not slow down the system terribly since quite a few NoSQL products are memory mapped which means that unlike SQL it doesn't go directly to disk but instead to a fsync queue that then writes to disk when it is convient to. For those NoSQL solutions which are not memory mapped they have extremely fast read/write speeds and it is normally a case of trade off between the two.
As for JOINs, it is a case of arranging your schema in such a manner that you can avoid huge joins. Small joins to join say, a user with his score record are fine but aggregated joins will be a problem and you will need to find other ways around this. There are numerous solutions provided by many user groups of various NoSQL products.
The database they use has been reported to be Membase. It's open source, and one of the many nosql databases.
In January 2012, Membase became Couchbase, and you can download it here if you want to give it a try.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I would be interested in looking at a list of projects that did and did not do unit testing, and other forms of regression testing, to see how those companies turned out.
All test infected developers know it saves them time, but it would be interesting to what correlation there is between code quality/test coverage and business success. Something objective like:
xyz corp, makes operating systems, didnt test, makes $50M
123 corp, makes operating systems, does test, makes $100M
Does anyone know of any studies done?
Microsoft commissioned this internal study not so long ago. It compared teams that did and didn't use TDD. To quote the summary:
Based on the findings of the existing studies, it can be concluded that TDD seems to improve software quality, especially when employed in an industrial context. The findings were not so obvious in the semiindustrial or academic context, but none of those studies reported on decreased quality either. The productivity effects of TDD were not very obvious, and the results vary regardless of the context of the study. However, there were indications that TDD does not necessarily decrease the developer productivity or extend the project leadtimes: In some cases, significant productivity improvements were achieved with TDD while only two out of thirteen studies reported on decreased productivity. However, in both of those studies the quality was improved.
Yes, pick up a copy of Code Complete or even Rapid Development by Steve McConnell. He cites a number of studies.
Any realistic study would have to include thousands of companies. There are far too many factors other than does/doesn't unit test that affect the bottom line. I doubt Microsoft's profit changes all that much whether or not they release an amazing OS every year or one that's as buggy as hell. Just listing a few companies is anecdotal evidence.
Perl is big on testing and regression testing.
I always associate Unit testing with Agile development (XP in particular); you might find that any link between project success and unit testing is influenced by use of agile as well.
I don't know of any surveys specifically, but I did find this just now:
http://people.engr.ncsu.edu/txie/testingresearchsurvey.htm which has around 30 llinks to stuff such as: "Qualitative methods in empirical studies of software engineering. Seaman, C.B, Software Engineering, IEEE Transactions on , Volume: 25 , Issue: 4 , July-Aug. 1999"
Not wanted to sound rude - I assume you've already done bit of a search online?
I seem to remember that Code Complete might have references to research into unit testing and project success - but I'm not sure.
Another option would be to approach some software testing companies and see if they had any useful data.