I'm an Electrical Test Engineer. Programming experience with C, mostly for devices with 256B of RAM or less. And have not a lot of experience with SQL databases...
We got a database with production data, serial numbers and testing results.
At the creation of the database no tools was created to retrieve the data.
If we can't retrieve the data, the database may as well not exist.
We have the data, the database exists. I want to create tools to retrieve and interpret data. And in the future do statistical analysis on the data.
The database has over 500k unique devices. With over 10 million measurements.
My question is: what's the most sensical way to retrieve and display the data?
For instance: a program what loops trough every entry and records the data will be complicated to write and will take days to complete.
The program and query's get complicated very fast.
We have Device types, Batch numbers, Serial numbers.
For every DISTINCT (DeviceType)
For Every DISTINCT (Batch number)
COUNT DISTINCT (Serial number) where...
NOT IN User <> 'development'...
AND Testing result <> 'FAIL'...
AND Date between ... and ...
Not to mention the measurement data, as each device may be tested multiple times. It seemed a trivial task, I'm now overwhelmed by the complexity.
I will create the code and query's myself. What I ask is help finding a strategy.
Ask yourself what questions you want to answer by recourse to the data. If the data is recorded in the most granular way possible, then it may be appropriate to consider common grouping or aggregation methods. These might include grouping by device, or location, or something else - each of these dimensions will have a business interpretation.
Writing down 3 top business requests should give you a starting point for constructing your extract/analysis strategy.
Next, try and draw together a data model, find out what tables exist, what references and relations they have to one another.
Together, between the questions you want to answer and the table-design, you should then be in a position to start constructing sensible general use queries.
Sometimes you may find different business questions can be answered using a common view of the data - when you're happy with an extract path, you can write this out using a common query language called SQL and - if appropriate, create a VIEW using that language. This abstracts the problem and makes it more convenient for users to actually get the answers they're looking for.
Your database will provide tools to write and run SQL statements, and you will need to refer to the documentation for your database to figure out how that happens - it's usually similar, but implementations differ across databases.
Related
I have designed my database tables where multiple tables store a value, all of which could be achieved via a query to one table.
My question is would it be considered better practice to never store duplicate data and always query, or to store small values multiple times to reduce the number of queries required?
For context, I am building a Python app that quizzes Korean language questions using SQLAlchemy and SQLite.
I have User , Quiz and Question classes.
The values in question are num_correct, num_wrong with regard to quiz questions.
Basically I have a question table that stores all questions related to quiz by quiz_id. Each question has a column "correct" that stores a boolean telling whether or not that question was answered correctly.
In my "quiz" table, I have columns for num_correct / num_wrong regarding questions answered for that quiz.
In my "user" table, I also have columns for num_correct / num_wrong regarding their total answers correct and wrong for all time.
I realize that to get the values in "quiz" I could query the "questions" table and to get the values in "user" I could do that same.
In this case (and in general) which would be the preferred strategy considering best practices?
I've tried googling quite a bit, but wording the question is a bit tricky.
The issue of duplicated data is a complicated one in relational databases. If your application is doing data modifications, then duplicated data incurs synchronization issues -- the data needs to be updated in multiple places.
That is bad for a variety of reasons:
Updating a single item of information requires multiple changes.
The multiple changes can get out-of-sync, meaning that queries will not see consistent data.
Changes to the database structure (such as adding new tables) can be rather cumbersome.
Databases do support this capability, via ACID properties, transactions, and triggers. However, they add overhead. In general, such duplication is added out of necessity (i.e. performance) rather than up-front. Hence, there is a strong preference for normalized data models where information is stored only once when updates frequently occur.
On the other hand, some databases are used primarily for querying purposes. These databases are often denormalized -- and quite so. For instance, a customer table might contain summaries along many different dimensions, gathering information from dozens of underlying tables.
This not only simplifies queries but it encodes business logic. One major issue with using data is that different people have slightly different definitions of things -- is a one-year customer someone who started 365 days ago? Someone who started on the same day of the year last year? Someone who has been around for 12 months? Standardized analysis tables provide the answer.
Your case seems to fall more into the first situation. You are doing updates and thinking about storing summaries up front. I would discourage you from doing this. Just write the queries you need to summarize the data. In all likelihood, indexes and partitioning will provide all the performance you need.
If you know up front that you will have millions of users taking hundreds of quizzes with dozens of questions, then you might want to think about performance optimizations up front. But for thousands of users taking a handful of quizzes with a few dozen questions, start with a simple data model and make it more complicated after you have demonstrated that it works.
My question is would it be considered better practice to never store duplicate data and always query, or to store small values multiple times to reduce the number of queries required?
I don't see how this reduces the number of queries.
It may affect the complexity of a query, i.e. you'll need to join a few tables together instead of a simple query on one table, but these operations are very fast. I would not worry about speed.
If you duplicate your data it will eventually get out of sync, and then you're in big trouble.
In short, don't duplicate.
Also, this question doesn't really have anything to do with Python.
For testing purposes I require a large amount of queries.
Creating this manually is not an option, so I am searching a tool which will do this automatically.
Sadly, the only solution I found (sqlsmith), is limited to postgres and SQLite.
Are there any similar tools for SQL-Server?
"I do not know from what random place people will want to travel to a random other place, so instead, let's create roads for every possible combination of origin and destination".
That sounds kind of insane, doesn't it? The same applies to what you seem to be wanting to achieve. You basically are hoping to find a tool that generates random queries against your database so you can feed them to the tuning advisor, which will then suggest query optimization indexes for hypothetical queries.
If you want to performance tune your database, you should have a pretty good idea of the type of questions your users will be throwing at it, as well as the structure of your data. Typical questions that will help you get started would be things like:
What is the most common search my users would do against this table?
What criteria are they most likely to use?
Which columns are guaranteed or likely to contain unique data in every row?
Which columns will most likely have a low selectivity of data? (I.e. Male/Female)
are you looking for generate random data for multiple tables ? we generally use redgate data genearator tool for the same.
for SQL tuning purpose I would suggest
https://www.brentozar.com/blitzindex/
http://www.nguyenlamminhdieu.com/zone/213/news/vi-VN/zone/213/news/351-database-engine-tuning-advisor-in-sql-server.aspx
I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.
Pardon me if this has already been asked (I know very little about Data Warehouse/BI and have yet to master the keywords).
I have a table that grow by more then 100 000 rows per day, each row having a timestamp and multiple information about an item (dimensions, weight,color,etc). Individual data can be useful for roughly a month after this period we are only interested in aggregations. I have a dedicated software that allow a more detailed visualisation of individual rows and mainly use PowerPivot for my reporting needs.
I could come up with an SQL query that would fill a new table daily:
In which I would have a row for each hour/item/batch and I would summarize the information (sum/average/stddev/etc.)
Within a day my script would be up and running and I could use powerpivot against this new table. All this while staying where I'm comfortable: plain old SQL.
From the few information I gathered reading about DataWarehouse and BI, what I'm about to do sounds a lot like creating dimensions and facts. My question therefore: is it worthwhile to investigate further in that direction (BI) or since my problem is relatively simple I would do better staying in a relational database.
N.B. Reports that are being produced are usually linked against another database to produce more meaningful informations. Task that is very well accomplished by Powerpivot.
Datawarehouses are normally implemented in relational databases, so your existing skills will still be usable.
Given that you have expressed an interest in the dimension/fact table approach to datawarehousing, the canonical books on this approach are usually considered to be:
The Date Warehouse Toolkit (Kimball, Ross)
The Date Warehouse Lifecycle Toolkit (Kimball, Ross, Thornthwaite, Mundy, Becker)
(The former has more of a technical focus, while the latter approaches the subject from a wider lifecycle management viewpoint.)
Implementing DWHs can be time-consuming, so it may be worth continuing with your existing approach even if you decide to build a DWH.
Good news: it sounds like you already have a data warehouse. "Data warehouse" is a very generic term, with no real formal definition - it pretty much means whatever you want it to.
Commonly accepted characteristics are:
Data warehouses do not run on the operational databases
Data warehouses schemas are optimized for querying, not for "normal form" compliance
Data warehouses are populated by "Extract, Transform, Load" proceses (ETL).
It sounds like you're already doing all of that. If there are no business requirements to change, I'd leave it as it is. If your business users are asking to create their own queries, using different levels of aggregation, filtering, or granularit, a star schema may be the way to go.
The most effective solutions are those which are simple, adequate to meet existing needsand stay within available skillsets.
I agree that this approach works well for your situation an if it provides the reports and information you need then its worth starting this way. If you need more complex functionality later then you can go for more complex BI
I've been reading a little about temporary tables in MySQL but I'm an admitted newbie when it comes to databases in general and MySQL in particular. I've looked at some examples and the MySQL documentation on how to create a temporary table, but I'm trying to determine just how temporary tables might benefit my applications and I guess secondly what sorts of issues I can run into. Granted, each situation is different, but I guess what I'm looking for is some general advice on the topic.
I did a little googling but didn't find exactly what I was looking for on the topic. If you have any experience with this, I'd love to hear about it.
Thanks,
Matt
Temporary tables are often valuable when you have a fairly complicated SELECT you want to perform and then perform a bunch of queries on that...
You can do something like:
CREATE TEMPORARY TABLE myTopCustomers
SELECT customers.*,count(*) num from customers join purchases using(customerID)
join items using(itemID) GROUP BY customers.ID HAVING num > 10;
And then do a bunch of queries against myTopCustomers without having to do the joins to purchases and items on each query. Then when your application no longer needs the database handle, no cleanup needs to be done.
Almost always you'll see temporary tables used for derived tables that were expensive to create.
First a disclaimer - my job is reporting so I wind up with far more complex queries than any normal developer would. If you're writing a simple CRUD (Create Read Update Delete) application (this would be most web applications) then you really don't want to write complex queries, and you are probably doing something wrong if you need to create temporary tables.
That said, I use temporary tables in Postgres for a number of purposes, and most will translate to MySQL. I use them to break up complex queries into a series of individually understandable pieces. I use them for consistency - by generating a complex report through a series of queries, and I can then offload some of those queries into modules I use in multiple places, I can make sure that different reports are consistent with each other. (And make sure that if I need to fix something, I only need to fix it once.) And, rarely, I deliberately use them to force a specific query plan. (Don't try this unless you really understand what you are doing!)
So I think temp tables are great. But that said, it is very important for you to understand that databases generally come in two flavors. The first is optimized for pumping out lots of small transactions, and the other is optimized for pumping out a smaller number of complex reports. The two types need to be tuned differently, and a complex report run on a transactional database runs the risk of blocking transactions (and therefore making web pages not return quickly). Therefore you generally don't want to avoid using one database for both purposes.
My guess is that you're writing a web application that needs a transactional database. In that case, you shouldn't use temp tables. And if you do need complex reports generated from your transactional data, a recommended best practice is to take regular (eg daily) backups, restore them on another machine, then run reports against that machine.
The best place to use temporary tables is when you need to pull a bunch of data from multiple tables, do some work on that data, and then combine everything to one result set.
In MS SQL, Temporary tables should also be used in place of cursors whenever possible because of the speed and resource impact associated with cursors.
If you are new to databases, there are some good books by Joe Kelko that review best practices for ANSI SQL. SQL For Smarties will describe in great detail the use of temp table, impact of indexes, where clauses, etc. It's a great reference book with in depth detail.
I've used them in the past when I needed to create evaluated data. That was before the time of views and sub selects in MySQL though and I generally use those now where I would have needed a temporary table. The only time I might use them is if the evaluated data took a long time to create.
I haven't done them in MySQL, but I've done them on other databases (Oracle, SQL Server, etc).
Among other tasks, temporary tables provide a way for you to create a queryable (and returnable, say from a sproc) dataset that's purpose-built. Let's say you have several tables of figures -- you can use a temporary table to roll those figures up to nice, clean totals (or other math), then join that temp table to others in your schema for final output. (An example of this, in one of my projects, is calculating how many scheduled calls a given sales-related employee must make per week, bi-weekly, monthly, etc.)
I also often use them as a means of "tilting" the data -- turning columns to rows, etc. They're good for advanced data processing -- but only use them when you need to. (My golden rule, as always, applies: If you don't know why you're using x, and you don't know how x works, then you probably shouldn't use it.)
Generally, I wind up using them most in sprocs, where complex data processing is needed. I'd love to give a concrete example, but mine would be in T-SQL (as opposed to MySQL's more standard SQL), and also they're all client/production code which I can't share. I'm sure someone else here on SO will pick up and provide some genuine sample code; this was just to help you get the gist of what problem domain temp tables address.