What do you do with results after you retrieve data with query? - sql

I am sorry about the novice nature of questions, but I could not find a post related to this elementary question. Say, I select the name of all customers from Germany in table "Customers".
SELECT c.customer_name, c.Country
FROM Customers c
WHERE c.Country = 'Germany'
So, I have a nice result with all customers from Germany. So, what's next? Here is one way that I could potentially save data that I selected by creating a new table:
CREATE TABLE name_country AS
SELECT Name, Country
FROM Customers
WHERE Country = 'Germany'
and I created a table called "name_country" in the database. Is this the only way of saving selected name-country in the database?
Question 2: is there a difference between the code above and the following code?
INSERT INTO name_country
SELECT Name, Country
FROM Customers
WHERE Country = 'Germany'

usually, sql queries are executed in one of two ways/reasons:
manually by typing out the query directly in sql (or phpmyadmin, etc) as it appears you have done. Usually done because there isn't already an easy way to answer whatever question you need answered, or you're testing a query for correctness to be used in #2. Maybe you want to know if a specific customer has an account or not for example, and you don't expect to need to execute the query again. It's a "one off".
From inside a program written in (virtually) any programming language. All major programming languages have a way to connect to databases and execute SQL queries. Then you can iterated through the result in code where you can either show the results in some form on the screen, or do some useful computation with it (or both of course). This can happen in a couple different "layers". For example, PHP is (unfortunately) a popular server side language which many people use for the sole purpose of executing a sql query and returning the result over the web. The PHP script can be triggered by a POST/GET/etc request from a client's web browser, so you can create a web page that populates with your customers that are from Germany for example.
I don't think you want to actually save the result of the query (because that kind of defeats the purpose of using a database in my opinion. What happens when the data changes?) but rather the query itself and execute the query again when you want to see the answer to your question. Usually we "save" queries in a program as discussed in #2.
Question 2:
To my knowledge, the only difference is in performance.
You might be more interested in a VIEW, however, because that will update when the underlying data changes, but the created table from a query will not change if the data later changes. Then a program could read from the VIEW as if it were like any other table as mentioned.

Related

Using SQL exists to check a criteria on all databases from a server

I'm using VBA to display in an Excel sheet all customers filling some business related criteria.
The server I'm using uses 1 database per customer, I have no way around it, that's what it is. Thoses databases have the same structure though (same tables and same columns by table).
Through coding, I found a way to to this, which works, but I'd like/need to speed it up.
I first get all customers-related databases names through one query, and then execute a query on each result to decide if I should display it or not.
Since SQL treatments are considered to be faster than pretty much everything else, I think I could get my program running faster using a single query, with an "EXISTS" clause inside it, and display all results of this single query.
Here is roughly one thing I tried which might be close to what I need, but doesnt work as it is :
SELECT name FROM dbo.sysdatabases
WHERE (name LIKE 'DBC%')
AND EXISTS (SELECT TRUE FROM name.dbo.Purchases WHERE product_category = '012')
I just changed some parameters so it's not specific to my company, like what comes after LIKE and the condition after exists, the main point being that I need (do I ? ) to use name in my EXISTS clause to run through all relevant databases.
My problem is that SQL tells me that 'name.dbo.Purchases' is not a valid name (while, when I loop on a query so I can send a SQL query with a real name, say DBC01.dbo.Purchases, it works fine). How can I get name to be considered as what I'm already querying on, as opposed to "name" itself, in a single query ?
I'm not 100% sure that SELECT TRUE works fine but if I need to I'll take a simple field from my table, that's not a primary concern.

Should I name tables based on date & time of creation, and use EXEC() and a variable to dynamically refer to these tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
TL;DR: Current company creates new table for every time period, such as sales_yyyymmdd, and use EXEC() to dynamically refer to table names, making the entire query red and hard to read. What kind of changes can I suggest to them to both improve readability and performance?
Some background: I'm a Data analyst (and not a DBA), so my SQL knowledge can be limited. I recently moved to a new company which use MS SQL Server as their database management system.
The issues: The DAs here share a similar style of writing SQL scripts, which includes:
Naming tables based on their time of creation, e.g. data for sales record everyday will be saved into a new table of that day, such as sales_yyyymmdd. This means there are a huge amount of tables like this. Note that the DAs has their own database to tinker with, so they are allowed to created any amount of tables there.
Writing queries enclosed in EXEC() and dynamically refer to table names based on some variable #date. As such, their entire scripts become a color red which is difficult for me to read.
They also claim that enclosing queries in EXEC(), per their own words, makes the scripts running entirely when stored as scheduled jobs, because when they write them the "normal way", sometimes these jobs stop mid-way.
My questions:
Regarding naming and creating new tables for every new time period: I suppose this is obviously a bad practice, at least in terms of management due to the sheer amount of tables. I suggested merging them and add a created_date column, but the DAs here argued that both ways take up the same amount of disk space, so why bother with such radical change. How do I explain this to them?
Regarding the EXEC() command: My issue with this way of writing queries is that it's hard to maintain and share to other people. My quick fix for now (if issue 1 remains), is to use one single EXEC() command to copy the tables needed to temp tables, then select these temp tables instead. If new data need to be merged, I first insert them into temp tables, manipulate them here, and finally merge into the final, official table. Would this method affect performance at all (as there is an extra step involving temp tables)? And is there any better way that both helps with readability and performance?
I don't have experience scheduling jobs myself on my own computer, as my previous company has a dedicated data engineering team that take my SQL scripts and automate the job on a server. My googling also has not yielded any result yet. Is it true that using EXEC() keeps jobs from being interrupted? If not, what is the actual issue here?
I know that the post is long, and I'm also not a native speaker. I hope that I explain my questions clearly enough, and I appreciate any helps/answers.
Thanks everyone, and stay safe!
While I understand the reasons for creating a table for each day, I do not think this is the correct solution.
Modern databases do very good job partitioning data, SQL Server also has this feature. In fact, such use-cases are exactly the rason why partitioning was created in the first place. For me that would be the way to go, as:
it's not a WTF solution (your description easily understandable, but it's a WTF still)
partitioning allows for optimizing partition-restricted queries, particularly time-restricted queries
it is still possible to execute a non-partition based query, while for the solution you showed it would require an union, or multiple unions
As everybody mentioned in the comments, You can have single table Sales and have extra column in the table to hold the date, the data got inserted.
Create table Sales to hold all sales data
CREATE TABLE Sales
(
col1 datatype
col2 datatype
.
.
InsertedDate date --This contains the date for which sales data correspond to
)
Insert all the existing tables data into the above table
INSERT INTO sales
SELECT *,'20200301' AS InsertedDate FROM Sales_20200301
UNION ALL
SELECT *,'20200302' AS InsertedDate FROM Sales_20200302
.
.
UNION ALL
SELECT *,'20200331' AS InsertedDate FROM Sales_20200331
Now, you can modify EXEC query with variable #date to direct query. You can easily read the script without them being in the red color.
DECLARE #date DATE = '20200301'
SELECT col1,col2...
FROM Sales
WHERE InsertedDate = #date
Note:
If data is huge, you can think of partitioning the data based on the Inserteddate.
The purpose of database is not to create tables. It is to use tables. To be honest, this is a nuance that is sometimes hard to explain to DBAs.
First, understand where they are coming from. They want to protect data integrity. They want to be sure that the database is available and that people can use the data they need. They may have been around when the database was designed, and the only envisioned usage was per day. This also makes the data safe when the schema changes (i.e. new columns are added).
Obviously, things have changed. If you were to design the database from scratch, you would probably have a single partitioned table; the partitioning would be by day.
What can you do? There are several options.
You do have some options, depending on what you are able to do and what the DBAs need. The most important thing is to communicate the importance of this issue. You are trying to do analysis. You know SQL. Before you can get started on a problem, you have to deal with the data model, thinking about execs, date ranges, and a whole host of issues that have nothing to do with the problems you need to solve.
This affects your productivity. And affects the utility of the database. Both of these are issues that someone should care about.
There are some potential solutions:
You can copy all the data into a single table each day, perhaps as a separate job. This is reasonable if the tables are small.
You can copy the latest data into a single table.
You can create a view that combines the data into a single view.
The DBAs could do any of the above.
I obviously don't know the structure of the existing code or how busy the DBAs are. However, (4) does not seem particularly cumbersome, regardless of which solution is chosen.
If you have no available space for a view or copy of the data, I would write SQL generation code that would construct a query like this:
select * from sales_20200101 union all
select * from sales_20200102 union all
. . .
This will be a long string. I would then just start my queries with:
with sales as (
<long string here>
)
<whatever code here>;
Of course, it would be better to have a view (at least) that has all the sales you want.

Iterative union SQL query

I'm working with CA (Broadcom) UIM. I want the most efficient method of pulling distinct values from several views. I have views that start with "V_" for every QOS that exists in the S_QOS_DATA table. I specifically want to pull data for any view that starts with "V_QOS_XENDESKTOP."
The inefficient method that gave me quick results was the following:
select * from s_qos_data where qos like 'QOS_XENDESKTOP%';
Take that data and put it in Excel.
Use CONCAT to turn just the qos names into queries such as:
SELECT DISTINCT samplevalue, 'QOS_XENDESKTOP_SITE_CONTROLLER_STATE' AS qos
FROM V_QOS_XENDESKTOP_SITE_CONTROLLER_STATE union
Copy the formula cell down for all rows and remove Union from the last query as well
as add a semicolon.
This worked, I got the output, but there has to be a more elegant solution. Most of the answers I've found related to iterating through SQL uses numbers or doesn't seem quite what I'm looking for. Examples: Multiple select queries using while loop in a single table? Is it Possible? and Syntax of for-loop in SQL Server
The most efficient method to do what you want to do is to do something like what CA's scripts do (the ones you linked to). That is, use dynamic SQL: create a string containing the SQL you want from system tables, and execute it.
A more efficient method would be to write a different query based on the underlying tables, mimicking the criteria in the views you care about.
Unless your view definitions are changing frequently, though, I recommend against dynamic SQL. (I doubt they change frequently. You regenerate the views no more frequently than you get a new script, right? CA isn't adding tables willy nilly.) AFAICT, that's basically what you're doing already.
Get yourself a list of the view names, and write your query against a union of them, explicitly. Job done: easy to understand, not much work to modify, and you give the server its best opportunity to optimize.
I can imagine that it's frustrating and error-prone not to be able to put all that work into your own view, and query against it at your convenience. It's too bad most organizations don't let users write their own views and procedures (owned by their own accounts, not dbo). The best I can offer is to save what would be the view body to a file, and insert it into a WITH clause in your queries
WITH (... query ...) as V select ... from V

Why not have a JOINONE keyword in SQL to hint and enforce that each record has at most one match?

I encounter this a lot when writing SQL. I have two tables that are meant to be in a one-to-one relationship with each other, and I wish I could easily assert that fact in my query. For example, the simplified query:
SELECT Person.ID, Person.Name, Location.Address1
FROM Person
LEFT JOIN Location ON Person.LocationID = Location.ID
When I read this query I think to myself, well what if the Location table fails to enforce uniqueness on its ID column? Suddenly you could have the same Person multiple times in your resultset. Sure, I can go look at the schema to assure myself it's unique so everything will be okay, but why shouldn't I simply be able to put it right here in my query, a la:
SELECT Person.ID, Person.Name, Location.Address1
FROM Person
LEFT JOINONE Location ON Person.LocationID = Location.ID
Not only would a keyword like this (made up "JOINONE") make it 100% clear to a human reading this query that we are guaranteed to get exactly one row for each Person record, but it lets the db engine optimize its execution plan because it knows there won't be more than one match each, even if the foreign key relationship isn't defined in the schema.
Another advantage of this would be that the db engine could enforce it, so if the data actually did have more than one match, an error could be thrown. This happens for subqueries already, e.g.:
SELECT Person.ID, Person.Name
, (
SELECT Location.Address1
FROM Location
WHERE Location.ID = Person.Location
) AS Address1
FROM Person
This is nice and spiffy, 100% clear to the human reader, neatly optimizable, and enforced by the db engine. In fact I often end up doing things this way for all those reasons. The problem is, besides the distracting syntax, you can only select one field this way. (What if I want City, State, and Zip too?) How nice it would be if you could flow this table right along with the rest of your JOINs and select any fields from it you wish in your SELECT clause just like all the rest of your tables.
I couldn't find any other question like this around StackOverflow, though I did find lots of repeats of a close question: people wanting to choose a single record. Close but really quite a different kind of goal, and less meaningful in my opinion.
I'm posting this question to see if there's some mechanism already in the SQL language that I'm missing, or an efficient workaround anyone has come up with. The concept of a one-to-one vs. one-to-many relationship is so fundamental to relational database design, I'm just so surprised at the absence of this language element.
SQL is two languages in one. Constraints, including uniqueness constraints, are set using the data definition language (DDL) in SQL. This is a layer above the data manipulation language (DML), where SELECT statements live, and it's understood that statements issued in the DDL might invalidate statements in the DML.
There's no way for a query to prevent someone from executing an ALTER TABLE command and changing the name of a field that the query refers to between query runs.
And there isn't much more of a way for a query to be written defensively against uncertain constraints; it's OK if you need to ask someone for information outside of the database environment to address this. The information may also be available within the environment; in most engines, you get to it by querying the data dictionary. This is the INFORMATION_SCHEMA in MySQL, for instance.

Can select * usage ever be justified?

I've always preached to my developers that SELECT * is evil and should be avoided like the plague.
Are there any cases where it can be justified?
I'm not talking about COUNT(*) - which most optimizers can figure out.
Edit
I'm talking about production code.
And one great example I saw of this bad practice was a legacy asp application that used select * in a stored procedure, and used ADO to loop through the returned records, but got the columns by index. You can imagine what happened when a new field was added somewhere other than the end of the field list.
I'm quite happy using * in audit triggers.
In that case it can actually prove a benefit because it will ensure that if additional columns are added to the base table it will raise an error so it cannot be forgotten to deal with this in the audit trigger and/or audit table structure.
(Like dotjoe) I am also happy using it in derived tables and column table expressions. Though I habitually do it the other way round.
WITH t
AS (SELECT *,
ROW_NUMBER() OVER (ORDER BY a) AS RN
FROM foo)
SELECT a,
b,
c,
RN
FROM t;
I'm mostly familiar with SQL Server and there at least the optimiser has no problem recognising that only columns a,b,c will be required and the use of * in the inner table expression does not cause any unnecessary overhead retrieving and discarding unneeded columns.
In principle SELECT * ought to be fine in a view as well as it is the final SELECT from the view where it ought to be avoided however in SQL Server this can cause problems as it stores column metadata for views which is not automatically updated when the underlying tables change and the use of * can lead to confusing and incorrect results unless sp_refreshview is run to update this metadata.
There are many scenarios where SELECT * is the optimal solution. Running ad-hoc queries in Management Studio just to get a sense of the data you're working with. Querying tables where you don't know the column names yet because it's the first time you've worked with a new schema. Building disposable quick'n'dirty tools to do a one-time migration or data export.
I'd agree that in "proper" development, you should avoid it - but there's lots of scenarios where "proper" development isn't necessarily the optimum solution to a business problem. Rules and best practices are great, as long as you know when to break them. :)
I'll use it in production when working with CTEs. But, in this case it's not really select *, because I already specified the columns in the CTE. I just don't want to respecify in the final select.
with t as (
select a, b, c from foo
)
select t.* from t;
None that I can think of, if you are talking about live code.
People saying that it makes adding columns easier to develop (so they automatically get returned and can be used without changing the Stored procedure) have no idea about writing optimal code/sql.
I only ever use it when writing ad-hoc queries that will not get reused (finding out the structure of a table, getting some data when I am not sure what the column names are).
I think using select * in an exists clause is appropriate:
select some_field from some_table
where exists
(select * from related_table [join condition...])
Some people like to use select 1 in this case, but it's not elegant, and it doesn't buy any performance improvements (early optimization strikes again).
In production code, I'd tend to agree 100% with you.
However, I think that the * more than justifies its existence when performing ad-hoc queries.
You've gotten a number of answers to your question, but you seem to be dismissing everything that isn't parroting back what you want to hear. Still, here it is for the third (so far) time: sometimes there is no bottleneck. Sometimes performance is way better than fine. Sometimes the tables are in flux, and amending every SELECT query is just one more bit of possible inconsistency to manage. Sometimes you've got to deliver on an impossible schedule and this is the last thing you need to think about.
If you live in bullet time, sure, type in all the column names. But why stop there? Re-write your app in a schema-less dbms. Hell, write your own dbms in assembly. That'd really show 'em.
And remember if you use select * and you have a join at least one field will be sent twice (the join field). This wastes database resources and network resources for no reason.
As a tool I use it to quickly refresh my memory as to what I can possibly get back from a query. As a production level query itself .. no way.
When creating an application that deals with the database, like phpmyadmin, and you are in a page where to display a full table, in that case using SELECT * can be justified, I guess.
About the only thing that I can think of would be when developing a utility or SQL tool application that is being written to run against any database. Even here though, I would tend to query the system tables to get the table structure and then build any necessary query from that.
There was one recent place where my team used SELECT * and I think that it was ok... we have a database that exists as a facade against another database (call it DB_Data), so it is primarily made up of views against the tables in the other database. When we generate the views we actually generate the column lists, but there is one set of views in the DB_Data database that are automatically generated as rows are added to a generic look-up table (this design was in place before I got here). We wrote a DDL trigger so that when a view is created in DB_Data by this process then another view is automatically created in the facade. Since the view is always generated to exactly match the view in DB_Data and is always refreshed and kept in sync, we just used SELECT * for simplicity.
I wouldn't be surprised if most developers went their entire career without having a legitimate use for SELECT * in production code though.
I've used select * to query tables optimized for reading (denormalized, flat data). Very advantageous since the purpose of the tables were simply to support various views in the application.
How else do the developers of phpmyadmin ensure they are displaying all the fields of your DB tables?
It is conceivable you'd want to design your DB and application so that you can add a column to a table without needing to rewrite your application. If your application at least checks column names it can safely use SELECT * and treat additional columns with some appropriate default action. Sure the app could consult system catalogs (or app-specific catalogs) for column information, but in some circumstances SELECT * is syntactic sugar for doing that.
There are obvious risks to this, however, and adding the required logic to the app to make it reliable could well simply mean replicating the DB's query checks in a less suitable medium. I am not going to speculate on how the costs and benefits trade off in real life.
In practice, I stick to SELECT * for 3 cases (some mentioned in other answers:
As an ad-hoc query, entered in a SQL GUI or command line.
As the contents of an EXISTS predicate.
In an application that dealt with generic tables without needing to know what they mean (e.g. a dumper, or differ).
Yes, but only in situations where the intention is to actually get all the columns from a table not because you want all the columns that a table currently has.
For example, in one system that I worked on we had UDFs (User Defined Fields) where the user could pick the fields they wanted on the report, the order as well as filtering. When building a result set it made more sense to simply "select *" from the temporary tables that I was building instead of having to keep track of which columns were active.
I have several times needed to display data from a table whose column names were unknown. So I did SELECT * and got the column names at run time.
I was handed a legacy app where a table had 200 columns and a view had 300. The risk exposure from SELECT * would have been no worse than from listing all 300 columns explicitly.
Depends on the context of the production software.
If you are writing a simple data access layer for a table management tool where the user will be selecting tables and viewing results in a grid, then it would seem *SELECT ** is fine.
In other words, if you choose to handle "selection of fields" through some other means (as in automatic or user-specified filters after retrieving the resultset) then it seems just fine.
If on the other hand we are talking about some sort of enterprise software with business rules, a defined schema, etc. ... then I agree that *SELECT ** is a bad idea.
EDIT: Oh and when the source table is a stored procedure for a trigger or view, "*SELECT **" should be fine because you're managing the resultset through other means (the view's definition or the stored proc's resultset).
Select * in production code is justifiable any time that:
it isn't a performance bottleneck
development time is critical
Why would I want the overhead of going back and having to worry about changing the relevant stored procedures, every time I add a field to the table?
Why would I even want to have to think about whether or not I've selected the right fields, when the vast majority of the time I want most of them anyway, and the vast majority of the few times I don't, something else is the bottleneck?
If I have a specific performance issue then I'll go back and fix that. Otherwise in my environment, it's just premature (and expensive) optimisation that I can do without.
Edit.. following the discussion, I guess I'd add to this:
... and where people haven't done other undesirable things like tried to access columns(i), which could break in other situations anyway :)
I know I'm very late to the party but I'll chip in that I use select * whenever I know that I'll always want all columns regardless of the column names. This may be a rather fringe case but in data warehousing, I might want to stage an entire table from a 3rd party app. My standard process for this is to drop the staging table and run
select *
into staging.aTable
from remotedb.dbo.aTable
Yes, if the schema on the remote table changes, downstream dependencies may throw errors but that's going to happen regardless.
If you want to find all the columns and want order, you can do the following (at least if you use MySQL):
SHOW COLUMNS FROM mytable FROM mydb; (1)
You can see every relevant information about all your fields. You can prevent problems with types and you can know for sure all the column names. This command is very quick, because you just ask for the structure of the table. From the results you will select all the name and will build a string like this:
"select " + fieldNames[0] + ", fieldNames[1]" + ", fieldNames[2] from mytable". (2)
If you don't want to run two separate MySQL commands because a MySQL command is expensive, you can include (1) and (2) into a stored procedure which will have the results as an OUT parameter, that way you will just call a stored procedure and every command and data generation will happen at the database server.