Regarding the dividing of PL/SQL apps into several units - sql

Here's my application workflow.
I have a ref cursor that is populated with all my employees IDs..It's just an identification number really.
But now I want to fetch a lot of information for every employee...(as fetched form the ref cursor).It's not simply data, but a lot of computed,derived data too. The sort of derivation that's more easily done via cursors and procedures and so on....
For example, the sum of all the time intervals during which an employee was stationed in Department 78...(that could be just one of the columns for each employee).
So I think I could accomplish this with a really large (by large, I mean really difficult to maintain, difficult to understand, difficult to optimize, difficult to reuse, refactor..etc etc) SQL query, but that really isn't something I'd do unless as a real last resort.
So I'm trying to find ways to use all of PL/SQL's might to split this into as many separate units (perhaps functions or procedures) as possible so as to be able to handle this in a simple and elegant way...
I think that some way to merge datasets (ref cursors probably) would solve my problems... I've looked at some stuff on the internet until now and some things looked promising, namely pipelining... Although I'm not really sure that's what I need..
To sum up, what I think I need is some way to compose the resulting ref cursor(a really big table, one column for the ID and about 40 other columns, each with a specific bit of information about that ID's owner.),using many procedures, which I can then send back to my server-side app and deal with it. (Export to excel in that case.)
I'm at a loss really.. Hope someone with more experience can help me on this.
FA

I'm not sure if that is what you want, or how often do you need to run this thing
But since it sounds very heavy maybe you dont need the data up to date this second
If it's once a day or less, you can create a table with the employee ids, and use seperate MERGE updates to calculate the different fields
Then the application can get the data from that table
You can have a job that calculates this every time you need updated data.
You can read about the merge command here wiki and specifically for oracle here oracle. Since you use separate commands you can of course do it in different procedures if that is convenient.
for example:
begin
execute immediate 'truncate table temp_table';
insert into temp_table select emp_id from emps;
MERGE INTO temp_table a
USING (
select name ) b
on (a.emp_id = b.emp_id )
WHEN MATCHED THEN
UPDATE SET a.name = b.name; ...

Related

Should I name tables based on date & time of creation, and use EXEC() and a variable to dynamically refer to these tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
TL;DR: Current company creates new table for every time period, such as sales_yyyymmdd, and use EXEC() to dynamically refer to table names, making the entire query red and hard to read. What kind of changes can I suggest to them to both improve readability and performance?
Some background: I'm a Data analyst (and not a DBA), so my SQL knowledge can be limited. I recently moved to a new company which use MS SQL Server as their database management system.
The issues: The DAs here share a similar style of writing SQL scripts, which includes:
Naming tables based on their time of creation, e.g. data for sales record everyday will be saved into a new table of that day, such as sales_yyyymmdd. This means there are a huge amount of tables like this. Note that the DAs has their own database to tinker with, so they are allowed to created any amount of tables there.
Writing queries enclosed in EXEC() and dynamically refer to table names based on some variable #date. As such, their entire scripts become a color red which is difficult for me to read.
They also claim that enclosing queries in EXEC(), per their own words, makes the scripts running entirely when stored as scheduled jobs, because when they write them the "normal way", sometimes these jobs stop mid-way.
My questions:
Regarding naming and creating new tables for every new time period: I suppose this is obviously a bad practice, at least in terms of management due to the sheer amount of tables. I suggested merging them and add a created_date column, but the DAs here argued that both ways take up the same amount of disk space, so why bother with such radical change. How do I explain this to them?
Regarding the EXEC() command: My issue with this way of writing queries is that it's hard to maintain and share to other people. My quick fix for now (if issue 1 remains), is to use one single EXEC() command to copy the tables needed to temp tables, then select these temp tables instead. If new data need to be merged, I first insert them into temp tables, manipulate them here, and finally merge into the final, official table. Would this method affect performance at all (as there is an extra step involving temp tables)? And is there any better way that both helps with readability and performance?
I don't have experience scheduling jobs myself on my own computer, as my previous company has a dedicated data engineering team that take my SQL scripts and automate the job on a server. My googling also has not yielded any result yet. Is it true that using EXEC() keeps jobs from being interrupted? If not, what is the actual issue here?
I know that the post is long, and I'm also not a native speaker. I hope that I explain my questions clearly enough, and I appreciate any helps/answers.
Thanks everyone, and stay safe!
While I understand the reasons for creating a table for each day, I do not think this is the correct solution.
Modern databases do very good job partitioning data, SQL Server also has this feature. In fact, such use-cases are exactly the rason why partitioning was created in the first place. For me that would be the way to go, as:
it's not a WTF solution (your description easily understandable, but it's a WTF still)
partitioning allows for optimizing partition-restricted queries, particularly time-restricted queries
it is still possible to execute a non-partition based query, while for the solution you showed it would require an union, or multiple unions
As everybody mentioned in the comments, You can have single table Sales and have extra column in the table to hold the date, the data got inserted.
Create table Sales to hold all sales data
CREATE TABLE Sales
(
col1 datatype
col2 datatype
.
.
InsertedDate date --This contains the date for which sales data correspond to
)
Insert all the existing tables data into the above table
INSERT INTO sales
SELECT *,'20200301' AS InsertedDate FROM Sales_20200301
UNION ALL
SELECT *,'20200302' AS InsertedDate FROM Sales_20200302
.
.
UNION ALL
SELECT *,'20200331' AS InsertedDate FROM Sales_20200331
Now, you can modify EXEC query with variable #date to direct query. You can easily read the script without them being in the red color.
DECLARE #date DATE = '20200301'
SELECT col1,col2...
FROM Sales
WHERE InsertedDate = #date
Note:
If data is huge, you can think of partitioning the data based on the Inserteddate.
The purpose of database is not to create tables. It is to use tables. To be honest, this is a nuance that is sometimes hard to explain to DBAs.
First, understand where they are coming from. They want to protect data integrity. They want to be sure that the database is available and that people can use the data they need. They may have been around when the database was designed, and the only envisioned usage was per day. This also makes the data safe when the schema changes (i.e. new columns are added).
Obviously, things have changed. If you were to design the database from scratch, you would probably have a single partitioned table; the partitioning would be by day.
What can you do? There are several options.
You do have some options, depending on what you are able to do and what the DBAs need. The most important thing is to communicate the importance of this issue. You are trying to do analysis. You know SQL. Before you can get started on a problem, you have to deal with the data model, thinking about execs, date ranges, and a whole host of issues that have nothing to do with the problems you need to solve.
This affects your productivity. And affects the utility of the database. Both of these are issues that someone should care about.
There are some potential solutions:
You can copy all the data into a single table each day, perhaps as a separate job. This is reasonable if the tables are small.
You can copy the latest data into a single table.
You can create a view that combines the data into a single view.
The DBAs could do any of the above.
I obviously don't know the structure of the existing code or how busy the DBAs are. However, (4) does not seem particularly cumbersome, regardless of which solution is chosen.
If you have no available space for a view or copy of the data, I would write SQL generation code that would construct a query like this:
select * from sales_20200101 union all
select * from sales_20200102 union all
. . .
This will be a long string. I would then just start my queries with:
with sales as (
<long string here>
)
<whatever code here>;
Of course, it would be better to have a view (at least) that has all the sales you want.

Iterative union SQL query

I'm working with CA (Broadcom) UIM. I want the most efficient method of pulling distinct values from several views. I have views that start with "V_" for every QOS that exists in the S_QOS_DATA table. I specifically want to pull data for any view that starts with "V_QOS_XENDESKTOP."
The inefficient method that gave me quick results was the following:
select * from s_qos_data where qos like 'QOS_XENDESKTOP%';
Take that data and put it in Excel.
Use CONCAT to turn just the qos names into queries such as:
SELECT DISTINCT samplevalue, 'QOS_XENDESKTOP_SITE_CONTROLLER_STATE' AS qos
FROM V_QOS_XENDESKTOP_SITE_CONTROLLER_STATE union
Copy the formula cell down for all rows and remove Union from the last query as well
as add a semicolon.
This worked, I got the output, but there has to be a more elegant solution. Most of the answers I've found related to iterating through SQL uses numbers or doesn't seem quite what I'm looking for. Examples: Multiple select queries using while loop in a single table? Is it Possible? and Syntax of for-loop in SQL Server
The most efficient method to do what you want to do is to do something like what CA's scripts do (the ones you linked to). That is, use dynamic SQL: create a string containing the SQL you want from system tables, and execute it.
A more efficient method would be to write a different query based on the underlying tables, mimicking the criteria in the views you care about.
Unless your view definitions are changing frequently, though, I recommend against dynamic SQL. (I doubt they change frequently. You regenerate the views no more frequently than you get a new script, right? CA isn't adding tables willy nilly.) AFAICT, that's basically what you're doing already.
Get yourself a list of the view names, and write your query against a union of them, explicitly. Job done: easy to understand, not much work to modify, and you give the server its best opportunity to optimize.
I can imagine that it's frustrating and error-prone not to be able to put all that work into your own view, and query against it at your convenience. It's too bad most organizations don't let users write their own views and procedures (owned by their own accounts, not dbo). The best I can offer is to save what would be the view body to a file, and insert it into a WITH clause in your queries
WITH (... query ...) as V select ... from V

Determine if a SQL Insert/Update statement affects the result from a stored Select Statement

Thought this would be a good place to ask for some "brainstorming." Apologies if it's a little broad/off subject.
I was wondering if anyone here had any ideas on how to approach the following problem:
First assume that I have a select statement stored somewhere as an object (this can be the tree form of the query). For example (for simplicity):
SELECT A, B FROM table_A WHERE A > 10;
It's easy to determine the below would change the result of the above query:
INSERT INTO table_A (A,B) VALUES (12,15);
But, given any possible Insert/Update/Whatever statement, as well as any possible starting Select (but we know the Selects and can analyze them all day) I'd like to determine if it would affect the result of the Select Statement.
It's fine to assume that there won't be any "outside" queries, and that we know about all the queries being sent to the DB. It is also assumed we know the DB schema.
No, this isn't for homework. Just a brain teaser I've been thinking about and started to get stuck on (obviously, SQL can get very complicated.)
Based on the reply to the comment, I'd say that without additional criteria, this ranges between very hard and impossible.
Very hard (leastways, it would be for me) because you'd have to write something to parse and interpret your SQL statements into a workable frame of reference for your goals. Doable, but can it be worth the effort?
Impossible because some queries transcend phrases like "Byzantinely complex". (Think nested queries, correlated subqueries, views, common table expressions, triggers, outer joins, and who knows what all.) Without setting criteria such as "no subqueries, no views or triggers, no more than X joins" and so forth, the problem becomes open-ended enough to warrant an NP Complete answer.
My first thought would be to put a trigger on table_A, where if any of the columns you're affecting (col A in this case) changes to meet (or no longer meet) the condition (> 10 here), then the trigger records that an "affecting" change has taken place.
E.g. have another little table to record a "last update timestamp", which the trigger could pop a getdate() into when it detects such a change.
Then, you could check that table to see if the timestamp has changed since the last time you ran the select query - if it has, then you know you need to re-run it, if it hasn't, then you know the results would be the same.
The table could hold many such timestamps (one per row, perhaps with the table/trigger name as a key value in another column) to service many such triggers.
Advantage? Being done in a trigger on the table means no risk of a change that could affect the select statement being missed.
Disadvantage? I guess depending on how your select statements come into existence, you might have an undesirable/unmanageable overhead in creating the trigger(s).

Strategy for avoiding a common sql development error (misleading result on join bug)

Sometimes when i'm writing moderately complex SELECT statements with a few JOINs, wrong key columns are sometimes used in the JOIN statement that still return valid-looking results.
Because the auto numbering values (especially early in development) all tend to fall in similar ranges (sub 100s or so) the SELECT sill produces some results. These results often look valid at first glance and a problem is not detected until much, much later making debugging much more difficult because familiarity with the data structures and code has staled. (Gone stale in the dev's mind.)
i just spent several hours tracking down yet another of this issue that i've run into a too many times before. i name my tables and columns carefully, write my SQL statements methodically but this is an issue i can't seem to competely avoid. It comes back and bites me for hours of productivity about twice a year on average.
My question is: Has anyone come up with a clever method for avoiding this; what i assume is probably a common SQL bug/mistake?
i have thought of trying to auto-number starting with different start values but this feels cludgy and would get ugly trying to keep such a scheme straight for data models with dozens of tables... Any better ideas?
P.S.
i am very careful and methodical in naming my tables and columns. Patient table gets PatientId column, Facility get a FacilityId etc. This issues tends to arise when there are join tables involved where the linkage takes on extra meaning such as: RelatedPatientId, ReferingPatientId, FavoriteItemId etc.
When writing long complex SELECT statements try to limit the result to one record.
For instance, assume you have this gigantic enormous awesome CMS system and you have to write internal reports because the reports that come with it are horrendous. You notice that there are about 500 tables. Your select statement joins 30 of these tables. Your result should limit your row count by using a WHERE clause.
My advice is to rather then get all this code written and generalized for all cases, break the problem up and use WHERE and limit the row count to only say a record. Check all fields, if they look ok, break it up and let your code return more rows. Only after further checking should you generalize.
It bites a lot of us who keep adding more and more joins until it seems to look ok, but only after Joe Blow the accountant runs the report does he realize that the PO for 4 million was really the telephone bill for the entire year. Somehow that join got messed up!
One option would be to use your natural keys.
More practically, Red Gate SQL Prompt picks the FK columns for me.
I also tend to build up one JOIN at a time to see how things look.
If you have a visualization or diagramming tool for your SQL statements, you can follow the joins visually, and any errors will become immediately apparent, provided you have followed a sensible naming scheme for your primary and foreign keys.
Your column names should take care of this unless you named them all "ID". Are you writing multiple select statement using the same tables? You may want to create views for the more common ones.
If you're using SQL Server, you can use GUID columns as primary keys (that's what we do). You won't have problems with collisions again.
You could use GUIDs as your primary keys, but it has its pros and cons.
This pro is actually not mentioned on that page.
I have never tried doing this myself - I use a tool on top of SQL that makes incorrect joins very unlikely, so I don't have this problem. I just thought I'd mention it as another option though!
For IDs use TableNameID, for example for table Person, use PersonID
Use db model and look at the drawing when writing queries.
This way join looks like:
... ON p.PersonID = d.PersonID
as opposed to:
... ON p.ID = d.ID
Auto-increment integer PKs are among your best friends.

suggest a method for updating data in many tables with random data?

I've got about 25 tables that I'd like to update with random data that's picked from a subset of data. I'd like the data to be picked at random but meaningful -- like changing all the first names in a database to new first names at random. So I don't want random garbage in the fields, I'd like to pull from a temp table that's populated ahead of time.
The only way I can think of to do this is with a loop and some dynamic sql.
insert pick-from names into temp table
with id field
foreach table name in a list of
tables:
build a dynamic sql that updates all
first name fields to be a name
picked at random from the temp table based on rand() * max(id) from temp table
But anytime I think "loop" in SQL I figure I'm doing something wrong.
The database in question has a lot of denormalized tables in it, so that's why I think I'd need a loop (the first name fields are scattered across the database).
Is there a better way?
Red Gate have a product called SQL Data Generator that can generate fake names and other fake data for testing purposes. It's not free, but they have a trial so you can test it out, and it might be faster than trying to do it yourself.
(Disclaimer: I have never used this product, but I've been very happy with some of their other products.)
I wrote a stored procedure to do something like this a while back. It is not as good as the Red Gate product and only does names, but if you need something quick and dirty, you can download it from
http://www.joebooth-consulting.com/products/
The script name is GenRandNames.sql
Hope this helps
Breaking the 4th wall a bit by answering my own question.
I did try this as a sql script. What I learned is that SQL pretty much sucks at random. The script was slow and weird -- functions that referenced views that were only created for the script and couldn't be made in tempdb.
So I made a console app.
Generate your random data, easy
to do with the Random class (just
remember to only use one instance of
Random).
Figure out what columns and table
names that you'd like to update via
a script that looks at
information_schema.
Get the IDs
for all the tables that you're going
to update, if possible (and wow will
it be slow if you have a large table
that doesn't have any good PKs).
Update each table 100 rows at a time. Why 100? No idea. Could be 1000. I just picked a number. Dictionary is handy here: pick a random ID from the dict using the Random class.
Wash, rinse, repeat. I updated about 2.2 million rows in an hour this way. Maybe it could be faster, but it was doing many small updates so it didn't get in anyone's way.