I'm trying to create a set of data that I'm going to write out to a file, it's essentially a report composed of various fields from a number of different tables, some columns need to have some processing done on them, some can just be selected.
Different users will likely want different processing performed on certain columns, and in the future, I'll probably need to add additional functions for computed columns.
I'm considering the cleanest/most flexable approach to storing and using all the different functions I'm likely to need for these computed columns, I've got two ideas in my head, but I'm hoping there might be a much more obvious solution I'm missing.
For a simple, slightly odd example, a Staff table:
Employee | DOB | VacationDays
Frank | 01/01/1970 | 25
Mike | 03/03/1975 | 24
Dave | 05/02/1980 | 30
I'm thinking I'd either end up with a query like
SELECT NameFunction(Employee, optionID),
DOBFunction(DOB, optionID),
VacationFunction(VacationDays, optionID),
from Employee
With user defined functions, where the optionID would be used in a case statement inside the functions to decide what processing to perform.
Or I'd want to make the way the data is returned customisable using a lookup table of other functions:
ID | Name | Description
1 | ShortName | Obtains 3 letter abbreviation of employee name
2 | LongDOB | Returns DOB in format ~ 1st January 1970
3 | TimeStampDOB | Returns Timestamp for DOB
4 | VacationSeconds | Returns Seconds of vaction time
5 | VacationBusinessHours | Returns number of business hours of vacation
Which seems neater, but I'm not sure how I'd formulate the query, presumably using dynamic SQL? Is there a sensible alternative?
The functions will be used on a few thousand rows.
The closest answer I've found was in this thread:
Call dynamic function name in SQL
I'm not a huge fan of dynamic SQL, although in this case I think it might be the best way to get the result I'm after?
Any replies appreciated,
Thanks,
Chris
I would go for the second solution. You could even use real stored proc names in your lookup table.
create proc ShortName (
#param varchar(50)
)as
begin
select 'ShortName: ' + #param
end
go
declare #proc sysname = 'ShortName'
exec #proc 'David'
As you can see in the example above, the first parameter of exec (i.e. the procedure name) can be a parameter. This said with all the usual warnings regarding dynamic sql...
In the end, you should go with whichever is faster, so you should try both ways (and any other way someone might come up with) and decide after that.
I like better the first option, as long as your functions don't have extra selects to a table. You may not even need the user defined functions, if they are not going to be reused in a different report.
I prefer to use Dynamic SQL ony to improve a query's performance, such as adding a dynamic ordering or adding / removing complex WHERE conditions.
But these are all subjective opinions, the best thing is try, compare, and decide.
Actually, this isn't a question of what's faster. It is a question of what makes the code cleaner, particularly for adding new functionality (new columns, new column formats, re-ordering them).
Don't think of your second approach as "using dynamic SQL", since that tends to have negative connotations. Instead, think of it as a data-driven approach. You want to build a table that describes the columns that users can get, and the formats. This is great! Users can then provide a list of columns, and you'll have a magical stored procedure that combines the information from the users with the information in your metadata table, and produces the desired result.
I'm a big fan of data-driven approaches, and dynamic SQL is the best SQL tool I've found so far for implementing them.
Related
lately we have a few cases were I had to build reports where I have one table like:
1|text|text
2|text|text
3|text|text
and another table
1|1.1.2017|text
1|1.2.2017|text
2|1.1.2017|text
2|1.2.2017|text
3|1.1.2017|and so on
result should be:
Jan|Feb|March|...
1|text|text|2 | 1 | ...
2|text|text|2 | 1 | ...
3|text|text|1 | 1 | ...
My first question would be if there is common way to do this. I already build querys to do this but maybe not so sufficient as they could be. Seems to me as a very common business case so maybe there are (standardized) techniques to do this which I don't know yet.
Another question would be: The queried data would go into a BI tool later. So is it maybe better (faster) to do first the queries, put the tables in BI tools and then manipulate the data as desired? Maybe someone has experience in this and could give me advice...
Thanks
Have a look at the answer to duplicate ID in sql select - except the first sentence.
In my experience it is best to do as you suggested: use SQL to group and aggregate the data as you want and then use a BI tool to present it.
I have the following table schema:
+-----+---------+----------+
+ chn | INTEGER | NULLABLE |
+-----+---------+----------|
+ size| STRING | NULLABLE |
+-----+---------+----------|
+ char| REPEATED| NULLABLE |
+-----+---------+----------|
+ ped | INTEGER | NULLABLE |
+-----+---------+----------
When I click on 'preview' in the Google BigQuery Web UI, I get the following result:
But when I query my table, I get this result:
It seems like "preview" is interpreting my repeated field as an array, I would want to get the same result in a query to limit the number of rows.
I did try to uncheck "Use Legacy SQL" which gave me the same result but the problem is that with my table, a same query takes ~1.0 sec to execute with "Use Legacy SQL" checked and ~12 seconds when it's unchecked.
I am looking for speed here so unfortunately, not using Legacy SQL is not an option...
Is there another way to render my repeated field like it does in the "preview" ?
Thanks for the help :)
In legacy SQL, BigQuery flattens the result of queries by default. This means two things:
All child fields of RECORD fields are propagated to the top-level, with their names changed from record.subrecord.leaf to record_subrecord_leaf. Parent records are removed from the schema.
All repeated fields are converted to fields of optional mode, with each repeated value expanded into its own row. (As a side note, this step is very similar to the FLATTEN function exposed in legacy SQL.)
What you see here is a product of #2. Each repeated value is becoming its own row (as you can see by the row count on the left-hand side in your two images) and the values from the other columns are, well, repeated for each new row.
You can prevent this behavior and receive "unflattened results" in a couple ways.
Using standard SQL, as you note in your original question. All standard SQL queries return unflattened results.
While using legacy SQL, setting the flattenResults parameter to false. This requires also specifying a destination table and setting allowLargeResults to false. These can be found in the Show Options panel beneath the query editor if you want to set them within the UI. Mikhail has some good suggestions for managing the temporary-ness of destination tables if you aren't interested in keeping them around.
I should note that there are a number of corner cases with legacy SQL with flattenResults set to false which might trip you up if you start writing more complex queries. A prominent example is that you can't output more than one independently repeated field in query results using legacy SQL, but you can output multiple with standard SQL. These issues are unlikely to be resolved in legacy SQL, and going forward we're suggesting people use standard SQL when they run into them.
If you could provide more details about your much slower query using standard SQL (e.g. job ID in legacy SQL, job ID in standard SQL, for comparison), I, and the rest of the BigQuery team, would be very interested in investigating further.
Is there another way to render my repeated field like it does in the
"preview" ?
To see original not-flattened output in Web UI for Legacy SQL, i used to set respective options (click Show Options) to actually write output to table with checked Allow Large Results and unchecked Flatten Results.
This actually not only saves result into table but also shows result in the same way as preview does (because it is actually preview of that table). To make sure that table gets removed afterwards - i have "dedicated" dataset (temp) with default expiration set to 1 day (or hour - depends on how aggressive you want to be with your junk), so you don't need to worry of that table(s) - it will get deleted automatically for you. Wanted to note: this was quite a common pattern for us to deal with and having to do extra settings was boring, so we ended up with our own custom UI that does all this for user automatically
What you see is called Flatten.
By default the UI flattens the query output, there is currently no option to show query results like you want. In order to produce unflatten results you must write to a table, but that's different thing.
I have a financial application that has a large set of rules to check. The file is stored in a sql server. This is a web application using C#. Each file must be checked for these rules and there are hundreds of rules to consider. These rules change every few weeks to months. My thought was to store these rules in an xml file and have my code behind read the xml and dynamically generate the sql queries on the file. For testing purposes we are hard coding these rules, but would like to move to an architecture that is more accommodating of these rules changes. I'd think that xml is a good way to go here, but I'd appreciate advice of those that have gone down similar roads before.
The complexity of each rule check is small and generally are just simple statements such as: "If A && B && (C || D)" then write output string to log file".
My thought would be to code up the query in xml (A && B && (C || D)) and attach a string to that node in the xml. If the query is successful the string is written, if the query is not successful no string is written.
Thoughts?
In response to a comment, here is a more specific example:
The database has an entity called 'assets'. There are a number of asset types supported, such as checking, savings, 401k, IRA, etc etc. An example of a rule we want to check would be: "If the file has a 401k, append warning text to the report saying ". That example is for a really simple case.
We also get into more complex and dynamic cases where for a short period of time a rule may be applied to deny files with clients in specific states with specific property types. Classic example is to not allow condominiums in Florida. This rule max exist for a while, then be removed.
The pool of rules are constantly changing based on the discretion of large lending banks. We need to be able to make these rule changes outside of the source code for the site. Thus the idea of using xml and have the C# parse the xml and apply the rules dynamically was my idea. Does this help clarify the application and its needs?
could you just have a table with SQL in it? you could then formalist it a bit by having the SQL return a particular strucure..
so you table of checks might be:
id | chechGroup | checkName | sql |
1 | '401k checks'| '401k present' |select |
| '401k present'|
| ,count(*) |
| ,'remove 401k'|
|from |
| assests |
|where |
| x like '401k%'|
you could insist that the sql in the sql column returns something of the format:
ruleName | count | comment
'401k present'| 85 |'remove 401k'
you could have different types of rules.. when i have done similar to this I have not returned totals instead I have returned something more like:
table | id | ruleBorken | comment
'assets' | 1 | '401k present' | 'remove 401k'
this obviously would have a query more like:
select
'assets'
,id
,'401k present'
,'remove 401k'
from
assets
where
x like '401k%'
this makes it easier to generate interactive reports where the aggregate functions are done by the report (e.g. ssrs) allowing drill down to problem records.
the queries that validate the rules can either be run within a stored procedure that selects the queries out and uses EXEC to execute them, or they could be run from your application code one by one.
some of the columns (e.g. rule name) can be populated but he calling stored procedure or code.
the comments and rulename in this example are basically the same, but it can be handy to have the comments separate and put a case statement in there. - e.g. when failing validation rules, say on fields that should not be blank if you have a 401k, then you can have a case statement that tells which fields are missing in the comments.
If you want end users or non devs to create the rules then you could look at ways of generating the where clause in code and allowing the user to select table, rule name and generate a where clause through some interface, then save it to your rule table and you are good to go.
if all of your rules return a set format it allows you to have one report template for all rules - equally if you have exactly three types of rule then you could have three return formats and three report formats.. basically i like formalizing the result structure as it allows much more reuse elsewhere
I've found a lot of bits and pieces of this, but I can't put the together. This is basically the idea of the table where name is a varchar, date is a datetime, and number is an int
Name | Date | Number
A 1-2-11 15
B 1-2-11 8
A 1-1-11 5
I'd like to create a view that looks like this
Name | 1-2-11 | 1-1-11
A 15 5
B 8
At first I was using a temp table, and appending each date row to it. I read on another forum that way was a major resource hog. Is that true? Is there a better way to do this?
I would combine dynamic SQL with a pivot as I mentioned in this answer.
You want to look into "cross-tab" or "pivot" statements. In SQL Server 2005 and up, its PIVOT, but syntax varries between platform.
This is a very complex subject, particuarly since you want to add columns to a view as your data grows over time. Besides your platform's documentation, check out the myriad other SO posts on the subject.
If the date column is a known set then you can use pivot in some cases.
It is often faster to use dynamic sql BUT this can be very dangerous so be wary.
To really know what the best solution is for your problem we would need some more information -- how much data -- how much variation is expected in the different columns, etc.
However, it is true, both PIVOT and Dynamic SQL will be faster than a temp table.
I would do it with Access or Excel instead of T-SQL.
I am wondering how others would handle a scenario like such:
Say I have multiple choices for a user to choose from.
Like, Color, Size, Make, Model, etc.
What is the best solution or practice for handling the build of your query for this scneario?
so if they select 6 of the 8 possible colors, 4 of the possible 7 makes, and 8 of the 12 possible brands?
You could do dynamic OR statements or dynamic IN Statements, but I am trying to figure out if there is a better solution for handling this "WHERE" criteria type logic?
EDIT:
I am getting some really good feedback (thanks everyone)...one other thing to note is that some of the selections could even be like (40 of the selections out of the possible 46) so kind of large. Thanks again!
Thanks,
S
What I would suggest doing is creating a function that takes in a delimited list of makeIds, colorIds, etc. This is probably going to be an int (or whatever your key is). And splits them into a table for you.
Your SP will take in a list of makes, colors, etc as you've said above.
YourSP '1,4,7,11', '1,6,7', '6'....
Inside your SP you'll call your splitting function, which will return a table-
SELECT * FROM
Cars C
JOIN YourFunction(#models) YF ON YF.Id = C.ModelId
JOIN YourFunction(#colors) YF2 ON YF2.Id = C.ColorId
Then, if they select nothing they get nothing. If they select everything, they'll get everything.
What is the best solution or practice for handling the build of your query for this scenario?
Dynamic SQL.
A single parameter represents two states - NULL/non-existent, or having a value. Two more means squaring the number of parameters to get the number of total possibilities: 2 yields 4, 3 yields 9, etc. A single, non-dynamic query can contain all the possibilities but will perform horribly between the use of:
ORs
overall non-sargability
and inability to reuse the query plan
...when compared to a dynamic SQL query that constructs the query out of only the absolutely necessary parts.
The query plan is cached in SQL Server 2005+, if you use the sp_executesql command - it is not if you only use EXEC.
I highly recommend reading The Curse and Blessing of Dynamic SQL.
For something this complex, you may want a session table that you update when the user selects their criteria. Then you can join the session table to your items table.
This solution may not scale well to thousands of users, so be careful.
If you want to create dynamic SQL it won't matter if you use the OR approach or the IN approach. SQL Server will process the statements the same way (maybe with little variation in some situations.)
You may also consider using temp tables for this scenario. You can insert the selections for each criteria into temp tables (e.g., #tmpColor, #tmpSize, #tmpMake, etc.). Then you can create a non-dynamic SELECT statement. Something like the following may work:
SELECT <column list>
FROM MyTable
WHERE MyTable.ColorID in (SELECT ColorID FROM #tmpColor)
OR MyTable.SizeID in (SELECT SizeID FROM #tmpSize)
OR MyTable.MakeID in (SELECT MakeID FROM #tmpMake)
The dynamic OR/IN and the temp table solutions work fine if each condition is independent of the other conditions. In other words, if you need to select rows where ((Color is Red and Size is Medium) or (Color is Green and Size is Large)) you'll need to try other solutions.