Creating a Table Using Previous Values (Iterative Process) - sql

I'm completely new to Visual FoxPro (9.0) and I was having trouble with creating a table which uses previous values to generate new values. What I mean by this is I have a given table that is two columns, age and probability of death. Using this I need to create a survival table which has the columns Age, l(x), d(x), q(x), m(x), L(x), T(x), and q(x) Where:
l(x): Survivorship Function; Defined as l(x+1) = l(x) * EXP(-m(x))
d(x): Number of Deaths; Defined as l(x) - l(x+1)
q(x): Probability of Death; This is given to me already
m(x): Mortality Rate; Defined as -LN(1-q(x))
L(x): Total Person-Years of Cohorts in the Interval (x, x+1); Defined as l(x+1) + (0.5 * d(x))
T(X): Total Person-Years of all Cohorts in the Interval (x, N); Defined as SUM(L(x)) [From x, N]
e(x): Remaining Life Expectancy; Defined as T(x) / l(x)
Now I'm not asking how to get all of these values, I just need help getting started and maybe pointed in the right direction. As far as I can tell, in VFP there is no way to point to a specific row in a data-table, so I can't do what I normally do in R and just make a loop. I.E. I can't do something like:
for (i in 1:length(given_table$Age))
{
new_table$mort_rate[i] <- -LN(1-given_table$death_prop[i])
}
It's been a little while so I'm not sure that's 100% correct anyway, but my point is I'm used to being able to create a table, and alter the values individually by pointing to a specific row and/or column using a loop with a simple counter variable. However, from what I've read there doesn't seem to be a way to do this in VFP, and I'm completely lost.
I've tried to make a Cursor, populating it with dummy values and trying to update each value sequentially using a SCATTER NAME and SCAN/REPLACE thing, but I don't really understand what's happening or how to fine tune this each calculation/entry that I need. (This is the post I was referencing when I tried this: Multiply and subtract values in previous row for new row in FoxPro.
So, how do I go about making a table that relies on iterative process to calculate subsequent values in Visual FoxPro? Are there any good resources that explain Cursors and the Scatter/Scan thing that I was trying (I couldn't find any resources that explained it in terms I could understand)?
Sorry if I've worded things poorly, I'm fairly new to programming in general. Thank you.

You absolutely can write a loop through an existing table in VFP. Use the SCAN command. However, if you're planning to add records to the same table as you go, you're going to run into some issues. Is that what you meant here? If so, my suggestion is to put the new records into a cursor as you create them and then APPEND them to the original table after you've processed all the records that were there when you started.
If you're putting records into a different table as you loop through the original, this is straightforward:
* Assumes you've already created the table or cursor to hold the result
SELECT YourOriginalTable && substitute in the alias/name for the original table
SCAN
* Do your calculations
* Substitute appropriately for YourNewTable and the two lists
INSERT INTO YourNewTable (<list of fields>) VALUES (<list of values>)
ENDSCAN
In the INSERT command, if you refer to any fields of the original table, you need to alias them, like this: YourOriginalTable.YourField, again substituting appropriately.

A bit too late but maybe still helps.
The steps to achieve what you want are:
0. close the tables - just in case (see CLOSE DATABASE)
open the Age table (see USE in VFP help)
create the Survival table structure (see CREATE TABLE)
for this you need to know the field type for each of your l(x), d(x), etc functions
Lets say that you named the fields like your functions (i.e. Lx,Dx, etc)
select the Age table (see SELECT)
loop through Age table (see SCAN)
pass each record into variables (see SCATTER)
made your calculations starting from the Age table data (variables) using L(x),D(x),etc formulas and store it into variables named as M.Your Survival Table Field
i.e. M.mx = -LOG(1-m.Age) && see LOG
Note: in these calculations you can use any mix of Age table variables and the new created variables.
after you calculated all the fields from Survival write it into table (see APPEND && GATHER commands)
close the tables (see CLOSE DATABASE)

Related

How to make Hive Terminal show rows (not just headers) after code is run?

As of now, Hive Terminal is showing only column headers after a create table code is run. What settings should I change to make Hive Terminal show few rows also, say first 100 rows?
Code I am using to create table t2 from table t1 which resides in the database (I don't know how t1 is created):
create table t2 as
select *
from t1
limit 100;
Now while development, I am writing select * from t2 limit 100; after each create table section to get the rows with headers.
You cannot
The Hive Create Table documentation does not mention anything about showing records. This, combined with my experience in Hive makes me quite confident that you cannot achieve this by mere regular config changes.
Of course you could tap into the code of hive itself, but that is not something to be attempted lightly.
And you should not want to
Changing the create command could lead to all kinds of problems. Especially because unlike the select command, it is in fact an operation on metadata, followed by an insert. Both of these normally would not show you anything.
If you would create a huge table, it would be problematic to show everything. If you choose always to just show the first 100 rows, that would be inconsistent.
There are ways
Now, there are some things you could do:
Change hive itself (not easy, probably not desirable)
Do it in 2 steps (what you currently do)
Write a wrapper:
If you want to automate things and don't like code duplication, you can look into writing a small wrapper function to call the create and select based on just the input of source (and limit) and destination.
This kind of wrapper could be written in bash, python, or whatever you choose.
However, note that if you like executing the commands ad-hoc/manually this may not be suitable, as you will need to start a hive JVM each time you run such a program and thus response time is expected to be slow.
All in all you are probably best off just doing the create first and select second.
The below command mentioned seems to be correct to show the first 100 rows:
select * from <created_table> limit 100;
Paste the code you have written to create the table will help to diagnose the issue in hand!!
Nevertheless , check if you have correctly mentioned the delimiters for the elements, key-value pairs, collection items etc while creating the table.
If you have not defined them correctly you might end up with having only the first row(header) being shown.

Join tablevariable in a loop

I'm using SAP HANA and would like to join the output results of a procedure call inside a loop, is there anyway to do this?
Something similar to this: But the problem is the duplicate attribute name
FOR i IN 1..:nYEARS DO
CALL FUTUREREVENUES (:i,resulttemp);
result = SELECT * FROM :result t1
INNER JOIN :resulttemp t2
ON t1.ID = t2.ID
END FOR;
Nope, the main problem is not the duplicate attribute name; SAP HANA actually allows that in a projection, as long as the attribute is uniquely identifiable.
What you are trying to do here is not a good idea in any statically typed language, such as SQL. Basically, the structure of your return type result depends on the input, i.e. how often the loop gets executed.
What you would need here is some way of dynamic SQL that adjusts the projected column names with each iteration. While this appears to be a straightforward approach, it's actually the opposite.
Every consumer of the result data is forced to accept whatever the table comes out of this loop, without even knowing how the projected columns would be named.
That's hard to handle and makes the solution very little reusable.
An alternative approach could be to have a fixed output table structure, say 5 years forecast (if you can even predict anything that far with any certainty), and no dynamic column names.
Instead, you could e.g. name the columns FC+Year1, FC+Year2, ...
That way, the output structure stays the same and all the client application has to do, is to match the output labels according to the baseline year (the input into your prediction).

TSQL: Is there a way to limit the rows returned and count the total that would have been returned without the limit (without adding it to every row)?

I'm working to update a stored procedure that current selects up to n rows, if the rows returned = n, does a select count without the limit, and then returns the original select and the total impacted rows.
Kinda like:
SELECT TOP (#rowsToReturn)
A.data1,
A.data2
FROM
mytable A
SET #maxRows = ##ROWCOUNT
IF #rowsToReturn = ##ROWCOUNT
BEGIN
SET #maxRows = (SELECT COUNT(1) FROM mytableA)
END
I'm wanting reduce this to a single select statement. Based on this question, COUNT(*) OVER() allows this, but it is put on every single row instead of in an output parameter. Maybe something like FOUND_ROWS() in MYSQL, such as a ##TOTALROWCOUNT or such.
As a side note, since the actual select has an order by, the data base will need to already traverse the entire set (to make sure that it gets the correct first n ordered records), so the database should already have this count somewhere.
As #MartinSmith mentioned in a comment on this question, there is no direct (i.e. pure T-SQL) way of getting the total numbers of rows that would be returned while at the same time limiting it. In the past I have done the method of:
dump the query to a temp table to grab ##ROWCOUNT (the total set)
use ROW_NUBMER() AS [ResultID] on the ordered results of the main query
SELECT TOP (n) FROM #Temp ORDER BY [ResultID] or something similar
Of course, the downside here is that you have the disk I/O cost of getting those records into the temp table. Put [tempdb] on SSD? :)
I have also experienced the "run COUNT(*) with the same rest of the query first, then run the regular SELECT" method (as advocated by #Blam), and it is not a "free" re-run of the query:
It is a full re-run in many cases. The issue is that when doing COUNT(*) (hence not returning any fields), the optimizer only needs to worry about indexes in terms of the JOIN, WHERE, GROUP BY, ORDER BY clauses. But when you want some actual data back, that could change the execution plan quite a bit, especially if the indexes used to get the COUNT(*) are not "covering" for the fields in the SELECT list.
The other issue is that even if the indexes are all the same and hence all of the data pages are still in cache, that just saves you from the physical reads. But you still have the logical reads.
I'm not saying this method doesn't work, but I think the method in the Question that only does the COUNT(*) conditionally is far less stressful on the system.
The method advocated by #Gordon is actually functionally very similar to the temp table method I described above: it dumps the full result set to [tempdb] (the INSERTED table is in [tempdb]) to get the full ##ROWCOUNT and then it gets a subset. On the downside, the INSTEAD OF TRIGGER method is:
a lot more work to set up (as in 10x - 20x more): you need a real table to represent each distinct result set, you need a trigger, the trigger needs to either be built dynamically, or get the number of rows to return from some config table, or I suppose it could get it from CONTEXT_INFO() or a temp table. Still, the whole process is quite a few steps and convoluted.
very inefficient: first it does the same amount of work dumping the full result set to a table (i.e. into the INSERTED table--which lives in [tempdb]) but then it does an additional step of selecting the desired subset of records (not really a problem as this should still be in the buffer pool) to go back into the real table. What's worse is that second step is actually double I/O as the operation is also represented in the transaction log for the database where that real table exists. But wait, there's more: what about the next run of the query? You need to clear out this real table. Whether via DELETE or TRUNCATE TABLE, it is another operation that shows up (the amount of representation based on which of those two operations is used) in the transaction log, plus is additional time spent on the additional operation. AND, let's not forget about the step that selects the subset out of INSERTED into the real table: it doesn't have the opportunity to use an index since you can't index the INSERTED and DELETED tables. Not that you always would want to add an index to the temp table, but sometimes it helps (depending on the situation) and you at least have that choice.
overly complicated: what happens when two processes need to run the query at the same time? If they are sharing the same real table to dump into and then select out of for the final output, then there needs to be another column added to distinguish between the SPIDs. It could be ##SPID. Or it could be a GUID created before the initial INSERT into the real table is called (so that it can be passed to the INSTEAD OF trigger via CONTEXT_INFO() or a temp table). Whatever the value is, it would then be used to do the DELETE operation once the final output has been selected. And if not obvious, this part influences a performance issue brought up in the prior bullet: TRUNCATE TABLE cannot be used as it clears the entire table, leaving DELETE FROM dbo.RealTable WHERE ProcessID = #WhateverID; as the only option.
Now, to be fair, it is possible to do the final SELECT from within the trigger itself. This would reduce some of the inefficiency as the data never makes it into the real table and then also never needs to be deleted. It also reduces the over-complication as there should be no need to separate the data by SPID. However, this is a very time-limited solution as the ability to return results from within a trigger is going bye-bye in the next release of SQL Server, so sayeth the MSDN page for the disallow results from triggers Server Configuration Option:
This feature will be removed in the next version of Microsoft SQL Server. Do not use this feature in new development work, and modify applications that currently use this feature as soon as possible. We recommend that you set this value to 1.
The only actual way to do:
the query one time
get a subset of rows
and still get the total row count of the full result set
is to use .Net. If the procs are being called from app code, please see "EDIT 2" at the bottom. If you want to be able to randomly run various stored procedures via ad hoc queries, then it would have to be a SQLCLR stored procedure so that it could be generic and work for any query as stored procedures can return dynamic result sets and functions cannot. The proc would need at least 3 parameters:
#QueryToExec NVARCHAR(MAX)
#RowsToReturn INT
#TotalRows INT OUTPUT
The idea is to use "Context Connection = true;" to make use of the internal / in-process connection. You then do these basic steps:
call ExecuteDataReader()
before you read any rows, do a GetSchemaTable()
from the SchemaTable you get the result set field names and datatypes
from the result set structure you construct a SqlDataRecord
with that SqlDataRecord you call SqlContext.Pipe.SendResultsStart(_DataRecord)
now you start calling Reader.Read()
for each row you call:
Reader.GetValues()
DataRecord.SetValues()
SqlContext.Pipe.SendResultRow(_DataRecord)
RowCounter++
Rather than doing the typical "while (Reader.Read())", you instead include the #RowsToReturn param: while(Reader.Read() && RowCounter < RowsToReturn.Value)
After that while loop, call SqlContext.Pipe.SendResultsEnd() to close the result set (the one that you are sending, not the one you are reading)
then do a second while loop that cycles through the rest of the result, but never gets any of the fields:
while (Reader.Read())
{
RowCounter++;
}
then just set TotalRows = RowCounter; which will pass back the number of rows for the full result set, even though you only returned the top n rows of it :)
Not sure how this performs against the temp table method, the dual call method, or even #M.Ali's method (which I have also tried and kinda like, but the question was specific to not sending the value as a column), but it should be fine and does accomplish the task as requested.
EDIT:
Even better! Another option (a variation on the above C# suggestion) is to use the ##ROWCOUNT from the T-SQL stored procedure, sent as an OUTPUT parameter, rather than cycling through the rest of the rows in the SqlDataReader. So the stored procedure would be similar to:
CREATE PROCEDURE SchemaName.ProcName
(
#Param1 INT,
#Param2 VARCHAR(05),
#RowCount INT OUTPUT = -1 -- default so it doesn't have to be passed in
)
AS
SET NOCOUNT ON;
{any ol' query}
SET #RowCount = ##ROWCOUNT;
Then, in the app code, create a new SqlParameter, Direction = Output, for "#RowCount". The numbered steps above stay the same, except the last two (10 and 11), which change to:
Instead of the 2nd while loop, just call Reader.Close()
Instead of using the RowCounter variable, set TotalRows = (int)RowCountOutputParam.Value;
I have tried this and it does work. But so far I have not had time to test the performance against the other methods.
EDIT 2:
If the T-SQL stored procs are being called from the app layer (i.e. no need for ad hoc execution) then this is actually a much simpler variation of the above C# methods. In this case you don't need to worry about the SqlDataRecord or the SqlContext.Pipe methods. Assuming you already have a SqlDataReader set up to pull back the results, you just need to:
Make sure the T-SQL stored proc has a #RowCount INT OUTPUT = -1 parameter
Make sure to SET #RowCount = ##ROWCOUNT; immediately after the query
Register the OUTPUT param as a SqlParameter having Direction = Output
Use a loop similar to: while(Reader.Read() && RowCounter < RowsToReturn) so that you can stop retrieving results once you have pulled back the desired amount.
Remember to not limit the result in the stored proc (i.e. no TOP (n))
At that point, just like what was mentioned in the first "EDIT" above, just close the SqlDataReader and grab the .Value of the OUTPUT param :).
How about this....
DECLARE #N INT = 10
;WITH CTE AS
(
SELECT
A.data1,
A.data2
FROM mytable A
)
SELECT TOP (#N) * , (SELECT COUNT(*) FROM CTE) Total_Rows
FROM CTE
The last column will be populated with the total number of rows it would have returned without the TOP Clause.
The issue with your requirement is, you are expecting a SINGLE select statement to return a table and also a scalar value. which is not possible.
A Single select statement will return a table or a scalar value. OR you can have two separate selects one returning a Scalar value and other returning a scalar. Choice is yours :)
Just because you think TSQL should have a row count because of a sort doe not mean it does. And if it does it does it is not currently sharing it with the outside world.
What you are missing is this is very efficient
select count(*)
from ...
where ...
select top x
from ...
where ...
order by ...
With the count(*) unless the query is just plain ugly those indexes should be in memory.
It has to perform a count to sort based on what?
Did you actually evaluate any query plans?
If TSQL has to perform a sort then explain the following.
Why is the count(*) 100% of the cost when the second had to do a count anyway?
Just where in that second query plan is there a free opportunity to count?
Why are those query plans so different if they both need to count?
I think there is an arcane way to do what you want. It involves triggers and non-temporary tables. And, I should mention, although I have implemented each piece (for different purposes), I have never put them together for this purpose.
The idea starts with this Stack Overflow question. According to this source, ##ROWCOUNT counts the number of attempted inserts, even when they don't really happen. Now, I must admit that a perusal of available documentation doesn't seem to touch on this topic, so this may or may not be "correct" behavior. This method is relying on this "problem".
So, you could do what you want by:
Creating a new table for the output -- but not a table variable or a temporary table.
Creating an "instead of" trigger that prevents more than #maxRows from going into the table.
Select the query results into the table.
Read ##ROWCOUNT after the select.
Note that you can create the table and trigger using dynamic SQL. You could also create it once, and have the trigger read the #maxRows value from some sort of parameter table. As mentioned before, this needs to be a real table that supports triggers.

Hide Empty columns

I got a table with 75 columns,. what is the sql statement to display only the columns with values in in ?
thanks
It's true that a similar statement doesn't exist (in a SELECT you can use condition filters only for the rows, not for the columns). But you could try to write a (bit tricky) procedure. It must check which are the columns that contains at least one not NULL/empty value, using queries. When you get this list of columns just join them in a string with a comma between each one and compose a query that you can run, returning what you wanted.
EDIT: I thought about it and I think you can do it with a procedure but under one of these conditions:
find a way to retrieve column names dynamically in the procedure, that is the metadata (I never heard about it, but I'm new with procedures)
or hardcode all column names (loosing generality)
You could collect column names inside an array, if stored procedures of your DBMS support arrays (or write the procedure in a programming language like C), and loop on them, making a SELECT each time, checking if it's an empty* column or not. If it contains at least one value concatenate it in a string where column names are comma-separated. Finally you can make your query with only not-empty columns!
Alternatively to stored procedure you could write a short program (eg in Java) where you can deal with a better flexibility.
*if you check for NULL values it will be simple, but if you check for empty values you will need to manage with each column data type... another array with data types?
I would suggest that you write a SELECT statement and define which COLUMNS you wish to display and then save that QUERY as a VIEW.
This will save you the trouble of typing in the column names every time you wish to run that query.
As marc_s pointed out in the comments, there is no select statement to hide columns of data.
You could do a pre-parse and dynamically create a statement to do this, but this would be a very inefficient thing to do from a SQL performance perspective. Would strongly advice against what you are trying to do.
A simplified version of this is to just select the relevant columns, which was what I needed personally. A quick search of what we're dealing with in a table
SELECT * FROM table1 LIMIT 10;
-> shows 20 columns where im interested in 3 of them. Limit is just to not overflow the console.
SELECT column1,column3,colum19 FROM table1 WHERE column3='valueX';
It is a bit of a manual filter but it works for what I need.

Can I maintain state between calls to a SQL Server UDF?

I have a SQL script that inserts data (via INSERT statements currently numbering in the thousands) One of the columns contains a unique identifier (though not an IDENTITY type, just a plain ol' int) that's actually unique across a few different tables.
I'd like to add a scalar function to my script that gets the next available ID (i.e. last used ID + 1) but I'm not sure this is possible because there doesn't seem to be a way to use a global or static variable from within a UDF, I can't use a temp table, and I can't update a permanent table from within a function.
Currently my script looks like this:
declare #v_baseID int
exec dbo.getNextID #v_baseID out --sproc to get the next available id
--Lots of these - where n is a hardcoded value
insert into tableOfStuff (someStuff, uniqueID) values ('stuff', #v_baseID + n )
exec dbo.UpdateNextID #v_baseID + lastUsedn --sproc to update the last used id
But I would like it to look like this:
--Lots of these
insert into tableOfStuff (someStuff, uniqueID) values ('stuff', getNextID() )
Hardcoding the offset is a pain in the arse, and is error prone. Packaging it up into a simple scalar function is very appealing, but I'm starting to think it can't be done that way since there doesn't seem to be a way to maintain the offset counter between calls. Is that right, or is there something I'm missing.
We're using SQL Server 2005 at the moment.
edits for clarification:
Two users hitting it won't happen. This is an upgrade script that will be run only once, and never concurrently.
The actual sproc isn't prefixed with sp_, fixed the example code.
In normal usage, we do use an id table and a sproc to get IDs as needed, I was just looking for a cleaner way to do it in this script, which essentially just dumps a bunch of data into the db.
I'm starting to think it can't be done that way since there doesn't seem to be a way to maintain the offset counter between calls. Is that right, or is there something I'm missing.
You aren't missing anything; SQL Server does not support global variables, and it doesn't support data modification within UDFs. And even if you wanted to do something as kludgy as using CONTEXT_INFO (see http://weblogs.sqlteam.com/mladenp/archive/2007/04/23/60185.aspx), you can't set that from within a UDF anyway.
Is there a way you can get around the "hardcoding" of the offset by making that a variable and looping over the iteration of it, doing the inserts within that loop?
If you have 2 users hitting it at the same time they will get the same id. Why didn't you use an id table with an identity instead, insert into that and use that as the unique (which is guaranteed) id, this will also perform much faster
sp_getNextID
never ever prefix procs with sp_, this has performance implication because the optimizer first checks the master DB to see if that proc exists there and then th local DB, also if MS decide to create a sp_getNextID in a service pack yours will never get executed
It would probably be more work than it's worth, but you can use static C#/VB variables in a SQL CLR UDF, so I think you'd be able to do what you want to do by simply incrementing this variable every time the UDF is called. The static variable would be lost whenever the appdomain unloaded, of course. So if you need continuity of your ID from one day to the next, you'd need a way, on first access of NextId, to poll all of tables that use this ID, to find the highest value.