I am running into problems doing a fairly basic summation.
My dataset is composed of company ID's (cusip8) and their daily (date) abnormal returns (AR). I need to sum the abnormal returns of each company, from days t+3 until t+60 forward.
cusip8 year date ret vwretd AR
"00030710" 2014 19998 . .0007672172 . .
"00030710" 2014 19999 .008108087815344334 .009108214 -.0010001262 .
"00030710" 2014 20002 .03163539618253708 -.00158689 .033222288 .
"00030710" 2014 20003 0 -.014999760000000001 .01499976 .
"00030710" 2014 20004 -.005717287305742502 .0158898 -.02160709 .
"00030710" 2014 20005 .006272913888096809 -.02121511 .027488023 .
"00030710" 2014 20006 -.012987012974917889 -.01333873 .000351717 .
I have tried the following:
sort cusip8 date
by cusip8: gen CAR = AR if _n==1
(24,741,003 missing values generated)
by cusip8: replace CAR = AR +CAR[_n-1] if _n>3 & if _n<60
And have yet been left with just .'s in the newly generated variable. Does anyone know how to solve this?
I am using Stata 16.0.
You have more problems than one. First, let's address your problem report.
In each panel, CAR[2] is created missing by your code, which creates CAR only in the first observation. That messes up all subsequent calculations, as for example CAR[3] is AR[3] + CAR[2], and so missing, CAR[4] is AR[4] + CAR[3] and so missing, and so on.
Contrary to your claim, in each panel CAR[1] should be non-missing whenever AR is.
Second, evidently you have gaps for days 20000 and 20001 which were at a weekend. dow() returns 6 for Saturday and 0 for Sunday from daily dates (for which 0 is 1 January 1960).
. di dow(20000)
6
. di dow(20001)
0
. di %td 20000
04oct2014
So, either set up a business calendar to exclude weekends and holidays, or decide that you want just to use whatever is available within particular windows based on daily dates.
Third, your wording is not precisely enough to make your problem unambiguous to anyone who doesn't routinely deal with your kind of data. It seems that you seek a cumulative (running) sum but the window could just be one window (as your question literally implies) or a moving window (which I guess at). The function sum() gives cumulative or running sums: see help sum(). Just possibly,
bysort cusip8 (date): gen wanted = sum(AR)
is a start on your solution. Otherwise, ssc describe rangestat shows you a command good for moving window calculations.
There are hundreds of posts in this territory on Statalist.
Related
I have the Formula:
(([Measures]. [QuantityPathology])/([Measures]. [QuantityPersons],
[DimPathology.Pathologies].[All])) * 100
The numerator is the measure QuantityPathology and the denominator is the QuantityPeople grouped by the member All of the hierarchy Pathologies, that is the total number of people served in that period regardless of the selected pathology.
But it does not work, since saiku analysis does not display any results, which may be doing wrong.
Modify message
Attached is an example of the formula
(([Measures], [QuantityPathology])/([Measures], [QuantityPeople],
[DimPathology.Pathologies][All])) * 100
QuantityPathology with Bronchitis year 2019 = 61
Number of people served in the year 2019 = 4569
61/4569 = 0,013350843*100
Result =1,3350843
Also avg(NumberPeople) does not work in all rows returns 1
I don't have any error in catalina.out
Thank you very much.
I am trying to improve the performance of updating only about 60K rows with data coming from different rows in the same table. At about 2 minutes, it's not terrible, but it's not great either, and my application really doesn't work if you have to wait so long between recalculations.
The app generates a set of financial statements for a business, where it calculates basic formulas on 1300 line items, like Rent, or Direct Labor, or Inventory costs, all of which roll up to totals that mimic the Balance Sheet, P&Ls, Cash Flow etc. Many of the line items need to calculate on a month by month basis, where for instance it has figure out April's On Hand Inventory before knowing what April's Inventory Value is. So the total program ends up looping through 48 months over 30 calculation passes, requiring about 8000 SQL statements. (fortunately it figures it all out by itself!) Each SQL is taking only a few milliseconds, but it adds up.
I'm pretty sure I can't reduce the number of loops, so I keep trying to figure out how to make each SQL quicker. The basic structure is as follows:
LI: Line item table that holds the basic info of each item, primary key LID
LID Name
123 Sales_1
124 Sales_2
200 Total Sales
Formula: Master/Detail tables that create any formula from the line items
Total sales=Sales_1 + Sales_2
or
{200}={123}+{124}
(I use curly braces to be able to find and replace the LIDs within the formula, as shown in the SQL below)
FC: Formula Calculation table: all line items by month, about 1300 items x 48 months=62K records. Primary key FID
FID SQL_ID LID LID_brace LIN OutputMonth Formula Amount
3232 25 123 {123} Sales_1 1 1200
3255 26 124 {124} Sales_2 1 1500
5454 177 200 {200} Total Sales 1 {123}+{124}
DMO:Operand Join table, which links a formula to its detail lines within the same table, so once Sales_1 is calculated, it can find the Total Sales record and update it, which then will evaluate then send its amount up the chain to the other LIDs that depend on it, such as Total Income. It locates the record to update based on the SQL_id, which is set based on the calc pass and month. Its complex to setup, but pretty straightforward once you actually run things
Master_FID Detail_FID
5454 3232 (links total sales to sales_1)
5454 3255 (links total sales to sales_2)
SQL1:
Update FC inner join DMO on FC.FID=DMO.Master_FID inner join FC2 on DMO.Detail_FID=FC2.FID set FC.formula=replace(FC.Formula,FC.LID_brace,FC2.Amount) where FC.sql_id=177
The above will change {123} + {124} to 1200+1500 which will then evaluate to 2700 when I run the following
SQL2:
UPDATE FC SET FC.amount = Eval([fc].[formula]) WHERE (((FC.calc_sql_id)=177 )
So those two sql statements are run over and over again, with the only thing changing is the SQL_id.
There are indexes on the SQL_ID, LID, FID etc
When measuring, the milliseconds per record can range from .04ms if there are many records included (~10K for some passes), up to 10 or 15 ms for just one record updated. Perhaps it is the setup of the query causing a whole lot of overhead time, because it doesn't seem to be a function of the actual number of records updated? Also its not very consistent, where some runs have 20+ ms compared to less than 3ms when it runs it again.
I know this is a complex question i'm asking that probably doesn't have a simple answer, but I'm just looking for directions for what might help. For instance, a parameter query if there isn't a whole lot of change between runs? Does Access have a better time of running a query if knows about it in advance, i.e a named query with parameters vs dynamic SQL? Am I just doomed because it still needs to run those 8000 queries?
Also, is there inherently a problem with trying to update the same table through a secondary join table, and/or is there a better way to do it?
Is it also because string replacing isn't efficient this way? If I tried RegEx would that be quicker? I would have to make a function that could do that within a query, but it seems like that's going to be slower.
Thanks in advance, this has been a most vexing problem!!!
I have two datasets, one in SSMS and one in Oracle I'm trying to combine through SSRS. Technically I have two questions regarding the results I'm getting.
Dataset 1:
DataSet1 - Sales - MS
Part Location Transaction_date QTY_SOLD
1234 New York 06/01/2017 1
1235 New York 06/01/2017 4
Dataset 2 - Returns - Oracle
Part Location Purchase_Date QTY_RTN Reason
1235 New york 06/01/2017 2 Broken`
What I'm wanting to get:
Part Location Date QTY_SOLD QTY_RTN Reason
1234 New York 06/01/2017 1 NULL NULL
1235 New York 06/01/2017 4 2 Broken
I have lookup expressions set to join on part, location, date for qty_rtn and reason columns.
Part one, 1234 with no returns does not show up. The first dataset should return ~1400 items. The second dataset should return the same theoretically, but since that info is manually entered the purchase_date does not always match the transaction_date (this is fine. half the purpose of this is to find those mistakes and get someone to go back and correct the data). When I run the query, I get ~400 items.
Part two, when I do a preview from within Studio, the MS and Oracle data shows up. When I pull from the web interface, only the MS data shows up. I've checked that the credentials on both sides are correct and have the correct connection strings as well.
Any thoughts are appreciated.
Not sure what was broken, but I wound up deleting and re-creating the report from scratch and it works with all the data as it's supposed to. Web interface is also not missing data.
I am using SQL Server 2005.
I have a site that people can vote on awesome motorcycles. Each time a user votes, there is one for the first bike and one vote against the second bike. Two votes are stored in the database. The vote table looks like this:
VoteID VoteDate BikeID Vote
1 2012-01-12 123 1
2 2012-01-12 125 0
3 2012-01-12 126 0
4 2012-01-12 129 1
I want to tally the votes for each bike quite frequently, say each hour. My idea is to store the tally as a percentage of contest won versus lost on the bike table as an attribute of the bike. So, if a bike won 10 contests and lost 20 contest, they would have a score (tally) of 33. I would tally up daily, weekly, and monthly scores.
BikeID BikeName DailyTally WeeklyTally MonthlyTally
1 Big Dog 5 10 50
2 Big Cat 3 15 40
3 Small Dog 9 8 0
4 Fish Face 19 21 0
Right now, there are about 500 votes per day being cast. We anticipate 2500 - 5000 per day in the next month or so.
What is the best way to tally the data and what is the best way to store it? Should the tallies be on their own table? Should a trigger be used to run a new tally each time a bike is voted on? Should a stored procedure be run hourly to get all tallies?
Any ideas would be very helpful!
Store your VoteDate as a datetime value instead of just date.
For your tallies, you can just make that a view and calculate it on the fly. This should be very simple to do using GROUP BY and DATEPART functions. If you need exact code for how to do this, please open a new question.
For that low volume of rows it doesn't make any sense to store aggregations in a table when you can just calculate them whenever you want to see them and get accurate and immediate results that are up-to-date.
I agree with #JNK try a view or just a normal stored proc to calculate the outputs on the fly. If you find it becomes too slow as your data grows I would investigate other routes then (like caching the data in another table etc). Probably worth keeping it simple to start with; you can always resuse the logic from the SP/VIEW later if you do want to setup a scheduled task.
Edit :
Removed the index view as per #Damien_The_Unbeliever comments its not deterministic and i'm stupid :)
This query is supposed to run with ms access 2003 using SQL. the function JOIN is NOT supported explicitly. implicitly in the WHERE clause is fine...implicity anywhere is fine as long as the word JOIN INNER JOIN Etc is not used.
DayNumnber PastTime
.
.
.
333 Homework
333 TV
334 Date
620 Chores
620 Date
620 Homework
725 Chores
725 Date
888 Internet
888 TV
.
.
.
Hey I would like a query that can Show the most important past time done for each day (TV and internet do not count!) .So importance would be Homework > Chores > Date.So:
DayNumber PastTime
333 Homework
334 Date
620 Homework
725 Chores
Something that might change this problem. Altho all the different past times are listen in a table together. but that was because i appended the table. originally the homework entries. chore entries and date entriess . internet entriess. tv entries. came from different tables.
eg homework 333
homework 620
Is it easier to do it without appending these tables first? I would hopefully like it to be done with the appended table but ya
I was thinking of a mixture of insert. delete... but the hardest part is checking that there is something there for a date a few things and how to put the more important thing done that day . Thank you
Create another table with:
Pri | PastTime
--------------
1 | Homework
2 | Chores
3 | Date
This is a priority list for the items.
Next do:
SELECT MIN(Pri), DayNumber
FROM PastTime_table, Priority_table
WHERE PastTime_table.PastTime = Priority_table.PastTime
GROUP BY DayNumber
This will give you the most important past time for each day. And because TV and Internet are not listed they will not show up.
But it will give you a number, and not the name.
If you had a better SQL you could then join this back to the Priority_table and lookup the name. But I guess you will have to do that part manually.
If you are willing to change the name and call them:
A_Homework
B_Chores
C_Date
instead then you could do (without any extra table):
SELECT MIN(PastTime), DayNumber
FROM PastTime_table
GROUP BY DayNumber
Since it sorts the name alphabetically it will always give you the best one.
You can add a WHERE to remove TV and Internet.