I am using CTE for the first time and am running into some difficulty. I have looked online and am trying to piece together examples.
I want to insert rows between each two rows returned to account for all the days inbetween. Row 1 has Date(A) and row 2 has Date(B). I want to insert a row for every day between A and B, where those rows all have the same values as Row 1.
If I run only my anchor definition on my test data, i get 341 rows. After running the CTE, i get 682. So it only runs once.
Any suggestions you can provide would be great. Thanks.
I have the following table schema:
field1 (varchar(10)) field2 (smalldatetime) field3 (numeric(18,0)) field4 (numeric(18,6)) field5 (numeric(18,6)) field6 (numeric(18,3))
An example of the input table is:
ABC 1-1-1990 0 0.1 0.1 0.125
ABC 1-5-1990 1 0.2 0.2 1.0
ABC 1-9-1990 0 0.3 0.3 0.750
ABC 1-13-1990 1 0.4 0.4 1.500
I want to turn that into this:
ABC 1-1-1990 0 0.1 0.1 0.125
ABC 1-2-1990 0 0.1 0.1 0.125
ABC 1-3-1990 0 0.1 0.1 0.125
ABC 1-4-1990 0 0.1 0.1 0.125
ABC 1-5-1990 1 0.2 0.2 1.0
ABC 1-6-1990 1 0.2 0.2 1.0
ABC 1-7-1990 1 0.2 0.2 1.0
ABC 1-8-1990 1 0.2 0.2 1.0
ABC 1-9-1990 0 0.3 0.3 0.750
ABC 1-10-1990 0 0.3 0.3 0.750
ABC 1-11-1990 0 0.3 0.3 0.750
ABC 1-12-1990 0 0.3 0.3 0.750
ABC 1-13-1990 1 0.4 0.4 1.500
Here is my current CTE:
WITH NewData (field1,field2,field3,field4,field5,field6) AS
(
SELECT m.field1,m.field2,m.field3,m.field4,m.field5,m.field6
FROM MyTable as m
WHERE m.field1 is not null
GROUP BY m.field1,m.field2,m.field3,m.field4,m.field5,m.field6
UNION ALL
SELECT m.field1, DATEADD(d, 1, m.field2), m.field3, m.field4, m.field5, m.field6
FROM MyTable as m
)
SELECT field1,field2,field3, field4, field5,field6
FROM NewData
order by field1, field2
OPTION(MAXRECURSION 0)
Current Output (it misses dates 1-3-1990, 1-4-1990, 1-7-1990, 1-8-1990, 1-11-1990, 1-12-1990):
ABC 1-1-1990 0 0.1 0.1 0.125
ABC 1-2-1990 0 0.1 0.1 0.125
ABC 1-5-1990 1 0.2 0.2 1.0
ABC 1-6-1990 1 0.2 0.2 1.0
ABC 1-9-1990 0 0.3 0.3 0.750
ABC 1-10-1990 0 0.3 0.3 0.750
ABC 1-13-1990 1 0.4 0.4 1.500
Your CTE is not currently defined to be recursive, in that the portion you think is recursive is not since it does not reference itself - so what it does is just a normal union query (so you get more rows, making you think its is recursive, when its just a union)
http://msdn.microsoft.com/en-us/library/ms186243.aspx
WITH NewData (field1,field2,field3,field4,field5,field6) AS
(
SELECT m.field1,m.field2,m.field3,m.field4,m.field5,m.field6
FROM MyTable as m
WHERE m.field1 is not null
GROUP BY m.field1,m.field2,m.field3,m.field4,m.field5,m.field6
UNION ALL
SELECT m.field1, DATEADD(d, 1, m.field2), m.field3, m.field4, m.field5, m.field6
FROM MyTable as m
INNER JOIN NewData n on n.field1 = m.field1
)
I am not entirely sure what join condition you want to recurse on so have just used field1 in the code example, but basically use that join to define how the rows relate.
Related
I created this simple example to illustrate my issue:
x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})
it yields this DataFrame:
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
or I can show unobserved values for all grouping variables like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
b 0.0
4 a 0.0
b 0.0
5 a 0.0
b 0.0
1 3 a 0.0
b 0.0
4 a 0.0
b 0.2
5 a 0.0
b 0.0
2 3 a 0.0
b 0.0
4 a 0.0
b 0.0
5 a 0.3
b 0.0
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?
You can unstack the level of interest, cat_var in this case:
(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
.agg({'value':'sum'})
.unstack('cat_var',fill_value=0)
)
Output:
value
cat_var a b
int_var1 int_var2
0 3 0.1 0.0
1 4 0.0 0.2
2 5 0.3 0.0
Ok, as a python beginner I found multiplication matrix in pandas dataframes is very difficult to conduct.
I have two tables look like:
df1
Id lifetime 0 1 2 3 4 5 .... 30
0 1 4 0.1 0.2 0.1 0.4 0.5 0.4... 0.2
1 2 7 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
2 3 8 0.5 0.2 0.1 0.4 0.5 0.4... 0.6
.......
9 6 10 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
df2
Group lifetime 0 1 2 3 4 5 .... 30
0 2 4 0.9 0.8 0.9 0.8 0.8 0.8... 0.9
1 2 7 0.8 0.9 0.9 0.9 0.8 0.8... 0.9
2 3 8 0.9 0.7 0.8 0.8 0.9 0.9... 0.9
.......
9 5 10 0.8 0.9 0.7 0.7 0.9 0.7... 0.9
I want to perform excel's sumproduct function in my codes and the length of the columns that need to be summed are based on the lifetime in column 1 of both dfs, e,g.,
for row 0 in df1&df2, lifetime=4:
sumproduct(df1 row 0 from column 0 to column 3,
df2 row 0 from column 0 to column 3)
for row 1 in df1&df2, lifetime=7
sumproduct(df1 row 2 from column 0 to column 6,
df2 row 2 from column 0 to column 6)
.......
How can I do this?
You can use .iloc to access row and columns with integers.
So where lifetime==4 is row 0, and if you count the column numbers where Id is zero, then column labeled as 0 would be 2, and column labeled as 3 would be 5, to get that interval you would enter 2:6.
Once you get the correct data in both data frames with .iloc[0,2:6], you run np.dot
See below:
import numpy as np
np.dot(df1.iloc[0,2:6], df2.iloc[1,2:6])
Just to make sure you have the right data, try just running
df1.iloc[0,2:6]
Then try the np.dot product. You can read up on "pandas iloc" and "slicing" for more info.
I am finding this issue quite complex:
I have the following df:
values_1 values_2 values_3 id name
0.1 0.2 0.3 1 AAAA_living_thing
0.1 0.2 0.3 1 AAA_mammals
0.1 0.2 0.3 1 AA_dog
0.2 0.4 0.6 2 AAAA_living_thing
0.2 0.4 0.6 2 AAA_something
0.2 0.4 0.6 2 AA_dog
The ouput should be:
values_1 values_2 values_3 id name
0.3 0.6 0.9 3 AAAA_living_thing
0.1 0.2 0.3 1 AAA_mammals
0.1 0.2 0.3 1 AA_dog
0.2 0.4 0.6 2 AAA_something
0.2 0.4 0.6 2 AA_dog
It would be like a group_by().sum() but only the AAAA_living_thing as the rows below are childs of AAAA_living_thing
Seperate the dataframe first by using query and getting the rows only with AAAA_living_thing and without. Then use groupby and finally concat them back together:
temp = df.query('name.str.startswith("AAAA")').groupby('name', as_index=False).sum()
temp2 = df.query('~name.str.startswith("AAAA")')
final = pd.concat([temp, temp2])
Output
id name values_1 values_2 values_3
0 3 AAAA_living_thing 0.3 0.6 0.9
1 1 AAA_mammals 0.1 0.2 0.3
2 1 AA_dog 0.1 0.2 0.3
4 2 AAA_something 0.2 0.4 0.6
5 2 AA_dog 0.2 0.4 0.6
Another way would be to make a unique identifier for rows which are not AAAA_living_thing with np.where and then groupby on name + unique identifier:
s = np.where(df['name'].str.startswith('AAAA'), 0, df.index)
final = df.groupby(['name', s], as_index=False).sum()
Output
name values_1 values_2 values_3 id
0 AAAA_living_thing 0.3 0.6 0.9 3
1 AAA_mammals 0.1 0.2 0.3 1
2 AAA_something 0.2 0.4 0.6 2
3 AA_dog 0.1 0.2 0.3 1
4 AA_dog 0.2 0.4 0.6 2
Hi I have the following data in an oracle sql dab:
AS_OF_DATE TICKER SCORE_1 SCORE_2
'20130301' 'BCO' 0.9 0.3
'20130409' 'BCO' 0.8 0.2
'20130501' 'BCO' 0.4 0.9
'20130601' 'BCO' 0.6 0.6
'20130701' 'BCO' 0.2
'20130801' 'BCO' 0.7 0.1
'20130901' 'BCO' 0.9 0.4
'20131001' 'BCO' 0.7 0.5
'20131101' 'BCO' 0.5
'20130701' 'WGO' 0.1
'20130801' 'WGO' 0.7
'20130901' 'WGO' 0.8
'20131001' 'WGO' 0.1 0.9
'20131101' 'WGO' 0.2 0.8
'20131201' 'WGO' 0.6 0.5
I need to count the number of tickers for each date for each score.
i.e.
AS_OF_DATE Count of SCORE_2 Count of SCORE_1
'20130601' 1 1
'20130701' 1 1
'20130801' 1 2
'20130901' 1 2
'20131001' 1 1
'20131101' 1
I've been trying to use GROUP BY and COUNT without success. Thanks!!!
SELECT COUNT(table.TICKER),
table.AS_OF_DATE,
table.SCORE_1,
table.SCORE_2
FROM table WHERE RowNum <= 200
GROUP BY table.AS_OF_DATE
SELECT table.AS_OF_DATE,
COUNT(table.SCORE_2) "Count of SCORE_2",
COUNT(table.SCORE_1) "Count of SCORE_1"
FROM table
WHERE RowNum <= 200
GROUP BY table.AS_OF_DATE
COUNT function calculates the number of not null attributes. Another point re COUNT is that it returns not null value, so in your case instead of empty value you will see 0.
I know that same kind of questions has been asked before. However I didn't succeed to do what I need to do. Therefore I'm asking you.
I have a table with client_ID and some probabilities of purchasing different product category corresponding to each client.
Client_ID | Prob_CategoryA | Prob_CategoryB | Prob_CategoryC
1 0.2 0.3 0.2
2 0.4 0.6 0.7
3 0.3 0.7 0.4
Now what I would like to do is transform the above table into this.
Client_ID | Category Name | Probability
1 A 0.2
1 B 0.3
1 C 0.2
2 A 0.4
2 B 0.6
2 C 0.7
3 A 0.3
3 B 0.7
3 C 0.4
Thank you very much
Simple UNPIVOT:
SELECT Client_Id, SUBSTRING(Cat, 14, 1) [Category Name], Probability
FROM Src
UNPIVOT (Probability FOR Cat IN (Prob_CategoryA, Prob_CategoryB, Prob_CategoryC)) UP
Result
Client_Id Category Name Probability
----------- ------------- -----------
1 A 0.2
1 B 0.3
1 C 0.2
2 A 0.4
2 B 0.6
2 C 0.7
3 A 0.3
3 B 0.7
3 C 0.4
Use UNPIVOT:
SELECT Client_ID, Cats, Probability
FROM
(SELECT Client_ID, Prob_CategoryA, Prob_CategoryB, Prob_CategoryC
FROM yourTable) t
UNPIVOT
(Probability FOR Cats IN (Prob_CategoryA, Prob_CategoryB, Prob_CategoryC)
) AS c