I have a specific question for my RDLC Report and a table, which data comes from a StoredProcedure. The looks like this:
Object Price More Data here...
====== ======== =======
X 10 $ ...
X 10 $ ...
Y 50 $ ...
Y 50 $ ...
Y 50 $ ...
Y 50 $ ...
Z 20 $ ...
Z 20 $ ...
Sum(expr)
What I now need is not the total Sum of those values, but the sum of all distinct values grouped by each object. So the result should be 80 $ (10+50+20)
I have no specific row or column groups. Grouping after Object and adding a grouprow was not my solution, because the sum only has to be at the end of the table. And I didn't find out how to sum after group values...
I tried different functions like Previous(compared the objects) and RunningValue(). But maybe I used them wrong or they showed me that this function can't be used in an Aggregate function. Maybe Maximun() for each Object is another idea, but it gave me the same error.
Within my tunnelview I have now no idea anymore, what can help me with this case. So maybe one of you can help me.
Thanks in advance
// I am currently working with Visual Studio 2013
Its better to do the distinct sum in the stored procedure like this. If you are using the mysql stored procedure
sum(distinctp.Price)
In rdlc there is aggregate function called the CountDistinct to count the distinct values without the repetition.
Related
Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67
I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?
Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:
Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.
Ok so I am trying to reference one variable with another in SQL.
X= a,b,c,d (x is a string variable with a list of things in it)
Y= b ( Y is a string variable that may or may not have a vaue that appears in X)
I tried this:
Case when Y in (X) then 1 else 0 end as aa
But it doesnt work since it looks for exact matches between X and Y
also tried this:
where contains(X,#Y)
but i cant create Y globally since it is a variable that changes in each row of the table.( x also changes)
A solution in SAS would also be useful.
Thanks
Maybe like will help
select
*
from
t
where
X like ('%'+Y+'%')
or
select
case when (X like ('%'+Y+'%')) then 1 else 0 end
from
t
SQLFiddle example
In SAS I would use the INDEX function, either in a data step or proc sql. This returns the position within the string in which it finds the character(s), or zero if there is no match. Therefore a test if the value returned is greater than zero will result in a binary 1:0 output. You need to use the compress function with the variable containing the search characters as SAS pads the value with blanks.
Data step solution :
aa=index(x,compress(y))>0;
Proc Sql solution :
index(x,compress(y))>0 as aa
Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck
I am working on a table with over 100 columns, many of them are boolean in case this is relevant, as I need to use avg(variable_name::int) to take each boolean column average.
Now, I want to take the average of all columns at the same time. How do I do that ?
Thank you very much.
I'll try to be more clear:
I want all the averages of all variables from A to ZZ. Some of them are integers, some are booleans, that's the sole reason why I mencioned the booleans.
PK A **** GZ *** ZZ
--------------------------
1 T **** F *** T
2 T **** F *** T
3 F **** T *** T
4 F **** F *** F
5 T **** F *** T
There are no real sneaky or tricky ways to do this. You might be able to build a dynamic query using the data dictionary, but that's really not recommended.
If you honestly need the average of 100 different columns, you're going to have to type avg() 100 times.
I do agree with the above comment, however, that it is likely that you DB would benefit greatly from some normalization. This is especially true if you have a bunch of columns named 'Something##` where ## is a series of numbers.
Maybe I don't understand your question. If you start with a table like this:
PK A
--------
1 T
2 T
3 F
4 F
5 T
What answer do you expect to get to the question, "What is the average value of column A?"