I am working on a table with over 100 columns, many of them are boolean in case this is relevant, as I need to use avg(variable_name::int) to take each boolean column average.
Now, I want to take the average of all columns at the same time. How do I do that ?
Thank you very much.
I'll try to be more clear:
I want all the averages of all variables from A to ZZ. Some of them are integers, some are booleans, that's the sole reason why I mencioned the booleans.
PK A **** GZ *** ZZ
--------------------------
1 T **** F *** T
2 T **** F *** T
3 F **** T *** T
4 F **** F *** F
5 T **** F *** T
There are no real sneaky or tricky ways to do this. You might be able to build a dynamic query using the data dictionary, but that's really not recommended.
If you honestly need the average of 100 different columns, you're going to have to type avg() 100 times.
I do agree with the above comment, however, that it is likely that you DB would benefit greatly from some normalization. This is especially true if you have a bunch of columns named 'Something##` where ## is a series of numbers.
Maybe I don't understand your question. If you start with a table like this:
PK A
--------
1 T
2 T
3 F
4 F
5 T
What answer do you expect to get to the question, "What is the average value of column A?"
Related
Guys I have a dataset like this:
`
df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
columns = ['Name'])
df
`
it gives this output
Name
0 John
1 gal britt
2 mona
3 diana
4 molly
5 merry
6 mony
7 molla
8 johnathon
so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )
The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for.
any algorithm or idea would really help and I'll adjust the logic to my purposes
I tried working with vaex instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation.
In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.
I'm trying to make search a bit more friendly and wanted to exploit the Levenshtein distance. This works great but if a value in a column has a length of 25 characters long, the distance to only 3 characters is too far. In this case, it performs worse than the LIKE method. I solved this by splitting all words into their own rows using regexp_split_to_table. This is nice, but it's still not working if I have multiple words as input.
For example:
Let the data look as following
id
col1
col2
1
one two
three
2
two
one
3
horse
tree
4
house
three
using regexp_split_to_table would transform this to
id
col
1
one
1
two
1
three
2
one
2
two
2
two
3
horse
3
tree
4
house
4
three
If I search for one tree, I'd like to compare one with each word but also compare tree with each word and then order by the sum of both distances.
I have no idea where to start. I also do not know if this is the best approach to do this (it seems somewhat excessive but I'm also not an expert). Maybe I'm also overthinking this. I'd appreciate a hint into the right direction :).
I have a data set that allows linking friends (i.e. observing peer groups) and thereby one can observe the characteristics of an individual's friends. What I have is an 8 digit identifier, id, each id's friend id's (up to 10 friends), and then many characteristic variables.
I want to take an individual and create a variables that are the foreign born status of each friend.
I already have an indicator for each person that is 1 if foreign born. Below is a small example, for just one friend. Notice, MF1 means male friend 1 and then MF1id is the id number for male friend 1. The respondents could list up to 5 male friends and 5 female friends.
So, I need Stata to look at MF1id and then match it down the id column, then look over to f_born for that matched id, and finally input the value of f_born there back up to the original id under MF1f_born.
edit: I did a poor job of explaining the data structure. I have a cross section so 1 observation per unique id. Row 1 is the first 8 digit id number with all the variables following over the row. The repeating id numbers are between the friend id's listed for each person (mf1id for example) and the id column. I hope that is a bit more clear.
Kevin Crow wrote vlookup that makes this sort of thing pretty easy:
use http://www.ats.ucla.edu/stat/stata/faq/dyads, clear
drop team y
rename (rater ratee) (id mf1_id)
bys id: gen f_born = mod(id,2)==1
net install vlookup
vlookup mf1_id, gen(mf1f_born) key(id) value(f_born)
So, Dimitriy's suggestion of vlookup is perfect except it will not work for me. After trying vlookup with both my data set, the UCLA data that Dimitriy used for his example, and a toy data set I created vlookup always failed at the point the program attempts to save a temp file to my temp folder. Below is the program for vlookup. Notice its sets tempfile file, manipulates the data, and then saves the file.
*! version 1.0.0 KHC 16oct2003
program define vlookup, sortpreserve
version 8.0
syntax varname, Generate(name) Key(varname) Value(varname)
qui {
tempvar g k
egen `k' = group(`key')
egen `g' = group(`key' `value')
local k = `k'[_N]
local g = `g'[_N]
if `k' != `g' {
di in red "`value' is unique within `key';"
di in red /*
*/ "there are multiple observations with different `value'" /*
*/ " within `key'."
exit 9
}
preserve
tempvar g _merge
tempfile file
sort `key'
by `key' : keep if _n == 1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save `file', replace
restore
sort `varlist'
joinby `varlist' using `file', unmatched(master) _merge(`_merge')
drop `_merge'
}
end
exit
For some reason, Stata gave me an error, "invalid file," at the save `file', replace point. I have a restricted data set with requirments to point all my Stata temp files to a very specific folder that has an erasure program sweeping it every so often. I don't know why this would create a problem but maybe it is, I really don't know. Regardless, I tweaked the vlookup program and it appears to do what I need now.
clear all
set more off
capture log close
input aid mf1aid fborn
1 2 1
2 1 1
3 5 0
4 2 0
5 1 0
6 4 0
7 6 1
8 2 .
9 1 0
10 8 1
end
program define justlinkit, sortpreserve
syntax varname, Generate(name) Key(varname) Value(name)
qui {
preserve
tempvar g _merge
sort `key'
by `key' : keep if _n ==1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save "Z:\Jonathan\created data sets\justlinkit program\fchara.dta",replace
restore
sort `varlist'
joinby `varlist' using "Z:\Jonathan\created data sets\justlinkit program\fchara.dta", unmatched(master) _merge(`_merge')
drop `_merge'
}
end
// set trace on
justlinkit mf1aid, gen(mf1_fborn) key(aid) value(fborn)
sort aid
list
Well, this fixed my problem. Thanks to all who responded I would not have figured this out without you.
I have a specific question for my RDLC Report and a table, which data comes from a StoredProcedure. The looks like this:
Object Price More Data here...
====== ======== =======
X 10 $ ...
X 10 $ ...
Y 50 $ ...
Y 50 $ ...
Y 50 $ ...
Y 50 $ ...
Z 20 $ ...
Z 20 $ ...
Sum(expr)
What I now need is not the total Sum of those values, but the sum of all distinct values grouped by each object. So the result should be 80 $ (10+50+20)
I have no specific row or column groups. Grouping after Object and adding a grouprow was not my solution, because the sum only has to be at the end of the table. And I didn't find out how to sum after group values...
I tried different functions like Previous(compared the objects) and RunningValue(). But maybe I used them wrong or they showed me that this function can't be used in an Aggregate function. Maybe Maximun() for each Object is another idea, but it gave me the same error.
Within my tunnelview I have now no idea anymore, what can help me with this case. So maybe one of you can help me.
Thanks in advance
// I am currently working with Visual Studio 2013
Its better to do the distinct sum in the stored procedure like this. If you are using the mysql stored procedure
sum(distinctp.Price)
In rdlc there is aggregate function called the CountDistinct to count the distinct values without the repetition.
Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck