I have a df that looks something like this:
batch group reading temp test block delay
0 9551 Control 340 22.9 1 X 35
1 9551 Control 345 22.9 1 Y 35
I need to group by 'group' and 'block', e.g. my means would look like so:
df.groupby(['block', 'group']).reading.mean().unstack().transpose()
block X Y
group
Control 347.339450 350.427273
Trial 347.790909 350.668182
What would be the best way to call a 2 argument function like scipy.stats.ttest_ind on data sliced this way so I end up with a table of t tests for
control vs trial in x
control vs trial in y
x vs y in control
x vs y in trial
Do you want to group and aggregate the data before applying the t-test? I think you want to select subsets of the data. Grouping can do that, but masking might get the job done more simply.
Offhand, I'd say you want something like
scipy.stats.ttest_ind(df[(df.group == 'Control') & (df.block == 'X')].reading,
df[(df.group == 'Trial') & (df.block == 'X')].reading)
Related
I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?
Lets say I have the following dataframe:
df = pd.DataFrame({'a':[1,1.1,1.03,3,3.1], 'b':[10,11,12,13,14]})
df
a b
0 1.00 10
1 1.10 11
2 1.03 12
3 3.00 13
4 3.10 14
And I want to group nearby points, eg.
df.groupby(#SOMETHING).mean():
a b
a
0 1.043333 11.0
1 3.050000 13.5
Now, I could use
#SOMETHING = pd.cut(df.a, np.arange(0, 5, 2), labels=False)
But only if I know the boundaries beforehand. How can I accomplish similar behavior if I don't know where to place the cuts? ie. I want to group nearby points (with nearby being defined as within some epsilon).
I know this isn't trivial because point x might be near point y, and point y might be near point z, but point x might be too far z; so then its ambiguous what to do--this is kind of a k-means problem, but I'm wondering if pandas has any tools built in to make this easy.
Use case: I have several processes that generate data on regular intervals, but they're not quite synced up, so the timestamps are close, but not identical, and I want to aggregate their data.
Based on this answer
df.groupby( (df.a.diff() > 1).cumsum() ).mean()
For example, the dataset a is
id x
1 15
2 25
3 35
4 45
I want to add a column y to dataset a, y being the average of x excluding the current id.
so y_1 = (x_2+x_3+x_4)/3 = (25+35+45)/3.
Easiest way to do it without SQL is to add the mean and the n to each row (use PROC MEANS, then merge on the values), and then use math to remove the current value. IE, if x_mean=(15+25+35+45)/4 = 30, and x=15, then
x_mean_others = ((30*4)-15)/(4-1) = 105/3 = 35
Alternateively, in SQL, you can calculate it on the fly with the same idea.
proc sql;
create table want as
select x, (mean(x)*n(x) - x)/(n(x)-1) as y
from have H
;
quit;
This takes advantage of SAS's automatic remerging, in something like SQL Server you'd need a WITH clause to make this work I imagine.
I am new to R and am trying to find a better solution for accomplishing this fairly simple task efficiently.
I have a data.frame M with 100,000 lines (and many columns, out of which 2 columns are relevant to this problem, I'll call it M1, M2). I have another data.frame where column V1 with about 10,000 elements is essential to this task. My task is this:
For each of the element in V1, find where does it occur in M2 and pull out the corresponding M1. I am able to do this using for-loop and it is terribly slow! I am used to Matlab and Perl and this is taking for EVER in R! Surely there's a better way. I would appreciate any valuable suggestions in accomplishing this task...
for (x in c(1:length(V$V1)) {
start[x] = M$M1[M$M2 == V$V1[x]]
}
There is only 1 element that will match, and so I can use the logical statement to directly get the element in start vector. How can I vectorize this?
Thank you!
Here is another solution using the same example by #aix.
M[match(V$V1, M$M2),]
To benchmark performance, we can use the R package rbenchmark.
library(rbenchmark)
f_ramnath = function() M[match(V$V1, M$M2),]
f_aix = function() merge(V, M, by.x='V1', by.y='M2', sort=F)
f_chase = function() M[M$M2 %in% V$V1,] # modified to return full data frame
benchmark(f_ramnath(), f_aix(), f_chase(), replications = 10000)
test replications elapsed relative
2 f_aix() 10000 12.907 7.068456
3 f_chase() 10000 2.010 1.100767
1 f_ramnath() 10000 1.826 1.000000
Another option is to use the %in% operator:
> set.seed(1)
> M <- data.frame(M1 = sample(1:20, 15, FALSE), M2 = sample(1:20, 15, FALSE))
> V <- data.frame(V1 = sample(1:20, 10, FALSE))
> M$M1[M$M2 %in% V$V1]
[1] 6 8 11 9 19 1 3 5
Sounds like you're looking for merge:
> M <- data.frame(M1=c(1,2,3,4,10,3,15), M2=c(15,6,7,8,-1,12,5))
> V <- data.frame(V1=c(-1,12,5,7))
> merge(V, M, by.x='V1', by.y='M2', sort=F)
V1 M1
1 -1 10
2 12 3
3 5 15
4 7 3
If V$V1 might contain values not present in M$M2, you may want to specify all.x=T. This will fill in the missing values with NAs instead of omitting them from the result.
I have a SQL table with three columns X, Y, Z. I need to split it in groups in such a way that all records with same value of X or Y or Z are assigned to the same group. I need to make sure that the records with same value X or Y or Z are never split across multiple groups.
If you think of records as nodes and values of X, Y, Z as edges, this problem is the same as finding all graphs where the nodes in each graph will be connected directly or indirectly via X, Y, or Z-edge, but each graph will have no edges in common with other graphs (otherwise it would be part of the same graph).
A few years ago I knew what this was called and even remembered the algorithm but now it escapes me. Please tell me how this problem is called so I can Google for solution. If you now a good algorithm -- please point me to it. If you have a SQL implementation -- I will marry you :)
Example:
X Y Z BUCKET
--------- ---------------- --------- -----------
1 34 56 1
54 43 45 2
1 12 22 1
2 34 11 1
The last row is in bucket 1 because of the value of Y=34 which is the same as of the first row, which is in bucket 1.
It looks not like a graph, more like a simplicial complex.
But if we treat this complex as its skeletal graph (the numbers are treated as vertices and a row in a table means that all that three vertices are connected by an edge), then we may just use any algorithm to find connected components of this graph. I'm not sure whether there is a feasible way to do this in SQL though, perhaps it would be more prudent to use a graph database somehow.
However, for this specific problem there may be some easy solution attainable by means of SQL which I didn't look for.
to find how many nodes in each group x:
select x, count(x)
from mytable
group by x
or to find the list of sets x:
select distinct x from mytable;
Why don't you initially GROUP BY one of the colums (say X), make buckets, then do so for Y and Z, each time merging all the buckets from the previous step if you find new groups.
Repeat the process for X, Y, and Z until the buckets stop changing.
Are you working for linked-in or facebook? :)