Zipping two dataframes in c# Deedle - dataframe

would anyone be able to provide me with a working example of dataframe zipping in C#? I am bit lost in the operation.
Thanks!

The frame.Zip operation is the same thing as zipAlign in the more documented F# API, so have a look at zipAlign in this section of the documentation.
Given a frame df1:
A
1 -> 1
2 -> 2
And a frame df2:
A
2 -> 2
3 -> 3
When you call df1.Zip(df2, (int a, int b) -> a + b), you get:
A
1 -> <missing>
2 -> 4
3 -> <missing>
That is, for cells where both frames contain a value, a + b is calculated. For all other cells, you get a missing value. Note that you need type annotations in the lambda function - this has to match the type of values in the frame (for not matching types, the function just returns the values from the first frame unchanged).

Related

How can I read and parse files with variant spaces as delim?

I need help solving this problem:
I have a directory full of .txt files that look like this:
file1.no
file2.no
file3.no
And every file has the following structure (I only care for the first two "columns" in the .txt):
#POS SEQ SCORE QQ-INTERVAL STD MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is:
#LL=-4797.62
1 M 0.3821 [0.01331,0.5465] 0.4421 7/7
2 E 0.4508 [0.05393,0.6788] 0.5331 7/7
3 L 0.5334 [0.05393,0.6788] 0.6279 7/7
4 G 0.5339 [0.05393,0.6788] 0.624 7/7
And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).
But now I am running into two issues:
How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files
How to parse this files, giving that the spaces between the columns vary for almost all of them.
My output should look like this:
|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|
So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.
Any help is welcome, thanks in advance.
Here is an example code that should work for you:
using DataFrames
function parsefile(filename)
l = readlines(filename)
filter!(x -> !startswith(x, "#"), l)
sl = split.(l)
return (File=filename,
SEQ=join(getindex.(sl, 2)),
SCORE=parse.(Float64, getindex.(sl, 3)))
end
df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])
your result will be in df data frame.

Can PICT handle independent parameters

Can PICT (=Pairwise Independent Combinatorial Testing) handle/model independent parameters.
For example in following input a and b are independent, so they should not be combined.
Input in PICT:
a: 1, 2, 3, 4
b: 5, 6, 7, 8
//some line which models the independence: a independent of b
Output, that I would expect:
a b
1 5
2 6
3 7
4 8
This example, with only 2 parameters, of course normally would not make much sense, but it's illustrative.
The same could be applied to 3 parameters (a,b,c), where a is independent of b, but not c.
The main goal of declaring parameters as independent would be the reduce the number of tests.
I read the paper/user guide to PICT, but I didn't found any useful information.
I will answer my question by myself:
The solution is to define submodels and set the default order from 2 (=pairwise) to 1 (= no combination).
For example parameter a = {a_1, a_2, a_3} should be independent of
b = {b_1, b_2, b_3} and
c = {c_1, ..., c_4}.
Therefor I would expect 12 tests ((b x c) + a).
Resulting in the following input file:
a: 1,2,3
b: 1,2,3
c: 1,2,3,4
{b,c}#2
{b,c}#2 defines a submodel, consisting of b and c, which uses pairwise combination.
And running pict with the option: "/o:1".
In PICT you can define conditions and achieve your goal to not combine them with the rest. You can add additional option as N/A and have something like this;
If [A] = '1' Then [B] = 'N/A';
This is one possible option to handle this case.

Dendrograms with SciPy

I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?

(KeyError): MultiIndex Slicing requires the index to be fully lexsorted tuple ... Why is this caused by a list, but not by a tuple?

This question is partially here to help me understand what lex-sorting is in the context of multi-indexes.
Say I have some MultiIndexed DataFrame df, and for the index I want to use:
a = (1, 1, 1)
So to pull the value from the dataframe I write:
df.loc[a, df.columns[i]]
Which works. But the following doesn't:
df.loc[list(a), df.columns[i]]
Giving me the error:
*** KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
Why is this?
Also, another question, what does the following performance warning mean?
PerformanceWarning: indexing past lexsort depth may impact performance.
I'll illustrate the difference between passing a tuple and a list to .loc, using the example with df being
0 1 2
first second
bar one 4 4 7
two 3 4 7
foo one 8 1 8
two 7 5 4
Here df.loc[('foo', 'two')] returns the row indexed by this tuple, namely (7, 5, 4). The parameter specifies both levels of the multiindex.
But df.loc[['foo', 'two']] means you want all rows with the top level of the multiindex being either 'foo' or 'two'. A list means these are the options you want, and since only one level is provided in each option, the selection is based on the first (leftmost) level. The result:
0 1 2
first second
foo one 8 1 8
two 7 5 4
(Since there are no multiindices that begin with 'two', only those with 'foo' are present.)
Without seeing your dataframe, I can't tell where this difference leads to getting KeyError, but I hope the difference itself is clear now.

Processing loading table data

I have a text file "celldata.txt" containing a very simple table of data.
1 2 3 4
5 6 7 8
9 10 11 12
1 2 3 4
2 3 4 5
The problem is when it comes to accessing the data at a certain column and row.
My approach has been to load using loadTable.
Table table;
int numCols;
int numRows;
void setup() {
size(200,200);
table = loadTable("celldata.txt","tsv");
numRows=table.getRowCount();
numCols=table.getColumnCount();
}
void draw() {
background(255);
fill(0);
text(numRows +" "+ numCols,100,100); // Check num of cols and rows
println(table.getFloat(0,0));
}
Question 1: When I do this, it says the number of rows are 5 and the number of columns is just 1. Why is it not 5 x 4?
Question 2: Why is table.getFloat(0,0) "NaN" instead of the first element of the data?
I want to use a much bigger matrix later and access certain elements (of type double) with something like getFloat(i,j) and be able to loop through all elements.
Using the same example data as I, can someone please help me understand what is wrong with my code and how to access the textfile's data? Should I be using another method than loadTable?
You've told Processing that the file contains tab separated values (by using the "tsv" option), but your file contains space separated values.
Since your file does not contain any tabs, it reads the entire row as a single value. So the 0,0 position of your table is 1 2 3 4, which isn't a number- hence the NaN. This is also why it thinks your table only has one column.
You should modify your celldata.txt file to actually be separated by tabs instead of spaces:
1 2 3 4
5 6 7 8
9 10 11 12
1 2 3 4
2 3 4 5
You could also separate them by commas and then use the "csv" option.
If you're still having trouble, you can see what Processing is reading in by adding saveTable(table, "data/new.csv"); to the end of your setup() function and then looking at that file. It will be a list of values separated by commas, so you can see exactly where Processing thinks the cells of the table are.