R: remove rows in data frame for which all columns contain same content or nothing - dataframe

I have a data frame:
# create a data frame
V1 = c("gene_1", "gene_1", "", "")
V2 = c("gene_2", "gene_2", "", "")
V3 = c("gene_3", "gene_3", "gene_4", "")
V4 = c("gene_4", "gene_4", "", "")
V5 = c("gene_5", "gene_5", "gene_8", "")
V6 = c("gene_6", "gene_6", "gene_6", "gene_7")
df = as.data.frame(rbind(V1, V2, V3, V4, V5, V6))
The data frame df looks like this:
V1 V2 V3 V4
V1 gene_1 gene_1
V2 gene_2 gene_2
V3 gene_3 gene_3 gene_4
V4 gene_4 gene_4
V5 gene_5 gene_5 gene_8
V6 gene_6 gene_6 gene_6 gene_7
Now, I want to remove all the rows that have only labels of the same gene, resulting in:
V1 V2 V3 V4
V3 gene_3 gene_3 gene_4
V5 gene_5 gene_5 gene_8
V6 gene_6 gene_6 gene_6 gene_7
I found several similar questions on stack overflow, including here but none of these solutions work for my exact issue. I feel like this should be easy, but I can't seem to find how to go about this.

I found a solution, based on another post that I found here:
df[df == ''] <- NA
df %>% filter(if_any(V2:V4, ~ .x != V1))
Gives:
V1 V2 V3 V4
V3 gene_3 gene_3 gene_4 <NA>
V5 gene_5 gene_5 gene_8 <NA>
V6 gene_6 gene_6 gene_6 gene_7

Related

Multiple paired Wilcoxon signed rank tests between vectors, with medians of difference, confidence intervals and p-values compiled in a R data frame?

Several hours spent trying to find a solution but without success. Does anyone know how to apply a wilcoxon-signed rank test (paired = TRUE) of v2, v3, v4 versus v1 (so v1 is the reference vector) in such a way that the following results are compiled in a R dataframe like that:
# groups (pseudo)median conf.int_low conf.int_high p.value
# v1 vs v2 6.41 4.645 8.245 1.335e-05
# v1 vs v3 21.2875 20.270 22.125 1.907e-06
# v1 vs v4 -1.899768 -2.725023 -1.349986 9.542e-05
The above results come from:
wilcox.test(df$v1, df$v2, paired = TRUE, conf.int = TRUE)
wilcox.test(df$v1, df$v3, paired = TRUE, conf.int = TRUE)
wilcox.test(df$v1, df$v3, paired = TRUE, conf.int = TRUE)
The data are:
df <-
structure(list(
v1 = c(280, 237.48, 235.7, 250.3, 242.9, 244.76, 245.74, 244.4, 246.24, 242.3, 239.64, 245.88, 247, 247.7, 242.86, 244.99, 234.52, 241.9, 244.99, 221.85),
v2 = c(284.39, 231.79, 226.53, 250.2, 237.05, 237.05, 239.68, 239.68, 237.05, 237.05, 234.42, 239.68, 231.79, 244.94, 231.79, 237.05, 226.53, 239.68, 239.68, 208.12),
v3 = c(256.43, 215.81, 215.86, 231.35, 221.7, 222.51, 222.7, 222.26, 225.59, 220.95, 218.43, 221.14, 224.53, 225.68, 224.79, 222.54, 215.08, 219.96, 225.4, 210.99),
v4 = c(282.85, 238.43, 239.2, 252.75, 243.8, 245.81, 247.14, 246.3, 247.09, 243.85, 240.79, 247.18, 248.95, 248.6, 246.41, 247.04, 241.42, 242.4, 247.44, 234.35)),
row.names = c(NA, -20L), class = "data.frame")
Thanks for help!

add networkx layout to holoview graph

I applied facebook network sample from this documentation on my work, to get this code:
edges_df = pd.read_csv('rel.csv', delimiter= ";")
nodes_df = pd.read_csv('monfichier.csv', delimiter= ";")
padding = dict(x=(-1.1, 1.1), y=(-1.1, 1.1))
fb_nodes = hv.Dataset(nodes_df, 'index')
fb_graph = hv.Graph((edges_df, fb_nodes)).redim.range(**padding)
colors = ['#000000'] + hv.Cycle('Category20').values
fb_graph.opts(color_index='age', show_frame=False,
xaxis=True, yaxis=True, node_size=10, edge_line_width=1, cmap=colors)
renderer = hv.renderer('bokeh')
plot = renderer.get_plot(fb_graph).state
show(plot)
It works fine. But the resulted network was a graph without a specific layout (as shown in attached figure). I want to specify the network layout as in networkx. How to do that ?
I found, this instruction:
hv.Graph.from_networkx(G, nx.layout.spring_layout).opts(tools=['hover'])
But I did not find how to use it with my code, since my G is already an holoview and not a networkx graph.
Do you have any suggestion ?
There is a function called layout_nodes in HoloViews which can apply networkx (and other) layouts to an existing graph:
N = 8
node_indices = np.arange(N)
source = np.zeros(N)
target = node_indices
padding = dict(x=(-1.2, 1.2), y=(-1.2, 1.2))
simple_graph = hv.Graph(((source, target),)).redim.range(**padding)
hv.element.graphs.layout_nodes(simple_graph, layout=nx.spring_layout)

Reflecting boundary conditions in FiPy

I am attempting to solve the convection diffusion equation in FiPy. For the moment, all I am trying to achieve is a Neumann boundary condition, so that the wave reflects back at the right-hand boundary rather than travelling out of the domain.
I have added the following line:
phi.faceGrad.constrain(0, mesh.exteriorFaces)
But this doesn't seem to change anything.
Am I imposing the wrong boundary condition? Am I imposing it incorrectly? I have searched for this, but can't seem to find an example which has the simple property of a wave reflecting off a boundary! My code is below. Thanks so much.
from fipy import *
nx = 100
L = 1.
dx = L/nx
steps = 160
dt = 0.1
t = dt * steps
mesh = Grid1D(nx=nx, dx=dx)
x = mesh.cellCenters[0]
phi = CellVariable(name="solution variable", mesh=mesh, value=0.)
phi.setValue(1., where=(x>0.03) & (x<0.09))
# Diffusion and convection coefficients
D = FaceVariable(name='diffusion coefficient',mesh=mesh,value=1.*10**(-4.))
C = (0.1,)
# Boundary conditions
phi.faceGrad.constrain(0, mesh.exteriorFaces)
eq = TransientTerm() == DiffusionTerm(coeff=D) - ConvectionTerm(coeff=C)
for step in range(steps):
eq.solve(var=phi, dt=dt)
if step%20==0:
viewer = Viewer(vars=phi, datamin=0., datamax=1.)
viewer.plot()

Numpy slogdet computation error

There appears to be a major difference between numpy's slogdet and the exact result when computing the log determinant of Vanermonde matrix.
I compare against the exact log determinant, see eg here for proof.
The minimal code to see this is:
A = np.power.outer(np.linspace(0,1,50),range(50))
print np.linalg.slogdet(A)[1]
s = 0
for v1 in np.linspace(0,1,50):
for v2 in np.linspace(0,1,50):
if v1>v2:
s+= np.log(v1-v2)
print s
Which yeilds:
-1191.88408998
-1706.99560647
I was wondering if there was a more accurate log determinant implementation which I could use in this situation but also in non-Vandermonde matrix situation.
You can use sympy and mpmath like this:
import numpy as np
import sympy as smp
import mpmath as mp
mp.mp.dps = 50
linspace1 = list(map(smp.mpmath.mpf,np.linspace(0,1,50)))
A = np.power.outer(list(map(float,linspace1)),range(50))
first_print = smp.mpmath.mpf(np.linalg.slogdet(A)[1])
print(first_print)
s = 0
linspace2 = list(map(smp.mpmath.mpf,np.linspace(0,1,50)))
linspace3 = list(map(smp.mpmath.mpf,np.linspace(0,1,50)))
for v1 in linspace1:
for v2 in linspace2:
if v1>v2:
s+= mp.log(v1-v2)
print(s)
RESULTS
first_print = -1178.272517342130186079884879291057586669921875
s = -1706.9956064674289001970168329846189154212781094939

How to render two pd.DataFrames in jupyter notebook side by side?

Is there an easy way to quickly see contents of two pd.DataFrames side-by-side in Jupyter notebooks?
df1 = pd.DataFrame([(1,2),(3,4)], columns=['a', 'b'])
df2 = pd.DataFrame([(1.1,2.1),(3.1,4.1)], columns=['a', 'b'])
df1, df2
You should try this function from #Wes_McKinney
def side_by_side(*objs, **kwds):
''' Une fonction print objects side by side '''
from pandas.io.formats.printing import adjoin
space = kwds.get('space', 4)
reprs = [repr(obj).split('\n') for obj in objs]
print(adjoin(space, *reprs))
# building a test case of two DataFrame
import pandas as pd
import numpy as np
n, p = (10, 3) # dfs' shape
# dfs indexes and columns labels
index_rowA = [t[0]+str(t[1]) for t in zip(['rA']*n, range(n))]
index_colA = [t[0]+str(t[1]) for t in zip(['cA']*p, range(p))]
index_rowB = [t[0]+str(t[1]) for t in zip(['rB']*n, range(n))]
index_colB = [t[0]+str(t[1]) for t in zip(['cB']*p, range(p))]
# buliding the df A and B
dfA = pd.DataFrame(np.random.rand(n,p), index=index_rowA, columns=index_colA)
dfB = pd.DataFrame(np.random.rand(n,p), index=index_rowB, columns=index_colB)
side_by_side(dfA,dfB) Outputs
cA0 cA1 cA2 cB0 cB1 cB2
rA0 0.708763 0.665374 0.718613 rB0 0.320085 0.677422 0.722697
rA1 0.120551 0.277301 0.646337 rB1 0.682488 0.273689 0.871989
rA2 0.372386 0.953481 0.934957 rB2 0.015203 0.525465 0.223897
rA3 0.456871 0.170596 0.501412 rB3 0.941295 0.901428 0.329489
rA4 0.049491 0.486030 0.365886 rB4 0.597779 0.201423 0.010794
rA5 0.277720 0.436428 0.533683 rB5 0.701220 0.261684 0.502301
rA6 0.391705 0.982510 0.561823 rB6 0.182609 0.140215 0.389426
rA7 0.827597 0.105354 0.180547 rB7 0.041009 0.936011 0.613592
rA8 0.224394 0.975854 0.089130 rB8 0.697824 0.887613 0.972838
rA9 0.433850 0.489714 0.339129 rB9 0.263112 0.355122 0.447154
The closest to what you want could be:
> df1.merge(df2, right_index=1, left_index=1, suffixes=("_1", "_2"))
a_1 b_1 a_2 b_2
0 1 2 1.1 2.1
1 3 4 3.1 4.1
It's not specific of the notebook, but it will work, and it's not that complicated. Another solution would be to convert your dataframe to an image and put them side by side in subplots. But it's a bit far-fetched and complicated.
I ended up using a helper function to quickly compare two data frames:
def cmp(df1, df2, topn=10):
n = topn
a = df1.reset_index().head(n=n)
b = df2.reset_index().head(n=n)
span = pd.DataFrame(data=[('-',) for _ in range(n)], columns=['sep'])
a = a.merge(span, right_index=1, left_index=1)
return a.merge(b, right_index=1, left_index=1, suffixes=['_L', '_R'])