Get node weight for a simple bipartite graph - data-science

I've created a bipartite networkx graph from a CSV file that maps Disorders to Symptoms.
So, a disorder may be linked to one or more Symptoms.
for disorder, symptoms in csv_dictionary.items():
for i in range (0, len(symptoms)):
G.add_edge(disorder, symptoms[i])
What I need is to find what Symptoms are connected to multiple diseases and sort them according to their weight.
Any suggestions?

You can use degree of the created graph. Every symptom with degree larger than 1 belongs to at least two diseases:
I've added some example csv_dictionary (please supply it in your next question as minimal reproducible example) and created a set of all symptoms during the creation of the graph. You could also think about adding these information as node feature to the graph.
import networkx as nx
csv_dictionary = {"a": ["A"], "b": ["B"], "c": ["A", "C"], "d": ["D"], "e": ["E", "B"], "f":["F"], "g":["F"], "h":["F"]}
G = nx.Graph()
all_symptoms = set()
for disorder, symptoms in csv_dictionary.items():
for i in range (0, len(symptoms)):
G.add_edge(disorder, symptoms[i])
all_symptoms.add(symptoms[i])
symptoms_with_multiple_diseases = [symptom for symptom in all_symptoms if G.degree(symptom) > 1]
print(symptoms_with_multiple_diseases)
# ['B', 'F', 'A']
sorted_symptoms = list(sorted(symptoms_with_multiple_diseases, key= lambda symptom: G.degree(symptom)))
print(sorted_symptoms)
# ['B', 'A', 'F']

Related

pandas chained indexing: how can we predict whether a view or a copy is created?

I have a specific question on pandas (v1.5.1) chained indexing. I know this issue has been discussed a lot and there are many questions regarding copies and views of data frames in SO. I also know how to properly set values using a single accessor and by avoiding chained indexing.
Still, I would appreciate your help so that I understand what is going on.
The code below makes two attempts:
import numpy as np
import pandas as pd
data = {"x": 2**np.arange(5),
"y": 3**np.arange(5),
"z": np.array([45, 98, 24, 11, 64])}
index = ["a", "b", "c", "d", "e"]
# attempt 1
df = pd.DataFrame(data=data, index=index).astype(dtype={"z": int})
print('-- attempt 1 --')
print('same base I : ',df.loc["a":"c"]["z"].to_numpy().base is df.to_numpy().base)
print('same base II: ',df.loc["a":"c"]["z"].to_numpy().base is df.to_numpy().base)
print('is view: ',df.loc["a":"c"]["z"]._is_view)
print('is copy: ',df.loc["a":"c"]["z"]._is_copy)
# SettingWithCopyWarning, df unmodified
df.loc["a":"c"]["z"] = 0
# attempt 2
df = pd.DataFrame(data=data, index=index).astype(dtype={"z": int})
print('-- attempt 2 --')
print('same base I : ',df["z"].loc["a":"c"].to_numpy().base is df.to_numpy().base)
print('same base II: ',df["z"].loc["a":"c"].to_numpy().base is df.to_numpy().base)
print('is view: ',df["z"].loc["a":"c"]._is_view)
print('is copy: ',df["z"].loc["a":"c"]._is_copy)
# df modified
df["z"].loc["a":"c"] = 0
The first attempt does not modify the data frame. The second one does. I would like to be able to predict this outcome.
The output of the code is
-- attempt 1 --
same base I : False
same base II: True
is view: True
is copy: <weakref at 0x000002A1B3868310; to 'DataFrame' at 0x000002A1B37D68E0>
-- attempt 2 --
same base I : False
same base II: True
is view: True
is copy: None
<ipython-input-719-49005df16cfc>:15: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.loc["a":"c"]["z"] = 0
I thought that I could use .to_numpy().base to predict the different outcome of the two attempts. Unfortunately, this does not help. In addition, repeating the base check gives different outcomes that is even more confusing. What is going on?
I have also included calls to the internals ._is_view and ._is_copy in case it helps, although I do not understand their behaviour either.
Thank you in advance for your time and help.

numpy array of array adding up another array

I am having the following array of array
a = np.array([[1,2,3],[4,5,6]])
b = np.array([[1,5,10])
and want to add up the value in b into a, like
np.array([[2,7,13],[5,10,16]])
what is the best approach with performance concern to achieve the goal?
Thanks
Broadcasting does that for you, so:
>>> a+b
just works:
array([[ 2, 7, 13],
[ 5, 10, 16]])
And it can also be done with
>>> a + np.tile(b,(2,1))
which gives the result
array([[ 2, 7, 13],
[ 5, 10, 16]])
Depending on size of inputs and time constraints, both methods might be of consideration
Method 1: Numpy Broadcasting
Operation on two arrays are possible if they are compatible
Operation generally done along with broadcasting
broadcasting in lay man terms could be called repeating elements along a specified axis
Conditions for broadcasting
Arrays need to be compatible
Compatibility is decided based on their shapes
shapes are compared from right to left.
from right to left while comparing, either they should be equal or one of them should be 1
smaller array is broadcasted(repeated) over bigger array
a.shape, b.shape
((2, 3), (1, 3))
From the rules they are compatible, so they can be added, b is smaller, so b is repeated long 1 dimension, so b can be treated as [[ 5, 10, 16], [ 5, 10, 16]]. But note numpy does not allocate new memory, it is just view.
a + b
array([[ 2, 7, 13],
[ 5, 10, 16]])
Method 2: Numba
Numba gives parallelism
It will convert to optimized machine code
Why this is because, sometimes numpy broadcasting is not good enough, ufuncs(np.add, np.matmul, etc) allocate temp memory during operations and it might be time consuming if already on memory limits
Easy parallelization
Using numba based on your requirement, you might not need temp memory allocation or various checks which numpy does, which can speed up code for huge inputs, for example. Why are np.hypot and np.subtract.outer very fast?
import numba as nb
#nb.njit(parallel=True)
def sum(a, b):
s = np.empty(a.shape, dtype=a.dtype)
# nb.prange gives numba hint to what to parallelize
for i in nb.prange(a.shape[0]):
s[i] = a[i] + b
return s
sum(a, b)

repmat with interlace or Kronecker product in Tensorflow

Suppose I have a tensor:
A=[[1,2,3],[4,5,6]]
Which is a matrix with 2 rows and 3 columns.
I would like to replicate it, suppose twice, to get the following tensor:
A2 = [[1,2,3],
[1,2,3],
[4,5,6],
[4,5,6]]
Using tf.repmat will clearly replicate it differently, so I tried the following code (which works):
A_tiled = tf.reshape(tf.tile(A, [1, 2]), [4, 3])
Unfortunately, it seems to be working very slow when the number of columns become large. Executing it in Matlab using Kronecker product with a vector of ones (Matlab's "kron") seems to be much faster.
Can anyone help?

How can I store sorted data in Redis with a repeated member?

I am new to Redis and have the following problem:
Given the Sorted Set myzset: [ [1,"A"], [2, "B"], [3, "C"] ]
I want to be able to add [4, "A"] in the set.
So far if I use
ZADD myzset 4 "A"
because the member "A" is already in the set I get back
[ [4,"A"], [2, "B"], [3, "C"] ]
rather than
[ [1,"A"], [2, "B"], [3, "C"], [4, "A"] ]
How can I insert data such that the set would be
[ [1,"A"], [2, "B"], [3, "C"], [4, "A"] ] ?
Redis' Sorted Sets (and regular Sets) do not allow duplicate members. You should reconsider what you're trying to do (perhaps even edit your question to explain about the data you're storing and how you want to retrieve it) and possibly use a different approach and/or data structure.
In cases where it is necessary and makes sense to store a non-unique member in a Sorted Set you'd usually concatenate some sort of a unique identifier to the member. For example, if you're storing a timeseries (e.g. measurements from a device) you'd store the timestamp as score and id:timestamp as the member.

Inconsistent interface of Pandas Series; yielding access to underlying data

While the new Categorical Series support since pandas 0.15.0 is fantastic, I'm a bit annoyed with how they decided to make the underlying data inaccessible except through underscored variables. Consider the following code:
import numpy as np
import pandas as pd
x = np.empty(3, dtype=np.int64)
s = pd.DatetimeIndex(x, tz='UTC')
x
Out[17]: array([140556737562568, 55872352, 32])
s[0]
Out[18]: Timestamp('1970-01-02 15:02:36.737562568+0000', tz='UTC')
x[0] = 0
s[0]
Out[20]: Timestamp('1970-01-01 00:00:00+0000', tz='UTC')
y = s.values
y[0] = 5
x[0]
Out[23]: 5
s[0]
Out[24]: Timestamp('1970-01-01 00:00:00.000000005+0000', tz='UTC')
We can see that both in construction and when asked for underlying values, no deep copies are being made in this DatetimeIndex with regards to its underlying data. Not only is this potentially useful in terms of efficiency, but it's great if you are using a DataFrame as a buffer. You can easily get the numpy primitive containing the underlying data, from there get a pointer to the raw data, which some low level C routine can use to do a copy into from some block of memory.
Now lets look at the behavior of the new Categorical Series. The underlying data of course is not the levels, but the codes.
x2 = np.zeros(3, dtype=np.int64)
s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
s2
Out[27]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
x2[0] = 1
s2[0]
Out[29]: 'hello'
y2 = s2.codes
y2[0] = 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-0366d645c98d> in <module>()
----> 1 y2[0] = 1
ValueError: assignment destination is read-only
y2 = s2._codes
y2[0] = 1
s2[0]
Out[34]: 'bye'
The net effect of this behavior is that as a developer, efficient manipulation of the underlying data for Categoricals is not part of the interface. Also as a user, the from_codes constructor is slow as it deep copies the codes, which may often be unnecessary. There should at least be an option for this.
But the fact that codes is a read only variable and _codes needs to be used strikes me as worse. Why wouldn't .codes give the same behavior as .values? Is there some justification for this beyond the concept that the codes are "private"? I'm hoping some of the pandas gurus on stackoverflow can shed some light on this.
The Categorical type is different from almost all other types in that it is a compound type that has a certain guarantee among its data. Namely that the codes provide a factorization of the levels.
So the argument against mutability is that it would be easy to break the codes-categories mapping, and it could be non-performant. Of course these could possibly be mitigated with checking on the setitem instead (but with some added code complexity).
The vast majority of users are not going to manipulate the codes/categories directly (and only use exposed methods) so this is really a protection against accidently breaking these guarantees.
If you need to efficiently manipulate the underlying data, best/easiest is simply to pull out the codes/categories. Mutate them, then create a new Categorical (which is cheap if codes/categories are already provided).
e.g.
In [3]: s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
In [4]: s2
Out[4]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
In [5]: s2.codes
Out[5]: array([0, 0, 0], dtype=int8)
In [6]: pd.Categorical(s2.codes+1,s2.categories,fastpath=True)
Out[6]:
[bye, bye, bye]
Categories (2, object): [hello, bye]
Of course this is quite dangerous, if you added 2 to the expression would blow up. Manipulation of the codes directly is simply buyer-be-ware.