Find nodes with common connections - data-science

At the moment, I've have created a bipartite networkx graph that maps Disorders to Symptoms. So, a disorder may be linked to one or more Symptoms.
Also, i have some basic statistics, like, Symptoms with at least one Disorder etc.
import networkx as nx
csv_dictionary = {"Da": ["A", "C"], "Db": ["B"], "Dc": ["A", "C", "F"], "Dd": ["D"], "De": ["E", "B"], "Df":["F"], "Dg":["F"], "Dh":["F"]}
G = nx.Graph()
all_symptoms = set()
for disorder, symptoms in csv_dictionary.items():
for i in range (0, len(symptoms)):
G.add_edge(disorder, symptoms[i])
all_symptoms.add(symptoms[i])
symptoms_with_multiple_diseases = [symptom for symptom in all_symptoms if G.degree(symptom) > 1]
sorted_symptoms = list(sorted(symptoms_with_multiple_diseases, key= lambda symptom:
G.degree(symptom)))
What i need is to find Disorders that share at least two Symptoms. So, Disorders that have two Symptoms in common with each other.
I've done some research and i think i should add weights for my edges, based on how they connect, but i cannot wrap my head around it.
So, in the above example, Da and Dc share two Symptoms ( A and C ).

You could iterate over the length 2 combinations of the disorder nodes with a centrality higher than 2, and find the nx.common_neighbours of each combination, keeping only those that share at least 2 neighbours.
So start instead by keeping track of all disorders too:
all_symptoms = set()
all_disorders = set()
for disorder, symptoms in csv_dictionary.items():
for i in range (0, len(symptoms)):
G.add_edge(disorder, symptoms[i])
all_symptoms.add(symptoms[i])
all_disorders.add(disorder)
Check which have a degree higher than 2:
disorders_with_multiple_diseases = [symptom for symptom in all_disorders
if G.degree(symptom) > 1]
And then iterate over all 2 combinations of all_dissorders:
from itertools import combinations
common_symtpoms = dict()
for nodes in combinations(all_disorders, r=2):
cn = list(nx.common_neighbors(G, *nodes))
if len(cn)>1:
common_symtpoms[nodes] = list(cn)
print(common_symtpoms)
# {('Da', 'Dc'): ['A', 'C']}

Related

How to add two partially overlapping numpy arrays and extend the non-overlapping parts?

I'm adding a short audio signal (1-D numpy array of a musical note) to roughly the end of a longer signal (the first part of the audio stream constructed so far). I'd like to add the overlapping part and extend the non-overlapping part. What is the most efficient way to achieve this? I can identify the overlapping part and add it to the main signal while concatenating the non-overlapping part, but I don't think this is sufficiently efficient. I also think make them the same size by padding with zeros is very memory inefficient. Is there a numpy or scipy function for achieving this?
np arrays are contiguous memory blocks. Those of a and b are almost guaranteed to not be contiguous with each other, so you really only have an option to extend one with a copy of the second or create a new object that creates the object you want.
I don't know your constraints, but I suspect you're trying to prematurely optimize. Just write something clear first and optimize if it doesn't meet whatever needs you have:
def add_signal(a, b, ai=0, bi=0):
assert ai >= 0
assert bi >= 0
al = len(a)
bl = len(b)
cl = max(ai + al, bi + bl)
c = np.zeros(cl)
c[ai: ai + al] += a
c[bi: bi + bl] += b
return c
Example:
a = np.array([0, 1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40])
add_signal(a, b, bi=len(a)-3)
Output:
array([ 0., 1., 2., 13., 24., 35., 40.])

Create a grid of pie charts with Pandas or Seaborn

Given this DataFrame:
x = pd.DataFrame({"A": [11, 3, 7], "B": [4, 12, 8], "C": [5, 5, 5]}, index=["s1", "s2", "s3"] )
Corresponding to the grades of students s1, s2, and s3 over a semester. Student s1, for example, got 11 A's, 4 B's and 5 C's. There were 20 assignments total.
I would like to create a collection of small pie charts showing the proportions of A,B and C grades, for each students.
In my real data set I might have 80 students so I would like a grid of say 8 by 10 little tiny pie charts, labeled with the students Id.
I've pored over the docs, but I can't find a good elegant solution other than literally iterating with Python. But I feel there ought to be a nicer way.
The real test dataset
When I used the dataset below (basically the same as above) and then try variations of this to create my grid of pies, the pies are always squashed in different directions.
df.T.plot.pie(subplots=True, figsize=[6,50], layout=[10,4], legend=False)
I can't make sense out of what fig size is doing. I've looked through the docs and plenty of Stack Overflow to help me understand the unites. Basically, the parameter seems to be ignored. Here's the data:
$ cat data.csv
,A,B,C,D
as9.2,31,0,0,0
as22.2,17,9,1,4
as21.1,16,15,0,0
as16.2,15,12,4,0
as17.1,12,15,4,0
as7.1,12,8,11,0
coursetotal,11,17,3,0
as22.1,11,17,1,2
as24.1,9,18,0,4
as22.9,7,5,0,0
as19.1,6,21,2,0
as18.2,6,18,5,2
as10.2,5,21,5,0
as14.2,4,23,4,0
as15.1,4,21,1,5
as20.1,4,16,9,2
as16.1,0,27,4,0
By using pandas, layout is to set up how many subplot you need in one line , here I am using 3
x.T.plot.pie(subplots=True, figsize=(7, 2),layout=(1,3))

How to perform matching between two sequences?

I have two mini-batch of sequences :
a = C.sequence.input_variable((10))
b = C.sequence.input_variable((10))
Both a and b have variable-length sequences.
I want to do matching between them where matching is defined as: match (eg. dot product) token at each time step of a with token at every time step of b .
How can I do this?
I have mostly answered this on github but to be consistent with SO rules, I am including a response here. In case of something simple like a dot product you can take advantage of the fact that it factorizes nicely, so the following code works
axisa = C.Axis.new_unique_dynamic_axis('a')
axisb = C.Axis.new_unique_dynamic_axis('b')
a = C.sequence.input_variable(1, sequence_axis=axisa)
b = C.sequence.input_variable(1, sequence_axis=axisb)
c = C.sequence.broadcast_as(C.sequence.reduce_sum(a), b) * b
c.eval({a: [[1, 2, 3],[4, 5]], b: [[6, 7], [8]]})
[array([[ 36.],
[ 42.]], dtype=float32), array([[ 72.]], dtype=float32)]
In the general case you need the following steps
static_b, mask = C.sequence.unpack(b, neutral_value).outputs
scores = your_score(a, static_b)
The first line will convert the b sequence into a static tensor with one more axis than b. Because of packing, some elements of this tensor will be invalid and those will be indicated by the mask. The neutral_value will be placed as a dummy value in the static_b tensor wherever data was missing. Depending on your score you might be able to arrange for the neutral_value to not affect the final score (e.g. if your score is a dot product a 0 would be a good choice, if it involves a softmax -infinity or something close to that would be a good choice). The second line can now have access to each element of a and all the elements of b as the first axis of static_b. For a dot product static_b is a matrix and one element of a is a vector so a matrix vector multiplication will result in a sequence whose elements are all inner products between the corresponding element of a and all elements of b.

Create a dataframe from MultiLCA results in Brightway2

I am trying to create a pandas dataframe from the results of a MultiLCA calculation, using as columns the methods and as rows the functional units. I did find a sort of solution, but it is a bit cumbersome (I am not very good with dictionaries)
...
mlca=MultiLCA("my_calculation_setup")
pd.DataFrame(mlca.results,columns=mlca.methods)
fu_names=[]
for d in mlca.func_units:
for key in d:
fu_names.append(str(key))
dfresults['fu']=fu_names
dfresults.set_index('fu',inplace=True)
is there a more elegant way of doing this? The names are also very long, but that's a different story...
Your code seems relatively elegant to me. If you want to stick with str(key), then you could simplify it somewhat with a list comprehension:
mlca=MultiLCA("my_calculation_setup")
dfresults = pd.DataFrame(mlca.results, columns=mlca.methods)
dfresults['fu'] = [str(key) for demand in mlca.func_units for key in demand]
dfresults.set_index('fu', inplace=True)
Note that this only works if your demand dictionaries have one activity each. You could have situations where one demand dictionary would have two activities (like LCA({'foo': 1, 'bar': 2})), where this would fail because there would be too many elements in the fu list.
If you do know that you only have one activity per demand, then you can make a slightly nicer dataframe as follows:
mlca=MultiLCA("my_calculation_setup")
scores = pd.DataFrame(mlca.results, columns=mlca.methods)
as_activities = [
(get_activity(key), amount)
for dct in mlca.func_units
for key, amount in dct.items()
]
nicer_fu = pd.DataFrame(
[
(x['database'], x['code'], x['name'], x['location'], x['unit'], y)
for x, y in as_activities
],
columns=('Database', 'Code', 'Name', 'Location', 'Unit', 'Amount')
)
nicer = pd.concat([nicer_fu, scores], axis=1)
However, in the general case dataframes as not a perfect match for calculation setups. When a demand dictionary has multiple activities, there is no nice way to "squish" this into one dimension or one row.

Looking for built-in, invertible, list-of-list-accepting constructor/deconstructor pair for pandas dataframes

Are there built-in ways to construct/deconstruct a dataframe from/to a Python list-of-Python-lists?
As far as the constructor (let's call it make_df for now) that I'm looking for goes, I want to be able to write the initialization of a dataframe from literal values, including columns of arbitrary types, in an easily-readable form, like this:
df = make_df([[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]],
['d', 'i'])
For the deconstructor, I want to essentially recover from a dataframe df the arguments one would need to pass to such make_df to re-create df.
AFAIK,
officially at least, the pandas.DataFrame constructor accepts only a numpy ndarray, a dict, or another DataFrame (and not a simple Python list-of-lists) as its first argument;
the pandas.DataFrame.values property does not preserve the original data types.
I can roll my own functions to do this (e.g., see below), but I would prefer to stick to built-in methods, if available. (The Pandas API is pretty big, and some of its names not what I would expect, so it is quite possible that I have missed one or both of these functions.)
FWIW, below is a hand-rolled version of what I described above, minimally tested. (I doubt that it would be able to handle every possible corner-case.)
import pandas as pd
import collections as co
import pandas.util.testing as pdt
def make_df(values, columns):
return pd.DataFrame(co.OrderedDict([(columns[i],
[row[i] for row in values])
for i in range(len(columns))]))
def unmake_df(dataframe):
columns = list(dataframe.columns)
return ([[dataframe[c][i] for c in columns] for i in dataframe.index],
columns)
values = [[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]]
columns = ['d', 'i']
df = make_df(values, columns)
Here's what the output of the call to make_df above produced:
>>> df
d i
0 9.750 1
1 6.375 2
2 9.000 3
3 0.250 1
4 1.875 2
5 3.750 3
6 8.625 1
A simple check of the round-trip1:
>>> df == make_df(*unmake_df(df))
True
>>> (values, columns) == unmake_df(make_df(*(values, columns)))
True
BTW, this is an example of the loss of the original values' types:
>>> df.values
array([[ 9.75 , 1. ],
[ 6.375, 2. ],
[ 9. , 3. ],
[ 0.25 , 1. ],
[ 1.875, 2. ],
[ 3.75 , 3. ],
[ 8.625, 1. ]])
Notice how the values in the second column are no longer integers, as they were originally.
Hence,
>>> df == make_df(df.values, columns)
False
1 In order to be able to use == to test for equality between dataframes above, I resorted to a little monkey-patching:
def pd_DataFrame___eq__(self, other):
try:
pdt.assert_frame_equal(self, other,
check_index_type=True,
check_column_type=True,
check_frame_type=True)
except:
return False
else:
return True
pd.DataFrame.__eq__ = pd_DataFrame___eq__
Without this hack, expressions of the form dataframe_0 == dataframe_1 would have evaluated to dataframe objects, not simple boolean values.
I'm not sure what documentation you are reading, because the link you give explicitly says that the default constructor accepts other list-like objects (one of which is a list of lists).
In [6]: pandas.DataFrame([['a', 1], ['b', 2]])
Out[6]:
0 1
0 a 1
1 b 2
[2 rows x 2 columns]
In [7]: t = pandas.DataFrame([['a', 1], ['b', 2]])
In [8]: t.to_dict()
Out[8]: {0: {0: 'a', 1: 'b'}, 1: {0: 1, 1: 2}}
Notice that I use to_dict at the end, rather than trying to get back the original list of lists. This is because it is an ill-posed problem to get the list arguments back (unless you make an overkill decorator or something to actually store the ordered arguments that the constructor was called with).
The reason is that a pandas DataFrame, by default, is not an ordered data structure, at least in the column dimension. You could have permuted the order of the column data at construction time, and you would get the "same" DataFrame.
Since there can be many differing notions of equality between two DataFrame (e.g. same columns even including type, or just same named columns, or some columns and in same order, or just same columns in mixed order, etc.) -- pandas defaults to trying to be the least specific about it (Python's principle of least astonishment).
So it would not be good design for the default or built-in constructors to choose an overly specific idea of equality for the purposes of returning the DataFrame back down to its arguments.
For that reason, using to_dict is better since the resulting keys will encode the column information, and you can choose to check for column types or ordering however you want to for your own application. You can even discard the keys by iterating the dict and simply pumping the contents into a list of lists if you really want to.
In other words, because order might not matter among the columns, the "inverse" of the list-of-list constructor maps backwards into a bigger set, namely all the permutations of the same column data. So the inverse you're looking for is not well-defined without assuming more structure -- and casual users of a DataFrame might not want or need to make those extra assumptions to get the invertibility.
As mentioned elsewhere, you should use DataFrame.equals to do equality checking among DataFrames. The function has many options that allow you specify the specific kind of equality testing that makes sense for your application, while leaving the default version as a reasonably generic set of options.