moving from tabular to graph representation of a given data - pandas

Suppose that I have the following data t:
activity
teacher
group
students
duration
subject
One
A
a
3
45
Math
One
B
b
2
45
Math
two
A
c
7
60
P.E
One
D
a
3
45
Math
two
C
c
7
60
P.E
I want to construct a graph data instead of this tabular data. I am actually interested in predicting the teacher by applying some kind of Graph ML. is there a way to transform the tabular data into graphical data ? maybe using networkX.
I tried the following code
G = nx.from_pandas_edgelist(df, "subject", "teacher", edge_attr=True, create_using=nx.Graph())
nx.draw_networkx(G)
plt.show()
the output of this looks like a graph, but I don't understand how it works or how can I get the new data or what is the best way to identify the node and the edge.
thank you in advance for any help.

Related

How can I detect similarity of names in the same columns

Guys I have a dataset like this:
`
df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
columns = ['Name'])
df
`
it gives this output
Name
0 John
1 gal britt
2 mona
3 diana
4 molly
5 merry
6 mony
7 molla
8 johnathon
so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )
The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for.
any algorithm or idea would really help and I'll adjust the logic to my purposes
I tried working with vaex instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation.
In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.

Sequence function on existing data

picture of example dataset
I am looking for a function to change the values from 0,1,2,3,4,5,6 etc. to every 5 in R. I have a big dataset similar to column A and B and would like to change it to columns H and I (like shown in the attached picture).
I'd like to change every cm til every 5 cm so that species that covers 5cm or shorter than 5 cm are registered as a point (similar to equ_palu). Moreover, the specie bet_nana covers 0-10 cm and is therefore registered as 5 and 10 in column H and I.

Multiple Object Tracking (MOT) benchmark data-set format for ground truth tracking

I am trying to evaluate the performance of my object detection+tracking on the standard dataset used in the industry in the 2DMOT Challenge 2015. I have downloaded the dataset but I am unable to understand the data fields in the labelled ground truth data.
I have understood the first six columns of the dataset but unable to do so for the rest four columns. Following is the sample data from the directory <\2DMOT2015\train\ETH-Bahnhof\gt>:
frame no. object_id bb_left bb_top bb_width bb_height (?) (?) (?) (?)
1 1 212 204 20 57 0 -3.1784 16.34 0.45739
1 2 223 181 36 104 1 -1.407 9.0212 0.68774
Please let me know if you are aware of this?
The last three fields represent the 3D real-world coordinates of the objects. A similar data structure can be found in videos of ETH-Bahnhof, ETH-Sunnyday, PETS09-S2L1 and TUD-Stadtmitte in 2DMOT2015. For ground-truth, score=1. But sometimes it varies b/w 0-1, then it acts as a flag value and zeroes mean that the line is not to be considered for evaluation. So the data fields are in the format:
frame no. , object_id , bb_left , bb_top , bb_width , bb_height , score, X, Y, Z

SPSS Compute Variable

Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck

Identifying graphs in heap of connected nodes -- how is this called?

I have a SQL table with three columns X, Y, Z. I need to split it in groups in such a way that all records with same value of X or Y or Z are assigned to the same group. I need to make sure that the records with same value X or Y or Z are never split across multiple groups.
If you think of records as nodes and values of X, Y, Z as edges, this problem is the same as finding all graphs where the nodes in each graph will be connected directly or indirectly via X, Y, or Z-edge, but each graph will have no edges in common with other graphs (otherwise it would be part of the same graph).
A few years ago I knew what this was called and even remembered the algorithm but now it escapes me. Please tell me how this problem is called so I can Google for solution. If you now a good algorithm -- please point me to it. If you have a SQL implementation -- I will marry you :)
Example:
X Y Z BUCKET
--------- ---------------- --------- -----------
1 34 56 1
54 43 45 2
1 12 22 1
2 34 11 1
The last row is in bucket 1 because of the value of Y=34 which is the same as of the first row, which is in bucket 1.
It looks not like a graph, more like a simplicial complex.
But if we treat this complex as its skeletal graph (the numbers are treated as vertices and a row in a table means that all that three vertices are connected by an edge), then we may just use any algorithm to find connected components of this graph. I'm not sure whether there is a feasible way to do this in SQL though, perhaps it would be more prudent to use a graph database somehow.
However, for this specific problem there may be some easy solution attainable by means of SQL which I didn't look for.
to find how many nodes in each group x:
select x, count(x)
from mytable
group by x
or to find the list of sets x:
select distinct x from mytable;
Why don't you initially GROUP BY one of the colums (say X), make buckets, then do so for Y and Z, each time merging all the buckets from the previous step if you find new groups.
Repeat the process for X, Y, and Z until the buckets stop changing.
Are you working for linked-in or facebook? :)