How to construct a temporal network using python - pandas

I have a data for different stations at different days:
Station_start Station_end Day Hour
A B 1 14
B C 1 10
C A 1 10
B A 2 15
A C 2 13
D E 2 12
E B 2 14
F C 3 12
I want to construct a dynamic/interactive network, where the network connections change according to day.
I found an example of it in the tutorial of pathpy.
But, How to load a pandas dataframe to it with nodes Station_start and Station_end?

Here is a way to do what you want. First, load your data into a pandas dataframe using pd.read_fwf (I saved your data in a file called data_net.txt).
Then incrementally add edges to your temporal network using pp.add_edge. Run t in a cell to see the animation.
See code below for more details:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import pathpy as pp
df=pd.read_fwf('data_net.txt')
t = pp.TemporalNetwork()
[t.add_edge(df['Station_start'][i],df['Station_end'][i],int(df['Day'][i])) for i in range(len(df))]
t # run t in a cell to start the animation
Below is what this code returns. Based on the link you gave, you can also control the speed of the animation by styling the network with pathpy.

Related

how to check the normality of data on a column grouped by an index

i'm working on a dataset which represents the completion time of some activities performed in some processes. There are just 6 types of activities that repeat themselves throughout all the dataset and that are described by a numerical value. The example dataset is as follows:
name duration
1 10
2 12
3 34
4 89
5 44
6 23
1 15
2 12
3 39
4 67
5 47
6 13
I'm trying to check if the duration of the activity is normally distributed with the following code:
import numpy as np
import pylab
import scipy.stats as stats
import seaborn as sns
from scipy.stats import normaltest
measurements = df['duration']
stats.probplot(measurements, dist='norm', plot=pylab)
pylab.show()
ax = sns.distplot(measurements)
stat,p = normaltest(measurements)
print('stat=%.3f, p=%.3f\n' % (stat, p))
if p > 0.05:
print('probably gaussian')
else:
print('probably non gaussian')
But i want to do it for each type of activity, which means applying the stats.probplot(), sns.distplot() and the normaltest() to each group of activities (e.g. checking if all the activities called 1 have a duration which is normally distributed).
Any idea on how can i specify in the functions to return different plots for each group of activities?
With the assumption that you have at least 8 samples per activity (as normaltest will throw an error if you don't) then you can loop through your data based on the unique activity values. You'll have to place pylab.show at the end of each graph so that they are not added to each other:
import numpy as np
import pandas as pd
import pylab
import scipy.stats as stats
import seaborn as sns
import random # Only needed by me to create a mock dataframe
import warnings # "distplot" is depricated. Look into using "displot"... in the meantime
warnings.filterwarnings('ignore') # I got sick of seeing the warning so I muted it
name = [1,2,3,4,5,6]*8
duration = [random.choice(range(0,100)) for _ in range(8*6)]
df = pd.DataFrame({"name":name, "duration":duration})
for name in df.name.unique():
nameDF = df[df.name.eq(name)]
measurements = nameDF['duration']
stats.probplot(measurements, dist='norm', plot=pylab)
pylab.show()
ax = sns.distplot(measurements)
ax.set_title(f'Name: {name}')
pylab.show()
stat,p = normaltest(measurements)
print('stat=%.3f, p=%.3f\n' % (stat, p))
if p > 0.05:
print('probably gaussian')
else:
print('probably non gaussian')
.
.
.
etc.

How to calculate pearsonr (and correlation significance) with pandas groupby?

I would like to do a groupby correlation using pandas and pearsonr.
Currently I have:
df = pd.DataFrame(np.random.randint(0,10,size=(1000, 4)), columns=list('ABCD'))
df.groupby(['A','B'])[['C','D']].corr().unstack().iloc[:,1]
However I would like to calculate the correlation significance using pearsonr (scipy package) like this:
from scipy.stats import pearsonr
corr,pval= pearsonr(df['C'],df['D'])
How do I combine the groupby with the pearsonr, something like this:
corr,val=df.groupby(['A','B']).agg(pearsonr(['C','D']))
If I understand, you need to perform the Pearson's test between C and D for any combination of A and B.
To carry out this task you need to groupby(['A','B']) as you already done. Now your grouped dataframe is a "set" of dataframes (one dataframe for each A,B combination), so you can apply the stats.pearsonr to any of these dataframes through the apply method. In order to have two distinct columns for the test-statistic (r, correlation index) and for the p-value, you can also include the output from pearsonr in a pd.Series.
from scipy import stats
df.groupby(['A','B']).apply(lambda d:pd.Series(stats.pearsonr(d.C, d.D), index=["corr", "pval"]))
The output is:
corr pval
A B
0 0 -0.318048 0.404239
1 0.750380 0.007804
2 -0.536679 0.109723
3 -0.160420 0.567917
4 -0.479591 0.229140
.. ... ...
9 5 0.218743 0.602752
6 -0.114155 0.662654
7 0.053370 0.883586
8 -0.436360 0.091069
9 -0.047767 0.882804
[100 rows x 2 columns]
In jupyter:
Another advice I can give you is to adjust the p-values to avoid false-positives, since you are replicating the experiment several times:
corr_df["qval"] = p_adjust_bh(corr_df.pval)
I used the p_adjust_bh function from here (answer from #Eric Talevich)

How to set the setCheckpoint in pyspark

I don't know much spark. On the top of the code I have
from pysaprk.sql import SparkSession
import pyspark.sql.function as f
spark = SparkSession.bulder.appName(‘abc’).getOrCreate()
H = sqlContext.read.parquet(‘path to hdfs file’)
H has about 30 million records and will be used in a loop. So I wrote
H.persist().count()
I have a list of 50 strings L = [s1,s2,…,s50], each of which is used to build a small data frame out of H, which are supposed to be stacked on top each other. I created an empty dataframe Z
schema = StructType([define the schema here])
Z = spark.createDataFrame([],schema)
Then comes the loop
for st in L:
K = process H using st
Z = Z.union(H)
where K has at most 20 rows. When L has only 2 or 3 elements this code works. But for length of L = 50, it never ends. I learned today that I can use checkpoints. So I created a hadoop path and right above where the loop starts I wrote:
SparkContext.setCheckpointDir(dirName=‘path/to/checkpoint/dir’)
But I get the following error: missing 1 required positional argument: ‘self’. I need to know how to fix the error and how to modify the loop to incorporate the checkpoint.
Create an object for the SparkContext and then,you need not specify the self parameter. Also, remove the name of the parameter which is not needed.
A code like below works:
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf())
sc.setCheckpointDir(‘path/to/checkpoint/dir’)

combine csv files, sort them by time and average the colums

I have many datasets in csv files they look like in the picture that I attached.
In the first column is always the time in minutes, but the time steps and the total number of rows differ between the raw data files. I'd like to have one output file (csv file) in which all the raw files are combined and sorted by the time. So that the time increases from the top to the bottom of the column.
raw data and output
The concentration column should be averaged, when more than one number exists.
I tried like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
d1.columns
d2.columns
merged_outer = pd.merge(d1,d2, on='time', how='outer')
print merged_outer
but it doesn't lead to the correct output. I'm a beginner in Pandas but I hope I explaind the problem well enough. Thank you for any idea or suggestion!
Thank you for your idea. Unfortunately, when I run it I get an error message saying that dat1.txt doesn't exist. This seems strange to me as I read the raw files initially by:
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
Sorry, here the data as raw text:
raw data 1
time column2 column3 concentration
1 2 4 3
2 2 4 6
4 2 4 2
7 2 4 5
raw data 2
time column2 column3 concentration
1 2 4 6
2 2 4 2
8 2 4 9
10 2 4 5
12 2 4 7
Something like this might work
filenames = ['dat1.txt', 'dat2.txt',...]
dataframes = {filename: pd.read_csv(filename, sep="\t") for filename in filenames}
merged_outer = pd.concat(dataframes).groupby('time').mean()
When you pass a dict to pd.concat, it creates a MultiIndex DataFrame with the dict keys as level0

Log values by SFrame column

Please, can anybody tell me, how I can take logarithm from every value in SFrame, graphlab (or DataFrame, pandas) column, without to iterate through the whole length of the SFrame column?
I specially interest on similar functionality, like by Groupby Aggregators for the log-function. Couldn't find it someself...
Important: Please, I don't interest for the for-loop iteration for the whole length of the column. I only interest for specific function, which transform all values to the log-values for the whole column.
I'm also very sorry, if this function is in the manual. Please, just give me a link...
numpy provides implementations for a wide number of basic mathematical transformations. You can use those on all data structures that build on numpy's ndarray.
import pandas as pd
import numpy as np
data = pd.Series([np.exp(1), np.exp(2), np.exp(3)])
np.log(data)
Outputs:
0 1
1 2
2 3
dtype: float64
This example is for pandas data types, but it works for all data structures that are based on numpy arrays.
The same "apply" pattern works for SFrames as well. You could do:
import graphlab
import math
sf = graphlab.SFrame({'a': [1, 2, 3]})
sf['b'] = sf['a'].apply(lambda x: math.log(x))
#cel
I think, in my case it could be possible also to use next pattern.
import numpy
import pandas
import graphlab
df
a b c
1 1 1
1 2 3
2 1 3
....
df['log c'] = df.groupby('a')['c'].apply(lambda x: numpy.log(x))
for SFrame (sf instead df object) it could look little be different
logvals = numpy.log(sf['c'])
log_sf = graphlab.SFrame(logvals)
sf = sf.join(log_sf, how = 'outer')
Probably with numpy the code fragment is a little bit to long, but it works...
The main problem is of course time perfomance. I did hope, I can fnd some specific function to minimise my time....