Integrating sql,pandas alongwith Bokeh - pandas

Here we are trying to load packages and then write an SQl query to integrate with Pandas and then finally using Bokeh to show the plot But bokeh is not showing anything.
You can consider the following as the dataset df_new_2:
name success_rate failure_rate
A 94.7 5.3
B 94.3 5.7
C 91 9
D 88 13
E 84 16
F 81 19
G 78 22
H 74.6 25.4
The code starts here
import pandas.io.sql
import pandas as pd
import pyodbc
from bokeh import mpl
from bokeh.plotting import output_file,show
server = 'root' #getting the server to work
db = 'y' #assigning database
# Create the connection
conn = pyodbc.connect("DRIVER={MySQL ODBC 3.51 Driver};SERVER=localhost;PORT= 3306;DATABASE=y;UID=root;PWD=123456789;")
cursor=conn.cursor()
# query db- Here we are trying to count the number of success in a table and the name for which the success has been found by joining tables
sql = """
SELECT count(*) AS TOTAL,
COUNT(CASE WHEN status=0 THEN 1 END) AS success,
b.name
FROM a
JOIN b
ON b.id=a.merchant
GROUP BY merchant
LIMIT 10
"""
df = pandas.io.sql.read_sql(sql, conn) #defining df as query result
df.head()
df_new=df.set_index('name') #indexing as the name of a
df_new['success_rate']=df_new['success']*100/df_new['TOTAL']
df_new['failure_rate']=100-df_new['success_rate'] #assigning failure rate
df_new2=pd.DataFrame(df_new,columns=['success_rate','failure_rate'])
p=df_new2.plot(kind='barh',stacked=True)
output_file("pandas_series.html", title="pandas_series.py example") #assigning the name of output screen
show(mpl.to_bokeh) #showing the output of bokeh

got something a bit more useful for you now. Had to avoid mpl as I couldn't get that to work. One possible reason is that I don't think horizontal bar charts are available in bokeh.
import pandas as pd
from bokeh.charts import Bar
from bokeh.plotting import output_file, show
from bokeh.charts.operations import blend
from bokeh.charts.attributes import cat, color
df_new2 = pd.DataFrame({'Success Rate' : [94.7,94.3,91,88,84,81,78,74.6], 'Failure Rate' : [5.3,5.7,9,12,16,19,22,25.4]})
df_new2['inds'] = ['A','B','C','D','E','F','G','H']
p = Bar(df_new2,
values=blend('Failure Rate','Success Rate', name='% Success/Failure', labels_name='stacked'),
label=cat('inds'),
stack=cat(columns='stacked', sort=False),
color=color(columns='stacked', palette=['Red', 'Green'],
sort=False),
legend='top_right',
title="Success Rate vs. Failure Rate")
output_file("pandas_series.html", title="pandas_series.py example") #assigning the name of output screen
show(p) #showing the output of bokeh

Related

how to check the normality of data on a column grouped by an index

i'm working on a dataset which represents the completion time of some activities performed in some processes. There are just 6 types of activities that repeat themselves throughout all the dataset and that are described by a numerical value. The example dataset is as follows:
name duration
1 10
2 12
3 34
4 89
5 44
6 23
1 15
2 12
3 39
4 67
5 47
6 13
I'm trying to check if the duration of the activity is normally distributed with the following code:
import numpy as np
import pylab
import scipy.stats as stats
import seaborn as sns
from scipy.stats import normaltest
measurements = df['duration']
stats.probplot(measurements, dist='norm', plot=pylab)
pylab.show()
ax = sns.distplot(measurements)
stat,p = normaltest(measurements)
print('stat=%.3f, p=%.3f\n' % (stat, p))
if p > 0.05:
print('probably gaussian')
else:
print('probably non gaussian')
But i want to do it for each type of activity, which means applying the stats.probplot(), sns.distplot() and the normaltest() to each group of activities (e.g. checking if all the activities called 1 have a duration which is normally distributed).
Any idea on how can i specify in the functions to return different plots for each group of activities?
With the assumption that you have at least 8 samples per activity (as normaltest will throw an error if you don't) then you can loop through your data based on the unique activity values. You'll have to place pylab.show at the end of each graph so that they are not added to each other:
import numpy as np
import pandas as pd
import pylab
import scipy.stats as stats
import seaborn as sns
import random # Only needed by me to create a mock dataframe
import warnings # "distplot" is depricated. Look into using "displot"... in the meantime
warnings.filterwarnings('ignore') # I got sick of seeing the warning so I muted it
name = [1,2,3,4,5,6]*8
duration = [random.choice(range(0,100)) for _ in range(8*6)]
df = pd.DataFrame({"name":name, "duration":duration})
for name in df.name.unique():
nameDF = df[df.name.eq(name)]
measurements = nameDF['duration']
stats.probplot(measurements, dist='norm', plot=pylab)
pylab.show()
ax = sns.distplot(measurements)
ax.set_title(f'Name: {name}')
pylab.show()
stat,p = normaltest(measurements)
print('stat=%.3f, p=%.3f\n' % (stat, p))
if p > 0.05:
print('probably gaussian')
else:
print('probably non gaussian')
.
.
.
etc.

Seaborn hue with loc condition

I'm facing the following problem: I'd like to create a lmplot with seaborn and I'd like to distinguish the colors not based on an existing column but based on a condition adressed to a column.
Given the following df for a rental price prediction:
area
rental price
year build
...
40
400
1990
...
60
840
1995
...
480
16
1997
...
...
...
...
...
sns.lmplot(x="area", y="rental price", data=df, hue = df.loc[df['year build'] > 1992])
this one above is not working. I know I can add a column representing this condition and adressing this column in "hue" but is there no way giving seaborn a condition to hue?
Thanks in advance!
You could add a new column with the boolean information and use that for the hue. For example data['at least from eighties'] = data['model_year'] >= 80. This will create a legend with the column name as title, and False and True as texts. If you map the values to strings, these will appear. Here is an example using one of seaborn's demo datasets:
import matplotlib.pyplot as plt
import seaborn as sns
df = sns.load_dataset('mpg')
df['decenium'] = (df['model_year'] >= 80).map({False: "seventies", True: "eighties"})
sns.lmplot(x='weight', y='mpg', data=df, hue='decenium')
plt.tight_layout()
plt.show()

How to construct a temporal network using python

I have a data for different stations at different days:
Station_start Station_end Day Hour
A B 1 14
B C 1 10
C A 1 10
B A 2 15
A C 2 13
D E 2 12
E B 2 14
F C 3 12
I want to construct a dynamic/interactive network, where the network connections change according to day.
I found an example of it in the tutorial of pathpy.
But, How to load a pandas dataframe to it with nodes Station_start and Station_end?
Here is a way to do what you want. First, load your data into a pandas dataframe using pd.read_fwf (I saved your data in a file called data_net.txt).
Then incrementally add edges to your temporal network using pp.add_edge. Run t in a cell to see the animation.
See code below for more details:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import pathpy as pp
df=pd.read_fwf('data_net.txt')
t = pp.TemporalNetwork()
[t.add_edge(df['Station_start'][i],df['Station_end'][i],int(df['Day'][i])) for i in range(len(df))]
t # run t in a cell to start the animation
Below is what this code returns. Based on the link you gave, you can also control the speed of the animation by styling the network with pathpy.

Bokeh grouped bar chart, changing data presented

I'm experimenting with the grouped bar chart example for Bokeh using Pandas shown below. I am trying to see if I can get it to display the data differently for example I wanted the bar graph to show a count of the rows that meet each group. I tried that by replacing instances of 'mpg_mean' to 'mpg_count' but just got an invalid column error. I also experimented with having the graph show a sum by again using 'mpg_sum' with the same error. I'm assuming the calculation for the 'mpg_mean' is occurring in the groupby but how do I get it to display the count or the sum? It's definitely not clear in this example where any calculations are happening.
Thanks in advance for any help!
from bokeh.io import output_file, show
from bokeh.palettes import Spectral5
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap
output_file("bar_pandas_groupby_nested.html")
df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)
group = df.groupby(by=['cyl', 'mfr'])
index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)
p = figure(width=800, height=300, title="Mean MPG by # cylinders and manufacturer",
x_range=group, toolbar_location=None, tooltips=[("MPG", "#mpg_mean"), ("Cyl, Mfr", "#cyl_mfr")])
p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=group,
line_color="white", fill_color=index_cmap, )
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
show(p)

Multiprocessing the Fuzzy match in pandas

I have two data frames.
DF_Address, which is having 347k distinct addresses and DF_Project which is having 24k records having
Project_Id, Project_Start_Date and Project_Address
I want to check if there is a fuzzy match of my Project_Address in Df_Address. If there is a match, I want to extract the Project_ID and Project_Start_Date for the same. Below is code of what I am trying
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])
def fuzzy_match(x, choices, cutoff):
print(x)
return process.extractOne(
x, choices=choices, score_cutoff=cutoff
)
Matched = Df_Address ["Address"].apply(
fuzzy_match,
args=(
Df_Project ["Project_Address"],
80
)
)
This code does provide an output in the form of a tuple
('matched_string', score)
But it is also giving similar strings. Also I need to extract
Project_Id and Project_Start_Date
. Can someone help me to achieve this using parallel processing as the data is huge.
You can convert the tuple into dataframe and then join out to your base data frame.
import pandas as pd
Df_Address = pd.DataFrame({'address': ['abc','cdf'],'random_stuff':[100,200]})
Matched = (('abc',10),('cdf',20))
dist = pd.DataFrame(x)
dist.columns = ['address','distance']
final = Df_Address.merge(dist,how='left',on='address')
print(final)
Output:
address random_stuff distance
0 abc 100 10
1 cdf 200 20