PySpark - numpy | pandas | matplotlib - pandas

I already make some research to find the answer for my question but I ddin't find anything...
How can I use this labs in Pyspark:
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
Anyone have a article that ilustrate ao to use it?
I'm trying to save a dataframe to make some graphs like this (I see this code here):
data1 = pd.sc.textFile("TEXT").flatMap { line => line.split("\n") }.distinct()
freqMap = {}
for line in data1:
for item in line:
if not item in freqMap:
freqMap[item] = {}
for other_item in line:
if not other_item in freqMap:
freqMap[other_item] = {}
freqMap[item][other_item] = freqMap[item].get(other_item, 0) + 1
freqMap[other_item][item] = freqMap[other_item].get(item, 0) + 1
df = data1[freqMap].fillna(0)
print(df)
plt.pcolormesh(df, edgecolors='black')
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
plt.savefig('plot.png')
plt.savefig('plot.png')
Thanks!

Related

Creating 3D scatter chart in Taipy

I was wondering how one would create a 3D scatter chart in Taipy.
I tried this code initially:
import pandas as pd
import numpy as np
from taipy import Gui
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('xyz'))
df['cluster1']=np.random.randint(0,3,100)
my_page ="""
Creation of a 3-D chart:
<|{df}|chart|type=Scatter3D|x=x|y=y|z=z|mode=markers|color=cluster|>
"""
Gui(page=my_page).run()
This does indeed display a 3D plot, but the colors (clusters) do not show up.
Any hint?
Yes, you need some massaging of your dataframes to do it.
Here's a sample code that achieves this:
import pandas as pd
import numpy as np
from taipy import Gui
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('xyz'))
df['cluster1']=np.random.randint(0,3,100)
# Create a list of 3 dataframes, one per cluster
datas = [df[df['cluster1']==i] for i in range(3)]
properties = {
}
# create dynamically the property list.
# str(i) points to a dataframe index
# "/x" points to the column value in the selected dataframe
for i in range(len(datas)):
properties[f"x[{i+1}]"] = str(i)+"/x"
properties[f"y[{i+1}]"] = str(i)+"/y"
properties[f"z[{i+1}]"] = str(i)+"/z"
properties[f'name[{i+1}]'] = str(i+1)
print(properties)
chart = "<|{datas}|chart|type=Scatter3D|properties={properties}|mode=markers|height=800px|>"
Gui(page=chart).run()
In fact, with the new release: Taipy 1.1, this is very easy to do in a few lines of code:
import pandas as pd
import numpy as np
from taipy import Gui
color_map={0:"blue",1:'green', 2:"red"}
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('xyz'))
df['cluster1'] = np.random.randint(0,3,100)
df['cluster_colors'] = df.apply(lambda row: color_map[row.cluster1], axis=1)
marker = {"color":"cluster_colors"}
chart = "<|{df}|chart|type=Scatter3D|x=x|y=y|z=z|marker={marker}|mode=markers|height=800px|>"
Gui(page=chart).run()
If you want to leave it to Taipy to pick the colors for you, then you can simply use:
import pandas as pd
import numpy as np
from taipy import Gui
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('xyz'))
df['cluster1'] = np.random.randint(0,3,100)
marker = {"color":"cluster1"}
chart = "<|{df}|chart|type=Scatter3D|x=x|y=y|z=z|marker={marker}|mode=markers|height=800px|>"
Gui(page=chart).run()

using ipywidgets SelectMultiple on a dataframe

import pandas as pd
import numpy as np
import ipywidgets as widgets
from IPython.display import display
a = ['Banking', 'Auto', 'Life', 'Electric', 'Technology', 'Airlines',
'Healthcare']
df = pd.DataFrame(np.random.randn(7, 4), columns = list('ABCD'))
df.index = a
df.head(7)
dropdown = widgets.SelectMultiple(
options=df.index,
description='Sector',
disabled=False,
layout={'height':'100px', 'width':'40%'})
display(dropdown)
I want to create a function where I can filter the df by Sector. i.e say I select Airlines, Banking and Electric from the display(dropdown) and it returns a dataframe of the selected sectors only.
Try something like this, I have used a global variable to demonstrate in this case, but I would normally wrap up the functionality in a class so you always have access to the filtered dataframe.
Rather than use interact I have used .observe on the Selection widget.
import pandas as pd
import numpy as np
import ipywidgets as widgets
from IPython.display import display, clear_output
a = ['Banking', 'Auto', 'Life', 'Electric', 'Technology', 'Airlines',
'Healthcare']
df = pd.DataFrame(np.random.randn(7, 4), columns = list('ABCD'), index=a)
filtered_df = None
dropdown = widgets.SelectMultiple(
options=df.index,
description='Sector',
disabled=False,
layout={'height':'100px', 'width':'40%'})
def filter_dataframe(widget):
global filtered_df
selection = list(widget['new'])
with out:
clear_output()
display(df.loc[selection])
filtered_df = df.loc[selection]
out = widgets.Output()
dropdown.observe(filter_dataframe, names='value')
display(dropdown)
display(out)

pd.describe() does not work

from abupy import ABuSymbolPd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
tsla_df = ABuSymbolPd.make_kl_df('usTSLA', n_folds=8)
tsla_df [['close', 'volume']].plot (subplots = True, style = ['r', 'g'],
grid = True)
print tsla_df [ ['close', 'volume']]
plt.show()
tsla_df.info()
tsla_df.describe(include = "all")
In above python code, I hope last code list the statistical of tsla_df, but it does not and also never give any error information. Anybody has any idea?

How can I draw scatter trend line on matplot? Python-Pandas

I want to draw a scatter trend line on matplot. How can I do that?
Python
import pandas as pd
import matplotlib.pyplot as plt
csv = pd.read_csv('/tmp/test.csv')
data = csv[['fee', 'time']]
x = data['fee']
y = data['time']
plt.scatter(x, y)
plt.show()
CSV
fee,time
100,650
90,700
80,860
70,800
60,1000
50,1200
time is integer value.
Scatter chart
I'm sorry I found the answer by myself.
How to add trendline in python matplotlib dot (scatter) graphs?
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
csv = pd.read_csv('/tmp/test.csv')
data = csv[['fee', 'time']]
x = data['fee']
y = data['time']
plt.scatter(x, y)
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.show()
Chart
With text:
from sklearn.metrics import r2_score
plt.plot(x,y,"+", ms=10, mec="k")
z = np.polyfit(x, y, 1)
y_hat = np.poly1d(z)(x)
plt.plot(x, y_hat, "r--", lw=1)
text = f"$y={z[0]:0.3f}\;x{z[1]:+0.3f}$\n$R^2 = {r2_score(y,y_hat):0.3f}$"
plt.gca().text(0.05, 0.95, text,transform=plt.gca().transAxes,
fontsize=14, verticalalignment='top')
You also can use Seaborn lmplot:
import seaborn as sns
import pandas as pd
from io import StringIO
textfile = StringIO("""fee,time
100,650
90,700
80,860
70,800
60,1000
50,1200""")
df = pd.read_csv(textfile)
_ = sns.lmplot(x='fee', y='time', data=df, ci=None)
Output:

How do I enable the REFS_OK flag in nditer in numpy in Python 3.3?

Does anyone know how one goes about enabling the REFS_OK flag in numpy? I cannot seem to find a clear explanation online.
My code is:
import sys
import string
import numpy as np
import pandas as pd
SNP_df = pd.read_csv('SNPs.txt',sep='\t',index_col = None ,header = None,nrows = 101)
output = open('100 SNPs.fa','a')
for i in SNP_df:
data = SNP_df[i]
data = np.array(data)
for j in np.nditer(data):
if j == 0:
output.write(("\n>%s\n")%(str(data(j))))
else:
output.write(data(j))
I keep getting the error message: Iterator operand or requested dtype holds references, but the REFS_OK was not enabled.
I cannot work out how to enable the REFS_OK flag so the program can continue...
I have isolated the problem. There is no need to use np.nditer. The main problem was with me misinterpreting how Python would read iterator variables in a for loop. The corrected code is below.
import sys
import string
import fileinput
import numpy as np
SNP_df = pd.read_csv('datafile.txt',sep='\t',index_col = None ,header = None,nrows = 5000)
output = open('outputFile.fa','a')
for i in range(1,51):
data = SNP_df[i]
data = np.array(data)
for j in range(0,1):
output.write(("\n>%s\n")%(str(data[j])))
for k in range(1,len(data)):
output.write(str(data[k]))
If you really want to enable the flag, I have an working example.
Python 2.7, numpy 1.14.2, pandas 0.22.0
import pandas as pd
import numpy as np
# get all data as panda DataFrame
data = pd.read_csv("./monthdata.csv")
print(data)
# get values as numpy array
data_ar = data.values # numpy.ndarray, every element is a row
for row in data_ar:
print(row)
sum = 0
count = 0
for month in np.nditer(row, flags=["refs_OK"], op_flags=["readwrite"]):
print month