try to use Pandas API on Spark not working - pandas

i use anaconda distribution
and i try to work with Pandas API on Spark but its not working
i pip install:
pip install pyspark
pip install pyspark[sql]
pip install pyspark[pandas_on_spark] plotly
I import:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession
and when i try to run:
psdf = ps.DataFrame(
{'a': [1, 2, 3, 4, 5, 6],
'b': [100, 200, 300, 400, 500, 600],
'c': ["one", "two", "three", "four", "five", "six"]},
index=[10, 20, 30, 40, 50, 60])
it's just thinking and not doing nothing

Related

VSCODE: jupyter adding interactive matplotlib plot %matplotlib widget not working interactively

The following example doesn't work in VSCODE. It works (with %matplotlib notebook in a Jupyter notebook in a web browser though).
# creating 3d plot using matplotlib
# in python
# for creating a responsive plot
# use %matplotlib widget in VSCODE
#%matplotlib widget
%matplotlib notebook
# importing required libraries
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
# creating random dataset
xs = [14, 24, 43, 47, 54, 66, 74, 89, 12,
44, 1, 2, 3, 4, 5, 9, 8, 7, 6, 5]
ys = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 6, 3,
5, 2, 4, 1, 8, 7, 0, 5]
zs = [9, 6, 3, 5, 2, 4, 1, 8, 7, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 0]
# creating figure
fig = plt.figure()
ax = Axes3D(fig)
# creating the plot
plot_geeks = ax.scatter(xs, ys, zs, color='green')
# setting title and labels
ax.set_title("3D plot")
ax.set_xlabel('x-axis')
ax.set_ylabel('y-axis')
ax.set_zlabel('z-axis')
# displaying the plot
plt.show()
The result should be that you get a plot that can be e.g. rotated interactively using the mouse arrow.
In VSCODE one can click on the </> and a renderer is presented. Chosen is JupyterIPWidget Renderer. Other renderers show the plot but don't allow for interactive manipulation.
Also a warning appears:
/var/folders/kc/5p61t70n0llbn05934gj4r_w0000gn/T/ipykernel_22590/1606073246.py:23:
MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is
deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False
and use fig.add_axes(ax) to suppress this warning. The default value of
auto_add_to_figure will change to False in mpl3.5 and True values will
no longer work in 3.6. This is consistent with other Axes classes.
ax = Axes3D(fig)
This behavior is expected-- %matplotlib notebook is not supported in VS Code. You should use %matplotlib widget instead. See https://github.com/microsoft/vscode-jupyter/wiki/Using-%25matplotlib-widget-instead-of-%25matplotlib-notebook,tk,etc

Plot rows of df on Plotly

The df is like this:
X Y Label
0 [16, 37, 38] [7968, 4650, 3615] 0.7
1 [29, 37, 12] [4321, 4650, 1223] 0.8
2 [12, 2, 445] [1264, 3456, 2112] 0.9
This should plot three lines on the same plot with labels as continuous variables. What is the fastest & simplest way to plot it using plotly?
Taking This should plot three lines on the same plot as the requirement. (Which is inconsistent with where I want subplots from each row of the df)
Simple case of create a trace for each row, using https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html to prepare x and y
import pandas as pd
import plotly.graph_objects as go
df = pd.DataFrame(
{
"X": [[16, 37, 38], [29, 37, 12], [12, 2, 445]],
"Y": [[7968, 4650, 3615], [4321, 4650, 1223], [1264, 3456, 2112]],
"Label": [0.7, 0.8, 0.9],
}
)
go.Figure(
[
go.Scatter(
x=r["X"].explode(), y=r["Y"].explode(), name=str(r["Label"].values[0])
)
for _, r in df.groupby(df.index)
]
)
with continuous color defined by label
import pandas as pd
import plotly.graph_objects as go
from plotly.colors import sample_colorscale
import plotly.express as px
df = pd.DataFrame(
{
"X": [[16, 37, 38], [29, 37, 12], [12, 2, 445]],
"Y": [[7968, 4650, 3615], [4321, 4650, 1223], [1264, 3456, 2112]],
"Label": [0.1, 0.5, 0.9],
}
)
fig = px.scatter(x=[0], y=[0], color=[.5], color_continuous_scale="YlGnBu")
fig = fig.add_traces(
[
go.Scatter(
x=r["X"].explode(),
y=r["Y"].explode(),
name=str(r["Label"].values[0]),
line_color=sample_colorscale("YlGnBu", r["Label"].values[0])[0],
showlegend=False
)
for _, r in df.groupby(df.index)
]
)
fig

How to split object type data in to multiple columns using pandas

array([{'name': 'pt.96qt6m0l6udrwon11jybxhvz0y7p1ixjisvowh52mtr53how1dihuaub.9q248rrj4hfm4wtbu4pavc7shb31kx7vdj75uzyy5v1hpychf3zdvamb.thdmrtukbty2v5apzwmyrghi7n7gd.s31.63z.de.', 'ttl': 60, 'type': 10, 'clas': 1, 'data': 'E9000328A013AAC5747648480103C414DCCB01070E1105A87E87061E0AE72A284E67E5D966C4751EC41ACD9AF05F911ACC0A17558DF9E1285793280A69077DFDC1860075C2811DA155CDA5E40EF89FDDE1EF67035A6E17C7C3406D89E6E9D1B857532A676193F1EE39EC2A2F334D9FDB55D097F6C038546F1E5D8C31A13A30518C94ED84AEA56E41D1F44837E0D220573184CC03E3765250129C2BE7258F4046CFDA00A74D71D9F2D0EFBE81AC2966DC07B97B3DE402B057E42DC2DE69A5463B81274E8ED0EEB18ABA6EE22713B4B55232E0A37975C4E9946F7B53216662BD6F445DE2914F09BBEA7ED6EC75C3E0142C440B719D7C5589D743A10044941CB215D7995921523DE21D63F7377E97E3A8AB23D79F55603CC2737D8015'}],
dtype=object)
Pass your array in Dataframe constructor:
In [199]: import numpy as np
In [198]: import pandas as pd
In [200]: s = np.array([{'name': 'pt.96qt6m0l6udrwon11jybxhvz0y7p1ixjisvowh52mtr53how1dihuaub.9q248rrj4hfm4wtbu4pavc7shb31kx7vdj75uzyy5v1hpychf3zdvamb.thdmrtukbty2v5apzwmyrghi7n7gd.s31.63z.de.', 'ttl': 60, 'type': 10, 'clas': 1, 'data': 'E9000328A013AAC5
...: 747648480103C414DCCB01070E1105A87E87061E0AE72A284E67E5D966C4751EC41ACD9AF05F911ACC0A17558DF9E1285793280A69077DFDC1860075C2811DA155CDA5E40EF89FDDE1EF67035A6E17C7C3406D89E6E9D1B857532A676193F1EE39EC2A2F334D9FDB55D097F6C038546F1E5D8C31A13A30518C94
...: ED84AEA56E41D1F44837E0D220573184CC03E3765250129C2BE7258F4046CFDA00A74D71D9F2D0EFBE81AC2966DC07B97B3DE402B057E42DC2DE69A5463B81274E8ED0EEB18ABA6EE22713B4B55232E0A37975C4E9946F7B53216662BD6F445DE2914F09BBEA7ED6EC75C3E0142C440B719D7C5589D743A10044
...: 941CB215D7995921523DE21D63F7377E97E3A8AB23D79F55603CC2737D8015'}],
...: dtype=object)
In [200]: df = pd.DataFrame(s[0], index=[0])
In [201]: df
Out[201]:
name ttl type clas data
0 pt.96qt6m0l6udrwon11jybxhvz0y7p1ixjisvowh52mtr... 60 10 1 E9000328A013AAC5747648480103C414DCCB01070E1105...

How can pd.cut return a number as group?

Example
pd.cut(df['a'],[0,2,4,10,np.inf],right=False)
It returns [0,2),[2,4),[4,10),[10,np.inf) .
But how can I get [0],(0,2),[2,4),[4,10),[10,np.inf)?
If all values are integers and greater than zero, this could work:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 3, 5, 7, 9, 11, 13]})
pd.cut(df['a'], [-np.inf, 1, 2, 4, 10, np.inf], right=False)

Trying to create a Seaborn heatmap from a Pandas Dataframe

This is first time trying this. I actually have a dict of lists I am generating in a program, but since this is my first time ever trying this, I am using a dummy dict just for testing.
I am following this:
python Making heatmap from DataFrame
but I am failing with the following:
Traceback (most recent call last):
File "C:/Users/Mark/PycharmProjects/main/main.py", line 20, in <module>
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True)
File "C:\Users\Mark\AppData\Roaming\Python\Python36\site-packages\seaborn\matrix.py", line 517, in heatmap
yticklabels, mask)
File "C:\Users\Mark\AppData\Roaming\Python\Python36\site-packages\seaborn\matrix.py", line 168, in __init__
cmap, center, robust)
File "C:\Users\Mark\AppData\Roaming\Python\Python36\site-packages\seaborn\matrix.py", line 205, in _determine_cmap_params
calc_data = plot_data.data[~np.isnan(plot_data.data)]
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
My code:
import pandas as pd
import seaborn as sns
Index = ['key1', 'key2', 'key3', 'key4', 'key5']
Cols = ['A', 'B', 'C', 'D']
testdict = {
"key1": [1, 2, 3, 4],
"key2": [5, 6, 7, 8],
"key3": [9, 10, 11, 12],
"key4": [13, 14, 15, 16],
"key5": [17, 18, 19, 20]
}
df = pd.DataFrame(testdict, index=Index, columns=Cols)
df = df.transpose()
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True)
You need to switch your column and index labels
Cols = ['key1', 'key2', 'key3', 'key4', 'key5']
Index = ['A', 'B', 'C', 'D']