Save a pandas dataframe containing numpy arrays - pandas

I have a dataframe with a column full of numpy arrays.
A B C
0 1.0 0.000000 [[0. 1.],[0. 1.]]
1 2.0 0.000000 [[85. 1.],[52. 0.]]
2 3.0 0.000000 [[5. 1.],[0. 0.]]
3 1.0 3.333333 [[0. 1.],[41. 0.]]
4 2.0 3.333333 [[85. 1.],[0. 21.]]
Problem is, when I save it as a CSV file, and when i load it on another python file, the numpy column is read as text.
I tried to transform the column with np.fromstring() or np.loadtxt() but it doesn't work.
Example of and array after pd.read_csv()
"[[ 85. 1.]\n [ 52. 0. ]]"
Thanks

You can try .to_json()
output = pd.DataFrame([
{'a':1,'b':np.arange(4)},
{'a':2,'b':np.arange(5)}
]).to_json()
But you will get only lists back when reloading with
df=pd.read_json()
Turn them to numpy arrays with:
df['b']=[np.array(v) for v in df['b']]

The code below should work. I used another question to solve it, theres a bit more explanation in there: Convert a string with brackets to numpy array
import pandas as pd
import numpy as np
from ast import literal_eval
# Recreating DataFrame
data = np.array([0, 1, 0, 1, 85, 1, 52, 0, 5, 1, 0, 0, 0, 1, 41, 0, 85, 1, 0, 21], dtype='float')
data = data.reshape((5,2,2))
write_df = pd.DataFrame({'A': [1.0,2.0,3.0,1.0,2.0],
'B': [0,0,0,3+1/3,3+1/3],
'C': data.tolist()})
# Saving DataFrame to CSV
fpath = 'D:\\Data\\test.csv'
write_df.to_csv(fpath)
# Reading DataFrame from CSV
read_df = pd.read_csv(fpath)
# literal_eval converts the string to a list of tuples
# np.array can convert this list of tuples directly into an array
def makeArray(rawdata):
string = literal_eval(rawdata)
return np.array(string)
# Applying the function row-wise, there could be a more efficient way
read_df['C'] = read_df['C'].apply(lambda x: makeArray(x))

Here is an ugly solution.
import pandas as pd
import numpy as np
### Create dataframe
a = [1.0, 2.0, 3.0, 1.0, 2.0]
b = [0.000000,0.000000,0.000000,3.333333,3.333333]
c = [np.array([[0. ,1.],[0. ,1.]]),
np.array([[85. ,1.2],[52. ,0.]]),
np.array([[5. ,1.],[0. ,0.]]),
np.array([[0. ,1.],[41. ,0.]]),
np.array([[85. ,1.],[0. ,21.]]),]
df = pd.DataFrame({"a":a,"b":b,"c":c})
#### Save to csv
df.to_csv("to_trash.csv")
df = pd.read_csv("to_trash.csv")
### Bad string manipulation that could be done better with regex
df["c"] = ("np.array("+(df
.c
.str.split()
.str.join(' ')
.str.replace(" ",",")
.str.replace(",,",",")
.str.replace("[,", "[", regex=False)
)+")").apply(lambda x: eval(x))

The best solution I found is using Pickle files.
You can save your dataframe as a pickle file.
import pickle
img = cv2.imread('img1.jpg')
data = pd.DataFrame({'img':img})
data.to_pickle('dataset.pkl')
Then you can read is as pickle file:
with (open(ref_path + 'dataset.pkl', "rb")) as openfile:
df_file = pickle.load(openfile)
Let me know if it worked.

Related

Speed up applying a transformation to each index value of a given array

I need to apply a function to the result of a transformation of all index values of a given numpy array. The following code does this:
import numpy as np
from matplotlib.transforms import IdentityTransform
# some 2D array
a = np.empty((2,3))
# some affine transformation, identity is just an example here
trans = IdentityTransform()
# some function taking a 2D index and returning some value depending
# on that index, again just an example
def f(idx):
return (idx[0]+idx[1])/2
# apply f to the result of transforming each index of a
b=np.empty_like(a)
for idx in np.ndindex(a.shape):
b[idx] = f(trans.transform(idx))
print(b)
This prints the following correct result:
[[0. 0.5 1. ]
[0.5 1. 1.5]]
The problem now is, the code is too slow when the shape of a gets larger, say 2000x3000. Is there a way to speed this up?
My idea is to create an array of indices of a idx = [[0,0], [0,1], ..., [1,2]], then transform this array in one go using something like tmp = trans.transform(idx), and lastly apply f to every element with np.vectorize(f)(tmp).
Is this a reasonable approach? If yes, how would this actually look like? If no, are there any alternatives?
Edit: I managed to get at tmp via the following code:
tmp=trans.transform(np.asarray([idx for idx in np.ndindex(a.shape)]))
So now I have an array containing the results of the affine transformation for every index value of a. But this seems to use an awful lot of memory.
I'll post an answer myself with what I figured out now. Maybe it is of use for someone.
To answer the first part of my question, I found a fast and efficient way to create the result of transforming the index values, using the result of np.indices() and then massaging the result of that until it fits to what t.transform() expects.
Given some array a = np.empty((2,3)), the indices of that array can be obtained via np.indices(a.shape). This returns two 2D arrays (one for each dimension of a, actually). What I failed to understand was how to turn these results into something transform() understands.
The key here is to apply np.ravel() to the result of each of those arrays, np.indices() returns:
>>> a=np.empty((2,3))
>>> list(map(np.ravel, np.indices(a.shape)))
[array([0, 0, 0, 1, 1, 1]), array([0, 1, 2, 0, 1, 2])]
Now I have a list of arrays containing all the x and y indices, which just needs to be put together with np.vstack() and then transposed to get an array of all (x, y) indices, and this is the form transform() will accept.
>>> l=list(map(np.ravel, np.indices(a.shape)))
>>> np.vstack(l).transpose()
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2]])
And finally, for some arbitrary affine transformation:
>>> from matplotlib.transforms import Affine2D
>>> t = Affine2D().translate(10, 20).scale(0.5)
>>> t.transform(np.vstack(l).transpose())
array([[ 5. , 10. ],
[ 5. , 10.5],
[ 5. , 11. ],
[ 5.5, 10. ],
[ 5.5, 10.5],
[ 5.5, 11. ]])
This is quite fast, even for larger array sizes. If the shape gets big enough (something like 20000x30000), I run out of memory, but for shapes 10000x10000 it still is amazingly fast.
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((20, 10)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.0003051299718208611
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((200, 100)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.0026413939776830375
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((2000, 1000)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.35055489401565865
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((20000, 10000)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
43.62860555597581
Now for the second part, for applying the function to each of the transformed index values I use the following code for now, which is fast enough in my case.
xxyy = t.transform(np.vstack(...).transpose())
np.fromiter((f(*xy) for xy in xxyy), dtype=np.short, count=len(xxyy))

Plotting ways (linestrings) over a map in Python

this is my second try for the same question and I really hope that someone may help me...
Even thought some really nice people tried to help me. There is a lot I couldn't figure out, despite there help.
From the beginning:
I created a dataframe. This dataframe is huge and gives information about travellers in a city. The dataframe looks like this. This is only the head.
In origin and destination you have the ids of the citylocations, in move how many travelled from origin to destination. longitude and latitude is where the exact point is and the linestring the combination of the points..
I created the linestring with this code:
erg2['Linestring'] = erg2.apply(lambda x: LineString([(x['latitude_origin'], x['longitude_origin']), (x['latitude_destination'], x['longitude_destination'])]), axis = 1)
Now my question is how to plot the ways over a map. Even thought I tried all th eexamples from the geopandas documentary etc. I cant help myself..
I cant show you what I already plotted because it doesnt make sense and I guess it would be smarter to start plotting from the beginning.
You see that in the column move there are some 0. This means that no one travelled this route. So this I dont need to plot..
I have to plot the lines with the information where the traveller started origin and where he went destination.
also I need to outline the different lines depending on movements..
with this plotting code
fig = px.line_mapbox(erg2, lat="latitude_origin", lon="longitude_origin", color="move",
hover_name= gdf["origin"] + " - " + gdf["destination"],
center =dict(lon=13.41053,lat=52.52437), zoom=3, height=600
)
fig.update_layout(mapbox_style="stamen-terrain", mapbox_zoom=4, mapbox_center_lat = 52.52437,
margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Maybe someone has an idea???
I tried it with thios code:
import requests, io, json
import geopandas as gpd
import shapely.geometry
import pandas as pd
import numpy as np
import itertools
import plotly.express as px
# get some public addressess - hospitals. data that has GPS lat / lon
dfhos = pd.read_csv(io.StringIO(requests.get("http://media.nhschoices.nhs.uk/data/foi/Hospital.csv").text),
sep="¬",engine="python",).loc[:, ["OrganisationName", "Latitude", "Longitude"]]
a = np.arange(len(dfhos))
np.random.shuffle(a)
# establish N links between hospitals
N = 10
df = (
pd.DataFrame({0:a[0:N], 1:a[25:25+N]}).merge(dfhos,left_on=0,right_index=True)
.merge(dfhos,left_on=1, right_index=True, suffixes=("_origin", "_destination"))
)
# build a geopandas data frame that has LineString between two hospitals
gdf = gpd.GeoDataFrame(
data=df,
geometry=df.apply(
lambda r: shapely.geometry.LineString(
[(r["Longitude_origin"], r["Latitude_origin"]),
(r["Longitude_destination"], r["Latitude_destination"]) ]), axis=1)
)
# sample code https://plotly.com/python/lines-on-mapbox/#lines-on-mapbox-maps-from-geopandas
lats = []
lons = []
names = []
for feature, name in zip(gdf.geometry, gdf["OrganisationName_origin"] + " - " + gdf["OrganisationName_destination"]):
if isinstance(feature, shapely.geometry.linestring.LineString):
linestrings = [feature]
elif isinstance(feature, shapely.geometry.multilinestring.MultiLineString):
linestrings = feature.geoms
else:
continue
for linestring in linestrings:
x, y = linestring.xy
lats = np.append(lats, y)
lons = np.append(lons, x)
names = np.append(names, [name]*len(y))
lats = np.append(lats, None)
lons = np.append(lons, None)
names = np.append(names, None)
fig = px.line_mapbox(lat=lats, lon=lons, hover_name=names)
fig.update_layout(mapbox_style="stamen-terrain",
mapbox_zoom=4,
mapbox_center_lon=gdf.total_bounds[[0,2]].mean(),
mapbox_center_lat=gdf.total_bounds[[1,3]].mean(),
margin={"r":0,"t":0,"l":0,"b":0}
)
which looks like the perfect code but I cant really use it for my data..
I am very new to coding. So please be patient a bit;))
Thanks a lot in advance.
All the best
previously answered this question How to plot visualize a Linestring over a map with Python?. I suggested that you update that question, I still recommend that you do
line strings IMHO are not the way to go. plotly does not use line strings, so it's extra complexity to encode to line strings to decode to numpy arrays. check out the examples on official documentation https://plotly.com/python/lines-on-mapbox/. here it is very clear geopandas is just a source that has to be encoded into numpy arrays
data
your sample data it appears should be one Dataframe and has no need for geopandas or line strings
almost all of your sample data is unusable as every row where origin and destination are different have move of zero which you note should be excluded
import pandas as pd
import numpy as np
import plotly.express as px
df = pd.DataFrame({"origin": [88, 88, 88, 88, 88, 87],
"destination": [88, 89, 110, 111, 112, 83],
"move": [20, 0, 5, 0, 0, 10],
"longitude_origin": [13.481016, 13.481016, 13.481016, 13.481016, 13.481016, 13.479667],
"latitude_origin": [52.457055, 52.457055, 52.457055, 52.457055, 52.457055, 52.4796],
"longitude_destination": [13.481016, 13.504075, 13.613772, 13.586891, 13.559341, 13.481016],
"latitude_destination": [52.457055, 52.443923, 52.533194, 52.523562, 52.507418, 52.457055]})
solution
have further refined line_array() function so it can be used to encode hover and color parameters from simplified solution I previously provided
# lines in plotly are delimited by none
def line_array(data, cols=[], empty_val=None):
if isinstance(data, pd.DataFrame):
vals = data.loc[:, cols].values
elif isinstance(data, pd.Series):
a = data.values
vals = np.pad(a.reshape(a.shape[0], -1), [(0, 0), (0, 1)], mode="edge")
return np.pad(vals, [(0, 0), (0, 1)], constant_values=empty_val).reshape(
1, (len(df) * 3))[0]
# only draw lines where move > 0 and destination is different to origin
df = df.loc[df["move"].gt(0) & (df["origin"]!=df["destination"])]
lons = line_array(df, ["longitude_origin", "longitude_destination"])
lats = line_array(df, ["latitude_origin", "latitude_destination"])
fig = px.line_mapbox(
lat=lats,
lon=lons,
hover_name=line_array(
df.loc[:, ["origin", "destination"]].astype(str).apply(" - ".join, axis=1)
),
hover_data={
"move": line_array(df, ["move", "move"], empty_val=-99),
"origin": line_array(df, ["origin", "origin"], empty_val=-99),
},
color=line_array(df, ["origin", "origin"], empty_val=-99),
).update_traces(visible=False, selector={"name": "-99"})
fig.update_layout(
mapbox={
"style": "stamen-terrain",
"zoom": 9.5,
"center": {"lat": lats[0], "lon": lons[0]},
},
margin={"r": 0, "t": 0, "l": 0, "b": 0},
)

Numpy equivalent of pandas replace (dictionary mapping)

I know working on numpy array can be quicker than pandas.
I am wondering if there is a equivalent way (and quicker) to do pandas.replace on a numpy array.
In the example below, I have created a dataframe and a dictionary. The dictionary contains the name of columns and its corresponding mapping. I wonder if there is any function which would allow me to feed a dicitonary to a numpy array to do the mapping and yield a quicker processing time?
import pandas as pd
import numpy as np
# Dataframe
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data=d)
# dictionary I want to map
d_mapping = {'col1' : {1:2 , 2:1} , 'col2' : {4:1}}
# result using pandas replace
print(df.replace(d_mapping))
# Instead of a pandas dataframe, I want to perform the same operation on a numpy array
df_np = df.to_records(index=False)
You can try np.select(). I believe it depends on the number of unique elements to replace.
def replace_values(df, d_mapping):
def replace_col(col):
# extract numpy array and column name from pd.Series
col, name = col.values, col.name
# generate condlist and choicelist
# for every key in mapping create a boolean mask
condlist = [col == x for x in d_mapping[name].keys()]
choicelist = d_mapping[name].values()
# use np.where to keep the existing value which won't be replaced
return np.select(condlist, choicelist, col)
return df.apply(replace_col)
usage:
replace_values(df, d_mapping)
I also believe that you you can speed up the code above if you use lists/arrays in the mapping instead of dicts and replace keys(), and values() calls with index lookups:
d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
...
lookups and are also expensive
m = d_mapping[name]
condlist = [col == x for x in m[0]]
choicelist = m[1]
...
np.isin(col, m[0]),
Upd:
Here is the benchmark
import pandas as pd
import numpy as np
# Dataframe
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
# dictionary I want to map
d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
d_mapping_2 = {
col: dict(zip(*replacement)) for col, replacement in d_mapping.items()
}
def replace_values(df, mapping):
def replace_col(col):
col, (m0, m1) = col.values, mapping[col.name]
return np.select([col == x for x in m0], m1, col)
return df.apply(replace_col)
from timeit import timeit
print("np.select: ", timeit(lambda: replace_values(df, d_mapping), number=5000))
print("df.replace: ", timeit(lambda: df.replace(d_mapping_2), number=5000))
On my 6-year old laptop it prints:
np.select: 3.6562702230003197
df.replace: 4.714512745998945
np.select is ~20% faster

why pandas dataframe style lost when saved with "to_excel"?

Per this example the to_excel method should save the Excel file with background color. However, my saved Excel file does not have any color in it.
I tried to write using both openpyxl and xlsxwriter engines. In both cases, the Excel file was saved, but the cell color/style was lost.
I can read the file back and reformat with openpyxl, but if this to_excel method is supposed to work, why doesn't it?
Here is the sample code.
import pandas as pd # version 0.24.2
dict = {'A': [1, 1, 1, 1, 1], 'B':[2, 1, 2, 1, 2], 'C':[1, 2, 1, 2, 1]}
df = pd.DataFrame(dict)
df_styled = df.style.apply(lambda x: ["background: #ffa31a" if x.iloc[0] < v else " " for v in x], axis=1)
df_styled
''' in my jupyter notebook, this displayed my dataframe with background color when condition is met, (all the 2s highlighted)'''
'''Save the styled data frame to excel using to_excel'''
df_styled.to_excel('example_file_openpyxl.xlsx', engine='openpyxl')
df_styled.to_excel('example_file_xlsxwriter.xlsx', engine='xlsxwriter')
I stumbled across this myself and as far as I'm aware there isn't support for exporting to excel like this yet. I've adjusted your code to match the output to excel in the documentation.
This is the documentation output to excel method.
df.style.\
applymap(color_negative_red).\
apply(highlight_max).\
to_excel('styled.xlsx', engine='openpyxl')
This is your code adjusted:
import pandas as pd
dict = {'A': [1, 1, 1, 1, 1], 'B':[2,1,2,1,2], 'C':[1,2,1,2,1]}
df = pd.DataFrame(dict)
def highlight(df, color = "yellow"):
attr = 'background-color: {}'.format(color)
df_bool = pd.DataFrame(df.apply(lambda x: [True if x.iloc[0] < v else False for v in x],axis=1).apply(pd.Series),
index=df.index)
df_bool.columns =df.columns
return pd.DataFrame(np.where(df_bool, attr, ""),
index= df.index, columns=df.columns)
df.style. \
apply(highlight, axis=None).\
to_excel("styled.xlsx", engine="openpyxl")
Inside the highlight function, I create a boolean dataframe based on the conditions applied in the list comprehension above. Then, I assign styling based on the result of this dataframe.

How to append a element to mxnet NDArray?

In numpy, one can append an element to an array by using np.append().
But though numpy and mxnet arrays are supposed to be sumilar, there is not append() function in NDArray class.
Update(18/04/24):
Thanks Thom. In fact, what I tried to achieve is this in numpy :
import numpy as np
np_a1 = np.empty((0,3), int)
np_a1 = np.append(np_a1, np.array([[1,2,3],[4,5,6]]), axis=0)
np_a1 = np.append(np_a1, np.array([[7,8,9]]), axis=0)
print("\nnp_a1:\n", np_a1)
print(np_a1.shape)
Thanks to you answer, I did that :
import mxnet as mx
nd_a1 = mx.nd.array([[0, 0, 0]])
# nd_a1 = mx.nd.empty((0,3))
nd_a1 = mx.nd.concat(nd_a1, mx.nd.array([[1,2,3],[4,5,6]]), dim=0)
nd_a1 = mx.nd.concat(nd_a1, mx.nd.array([[7, 8, 9]]), dim=0)
print("\nnd_a1", nd_a1)
print(nd_a1.shape)
But I can't figure out how to start from an empty nd array.
Starting from :
nd_a1 = mx.nd.empty((0,3))
does not work
You can use mx.nd.concat to achieve this. Using the example given in the numpy docs, you need to be careful with dimensions before concatenating. MXNet works well with data in batches (often the first dimension if the is batch dimension) as this is useful when training/using neural networks, but this makes the example below look more complicated than it would be in practice.
import numpy as np
import mxnet as mx
a = np.array([1, 2, 3])
b = np.array([[4, 5, 6], [7, 8, 9]])
out = np.append(a, b)
print(out)
a = mx.nd.array([1, 2, 3])
b = mx.nd.array([[4, 5, 6], [7, 8, 9]])
a = a.expand_dims(0)
out = mx.nd.concat(a, b, dim=0)
out = out.reshape(shape=(-1,))
print(out)