Expected a bytes object, got a 'int' object erro with cudf - pandas

I have a pandas dataframe, all the columns are objects type. I am trying to convert it to cudf by typing cudf.from_pandas(df) but I have this error:
ArrowTypeError: Expected a bytes object, got a 'int' object
I don't understand why even that columns are string and not int.
My second question is how to append to a cudf a new element ( like pandas : df. append())

cudf does have an df.append() capability for Series.
import cudf
df1 = cudf.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, 0.3, 0.4]})
df2 = cudf.DataFrame({'a': [4, 5, 6, 7],'b': [0.1, 0.2, None, 0.3]}) #your new elements
df3= df1.a.append(df2.a)
df3
Output:
0 0
1 1
2 2
3 3
0 4
1 5
2 6
3 7
Name: a, dtype: int64

Related

Two Pandas dataframes, how to interpolate row-wise using scipy

How can I use scipy interpolate on two dataframes, interpolating row-rise?
For example, if I have:
dfx = pd.DataFrame({"a": [0.1, 0.2, 0.5, 0.6], "b": [3.2, 4.1, 1.1, 2.8]})
dfy = pd.DataFrame({"a": [0.8, 0.2, 1.1, 0.1], "b": [0.5, 1.3, 1.3, 2.8]})
display(dfx)
display(dfy)
And say I want to interpolate for y(x=0.5), how can I get the results into an array that I can put in a new dataframe?
Expected result is: [0.761290323 0.284615385 1.1 -0.022727273]
For example, for first row, you can see the expected value is 0.761290323:
x = [0.1, 3.2] # from dfx, row 0
y = [0.8,0.5] # from dfy, row 0
fig, ax = plt.subplots(1,1)
ax.plot(x,y)
f = scipy.interpolate.interp1d(x,y)
out = f(0.5)
print(out)
I tried the following but received ValueError: x and y arrays must be equal in length along interpolation axis.
f = scipy.interpolate.interp1d(dfx, dfy)
out = np.exp(f(0.5))
print(out)
Since you are looking for linear interpolation, you can do:
def interpolate(val, dfx, dfy):
t = (dfx['b'] - val) / (dfx['b'] - dfx['a'])
return dfy['a'] * t + dfy['b'] * (1-t)
interpolate(0.5, dfx, dfy)
Output:
0 0.885714
1 0.284615
2 1.100000
3 -0.022727
dtype: float64

Can we unwind list in pandas column

I have my pandas df like below. It has list in one of the column can it be unwind as below:?
import pandas as pd
L1 = [['ID1', 0, [0, 1, 1] , [0, 1]],
['ID2', 2, [1, 2, 3], [0, 1]]
]
df1 = pd.DataFrame(L1,columns=['ID', 't', 'Key','Value'])
can this be unwinded like below?
import pandas as pd
L1 = [['ID1', 0, 0, 1, 1 , 0, 1],
['ID2', 2, 1, 2, 3, 0, 1]
]
df1 = pd.DataFrame(L1,columns=['ID', 't', 'Key_0','Key_1','Key_2','Value_0', 'Value_1'])
Turning the Series into lists you can call the DataFrame constructor to explode them into multiple columns. Using pop in a list comprehension within concat will remove the original columns from your DataFrame so that you just join back the exploded versions.
This will work regardless of the number of elements in each list, and even if the lists have varying numbers of elements across rows.
df2 = pd.concat(([df1] + [pd.DataFrame(df1.pop(col).tolist(), index=df1.index).add_prefix(f'{col}_')
for col in ['Key', 'Value']]),
axis=1)
print(df2)
ID t Key_0 Key_1 Key_2 Value_0 Value_1
0 ID1 0 0 1 1 0 1
1 ID2 2 1 2 3 0 1
You can flatten L1 before constructing the data frame:
L2 = [ row[0:2] + row[2] + row[3] for row in L1 ]
df2 = pd.DataFrame(L2,columns=['ID', 't', 'Key_0','Key_1','Key_2','Value_0', 'Value_1'])
You can use explode the dataframe columns wise-
df3 = df1.apply(pd.Series.explode, axis=1)
df3.columns = ['ID', 't', 'Key_0','Key_1','Key_2','Value_0', 'Value_1']

Get max value of 3 columns from pandas DataFrame?

I've a Pandas DataFrame with 3 columns:
c={'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
Now I need the max value of these 3 columns.
I've tried:
df['max_val'] = df[['a','b','c']].max(axis=1)
The result is Nan instead of the expected output: US.
How can I get the max value for these 3 columns? (and what if one of them contains Nan)
Use:
c={'a': [['US', 'BE'],['US']],'b': [['US'],['US']], 'c': [['US','BE'],['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
from collections import Counter
df = df[['a','b','c']].apply(lambda x: list(Counter(map(tuple, x)).most_common()[0][0]), 1)
print (df)
0 [US, BE]
1 [US]
dtype: object
if it as # Erfan stated, most common value in a row then .agg(), mode
df.agg('mode', axis=1)
0
0 [US, BE]
1 [US]
while your data are lists, you can't use pandas.mode(). because lists objects are unhashable and mode() function won't work.
a solution is converting the elements of your dataframe's row to strings and then use pandas.mode().
check this:
>>> import pandas as pd
>>> c = {'a': [['US','BE']],'b': [['US']], 'c': [['US','BE']]}
>>> df = pd.DataFrame(c, columns = ['a','b','c'])
>>> x = df.iloc[0].apply(lambda x: str(x))
>>> x.mode()
# Answer:
0 ['US', 'BE']
dtype: object
>>> d = {'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
>>> df2 = pd.DataFrame(d, columns = ['a','b','c'])
>>> z = df.iloc[0].apply(lambda z: str(z))
>>> z.mode()
# Answer:
0 ['US']
dtype: object
As I can see you have some elements as a list type, So I think the below-mentioned code will work fine.
First, append all value into an array
Then, find the most occurring element from that array.
from scipy.stats import mode
arr = []
for i in df:
for j in range(len(df[i])):
for k in range(len(df[i][j])):
arr.append(df[i][j][k])
from collections import Counter
b = Counter(arr)
print(b.most_common())
this will give you an answer as you want.

Create polygons from points with GeoPandas

I have a geopandas dataframe containing a list of shapely POINT geometries. There is another column with a list of ID's that specifies which unique polygon each point belongs to. Simplified input code is:
import pandas as pd
from shapely.geometry import Point, LineString, Polygon
from geopandas import GeoDataFrame
data = [[1,10,10],[1,15,20],[1,20,10],[2,30,30],[2,35,40],[2,40,30]]
df_poly = pd.DataFrame(data, columns = ['poly_ID','lon', 'lat'])
geometry = [Point(xy) for xy in zip(df_poly.lon, df_poly.lat)]
geodf_poly = GeoDataFrame(df_poly, geometry=geometry)
geodf_poly.head()
I would like to groupby the poly_ID in order to convert the geometry from POINT to POLYGON. This output would essentially look like:
poly_ID geometry
1 POLYGON ((10 10, 15 20, 20 10))
2 POLYGON ((30 30, 35 40, 40 30))
I imagine this is quite simple, but I'm having trouble getting it to work. I found the following code that allowed me to convert it to open ended polylines, but have not been able to figure it out for polygons. Can anyone suggest how to adapt this?
geodf_poly = geodf_poly.groupby(['poly_ID'])['geometry'].apply(lambda x: LineString(x.tolist()))
Simply replacing LineString with Polygon results in TypeError: object of type 'Point' has no len()
Your request is a bit tricky to accomplish in Pandas because, in your output you want the text 'POLYGON' but numbers inside the brackets.
See the below options work for you
from itertools import chain
df_poly.groupby('poly_ID').agg(list).apply(lambda x: tuple(chain.from_iterable(zip(x['lon'], x['lat']))), axis=1).reset_index(name='geometry')
output
poly_ID geometry
0 1 (10, 10, 15, 20, 20, 10)
1 2 (30, 30, 35, 40, 40, 30)
or
from itertools import chain
df_new =df_poly.groupby('poly_ID').agg(list).apply(lambda x: tuple(chain.from_iterable(zip(x['lon'], x['lat']))), axis=1).reset_index(name='geometry')
df_new['geometry']=df_new.apply(lambda x: 'POLYGON ('+str(x['geometry'])+')',axis=1 )
df_new
Output
poly_ID geometry
0 1 POLYGON ((10, 10, 15, 20, 20, 10))
1 2 POLYGON ((30, 30, 35, 40, 40, 30))
Note: Column geometry is a string & I am not sure you can feed this directly into Shapely
This solution works for large data via .dissolve and .convex_hull.
>>> import pandas as pd
>>> import geopandas as gpd
>>> df = pd.DataFrame(
... {
... "x": [0, 1, 0.1, 0.5, 0, 0, -1, 0],
... "y": [0, 0, 0.1, 0.5, 1, 0, 0, -1],
... "label": ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b'],
... }
... )
>>> gdf = geopandas.GeoDataFrame(
... df,
... geometry=gpd.points_from_xy(df["x"], df["y"]),
... )
>>> gdf
x y label geometry
0 0.0 0.0 a POINT (0.00000 0.00000)
1 1.0 1.0 a POINT (1.00000 1.00000)
2 0.1 0.1 a POINT (0.10000 0.10000)
3 0.5 0.5 a POINT (0.50000 0.50000)
4 0.0 1.0 a POINT (0.00000 1.00000)
5 0.0 0.0 b POINT (0.00000 0.00000)
6 -1.0 0.0 b POINT (-1.00000 0.00000)
7 0.0 -1.0 b POINT (0.00000 -1.00000)
>>> res = gdf.dissolve("label").convex_hull
>>> res.to_wkt()
label
a POLYGON ((0 0, 0 1, 1 0, 0 0))
b POLYGON ((0 -1, -1 0, 0 0, 0 -1))
dtype: object

DataFrame.apply unintuitively changes int to float breaking an index loopup

Problem description
The column 'a' has type integer, not float. The apply function should not change the type just because the dataframe has another, unrelated float column.
I understand, why it happens: it detects the most suitable type for a Series. I still consider it unintuitive that I select a group of columns to apply some function to them that only works on ints, not on floats, and suddenly I remove one unrelated column and get an exception, because now I only have numeric columns, and all ints became floats.
>>> import pandas as pd
# This works.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': ['', '', '']}).apply(lambda row: row['a'], axis=1)
0 1
1 2
2 3
dtype: int64
# Here we also expect 1, 2, 3, as above.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: row['a'], axis=1)
0 1.0
1 2.0
2 3.0
# Why floats?!?!?!?!?!
# It's an integer column:
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})['a'].dtype
dtype('int64')
Expected Output
0 1
1 2
2 3
dtype: int64
Specifically in my problem I am trying to use the value in the apply function to get the value from a list. I am trying to do this in a performant way such that recasting as int inside the apply is too slow.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: myList[row['a']], axis=1)
https://github.com/pandas-dev/pandas/issues/23230
This is from the only source I could find having the same problem.
It seems like your underlying problem is to index a list by the values in one of your DataFrame columns. This can be done by converting your list to an array and then you can normally slice:
Sample Data
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 0, 3], 'b': ['', '', '']})
myList = ['foo', 'bar', 'baz', 'boo']
Code:
np.array(myList)[df.a.to_numpy()]
#array(['bar', 'baz', 'boo'], dtype='<U3')
Or if you want the Series:
pd.Series(np.array(myList)[df.a.to_numpy()], index=df.index)
#0 bar
#1 foo
#2 boo
#dtype: object
Alternatively with a list comprehension this is:
[myList[i] for i in df.a]
#['bar', 'foo', 'boo']
You are getting caught by Pandas upcasting. Certain operations will result in an upcast column dtype. The (0.24 Doc)[https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#gotchas] describes this here.
Examples of this are encountered when certain operations are done.
import pandas as pd
import numpy as np
print(pd.__version__)
# float64 is the default dtype of an empty dataframe.
df = pd.DataFrame({'a': [], 'b': []})['a'].dtype
print(df)
try:
df['a'] = [1,2,3,4]
except TypeError as te:
# good, the default dtype is float64
print(te)
print(df)
# even if 'defaul' is changed, this is a surprise
# because referring to all columns does convert to float
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# creates an index, "a" is float type
df.loc["a", "col1":"col2"] = np.int64(0)
print(df.dtypes)
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# not upcast
df.loc[:"col1"] = np.int64(0)
print(df.dtypes)
Taking a shot at a performant answer that works around such upcasting behavior:
import pandas as pd
import numpy as np
print(pd.__version__)
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})
df['a'] = df['a'].apply(lambda row: row+1)
df['b'] = df['b'].apply(lambda row: row+1)
print(df)
print(df['a'].dtype)
print(df['b'].dtype)
dtypes are preserved.
0.24.2
a b
0 2 1.0
1 3 1.0
2 4 1.0
int64
float64