Pyspark pandas TypeError when try to concatenate two dataframes - pandas

I got an below error while I am trying to concatenate two pandas dataframes:
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
At the beginning I thought It was emerged because of one the dataframe that includes list on some column. So I tried to concatenate the two dataframes that does not include list on their columns. But I got the same error. I printed the type of dataframes to be sure. Both of them are pandas.core.frame.DataFrame. Why I got this error even they are not list?
import pyspark.pandas as ps
split_col = split_col.toPandas()
split_col2 = split_col2.toPandas()
dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_1455538/463168233.py in <module>
2 split_col = split_col.toPandas()
3 split_col2 = split_col2.toPandas()
----> 4 dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
/home/anaconda3/envs/virtenv/lib/python3.10/site-packages/pyspark/pandas/namespace.py in concat(objs, axis, join, ignore_index, sort)
2464 for obj in objs:
2465 if not isinstance(obj, (Series, DataFrame)):
-> 2466 raise TypeError(
2467 "cannot concatenate object of type "
2468 "'{name}"
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
type(split_col)
pandas.core.frame.DataFrame
type(split_col2)
pandas.core.frame.DataFrame
I want to concatenate 2 dataframe but I stuck. Do you have any suggestion?

You're having this error because you're trying to concatenate two pandas DataFrames using the Pandas API for pyspark.
Instead of converting your pyspark dataframes to pandas dataframes using the toPandas() method, try the following:
split_col = split_col.to_pandas_on_spark()
split_col2 = split_col2.to_pandas_on_spark()
More documentation on this method.
https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.DataFrame.to_pandas_on_spark.html

Related

Dropping same rows in two pandas dataframe in python

I want to have uncommon rows in two pandas dataframes. Two dataframes are df1 and wildone_df. When I check their typy both of them are "pandas.core.frame.DataFrame" but when I use below mentioned code to omit their intersection:
o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
I face following error:
TypeError Traceback (most recent call last)
<ipython-input-36-4e158c0eeb97> in <module>
----> 1 o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
5 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/algorithms.py in factorize_array(values, na_sentinel, size_hint, na_value, mask)
561
562 table = hash_klass(size_hint or len(values))
--> 563 uniques, codes = table.factorize(
564 values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
565 )
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
**TypeError: unhashable type: 'numpy.ndarray'**
How can I solve this issue?!
Omitting the intersection of two dataframes
Either use inplace=True or re-assign your dataframe when using pandas.DataFrame.drop_duplicates or any other built-in function that has an inplace parameter. You can't use them both at the same time.
Returns (DataFrame or None)
DataFrame with duplicates removed or None if inplace=True.
Try this :
o = pd.concat([wildone_df, df1]).drop_duplicates() #keep="first" by default
try this:
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()].copy()
See this post for more info

Convert pandas to dask code and it errors out

I have pandas code which works perfectly.
import pandas as pd
courses_df = pd.DataFrame(
[
["Jay", "MS"],
["Jay", "Music"],
["Dorsey", "Music"],
["Dorsey", "Piano"],
["Mark", "MS"],
],
columns=["Name", "Course"],
)
pandas_df_json = (
courses_df.groupby(["Name"])
.apply(lambda x: x.drop(columns="Name").to_json(orient="records"))
.reset_index(name="courses_json")
)
But when I convert the dataframe to Dask and try the same operation.
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
name="courses_json"
).compute()
And the error i get is
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [37], in <module>
1 from dask import dataframe as dd
3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
5 name="courses_json"
6 ).compute()
TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'
My expected output from dask and pandas should be same that is
Name courses_json
0 Dorsey [{"Course":"Music"},{"Course":"Piano"}]
1 Jay [{"Course":"MS"},{"Course":"Music"}]
2 Mark [{"Course":"MS"}]
How do i achieve this in dask ?
My try so far
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records")
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(
Out[57]:
Name
Dorsey [{"Course":"Piano"},{"Course":"Music"}]
Jay [{"Course":"MS"},{"Course":"Music"}]
Mark [{"Course":"MS"}]
dtype: object
I want to pass in a meta arguement and also want the second column
to have a meaningful name like courses_json
For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html
This should solve the warning:
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
new_df = df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records"),
meta=("Name", "O")
).to_frame()
# rename columns
new_df.columns = ["courses_json"]
# use numeric int index instead of name as in the given example
new_df = new_df.reset_index()
new_df.compute()
The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.

Pandas Rolling Operation on Categorical column

The code I am trying to execute:
for cat_name in df['movement_state'].cat.categories:
transformed_df[f'{cat_name} Count'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts()[cat])
transformed_df[f'{cat_name} Ratio'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts(normalize=True)[cat])
For reproduction purposes just assume the following:
import numpy as np
import pandas as pd
d = {'movement_state': pd.Categorical(np.random.choice(['moving', 'standing', 'parking'], 20))}
grouped_df = pd.DataFrame.from_dict(d)
rolling_window_size = 3
I want to do rolling window operations on my GroupBy Object. I am selecting the column movement_state beforehand. This column is categorical as shown below.
grouped_df['movement_state'].dtypes
# Output
CategoricalDtype(categories=['moving', 'parking', 'standing'], ordered=False)
If I execute, I get these error messages:
pandas.core.base.DataError: No numeric types to aggregate
TypeError: cannot handle this type -> category
ValueError: could not convert string to float: 'standing'
Inside this code snippet of rolling.py from the pandas source code I read that the data must be converted to float64 before it can be processed by cython.
def _prep_values(self, values: ArrayLike) -> np.ndarray:
"""Convert input to numpy arrays for Cython routines"""
if needs_i8_conversion(values.dtype):
raise NotImplementedError(
f"ops for {type(self).__name__} for this "
f"dtype {values.dtype} are not implemented"
)
else:
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
try:
if isinstance(values, ExtensionArray):
values = values.to_numpy(np.float64, na_value=np.nan)
else:
values = ensure_float64(values)
except (ValueError, TypeError) as err:
raise TypeError(f"cannot handle this type -> {values.dtype}") from err
My question to you
Is it possible to count the values of a categorical column in a pandas DataFrame using the rolling method as I tried to do?
A possible workaround a came up with is to just use the codes of the categorical column instead of the string values. But this way, s.value_counts()[cat] would raise a KeyError if the window I am looking at does not contain every possible value.

Pandas Dataframe

import numpy as np
import pandas as pd
from pandas_datareader import data as wb
from yahoofinancials import YahooFinancials
sympol = [input()]
abc = YahooFinancials(sympol)
l=abc.get_financial_stmts('annual', 'cash')
df=pd.concat([pd.DataFrame(key) for key in l['cashflowStatementHistory']['FB']],axis=1,sort=True).reset_index().rename(columns={'index':'Time'})
Thank you it works but only with specific sympol. How to make it customised so that I can enter any symbol instead. What to write instead of 'FB'. I tried `[[sympol]] but it gives my an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
7 l=abc.get_financial_stmts('annual', 'cash')
----> 8 df=pd.concat([pd.DataFrame(key) for key in l['cashflowStatementHistory'][[sympol]]],axis=1,sort=True).reset_index().rename(columns={'index':'Time'})
TypeError: unhashable type: 'list'
What do you think is the problem? How to fix it.
Thank you for help.
I think simpliest is use list comprehension with transpose by DataFrame.T of DataFrame and last use concat.
If need working later with TimeSeries, the best is also create DatetimeIndex:
df = pd.concat([pd.DataFrame(x).T for x in l['cashflowStatementHistory']['FB']], sort=True)
df.index = pd.to_datetime(df.index)

Pandas timestamp on array

Pandas does not convert my array into an array of Timestamps:
a = np.array([1457392827660434006, 1457392828660434012, 1457392829660434023,1457474706167386148])
pd.Timestamp(a)
gives an an error :
TypeError Traceback (most recent call last)
<ipython-input-42-cdf0e494942d> in <module>()
1 a = np.array([1457392827660434006, 1457392828660434012, 1457392829660434023,1457474706167386148])
----> 2 pd.Timestamp(a)
pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:8967)()
pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:23508)()
TypeError: Cannot convert input to Timestamp
Whereas looping on the array elements works just fine :
for i in range(4):
t = pd.Timestamp(a[i])
print t
gives :
2016-03-07 23:20:27.660434006
2016-03-07 23:20:28.660434012
2016-03-07 23:20:29.660434023
2016-03-08 22:05:06.167386148
As expected.
Moreover when that array is my first column in a csv file, it does not get parsed to a TimeStamp automatically, even if I specify parse_date correctly.
Any help please?
I think you can use to_datetime and then if you need array values:
import pandas as pd
import numpy as np
a = np.array([1457392827660434006, 1457392828660434012,
1457392829660434023,1457474706167386148])
print pd.to_datetime(a).values
['2016-03-08T00:20:27.660434006+0100' '2016-03-08T00:20:28.660434012+0100'
'2016-03-08T00:20:29.660434023+0100' '2016-03-08T23:05:06.167386148+0100']
print pd.to_datetime(a, unit='ns').values
['2016-03-08T00:20:27.660434006+0100' '2016-03-08T00:20:28.660434012+0100'
'2016-03-08T00:20:29.660434023+0100' '2016-03-08T23:05:06.167386148+0100']