Convert pandas to dask code and it errors out - pandas

I have pandas code which works perfectly.
import pandas as pd
courses_df = pd.DataFrame(
[
["Jay", "MS"],
["Jay", "Music"],
["Dorsey", "Music"],
["Dorsey", "Piano"],
["Mark", "MS"],
],
columns=["Name", "Course"],
)
pandas_df_json = (
courses_df.groupby(["Name"])
.apply(lambda x: x.drop(columns="Name").to_json(orient="records"))
.reset_index(name="courses_json")
)
But when I convert the dataframe to Dask and try the same operation.
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
name="courses_json"
).compute()
And the error i get is
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [37], in <module>
1 from dask import dataframe as dd
3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
5 name="courses_json"
6 ).compute()
TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'
My expected output from dask and pandas should be same that is
Name courses_json
0 Dorsey [{"Course":"Music"},{"Course":"Piano"}]
1 Jay [{"Course":"MS"},{"Course":"Music"}]
2 Mark [{"Course":"MS"}]
How do i achieve this in dask ?
My try so far
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records")
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(
Out[57]:
Name
Dorsey [{"Course":"Piano"},{"Course":"Music"}]
Jay [{"Course":"MS"},{"Course":"Music"}]
Mark [{"Course":"MS"}]
dtype: object
I want to pass in a meta arguement and also want the second column
to have a meaningful name like courses_json

For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html
This should solve the warning:
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
new_df = df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records"),
meta=("Name", "O")
).to_frame()
# rename columns
new_df.columns = ["courses_json"]
# use numeric int index instead of name as in the given example
new_df = new_df.reset_index()
new_df.compute()
The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.

Related

dtype definition for pandas dataframe with columns of VARCHAR or String

I want to get some data in a dictionary that need to go into a pandas dataframe.
The dataframe is later written in a PostgreSQL table using sqlalchemy, and I would like to get the right column types.
Hence, I specify the dtypes for the dataframe
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(length=40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(length=100),
"id_lokalId": sqlalchemy.types.String(length=36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
Later I use df = pd.DataFrame(item_lst, dtype=dtypes), where item_lst is a list of dictionaries.
Independent from me using either String(8), String(length=8) or VARCHAR(8) in the dtype definition, the result of pd.DataFrame(item_lst, dtype=dtypes) is always object of type '(String or VARCHAR)' has no len().
How do I have to define the dtype to overcome this error?
Instead of forcing data types when the DataFrame is created, let pandas infer the data types (just df = pd.DataFrame(item_lst)) and then use your dtypes dict with to_sql() when you push your DataFrame to the database, like this:
from pprint import pprint
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite://")
item_lst = [{"forretningshændelse": "foo"}]
df = pd.DataFrame(item_lst)
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 forretningshændelse 1 non-null object
dtypes: object(1)
memory usage: 136.0+ bytes
None
"""
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8)}
df.to_sql("tbl", engine, index=False, dtype=dtypes)
insp = sqlalchemy.inspect(engine)
pprint(insp.get_columns("tbl"))
"""
[{'autoincrement': 'auto',
'default': None,
'name': 'forretningshændelse',
'nullable': True,
'primary_key': 0,
'type': VARCHAR(length=8)}]
"""
I believe you are confusing the dtypes within the DataFrame with the dtypes on the SQL table itself.
You probably don't need to manually specify the datatypes in pandas itself but if you do, here's how.
Spoiler alert: it is written in the pandas.Dataframe documentation that only a single dtype must be specified so you will need some loops or manual column work to get different types.
To solve your problem:
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("connection_string")
df = pd.DataFrame(item_list)
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(100),
"id_lokalId": sqlalchemy.types.String(36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
with engine.connect() as engine:
df.to_sql("table_name",if_exists="replace", con=engine, dtype=dtypes)
Tip: Avoid using special characters while coding in general, it only makes maintaining code harder at some point :). I assumed you're creating a new sql table and not appending, otherwise types for the table would already be defined.
Happy Coding!

Python Pandas : Read dates from excel in different formats [duplicate]

I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')

How to add "-" inside string values in Pandas [duplicate]

I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')

Pandas date comparison giving Invalid type comparison error [duplicate]

I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')

Replace None with NaN in pandas dataframe

I have table x:
website
0 http://www.google.com/
1 http://www.yahoo.com
2 None
I want to replace python None with pandas NaN. I tried:
x.replace(to_replace=None, value=np.nan)
But I got:
TypeError: 'regex' must be a string or a compiled regular expression or a list or dict of strings or regular expressions, you passed a 'bool'
How should I go about it?
You can use DataFrame.fillna or Series.fillna which will replace the Python object None, not the string 'None'.
import pandas as pd
import numpy as np
For dataframe:
df = df.fillna(value=np.nan)
For column or series:
df.mycol.fillna(value=np.nan, inplace=True)
Here's another option:
df.replace(to_replace=[None], value=np.nan, inplace=True)
The following line replaces None with NaN:
df['column'].replace('None', np.nan, inplace=True)
If you use df.replace([None], np.nan, inplace=True), this changed all datetime objects with missing data to object dtypes. So now you may have broken queries unless you change them back to datetime which can be taxing depending on the size of your data.
If you want to use this method, you can first identify the object dtype fields in your df and then replace the None:
obj_columns = list(df.select_dtypes(include=['object']).columns.values)
df[obj_columns] = df[obj_columns].replace([None], np.nan)
This solution is straightforward because can replace the value in all the columns easily.
You can use a dict:
import pandas as pd
import numpy as np
df = pd.DataFrame([[None, None], [None, None]])
print(df)
0 1
0 None None
1 None None
# replacing
df = df.replace({None: np.nan})
print(df)
0 1
0 NaN NaN
1 NaN NaN
Its an old question but here is a solution for multiple columns:
values = {'col_A': 0, 'col_B': 0, 'col_C': 0, 'col_D': 0}
df.fillna(value=values, inplace=True)
For more options, check the docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
DataFrame['Col_name'].replace("None", np.nan, inplace=True)