Pandas series pad function not working with apply in pandas - pandas

I am trying to write the code to pad columns of my pandas dataframe with different characters. I tried to use apply function to fill '0' with zfill and it works.
print(df["Date"].apply(lambda x: x.zfill(10)))
But when I try to use pad function using apply method to my dataframe I face error:
AttributeError: 'str' object has no attribute 'pad'
The code I am trying is:
print(df["Date"].apply(lambda x: x.pad(10, side="left", fillchar="0")))
Both the zfill and pad functions are a part of pandas.Series.str. I am confused why pad is not working and zfill works. How can I achieve this functionality?
Full code:
import pandas as pd
from io import StringIO
StringData = StringIO(
"""Date,Time
パンダ,パンダ
パンダサンDA12-3,パンダーサンDA12-3
パンダサンDA12-3,パンダサンDA12-3
"""
)
df = pd.read_csv(StringData, sep=",")
print(df["Date"].apply(lambda x: x.zfill(10))) -- works
print(df["Date"].apply(lambda x: x.pad(10, side="left", fillchar="0"))) -- doesn't work
I am using pandas 1.5.1.

You should just not use apply, doing so you don't benefit from Series methods, but rather use pure python str methods:
print(df["Date"].str.zfill(10))
print(df["Date"].str.pad(10, side="left", fillchar="0"))
output:
0 0000000パンダ
1 パンダサンDA12-3
2 パンダサンDA12-3
Name: Date, dtype: object
0 0000000パンダ
1 パンダサンDA12-3
2 パンダサンDA12-3
Name: Date, dtype: object
multiple columns:
Now, you need to use apply, but this is DataFrame.apply, not Series.apply:
df[['col1', 'col2', 'col3']].apply(lambda s: s.str.pad(10, side="left", fillchar="0"))

Related

dtype definition for pandas dataframe with columns of VARCHAR or String

I want to get some data in a dictionary that need to go into a pandas dataframe.
The dataframe is later written in a PostgreSQL table using sqlalchemy, and I would like to get the right column types.
Hence, I specify the dtypes for the dataframe
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(length=40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(length=100),
"id_lokalId": sqlalchemy.types.String(length=36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
Later I use df = pd.DataFrame(item_lst, dtype=dtypes), where item_lst is a list of dictionaries.
Independent from me using either String(8), String(length=8) or VARCHAR(8) in the dtype definition, the result of pd.DataFrame(item_lst, dtype=dtypes) is always object of type '(String or VARCHAR)' has no len().
How do I have to define the dtype to overcome this error?
Instead of forcing data types when the DataFrame is created, let pandas infer the data types (just df = pd.DataFrame(item_lst)) and then use your dtypes dict with to_sql() when you push your DataFrame to the database, like this:
from pprint import pprint
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite://")
item_lst = [{"forretningshændelse": "foo"}]
df = pd.DataFrame(item_lst)
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 forretningshændelse 1 non-null object
dtypes: object(1)
memory usage: 136.0+ bytes
None
"""
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8)}
df.to_sql("tbl", engine, index=False, dtype=dtypes)
insp = sqlalchemy.inspect(engine)
pprint(insp.get_columns("tbl"))
"""
[{'autoincrement': 'auto',
'default': None,
'name': 'forretningshændelse',
'nullable': True,
'primary_key': 0,
'type': VARCHAR(length=8)}]
"""
I believe you are confusing the dtypes within the DataFrame with the dtypes on the SQL table itself.
You probably don't need to manually specify the datatypes in pandas itself but if you do, here's how.
Spoiler alert: it is written in the pandas.Dataframe documentation that only a single dtype must be specified so you will need some loops or manual column work to get different types.
To solve your problem:
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("connection_string")
df = pd.DataFrame(item_list)
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(100),
"id_lokalId": sqlalchemy.types.String(36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
with engine.connect() as engine:
df.to_sql("table_name",if_exists="replace", con=engine, dtype=dtypes)
Tip: Avoid using special characters while coding in general, it only makes maintaining code harder at some point :). I assumed you're creating a new sql table and not appending, otherwise types for the table would already be defined.
Happy Coding!

Convert pandas to dask code and it errors out

I have pandas code which works perfectly.
import pandas as pd
courses_df = pd.DataFrame(
[
["Jay", "MS"],
["Jay", "Music"],
["Dorsey", "Music"],
["Dorsey", "Piano"],
["Mark", "MS"],
],
columns=["Name", "Course"],
)
pandas_df_json = (
courses_df.groupby(["Name"])
.apply(lambda x: x.drop(columns="Name").to_json(orient="records"))
.reset_index(name="courses_json")
)
But when I convert the dataframe to Dask and try the same operation.
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
name="courses_json"
).compute()
And the error i get is
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [37], in <module>
1 from dask import dataframe as dd
3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
5 name="courses_json"
6 ).compute()
TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'
My expected output from dask and pandas should be same that is
Name courses_json
0 Dorsey [{"Course":"Music"},{"Course":"Piano"}]
1 Jay [{"Course":"MS"},{"Course":"Music"}]
2 Mark [{"Course":"MS"}]
How do i achieve this in dask ?
My try so far
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records")
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(
Out[57]:
Name
Dorsey [{"Course":"Piano"},{"Course":"Music"}]
Jay [{"Course":"MS"},{"Course":"Music"}]
Mark [{"Course":"MS"}]
dtype: object
I want to pass in a meta arguement and also want the second column
to have a meaningful name like courses_json
For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html
This should solve the warning:
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
new_df = df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records"),
meta=("Name", "O")
).to_frame()
# rename columns
new_df.columns = ["courses_json"]
# use numeric int index instead of name as in the given example
new_df = new_df.reset_index()
new_df.compute()
The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.

Pandas Rolling Operation on Categorical column

The code I am trying to execute:
for cat_name in df['movement_state'].cat.categories:
transformed_df[f'{cat_name} Count'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts()[cat])
transformed_df[f'{cat_name} Ratio'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts(normalize=True)[cat])
For reproduction purposes just assume the following:
import numpy as np
import pandas as pd
d = {'movement_state': pd.Categorical(np.random.choice(['moving', 'standing', 'parking'], 20))}
grouped_df = pd.DataFrame.from_dict(d)
rolling_window_size = 3
I want to do rolling window operations on my GroupBy Object. I am selecting the column movement_state beforehand. This column is categorical as shown below.
grouped_df['movement_state'].dtypes
# Output
CategoricalDtype(categories=['moving', 'parking', 'standing'], ordered=False)
If I execute, I get these error messages:
pandas.core.base.DataError: No numeric types to aggregate
TypeError: cannot handle this type -> category
ValueError: could not convert string to float: 'standing'
Inside this code snippet of rolling.py from the pandas source code I read that the data must be converted to float64 before it can be processed by cython.
def _prep_values(self, values: ArrayLike) -> np.ndarray:
"""Convert input to numpy arrays for Cython routines"""
if needs_i8_conversion(values.dtype):
raise NotImplementedError(
f"ops for {type(self).__name__} for this "
f"dtype {values.dtype} are not implemented"
)
else:
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
try:
if isinstance(values, ExtensionArray):
values = values.to_numpy(np.float64, na_value=np.nan)
else:
values = ensure_float64(values)
except (ValueError, TypeError) as err:
raise TypeError(f"cannot handle this type -> {values.dtype}") from err
My question to you
Is it possible to count the values of a categorical column in a pandas DataFrame using the rolling method as I tried to do?
A possible workaround a came up with is to just use the codes of the categorical column instead of the string values. But this way, s.value_counts()[cat] would raise a KeyError if the window I am looking at does not contain every possible value.

PySpark Create DataFrame With Float TypeError

I have Data Sets as Below:
I am using PySpark to parse the data and create a DataFrame later using below code:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f
def parseInput(line):
fields = line.split(',')
stationID=fields[0]
entryType=fields[2]
temperature= fields[3]*0.3
return Row(stationID,entryType,temperature)
spark = SparkSession.builder.appName("MinTemperatures").getOrCreate()
lines = spark.sparkContext.textFile("data/1800.csv")
temperatures = lines.map(parseInput)
minTemps=temperatures.filter(lambda x:x[1]=='TMIN')
df = spark.createDataFrame(minTemps)
I got below error:
TypeError: can't multiply sequence by non-int of type 'float'
Obviously, if I remove 0.3 out of temperature= fields[3]*0.3, the create DataFrame work. How can I return the temperature with float number and some basic math operation?
Try temperature= float(fields[3])*0.3
You can read the file without multiplication first and then cast it to Type Double, do the multiplication finally.
I assume your csv file have header.
The following code is for casting:
data = data.withColumn("COLUMN_NAME", data["COLUMN_NAME"].cast("double"))

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])