Pandas read_csv accepts converters to pre-process each field. This is very useful especially for int64 validation or mixed dateformats etc. Could you please provide a way to read multiple columns as pl.Utf8 and then cast as Int64, Float64, Date etc ?
If you need to preprocess some column like converters do in pandas, you can just read that column as pl.Utf8 dtype and use polars expressions to process that column before a cast.
csv = """a,b,c
#12,1,2,
#1,3,4
1,45,5""".encode()
(pl.read_csv(csv, dtypes={"a": pl.Utf8})
.with_column(pl.col("a").str.replace("#", "").cast(pl.Int64))
)
Or if you want to do the same to multiple columns of that dtype
csv = """a,b,c,str_col
#12,1#,2foo,
#1,3#,4,bar
1,45#,5,ham""".encode()
pl.read_csv(
file = csv,
).with_columns([
pl.col(pl.Utf8).exclude("str_col").str.replace("#","").cast(pl.Int64),
])
Related
How can i convert a matrix to DataFrame in Julia?
I have an 10×2 Matrix{Any}, and when i try to convert it to a dataframe, using this:
df2 = convert(DataFrame,Xt2)
i get this error:
MethodError: Cannot `convert` an object of type Matrix{Any} to an object of type DataFrame
Try instead
df2 = DataFrame(Xt2,:auto)
You cannot use convert for this; you can use the DataFrame constructor, but then as the documentation (simply type ? DataFrame in the Julia REPL) will tell you, you need to either provide a vector of column names, or :auto to auto-generate column names.
Tangentially, I would also strongly recommend avoiding Matrix{Any} (or really anything involving Any) for any scenario where performance is at all important.
I have a dataframe which saved as csv in this way:
df.to_csv("df.csv", index=False)
The dataframe values are ndarray type, for example:
type(df.iloc[0][2]) = ndarray
I'm reading the csv file as:
df = pd.read_csv("df.csv", sep=',')ndarray
And the values turned from ndarray to string:
type(df.iloc[0][2]) = string
How can I read the dataframe from csv, while preserving the type (ndarray) of each item ?
You can try to use the dtype or converters arguments in the read_csv function, see the documentation
I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")
I currently am processing a bunch of CSV files and transforming them into Parquet. I use these with Hive and query the files directly. I would like to switch over to Dask for my data processing. My data I am reading has optional columns some of which are Boolean types. I know Pandas does not support optional bool types at this time, but is there anyway to specify to either FastParquet or PyArrow what type I would like a field to be? I am fine with the data being a float64 in my DF, but can't have it as such in my Parquet store due to existing files already being an optional Boolean Type.
You should try using the fastparquet engine, and the following keyword argument
object_encoding={'bool_col': 'bool'}
Also, pandas does now allow boolean columns with nans as an extension type, but it is not yet exactly default. That should work directly.
Example
import fastparquet as fp
df = pd.DataFrame({'a': [0, 1, 'nan']})
fp.write('out.parq', df, object_encoding={'a': 'bool'})
fp.write('out.parq', df.astype(float), object_encoding={'a': 'bool'})
How can I turn a DataFrame into DataFrame of strings according to the same rules that str(df) uses?
I tried df.astype("str") and df.applymap(str), but both left floats with larger precision than indicated by display.precision setting.
Use .round() before converting to str:
p = pd.get_option('display.precision')
df.round(p).astype(str)
Pandas rounds numerical data when you try to display it, to the precision specified by display.precision; the data is still stored by its full precision.
Directly casting to str results in pandas using the full precision of the float; it is independent of whatever setting you have for display.precision.
You can use applymap with a format string, e.g.:
df.applymap(lambda x: '{0:.2f}'.format(x))