why loading dataframe from csv turned all the ndarray to string? - pandas

I have a dataframe which saved as csv in this way:
df.to_csv("df.csv", index=False)
The dataframe values are ndarray type, for example:
type(df.iloc[0][2]) = ndarray
I'm reading the csv file as:
df = pd.read_csv("df.csv", sep=',')ndarray
And the values turned from ndarray to string:
type(df.iloc[0][2]) = string
How can I read the dataframe from csv, while preserving the type (ndarray) of each item ?

You can try to use the dtype or converters arguments in the read_csv function, see the documentation

Related

Can not infer schema for type when converting pandas dataframe to pyspark dataframe

I am trying to use pyspark.pandas to read excel and I need to convert the pandas dataframe to pyspark dataframe.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True)
pyspark_df= spark.createDataFrame(df)
when I do this, I got error
TypeError: Can not infer schema for type:
Even though I tried to specify the dtype for the read_excel and define the schema. I still have the error.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True,dtype= dtypetest)
pyspark_df= spark.createDataFrame(df,schema)
Would you tell me how to solve it?

pyspark : TypeError: element in array element in array element in array field prediction: ArrayType(FloatType,true) can not accept object

I am looking for advice to resolve a TypeError when operating on the entire PySpark DataFrame.
I have a PySpark DataFrame with the following schema and I want to apply further operations like count() or show() to the DataFrame then convert the Spark DataFrame to a Pandas DataFrame. As you can see below, the error being returned when performing .count() on the PySpark DataFrame is as follows
TypeError: element in array element in array element in array field prediction: ArrayType(FloatType,true) can not accept object -1.2425838708877563 in type <class 'float'>
I am able to successfully display the first row from the Pandas DataFrame, using .head(), after converting the PySpark DataFrame to Pandas DataFrame.

Polar converters like pandas

Pandas read_csv accepts converters to pre-process each field. This is very useful especially for int64 validation or mixed dateformats etc. Could you please provide a way to read multiple columns as pl.Utf8 and then cast as Int64, Float64, Date etc ?
If you need to preprocess some column like converters do in pandas, you can just read that column as pl.Utf8 dtype and use polars expressions to process that column before a cast.
csv = """a,b,c
#12,1,2,
#1,3,4
1,45,5""".encode()
(pl.read_csv(csv, dtypes={"a": pl.Utf8})
.with_column(pl.col("a").str.replace("#", "").cast(pl.Int64))
)
Or if you want to do the same to multiple columns of that dtype
csv = """a,b,c,str_col
#12,1#,2foo,
#1,3#,4,bar
1,45#,5,ham""".encode()
pl.read_csv(
file = csv,
).with_columns([
pl.col(pl.Utf8).exclude("str_col").str.replace("#","").cast(pl.Int64),
])

saving a dask dataframe to hdf5

I have dask dataframe that has cols
[ID,'PERIOD','CURRENCY']
Where I created PERIOD as
datetime.datetime.strptime(''201901, "%Y%m").date()
When I try to save this dataframe using:
dd.to_hdf('table.h5', key='df', append=True,complib='zlib', format='table', data_column=True)
I get an error as :
TypeError: Cannot serialize the column [PERIOD] because its data contents are [date] object dtype
However when I save the dataframe to CSV/PARQUET I dont see any error. I'm using dask Version 2.5.2
Apparently converting to unix timestamp works:
time.mktime(datetime.datetime.strptime('201901', "%Y%m").date().timetuple())

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5.
My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM. Dask might be the best way to accomplish this task.
If I use parsing my data to fit into one pandas dataframe, I would do this:
import pandas as pd
import csv
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame() # create empty pandas DataFrame
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = pd.concat([total_df, df]) # creates one big dataframe
Using dask to do the same task, it appears users should try something like this:
import pandas as pd
import csv
import dask.dataframe as dd
import dask.array as da
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"] # define columns
readcsvfile = csv.reader(csvfile) # read in file, if csv
# somehow define empty dask dataframe total_df = dd.Dataframe()?
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = da.concatenate([total_df, df]) # creates one big dataframe
After creating a ~TB dataframe, I will save into hdf5.
My problem is that total_df does not fit into RAM, and must be saved to disk. Can dask dataframe accomplish this task?
Should I be trying something else? Would it be easier to create an HDF5 from multiple dask arrays, i.e. each column/field a dask array? Maybe partition the dataframes among several nodes and reduce at the end?
EDIT: For clarity, I am actually not reading directly from a csv file. I am aggregating, parsing, and formatting tabular data. So, readcsvfile = csv.reader(csvfile) is used above for clarity/brevity, but it's far more complicated than reading in a csv file.
Dask.dataframe handles larger-than-memory datasets through laziness. Appending concrete data to a dask.dataframe will not be productive.
If your data can be handled by pd.read_csv
The pandas.read_csv function is very flexible. You say above that your parsing process is very complex, but it might still be worth looking into the options for pd.read_csv to see if it will still work. The dask.dataframe.read_csv function supports these same arguments.
In particular if the concern is that your data is separated by tabs rather than commas this isn't an issue at all. Pandas supports a sep='\t' keyword, along with a few dozen other options.
Consider dask.bag
If you want to operate on textfiles line-by-line then consider using dask.bag to parse your data, starting as a bunch of text.
import dask.bag as db
b = db.read_text('myfile.tsv', blocksize=10000000) # break into 10MB chunks
records = b.str.split('\t').map(parse)
df = records.to_dataframe(columns=...)
Write to HDF5 file
Once you have dask.dataframe try the .to_hdf method:
df.to_hdf('myfile.hdf5', '/df')