Dask unable to write to parquet with concatenated data - pandas

I am trying to do the following:
Read in a .dat file with pandas, converting it to a dask dataframe, concatenate it to another dask dataframe that I read in from a parquet file, and then output to a new parquet file. I do the following:
import dask.dataframe as dd
import pandas as pd
hist_pth = "\path\to\hist_file"
hist_file = dd.read_parquet(hist_pth)
pth = "\path\to\file"
daily_file = pd.read_csv(pth, sep="|", encoding="latin")
daily_file = daily_file.astype(hist_file.dtypes.to_dict(), errors="ignore")
dask_daily_file = dd.from_pandas(daily_file, npartitions=1)
combined_file = dd.concat([dask_daily_file, hist_file])
output_path = "\path\to\output"
combined_file.to_parquet(output_path)
The combined_file.to_parquet(output_path) always starts and then stops / or doesn't work correctly. In a jupyter notebook when I do this I get a kernel fail error. When I do it in a python script the script completes but the whole combined file isn't written (I know because of the size - the CSV is 140MB and the parquet file is around 1GB - the output of to_parquet is only 20MB).
Some context, this is for an ETL process and with the amount of data were adding daily I'm soon going to run out of memory on the historical and combined datasets, so I'm trying to migrate the process from just pandas to Dask to handle the larger than memory data I will soon have. The current data, daily + historical, still fits in memory but just barely (I already make use of categoricals, these are stored in the parquet file and then I copy that schema to the new file).
I also noticed that after the dd.concat([dask_daily_file, hist_file]) that I am unable to call .compute() even on simple tasks without it crashing the same way it does when writing to parquet. For example, on the original, pre-concatenated data, I can call hist_file["Value"].div(100).compute() and get the expected value but the same method on combined_file crashes. Even just combined_file.compute() to turn it into a pandas df crashes. I have tried repartitioning as well with no luck.
I was able to do these exact operations, just in pandas, without issue. But again, I'm going to be running out of memory soon which is why I am moving to dask.
Is this something dask isn't able to handle? If it can handle it, am I processing it correctly? Specifically, it seems like the concat is causing issues. Any help appreciated!
UPDATE
After playing around more I ended up with the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'categories'
There is an existing GitHub issue that seems like it could be related to this - i asked and am waiting for confirm.
As a work around I converted all categorical columns to strings/objects and tried again and then ended up with
ArrowTypeError: ("Expected a bytes object, got a 'int' object, 'Conversion failed for column Account with type object')
When I check that column df["Account"].dtype it returns dtype('O') so I think I have the correct dtype already. The values in this column are mainly numbers but there are some records with just letters.
Is there a way to resolve this?

I got this error in Pandas after concatenating dataframes and saving the result to Parquet format..
data = pd.concat([df_1, d2, df3], axis=0, ignore_index=True)
data.to_parquet(filename)
..apparently because the rows contained different data types, either int or float. By forcing them before saving to have the same data type the error goes away
cols = ["first affected col", "second affected col", ..]
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

Values begin with 'b' when reading an arff file to pandas dataframe

I'm reading in this arff file to a pandas dataframe in Colab. I've used the following code, which seems to be fairly standard, from what a quick scan of top search results tells me.
from scipy.io.arff import loadarff
raw_data = loadarff('/speeddating.arff')
df = pd.DataFrame(raw_data[0])
When I inspect the dataframe, many of the values appear in this format: b'some_text'.
When I call type(df.iloc[0,0]) it returns bytes.
What is happening, and how do I get it to not be that way?
If anyone else stumbles upon this question, I found it answered here: Letter appeared in data when arff loaded into Python

Which characters are allowed in a BigQuery STRING column (getting "UDF out of memory" error)

I have a dataframe containing receipt-data. The column text in my dataframe contains the text from the receipt and seems to be an issue when I try to upload the data to BigQuery using df.to_gbq(...) since it produces the error
GenericGBQException: Reason: 400 Resources exceeded during query execution: UDF out of memory.; Failed to read Parquet file /some/file.
This might happen if the file contains a row that is too large,
or if the total size of the pages loaded for the queried columns is too large.
According to the error-message it seems to be an "memory error", but I have tried to convert all characters in each text to an "a" (to see if the strings contained to many characters) but that worked fine i.e I doubt it is that.
I have tried converting all characters to utf8 by
df["text"] = df["text"].str.encode('utf-8') (since according to the docs they should be so) but that failed. I have tried to replace "\n" with " " but that fails aswell.
It seems like there's some values in my receipt-text that causes some troubles, but It's very difficult to figure out what (and since I have ~3 mio rows, it takes a while to try each and every row at a time) - are there any values that are not allowed in a big-query table?
It turns out that chunksize in to_gbq does not split up the chunks in the way I thought it did. Manually looping over the dataframe in chunks like
CHUNKSIZE = 100_000
for i in range(0,df.shape[0]//CHUNKSIZE):
print(i)
df_temp = dataframe.iloc[i*CHUNKSIZE:(i+1)*CHUNKSIZE]
df_temp.to_gbq(destination_table="Dataset.my_table",
project_id = "my-project",
if_exists="append",
)
did the trick (setting chunksize=100_000 did not work)

TfidfTransformer.fit_transform( dataframe ) fails

I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).

Reading Fortran binary file in Python

I'm having trouble reading an unformatted F77 binary file in Python.
I've tried the SciPy.io.FortraFile method and the NumPy.fromfile method, both to no avail. I have also read the file in IDL, which works, so I have a benchmark for what the data should look like. I'm hoping that someone can point out a silly mistake on my part -- there's nothing better than having an idiot moment and then washing your hands of it...
The data, bcube1, have dimensions 101x101x101x3, and is r*8 type. There are 3090903 entries in total. They are written using the following statement (not my code, copied from source).
open (unit=21, file=bendnm, status='new'
. ,form='unformatted')
write (21) bcube1
close (unit=21)
I can successfully read it in IDL using the following (also not my code, copied from colleague):
bcube=dblarr(101,101,101,3)
openr,lun,'bcube.0000000',/get_lun,/f77_unformatted,/swap_if_little_endian
readu,lun,bcube
free_lun,lun
The returned data (bcube) is double precision, with dimensions 101x101x101x3, so the header information for the file is aware of its dimensions (not flattend).
Now I try to get the same effect using Python, but no luck. I've tried the following methods.
In [30]: f = scipy.io.FortranFile('bcube.0000000', header_dtype='uint32')
In [31]: b = f.read_record(dtype='float64')
which returns the error Size obtained (3092150529) is not a multiple of the dtypes given (8). Changing the dtype changes the size obtained but it remains indivisible by 8.
Alternately, using fromfile results in no errors but returns one more value that is in the array (a footer perhaps?) and the individual array values are wildly wrong (should all be of order unity).
In [38]: f = np.fromfile('bcube.0000000')
In [39]: f.shape
Out[39]: (3090904,)
In [42]: f
Out[42]: array([ -3.09179121e-030, 4.97284231e-020, -1.06514594e+299, ...,
8.97359707e-029, 6.79921640e-316, -1.79102266e-037])
I've tried using byteswap to see if this makes the floating point values more reasonable but it does not.
It seems to me that the np.fromfile method is very close to working but there must be something wrong with the way it's reading the header information. Can anyone suggest how I can figure out what should be in the header file that allows IDL to know about the array dimensions and datatype? Is there a way to pass header information to fromfile so that it knows how to treat the leading entry?
I played a bit around with it, and I think I have an idea.
How Fortran stores unformatted data is not standardized, so you have to play a bit around with it, but you need three pieces of information:
The Format of the data. You suggest that is 64-bit reals, or 'f8' in python.
The type of the header. That is an unsigned integer, but you need the length in bytes. If unsure, try 4.
The header usually stores the length of the record in bytes, and is repeated at the end.
Then again, it is not standardized, so no guarantees.
The endianness, little or big.
Technically for both header and values, but I assume they're the same.
Python defaults to little endian, so if that were the the correct setting for your data, I think you would have already solved it.
When you open the file with scipy.io.FortranFile, you need to give the data type of the header. So if the data is stored big_endian, and you have a 4-byte unsigned integer header, you need this:
from scipy.io import FortranFile
ff = FortranFile('data.dat', 'r', '>u4')
When you read the data, you need the data type of the values. Again, assuming big_endian, you want type >f8:
vals = ff.read_reals('>f8')
Look here for a description of the syntax of the data type.
If you have control over the program that writes the data, I strongly suggest you write them into data streams, which can be more easily read by Python.
Fortran has record demarcations which are poorly documented, even in binary files.
So every write to an unformatted file:
integer*4 Test1
real*4 Matrix(3,3)
open(78,format='unformatted')
write(78) Test1
write(78) Matrix
close(78)
Should ultimately be padded by an np.int32 values. (I've seen references that this tells you the record length, but haven't verified persconally.)
The above could be read in Python via numpy as:
input_file = open(file_location,'rb')
datum = np.dtype([('P1',np.int32),('Test1',np.int32),('P2',np.int32),('P3',mp.int32),('MatrixT',(np.float32,(3,3))),('P4',np.int32)])
data = np.fromfile(input_file,datum)
Which should fully populate the data array with the individual data sets of the format above. Do note that numpy expects data to be packed in C format (row major) while Fortran format data is column major. For square matrix shapes like that above, this means getting the data out of the matrix requires a transpose as well, before using. For non square matrices, you will need to reshape and transpose:
Matrix = np.transpose(data[0]['MatrixT']
Transposing your 4-D data structure is going to need to be done carefully. You might look into SciPy for automated ways to do so; the SciPy package seems to have Fortran related utilities which I have not fully explored.