Writing csv file with CSV.jl throws StackOverflowError message - dataframe

CSV.write() returns: ERROR: LoadError: StackOverflowError: further down claiming --- the last 2 lines are repeated 39990 more times ---. Which is nonsense since the data frame to be written is acknowledged to be of 129x6 dimension with no empty fields. I fear the function is counting rows beyond the dimension of the data frame and shouldn't be doing so. How do I make this function work?
using CSV,DataFrames
file=DataFrame(
a=1:50,
b=1:50,
c=Vector{Any}(missing,50),
d=Vector{Any}(missing,50))
CSV.write("a.csv",file)
file=CSV.read("a.csv",DataFrame)
c,d=[],[]
for i in 1:nrow(file)
push!(c,file.a[i]*file.b[i])
push!(d,file.a[i]*file.b[i]/file.a[i])
end
file[!,:c]=c
file[!,:d]=d
# I found no straight forward way to fill empty columns based on the values of filled columns in a for loop that wouldn't generate error messages
CSV.write("a.csv",DataFrame) # Error in question here

The StackOverflowError is indeed due to the line
CSV.write("a.csv",DataFrame)
which tries to write the data type itself to a file instead of the actual dataframe variable. (The reason that leads to this particular error is an implementation detail that's described here.)
# I found no straight forward way to fill empty columns based on the values of filled columns in a for loop that wouldn't generate error messages
You don't need a for loop for this, just
file.c = file.a .* file.b
file.d = file.a .* file.b ./ file.a
will do the job. (Though I don't understand the point of the file.d calculation - maybe this is just a sample dummy calculation you chose just for illustration.)

Related

Which characters are allowed in a BigQuery STRING column (getting "UDF out of memory" error)

I have a dataframe containing receipt-data. The column text in my dataframe contains the text from the receipt and seems to be an issue when I try to upload the data to BigQuery using df.to_gbq(...) since it produces the error
GenericGBQException: Reason: 400 Resources exceeded during query execution: UDF out of memory.; Failed to read Parquet file /some/file.
This might happen if the file contains a row that is too large,
or if the total size of the pages loaded for the queried columns is too large.
According to the error-message it seems to be an "memory error", but I have tried to convert all characters in each text to an "a" (to see if the strings contained to many characters) but that worked fine i.e I doubt it is that.
I have tried converting all characters to utf8 by
df["text"] = df["text"].str.encode('utf-8') (since according to the docs they should be so) but that failed. I have tried to replace "\n" with " " but that fails aswell.
It seems like there's some values in my receipt-text that causes some troubles, but It's very difficult to figure out what (and since I have ~3 mio rows, it takes a while to try each and every row at a time) - are there any values that are not allowed in a big-query table?
It turns out that chunksize in to_gbq does not split up the chunks in the way I thought it did. Manually looping over the dataframe in chunks like
CHUNKSIZE = 100_000
for i in range(0,df.shape[0]//CHUNKSIZE):
print(i)
df_temp = dataframe.iloc[i*CHUNKSIZE:(i+1)*CHUNKSIZE]
df_temp.to_gbq(destination_table="Dataset.my_table",
project_id = "my-project",
if_exists="append",
)
did the trick (setting chunksize=100_000 did not work)

ftell/fseek fail when near end of file

Reading a text file (which happens to be a PDS Member FB 80)
hFile = fopen(filename,"r");
and have reached up to the point in the file where there is only an empty line left.
FilePos = ftell(hFile);
Then read the last line, which only contains a '\n' character.
fseek(hFile, FilePos, SEEK_SET);
fails with:-
errno=(27) EDC5027I The position specified to fseek() was invalid.
The position specified to fseek() was returned by ftell() a few lines earlier. It has the value 841 in the specific error case I have seen. Checking through the debugger, this is also the value returned by ftell a few lines earlier. It has not been corrupted.
The same code works at other positions in the file, and only fails at the point where there is a single empty line left to read when the position is remembered.
My understanding of how ftell/fseek should work is succinctly captured by another answer on SO.
The value returned from ftell on a text stream has no predictable relationship to the number of characters you have read so far. The only thing you can rely on is that you can use it subsequently as the offset argument to fseek or fseeko to move back to the same file position.
It would seem that I cannot rely on the one thing I should be able to rely on.
My questions is, why does fseek fail in this way?
As z/OS has some file formats that are unique you might find the answer in this Knowledge Center article.
Given that you are processing a PDS member I would suspect that this is record level I/O which is handled differently than stream I/O which is more common in distributed implementations.
I do not know why fseek fails in this way, but if your common usage pattern is to use ftell to get the position and then fseek to go to that position, I strongly suggest using fgetpos and fsetpos instead for data set I/O. Not only will you avoid this problem that you are finding, but it is also better performing for certain data set characteristics.

How to assign Pandas.Series.str.extractall() result back to original dataset? (TypeError: incompatible index of inserted column with frame index)

Dataset brief overview
dete_resignations['cease_date'].head()
gives
dete_resignations['cease_date'].value_counts()
gives
of the code above
What I tried
I was trying to extract only the year value (e.g. 05/2012 -> 2012) from 'dete_resignations['cease_date']' using 'Pandas.Series.str.extractall()' and assign the result back to the original dataframe. However, since not all the rows contain that specific string values(e.g. 05/2012), an error occurred.
Here are the code I wrote.
pattern = r"(?P<month>[0-1][0-9])/?(?P<year>[0-2][0-9]{3})"
years = dete_resignations['cease_date'].str.extractall(pattern)
dete_resignations['cease_date_'] = years['year']
'TypeError: incompatible index of inserted column with frame index'
I thought the 'years' share the same index with 'dete_resignations['cease']'. Therefore, even though two dataset's index is not identical, I expected python automatically matches and assigns the values to the right rows. But it didn't
Can anyone help solve this issue?
Much appreciated if someone can enlighten me!
If you only want the years, then don't catch the month in pattern, and you can use extract instead of extractall:
# the $ indicates end of string
# \d is equivalent to [0-9]
# pattern extracts the last digit groups
pattern = '(?P<year>\d+)$'
years = dete_resignations['cease_date'].str.extract(pattern)
dete_resignations['cease_date_'] = years['year']

How to efficiently append a dataframe column with a vector?

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.
You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).

Any idea about this weird type error in Pentaho Data Integration?

I have this :
Insertion des données dans table some_table.0 - SOME_AUTO_GENERATED_DB_KEY Integer : There was a data type error: the data type of java.lang.Boolean object [true] does not correspond to value meta [Integer]
What boolean??? Where do you see a boolean? I have added a trace writing to the step just before this failing inserting step, and I see a perfectly fine integer as value of SOME_AUTO_GENERATED_DB_KEY .
How can this be possible? I am very new to Kettle, if you have any idea or tips it would be awesome.
Here a screenshot of the transformation :
Just before the failed insert, you have a filter that splits the stream. On one half of the stream, it looks like you have an Add Constant step. If I'm reading this right, then the two inputs to the Insert step don't have the same fields in the same order. A few steps earlier, there is a similar splitting of the paths that goes off to the right, which could have the same effect.
Whenever you remerge streams like this without being very careful, strange errors like this can pop up. Pentaho usually tries to warn you when you create the hop to remerge the streams, but there are ways to miss that warning.
Suggestion: For each time the stream remerges, right-click on each of the two previous steps, and have it show you the output fields. Compare the two lists side-by-side to verify they are the same. If not, then you will have to add or remove fields as appropriate to make them the same on both sides.