Reading line by line in Julia

Reading line by line in Julia - file-io

I am trying to read from a file where each line contains some integer
But when I gave like this
f=open("data.txt")
a=readline(f)
arr=int64[]
push!(arr,int(a))
I am getting
ERROR: no method getindex(Function)
in include_from_node1 at loading.jl:120

The error comes from int64[], since int64 is a function and you are trying to index it with []. To create an array of Int64 (note the case), you should use, e.g., arr = Int64[].
Another problem in your code is the int(a) - since you have an array of Int64, you should also specify the same type when parsing, e.g., push!(arr,parseint(Int64,a))

Related

combining low frequency value into single "other" category using pandas

I am using this line of code which has the replace method to combine low frequency values in the column
psdf['method_name'] = psdf['method_name'].replace(small_categoris, 'Other')
The error I am getting is:
'to_replace' should be one of str, list, tuple, dict, int, float
So I tried to run this line of code before the replace method
psdf['layer'] = psdf['layer'].astype("string")
Now the column is of type string but the same error still appears. For the context, I am working on pandas api on spark. Also, is there a more efficient way than replace? especially if we want to do the same for more than one column.

Getting DataFrame's Column value results in 'Column' object is not callable

For stream read from FileStore I'm trying to check if first column of first row value is equal to some string. Unfortunately while I access this column in any way e.g. launching .toList() on it, it throws
if df["Name"].iloc[0].item() == "Bob":
TypeError: 'Column' object is not callable
I'm calling the customProcessing function from:
df.writeStream\
.format("delta")\
.foreachBatch(customProcessing)\
[...]
And inside this function I'm trying to get the value, but none of the ways of getting the data works. The same error is being thrown.
def customProcessing(df, epochId):
if df["Name"].iloc[0].item() == "Bob":
[...]
Is there a possibility for reading single cols? Or it is writeStream specific and I'm unable to use conditions on that input?

There is no iloc for spark dataframes - this is not pandas; also there is no concept of index.
If you want to get the first item you could try
df.select('Name').limit(1).collect()[0][0] == "Bob"

TfidfTransformer.fit_transform( dataframe ) fails

I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).

I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).

Julia : Dataframes packages having trouble to convert column containing both int and float

I'm a R user with great interest for Julia. I don't have a computer science background. I just tried to read a 'csv' file in Juno with the following command:
using CSV
using DataFrames
df = CSV.read(joinpath(Pkg.dir("DataFrames"),
"path/to/database.csv"));
and got the following error message
CSV.CSVError('error parsing a 'Int64' value on column 26, row 289; encountered '.'"
in read at CSV/src/Source.jl:294
in #read#29 at CSV/src/Source.jl:299
in stream! at DataStreams/src/DataStreams.jl:145
in stream!#5 at DataStreams/src/DataStreams.jl:151
in stream! at DataStreams/src/DataStreams.jl:187
in streamto! at DataStreams/src/DataStreams.jl:173
in streamfrom at CSV/src/Source.jl:195
in paresefield at CSV/src/paresefield.jl:107
in paresefield at CSV/src/paresefield.jl:127
in checknullend at CSV/src/paresefield.jl:56
I look at the entry indicated in the data frame: the row 287, 288 are like this 30, 33 respectively (seem to be of type Integer) and the the row 289 is 30.445 (which is of type float).
Is the problem that DataFrames filling the column with Int and stopped when it saw an Float?
Many thanks in advance

The problem is that float happens too late in the data set. By default CSV.jl uses rows_for_type_detect value equal to 100. Which means that only first 100 rows are used to determine the type of a column in the output. Set rows_for_type_detect keyword parameter in CSV.read to e.g. 300 and all should work correctly.
Alternatively you can pass types keyword argument to manually set column type (in this case Float64 for this column would be appropriate).

How can I change column data type from float to string in Julia?

I am trying to get a column in a dataframe form float to string. I have tried
df = readtable("data.csv", coltypes = {String, String, String, String, String, Float64, Float64, String});
but I got complained
syntax: { } vector syntax is discontinued
I also have tried
dfB[:serial] = string(dfB[:serial])
but it didn't work either. So, I'd like to know what would be the proper approach to change column data type in Julia.
thx

On your first attempt, Julia tells you what the problem is - you can't make a vector with {}, you need to use []. Also, the name of the keyword argument should be eltypes rather than coltypes.
On the second try, you don't have a float, you have a Vector of floats. So to change the type you need to change the type of all elements. In Julia, elementwise operations on vectors are generalized by the 'dot' syntax, e.g. string.(collect(dfB[:serial])) . The collect is needed currently to cast the DataArray to a normal Array first – this will fail if the DataArray contains NAs. IMHO the DataFrames interface is still rather wonky, so expect a few headaches like this ATM.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Reading line by line in Julia - file-io

I am trying to read from a file where each line contains some integer But when I gave like this f=open("data.txt") a=readline(f) arr=int64[] push!(arr,int(a)) I am getting ERROR: no method getindex(Function) in include_from_node1 at loading.jl:120

Related

combining low frequency value into single "other" category using pandas

Getting DataFrame's Column value results in 'Column' object is not callable

TfidfTransformer.fit_transform( dataframe ) fails

Julia : Dataframes packages having trouble to convert column containing both int and float

How can I change column data type from float to string in Julia?

Categories

Resources