Indexing in Function, Pandas - pandas

I work with pandas dataframes and want to create a function with indexing.
def func(L,ra,N):
df_N = df[T_N].diff()
df[T_N] = df[T_N].iloc[df_N.index.min():df_N.index.max()]
df[T_N] = df[T_N].mean()
Value = [L,df[TR[0]].iloc[1]-(df[T_N[0]].iloc[1]-df[T_N[0]].iloc[1]),1]
return Value
For the line
df[T_N] = df[T_N].iloc[df_N.index.min():df_N.index.max()]
The error
TypeError: cannot do positional indexing on Int64Index with these indexers [nan] of type float
occurs. Does anyone know a way how I can avoid that? Is it even possible to do it that way? It works at other lines, just this one seems to be a problem.

Related

Can't convert Matrix to DataFrame JULIA

How can i convert a matrix to DataFrame in Julia?
I have an 10×2 Matrix{Any}, and when i try to convert it to a dataframe, using this:
df2 = convert(DataFrame,Xt2)
i get this error:
MethodError: Cannot `convert` an object of type Matrix{Any} to an object of type DataFrame
Try instead
df2 = DataFrame(Xt2,:auto)
You cannot use convert for this; you can use the DataFrame constructor, but then as the documentation (simply type ? DataFrame in the Julia REPL) will tell you, you need to either provide a vector of column names, or :auto to auto-generate column names.
Tangentially, I would also strongly recommend avoiding Matrix{Any} (or really anything involving Any) for any scenario where performance is at all important.

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

DataFrame.quantile() function didn't work when use apply function

I made a function to get the middle part of a DataFrame, but the quantile function didn't work well. So I can't get the part of dataframe I want.
I'm using pandas 0.23.4
Here is the func body.
def func(df,key_name,tail):
df_new = df[df[key_name]>df[key_name].quantile(tail)]
return df_new
but it works good when I try it in this way
func(clean_video,'all_vv',0.1)
when I apply it like
clean_video.apply(func,axis = 1,args = ('all_vv',0.1))
I got the Error below.
AttributeError: ("'int' object has no attribute 'quantile'", 'occurred at index 0')
Thanks for all help.

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.

DataFrame.apply(func, raw=True) doesn't seem to take effect?

I am trying to hash together only a few columns of my dataframe df so I do
temp = df['field1', 'field2']
df["hash"] = temp.apply(lambda x: hash(x), raw=True, axis=1)
I set raw to true because the doc (I am using 0.22) says it will pass a numpy array instead of a mutable Series but even with raw=True I am getting a Series, why?
File "/nix/store/9ampki9dbq0imhhm7i27qkh56788cjpz-python3.6-pandas-0.22.0/lib/python3.6/site-packages/pandas/core/frame.py", line 4877, in apply
ignore_failures=ignore_failures)
File "/nix/store/9ampki9dbq0imhhm7i27qkh56788cjpz-python3.6-pandas-0.22.0/lib/python3.6/site-packages/pandas/core/frame.py", line 4973, in _apply_standard
results[i] = func(v)
File "/home/teto/mptcpanalyzer/mptcpanalyzer/data.py", line 190, in _hash_row
return hash(x)
File "/nix/store/9ampki9dbq0imhhm7i27qkh56788cjpz-python3.6-pandas-0.22.0/lib/python3.6/site-packages/pandas/core/generic.py", line 1045, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 1')
It's strange, as I can't reproduce your exact error (that is, by me, raw=True indeed results in an np.ndarray being passed). In any case, neither a Series nor a np.ndarray are hashable. The following works, though:
temp.apply(lambda x: hash(tuple(x)), axis=1)
A tuple is hashable.