Filtering DataFrame with static date value - dataframe

I am trying to filter DataFrame to get all dates greater than '2012-09-15'
I tried the solution from another post which suggested me to use
data.filter(data("date").lt(lit("2015-03-14")))
but i am getting an error
TypeError: 'DataFrame' object is not callable
What is the solution for this

You need square brackets around "date", i.e.
data.filter(data["date"] < lit("2015-03-14"))
Calling data("date") is treating data as a function (rather than a dataframe)

Related

Error tokenizing data. C error: EOF inside string starting at row 148480 with groupby function (dask)

I'm working with a dask dataframe and as I want to apply the groupby method, I get the following Parse Error :
ParserError: Error tokenizing data. C error: EOF inside string starting at row 148480
I'm pretty new to python, I don't see how to fix a parsing Error that occurs when using a method
here's my code :
df1 = df.groupby('date')[['compound score']].mean().compute()
where date (srting) and compound score (float) are two columns of df (a dask dataframe)
Each date appears in many rows of the dataframe, that's why I want to use the groupby method.
I was expecting df1 to be a new dask dataframe with only 2 columns, the date and the mean of compound score for each date. Instead I get the parsing error.
I see many people are having this issue with pandas.read_csv(), but none with the groupby method.

Getting DataFrame's Column value results in 'Column' object is not callable

For stream read from FileStore I'm trying to check if first column of first row value is equal to some string. Unfortunately while I access this column in any way e.g. launching .toList() on it, it throws
if df["Name"].iloc[0].item() == "Bob":
TypeError: 'Column' object is not callable
I'm calling the customProcessing function from:
df.writeStream\
.format("delta")\
.foreachBatch(customProcessing)\
[...]
And inside this function I'm trying to get the value, but none of the ways of getting the data works. The same error is being thrown.
def customProcessing(df, epochId):
if df["Name"].iloc[0].item() == "Bob":
[...]
Is there a possibility for reading single cols? Or it is writeStream specific and I'm unable to use conditions on that input?
There is no iloc for spark dataframes - this is not pandas; also there is no concept of index.
If you want to get the first item you could try
df.select('Name').limit(1).collect()[0][0] == "Bob"

TfidfTransformer.fit_transform( dataframe ) fails

I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).

pandas questions about argmin and timestamp

final_month = pd.Timestamp('2018-02-01')
df_final_month = df[df['week'] >= final_month]
df_final_month.iloc[:, 1:].sum().argmax()
index = df.set_index('week')
index['storeC'].argmin()
the code above is correct, i just don't exactly understand how does it work inside. i have some questions:
1.the type(week) is datetime, the reason why set final_month as Timestamp is that the datetime is almost as same as Timestamp, they recognise each other in Python?
2.about the argmax(), and argmin(), for the df_final_month.iloc[:, 1:].sum().argmax(), i removed sum() and tried like df_final_month.iloc[:, 1:].argmax(), it returns
`AttributeError: 'DataFrame' object has no attribute 'argmax'`
why is it? why the second code doesn't need a max() or something to call argmin(), what's the requirement for using argmin()/argmax() ?
please explaining the details of how python or pandas deal with these data, the more detail the better.
thanks!
i am new in Python.
Is Timestamp almost as same as datetime?
Here is quote from pandas documentation itself:
TimeStamp is the pandas equivalent of python’s Datetime and is interchangable with it in most cases
In fact, if you look at source code of pandas you will see that Timestamp actually inherits from datetime. Here is code to check these statements are true:
dt = datetime.datetime(2018, 1, 1)
ts = pd.Timestamp('2018-01-01')
dt == ts # True
isinstance(ts, datetime.datetime) # True
Why calling argmax method on DataFrame, without calling sum throws an error?
Because DataFrame object doesn't have argmax method, only Series do. And sum, in your case, returns a Series instance.

What's the difference between .col and ['col'] in pandas

I've been using pandas for a little while now and I've realised that I use
df.col
df['col']
interchangeably. Are they actually the same or am I missing something?
Following on from the link in the comments.
df.col
Simply refers to an attribute of the dataframe, similar to say
df.shape
Now if 'col' is a column name in the dataframe then accessing this attribute returns the column as series. This sometimes will be sufficient but
df['col']
will always work, and can also be used to add a new column to a dataframe.
I think this is kind of obvious.....
You cannot use df.col if the column name 'col' has a space in it. But df['col'] always works.
e.g,
df['my col'] works but df.my col will not work.
I'll note there's a difference in how some methods consume data. For example, in the LifeTimes library if I use dataframe.col with some methods, the method will consider the column to be an ndarray and throw an exception that the data must be 1-dimensional.
If however I use dataframe['col'] then the method will consume the data as expected.