It happens quite often that I'm adding a new line to a pipe that throws an error for one reason or another. Normaly this isn't a problem. I fix my code and it works. However, when using pipes, I often have to re-run all the code that created the DataFrame up to the point the pipe starts because the pipe itself changes the DataFrame in such a way it's no valid input anymore.
For example:
df = CSV.read("myfile.csv", DataFrame)
# ----
# all kinds of code working on df
# ----
#pipe df |>
rename!(_, "A ha" => :Aha) |> # This works fine the first time
select!(_, :typo, :) # throws an error for any reason
#pipe df |>
rename!(_, "A ha" => :Aha) |> # Now this throws an error
select!(_, :fixed_the_typo, :)
ArgumentError: Tried renaming :A ha to :Aha, when :A ha does not exist in the Index.
Is there a way to either make a pipeline atomic (it either all runs or nothing runs), or write my code in a way that prevents this problem?
I guess what I'm looking for is something like this:
#pipe df |>
rename(_, "A ha" => :Aha) |>
select(_, :typo, :) |>
commit!(_)
The issue is that running each of the piped commands does an in-place modification of the DataFrame. If you instead do
#pipe df |>
rename(_, "A ha" => :Aha) |>
select(_, :typo, :)
(notice I omitted the exclamation marks), then instead of modifying the DataFrame df directly in each operation it will create a new version to operate on.
For the exact behavior you asked for you could do
df = #pipe df |>
rename(_, "A ha" => :Aha) |>
select(_, :typo, :)
which assigns the result to df when it finishes.
Or, to only create a new DataFrame for the first operation, leave out the exclamation mark for the first operation, and leave it in for all the rest in the pipe:
df = #pipe df |>
rename(_, "A ha" => :Aha) |>
select!(_, :typo, :)
Now, in the first operation, a new DataFrame is created, and the same DataFrame is operated on from then on. This will give you the best possible performance while doing what you asked.
Related
df = pd.DataFrame()
c = WebSocketClient( api_key='APIKEYHERE', feed='socket.polygon.io', market='crypto', subscriptions=["XT.BTC-USD"] )
def handle_msg(msgs: List[WebSocketMessage]):
global df
df = df.append(msgs, ignore_index=True)
print(df)
c.run(handle_msg)
I have a WebSocket client open through polygon.io, when I run this I get exactly what I want but then I get a warning that the frame.append is being deprecated and that I should use pandas.concat instead. Unfortunately, my little fragile brain has no idea how to do this.
I tried doing df = pd.concat(msgs, ignore_index=True) but get TypeError: cannot concatenate object of type '<class 'polygon.websocket.models.models.CryptoTrade'>';
Thanks for any help
To use pandas.concat instead of DataFrame.append, you need to convert the WebSocketMessage objects in the msgs list to a DataFrame and then concatenate them. Here's an example:
def handle_msg(msgs: List[WebSocketMessage]):
global df
msgs_df = pd.DataFrame([msg.to_dict() for msg in msgs])
df = pd.concat([df, msgs_df], ignore_index=True)
print(df)
This code converts each WebSocketMessage object in the msgs list to a dictionary using msg.to_dict() and then creates a DataFrame from the list of dictionaries. Finally, it concatenates this DataFrame with the existing df using pd.concat.
I'm using Pingouin.jl to test normality.
In their docs, we have
dataset = Pingouin.read_dataset("mediation")
Pingouin.normality(dataset, method="jarque_bera")
Which should return a DataFrame with normality true or false for each name in the dataset.
Currently, this broadcasting is deprecated, and I'm unable to concatenate the result in one DataFrame for each unique-column-output (which is working and outputs a DataFrame).
So, what I have so far.
function var_norm(df)
norm = DataFrame([])
for i in 1:1:length(names(df))
push!(norm, Pingouin.normality(df[!,names(df)[i]], method="jarque_bera"))
end
return norm
end
The error I get:
julia> push!(norm, Pingouin.normality(df[!,names(df)[1]], method="jarque_bera"))
ERROR: ArgumentError: `push!` does not allow passing collections of type DataFrame to be pushed into a DataFrame. Only `Tuple`, `AbstractArray`, `AbstractDict`, `DataFrameRow` and `NamedTuple` are allowed.
Stacktrace:
[1] push!(df::DataFrame, row::DataFrame; promote::Bool)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1603
[2] push!(df::DataFrame, row::DataFrame)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1601
[3] top-level scope
# REPL[163]:1
EDIT: push! function was not properly written at my first version of the post. But, the error persists after the change. How can I reformat the output of type DataFrame from Pingouin into DataFrameRow?
As Pengouin.normality returns a DataFrame, you will have to iterate over its results and push one-by-one:
df = Pengouin.normality(…)
for row in eachrow(df)
push!(norms, row)
end
If you are sure Pengouin.normality returns a DataFrame with exactly one row, you can simply write
push!(norms, only(Pengouin.normality(…)))
I was wondering if you could subset a dataframe like the one below based on the values of one of the columns (such as ids), you could use the equals operator like in df2 however, if you want to subset based on a list like ids I cannot find an operator to subset the dataframe based on a list as the .in operator does not seem to work with dataframes is there another operator I could use?
df = DataFrame(ids = [1, 1000, 10000, 100000,1,2,3,4], B = [1,2,3,4,123,6,2,7], D = ["N", "M", "I", "J","hi","CE", "M", "S"])
df2= df[df[:pmid] .== 1000, :]
ids = [2,3, 10000]
df3= df[df[:pmid] .in ids,:]
As of right now df3 gives me a bounds error.
Also I am running this on Julia 0.6.4
I guess there's typo in your first line ids= should be pmid=, I guess, since you're filtering using that name later.
As for df3, the correct syntax should be (I tried on 1.0.2):
df3= df[in.(df[:pmid], [ids]),:]
note added [] around ids as that should be vector of vectors.
I'd like to point you to DataFramesMeta.jl package, which provides much clearer syntax:
using DataFramesMeta
#where df (in.(:pmid, [ids]))
There was also quite an interesting discussion on discourse.julialang.org
regarding syntax for filtering by list, including performance tips.
This seems like a very fundamental question that would have been asked before, but I can't find an answer.
I have a dataframe. I want to do a groupby, then apply a function. I want the function to modify a column in the original dataframe. None of these options work:
import pandas as pd
df = pd.DataFrame(list(zip(list('abababa'), [1,2,3,4,1,2,3], [5,4,3,2,1,2,3])),
columns=["ab", "x", "y"])
print(df,"\n")
### Attempt #1
def change_y(tab):
tab.y = tab.y.min()
df.groupby(df.ab).apply(change_y)
### Attempt #2
def change_y(tab):
tab.loc[:,"y"] = tab.y.min()
df.groupby(df.ab).apply(change_y)
### Attempt #3
def change_y(tab):
tab.at[:,"y"] = tab.y.min()
df.groupby(df.ab).apply(change_y)
### Attempt #4
def change_y(tab):
tab.loc[tab.index,"y"] = tab.y.min()
df.groupby(df.ab).apply(change_y)
However, this works:
### Attempt #5 -- This one works
def change_y(big_tab,tab):
big_tab.loc[tab.index,"y"] = tab.y.min()
df.groupby(df.ab).apply(lambda tab: change_y(df,tab))
print(df,"\n")
So, I understand why #5 works, but I don't understand why none of 1-4 works. Have I misunderstood groupby? I thought that it did not make a copy of the underlying dataframe, but merely constructed indices on the underlying dataframe and passed them to the function. In that case, it seems at least one of 1-4 should work!
Does groupby in fact make a copy of the dataframe for each group? It seems like this would be unnecessary and inefficient.
If it does make a copy, is there any other solution than #5? I do understand that I could simply have the function create a new Series and assign it at the end:
df.y = df.groupby(df.ab).apply(lambda tab: tab.x = tab.y)
but for other reasons, that's not what I want to do in this case.
It appears that groupby does in fact make copies. I wanted to avoid this because I have a really giant dataframe. I did come up with this solution, which is as ugly as my dataframe is giant. I have to copy only one column of the dataframe. (It could be any column; I'm only interested in the index.)
Thanks to all who helped.
import pandas as pd
df = pd.DataFrame(list(zip(list('abababa'), [1,2,3,4,1,2,3], [5,4,3,2,1,2,3])),
columns=["ab", "x", "y"])
print(df,"\n")
def change_y(big_tab,ab_tab):
indx = ab_tab.index
tab = big_tab.loc[indx] # I can do all but the update with tab.
big_tab.loc[indx,"y"] = tab.y.min()
df.ab.groupby(df.ab).apply(lambda ab_tab: change_y(df,ab_tab))
print(df,"\n")
I am pretty new to Pandas and trying to find out where my code breaks. Say, I am doing a type conversion:
df['x']=df['x'].astype('int')
...and I get an error "ValueError: invalid literal for long() with base 10: '1.0692e+06'
In general, if I have 1000 entries in the dataframe, how can I find out what entry causes a break. Is there anything in ipdb to output the current location (i.e. where the code broke)? Basically, I am trying to pinpoint what value cannot be converted to Int.
The error you are seeing might be due to the value(s) in the x column being strings:
In [15]: df = pd.DataFrame({'x':['1.0692e+06']})
In [16]: df['x'].astype('int')
ValueError: invalid literal for long() with base 10: '1.0692e+06'
Ideally, the problem can be avoided by making sure the values stored in the
DataFrame are already ints not strings when the DataFrame is built.
How to do that depends of course on how you are building the DataFrame.
After the fact, the DataFrame could be fixed using applymap:
import ast
df = df.applymap(ast.literal_eval).astype('int')
but calling ast.literal_eval on each value in the DataFrame could be slow, which is why fixing the problem from the beginning is the best alternative.
Usually you could drop to a debugger when an exception is raised to inspect the problematic value of row.
However, in this case the exception is happening inside the call to astype, which is a thin wrapper around C-compiled code. The C-compiled code is doing the looping through the values in df['x'], so the Python debugger is not helpful here -- it won't allow you to introspect on what value the exception is being raised from within the C-compiled code.
There are many important parts of Pandas and NumPy written in C, C++, Cython or Fortran, and the Python debugger will not take you inside those non-Python pieces of code where the fast loops are handled.
So instead I would revert to a low-brow solution: iterate through the values in a Python loop and use try...except to catch the first error:
df = pd.DataFrame({'x':['1.0692e+06']})
for i, item in enumerate(df['x']):
try:
int(item)
except ValueError:
print('ERROR at index {}: {!r}'.format(i, item))
yields
ERROR at index 0: '1.0692e+06'
I hit the same problem, and as I have a big input file (3 million rows), enumerating all rows will take a long time. Therefore I wrote a binary-search to locate the offending row.
import pandas as pd
import sys
def binarySearch(df, l, r, func):
while l <= r:
mid = l + (r - l) // 2;
result = func(df, mid, mid+1)
if result:
# Check if we hit exception at mid
return mid, result
result = func(df, l, mid)
if result is None:
# If no exception at left, ignore left half
l = mid + 1
else:
r = mid - 1
# If we reach here, then the element was not present
return -1
def check(df, start, end):
result = None
try:
# In my case, I want to find out which row cause this failure
df.iloc[start:end].uid.astype(int)
except Exception as e:
result = str(e)
return result
df = pd.read_csv(sys.argv[1])
index, result = binarySearch(df, 0, len(df), check)
print("index: {}".format(index))
print(result)
To report all rows which fails to map due to any exception:
df.apply(my_function) # throws various exceptions at unknown rows
# print Exceptions, index, and row content
for i, row in enumerate(df):
try:
my_function(row)
except Exception as e:
print('Error at index {}: {!r}'.format(i, row))
print(e)