Good afternoon,
I have a few .csv files to be transformed into pandas DataFrames. Although they contain the same type of data in the same columns, they have different column names. I am trying to do all the small transformations on the fly to be able to concatenate the DataFrames all at once. The problem I am having is that as far as I know there is not way to access the attributes of the DataFrame "on the fry", first you assign it to a variable and then access the data. In the following way:
df = pd.read_csv("my_csv.csv")
df = df.rename(columns=dict(zip(df.columns, [my_columns])))
So I was wondering if anyone knows a way to do something like the following:
(pd.read_csv("my_csv.csv")
.rename(columns=dict(zip(SELF.columns, [my_columns])))
)
where SELF references the DataFrame that has been just created.
So far I have tried unsuccessfully to use lambda functions as I know they can be used to subset the DataFrame by conditions set on the just created object like [lambda x: x.ColumnA > 20]
Thank you in advance.
EDIT:
I was able to do what I was looking for with the help of .pipe() I did the following:
def rename_columns(self, columns):
return self.rename(columns=dict(zip(self.columns, columns)))
(pd.DataFrame([{'a':1},{'a':1},{'a':1},{'a':1},{'a':1}])
.pipe(rename_columns, ['b'])
)
You can use .set_axis for this:
(pd.DataFrame(np.random.randn(5, 5))
.set_axis(['A', 'B', 'C', 'D', 'E'], axis=1, inplace=False)
)
inplace will change in a future version of pandas, but currently defaults to True; axis=1 operates on columns.
Related
I am conducting an analysis of a dataset. To find my results, I use this line of code:
new_df = df_ncis.groupby(['state', 'year'])['totals'].mean()
The object returned by this statement is a Series, when it should be a dataframe. I don't understand why this happened, or how to solve this issue. Also, one of the columns of the new object is missing its name. Here is the github link for the project: https://github.com/louishrm/gundataUS.
Any help would be great.
You are filtering the result by ['totals'] which is a series.
try this instead
new_df = df_ncis[['state', 'year', 'totals']].groupby(['state', 'year']).mean()
which will give you a dataframe with your 3 columns.
or if you want it as a dataframe of one column (Note the double brackets)
new_df = df_ncis.groupby(['state', 'year'])[['totals']].mean()
I am trying to port some code from Pandas to Koalas to take advantage of Spark's distributed processing. I am taking a dataframe and grouping it on A and B and then applying a series of functions to populate the columns of the new dataframe. Here is the code that I was using in Pandas:
new = old.groupby(['A', 'B']) \
.apply(lambda x: pd.Series({
'v1': x['v1'].sum(),
'v2': x['v2'].sum(),
'v3': (x['v1'].sum() / x['v2'].sum()),
'v4': x['v4'].min()
})
)
I believe that it is working well and the resulting dataframe appears to be correct value-wise.
I just have a few questions:
Does this error mean that my method will be deprecated in the future?
/databricks/spark/python/pyspark/sql/pandas/group_ops.py:76: UserWarning: It is preferred to use 'applyInPandas' over this API. This API will be deprecated in the future releases. See SPARK-28264 for more details.
How can I rename the group-by columns to 'A' and 'B' instead of "__groupkey_0__ __groupkey_1__"?
As you noticed, I had to call pd.Series -- is there a way to do this in Koalas? Calling ks.Series gives me the following error that I am unsure how to implement:
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Thanks for any help that you can provide!
I'm not sure about the error. I am using koalas==1.2.0 and pandas==1.0.5 and I don't get the error so I wouldn't worry about it
The groupby columns are already called A and B when I run the code. This again may have been a bug which has since been patched.
For this you have 3 options:
Keep utilising pd.Series. As long as your original Dataframe is a koalas Dataframe, your output will also be a koalas Dataframe (with the pd.Series automatically converted to ks.Series)
Keep the function and the data exactly the same and just convert the final dataframe to koalas using the from_pandas function
Do the whole thing in koalas. This is slightly more tricky because you are computing an aggregate column based on two GroupBy columns and koalas doesn't support lambda functions as a valid aggregation. One way we can get around this is by computing the other aggregations together and adding the multi-column aggregation afterwards:
import databricks.koalas as ks
ks.set_option('compute.ops_on_diff_frames', True)
# Dummy data
old = ks.DataFrame({"A":[1,2,3,1,2,3], "B":[1,2,3,3,2,3], "v1":[10,20,30,40,50,60], "v2":[4,5,6,7,8,9], "v4":[0,0,1,1,2,2]})
new = old.groupby(['A', 'B']).agg({'v1':'sum', 'v2':'sum', 'v4': 'min'})
new['v3'] = old.groupby(['A', 'B']).apply(lambda x: x['v1'].sum() / x['v2'].sum())
I have a Dataframe containing a single column with a list of file names. I want to find all rows in the Dataframe that their value has a prefix from a set of know prefixes.
I know I can run a simple for loop, but I want to run in a Dataframe to check speeds and run benchmarks - it's also a nice exercise.
What I had in mind is combining str.slice with str.index but I can't get it to work. This is what I have in mind:
import pandas as pd
file_prefixes = {...}
file_df = pd.Dataframe(list_of_file_names)
file_df.loc[file_df.file.str.slice(start=0, stop=upload_df.file.str.index('/')-1).isin(file_prefixes), :] # this doesn't work as the index returns a dataframe
My hope is that said code will return all rows that the value there starts with a file prefix from the list above.
In summary, I would like help with 2 things:
Combining slice and index
Thoughts about better ways to achieve this
Thanks
I will use startswith
file_df.loc[file_df.file.str.startswith(tuple(file_prefixes)), :]
I have multiple dataframes which all share the same number of columns with the same names. For some reason, I would like to rename all these columnes with a dictionary.
I know how to do it for one dataframe at a time using the rename function of pandas such as:
df = df.rename(columns={"1": "a", etc.})
I would like to add a loop for each dataframe from a list of dataframes (df_list) but for some reason it does not act as I would expect it to.
df_list = (df1, df2)
for i in df_list:
i = i.rename(columns={
'1':'a',
'2':'b',
'3':'c', etc.})
The code above does not provide any error, nor change anything. Again, both df1 and df2 share the exact same structure (columns = "1", "2", "3", etc.).
I would happily read any suggestion as to how I can proceed to automate the renaming of these columns... Thanks !
You are trying to modify a tuple, which is immutable. Use a list instead:
df_list = [df1, df2]
for i in df_list:
i = i.rename(columns={
'1':'a',
'2':'b',
'3':'c', ...})
I am trying to remove some columns from my data frame and would prefer not to return the modified data frame and reassign it to the old. Instead, I would like the function to just modify the data frame. This is what I tried but it does not seem to be doing what I except. I was under the impression arguments as passed as reference and not by value?
function remove_cols! (df::DataFrame, cols)
df = df[setdiff(names(df), cols)];
end
df = DataFrame(x = [1:10], y = [11:20]);
remove_cols!(df, [:y]); # this does not modify the original data frame
Of course the below works but I would prefer if remove_cols! just changed the df in place
df = remove_cols!(df, [:y]);
How can I change the df in place inside my function?
Thanks!
As I understand Julia it uses what is called pass by sharing, meaning that the reference is passed by value. So when you pass the DataFrame to the function a new reference to the DataFrame is created which is local to the function. When you reassign the local df variable with its own reference to the DataFrame it has no effect on the separate global variable and its separate reference to the DataFrame.
There is a function in DataFrames.jl for deleting columns in DataFrames.
To answer the question of how to mutate a dataframe in your own function in general, the key is to use functions and operations that mutate the dataframe within the function. For example, see the function below which builds upon the standard dataframe append! function with some added benefits like it can append from any number of dataframes, the order of columns does not matter and missing columns will be added to the dataframes:
function append_with_missing!(df1::DataFrame, dfs::AbstractDataFrame...)
columns = Dict{Symbol, Type}(zip(names(df1), colwise(eltype, df1)))
for df in dfs
columns_temp = Dict(zip(names(df), colwise(eltype, df)))
merge!(columns, columns_temp)
end
for (n, t) in columns, df in [df1; [i for i in dfs]]
n in names(df) || (df[n] = Vector{Union{Missing,t}}(missing, size(df, 1)))
end
for df in dfs
append!(df1, df[names(df1)])
end
end
Here, the first dataframe passed itself is mutated with rows added from the other dataframes.
(The functionality for adding missing columns is based upon the answer given by #Bogumił Kamiński here: Breaking change on vcat when columns are missing)