I'd like to merge/concatenate multiple dataframe together; basically it's too add up many feature columns together based on the same first column 'Name'.
F1.merge(F2, on='Name', how='outer').merge(F3, on='Name', how='outer').merge(F4,on='Name', how='outer')...
I tried the code above, it's working. But I've got say, 100 features to add up together, I'm wondering is there any better way?
Without data it is not easy, but this can works:
df = pd.concat([x.set_index('Name') for x in [df1,df2,df3]]).reset_index()
Related
I got the below performance warning
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use newframe = frame.copy()
when I tried to add columns to a dataframe from a list.
The warning asks me to consider using pd.concat. But it looks like pd.concat does not take lists.
I am trying to create a set of dataframes from Excel with the same columns and rows. Each file I'm working with has a slightly different set of numbered columns. I tried iterating over a list to add the columns that are missing.
But since that throws a performance warning, I'd like to improve performance. A sample of my code is below.
missing_columns = list(map(str, [*range(1900, 2022, 1)]))
for f in files:
data = pd.read_excel(f)
cols = data.columns.values.tolist()
new_cols[0] = 'Company'
data.columns = new_cols
for missed in missing_columns :
if missed not in new_cols:
data[missed] = np.NAN
data = data.set_index('Company')
data = data.reindex(sorted(data.columns), axis = 1)
Any help is appreciated!
I am trying to process a set of Excel files that have different sets of columns, and I want to add any missing columns to the dataframes so that they all have the same set of columns.
One issue with my code was that it used a for loop to iterate over the missing cols list and add missing columns to the dataframes. This approach can be inefficient because it requires the code to iterate over the entire missing cols list for each dataframe, which can take a long time if the list is large and the dataframes are large.
One way to improve the efficiency of the code is to use the reindex method to add missing columns to the dataframes, rather than using a for loop. The reindex method can add missing columns to a dataframe and fill them with a specified value, such as NaN, which is what I tried in the for loop. By using the reindex method, I can avoid the need to iterate over the years list and add missing columns one at a time, which can improve the performance of the code.
Here how I used the reindex method to add missing columns to the dataframes:
# Get the complete set of columns for the dataframes
all_columns = ['Company'] + missing_columns
# Reindex the dataframes to include all columns
data = data.reindex(columns = all_columns, fill_value = np.NAN)
I have a function that returns tuples. When I apply this to my pandas dataframe using pd.apply() function, the results look this way.
The Date here is an index and I am not interested in it.
I want to create two new columns in a dataframe and set their values to the values you see in these tuples.
How do I do this?
I tried the following:
This errors out citing mismatch between expected and available values. It is seeing these tuples as a single entity, so those two columns I specified on the left hand side are a problem. Its expecting only one.
And what I need is to break it down into two parts that can be used to set two different columns.
Whats the correct way to achieve this?
Make your function return a pd.Series, this will be expanded into a frame.
orders.apply(lambda x: pd.Series(myFunc(x)), axis=1)
use zip
orders['a'], orders['b'] = zip(*df['your_column'])
I am relatively new to Pandas, and was hoping for guidance on the most efficient and clean way to handle multiple rules/masks to the same dataframe column.
I have two unique and independent conditions working:
Condition 1
df["price"]= df["price"].mask(df["price"].eq("£ 0.00"), df["product_price_old"])
df.drop(axis=1, inplace=True, columns='product_price_old')
Condition 2
df["price"] = df["price"].mask(df["product_price_old"].gt(df["price"]), df["product_price_old"])
df.drop(axis=1, inplace=True, columns='product_price_old')
What is the best syntax in Pandas to merge these conditions together and remove the duplication?
Would a separate Python function and call it via .agg? I came across a .pipe in the docs earlier, would this be a suitable use case?
Any help would be appreciated.
i have some dataframes
df_1
df_2
…
df_99
df_100
over which i would like to iterate to perform some operations on a specific column, say Column_A, which exists in each dataframe.
i can create strings with the names of the dataframes using
for i in range (1,101):
’df_’+str(i)
but when i try to use these to access the dataframes like this
for i in range (1,101):
df_x = ’df_’+str(i)
df_x['Column_A’].someoperation(i)
# the operation involves the number of the dataframe
i get a TypeError: „string indices must be integers“.
I searched extensively and the suggested solution to this kind of problem which i found most often was to create a dictionary with the names of the dataframes as keys and the actual dataframes as the associated values.
However i would not like to proceed like this for two or three reasons:
For one, as i am still rather new to pandas, i am not sure about how to address a specific column in a dataframe which is placed as a value in a dictionary.
Additionally, putting the dataframes in a dictionary would create copies of them (if i understand correctly), which is not ideal if there are very many dataframes or if the dataframes are large.
But most importantly, since i do not know how to iterate over the names, putting the dataframes in a dictionary would have to be done manually, so it is still the same problem in a way.
I tried creating a list with the names of the dataframes to loop over
df_list= [ ]
for i in range (1,101):
df_list.append('df_‘+str(i))
for df in df_list:
df['Column_A’].someoperation
but that approach results in the same type error as above - and i cannot conveniently involve the number of the dataframe in "someoperation".
Apparently pandas does take df_1 , df_2 etc as the strings they are and not as the name of the already existing dataframe i would like to access, but i dont know how to tell it to do otherwise.
Any suggestions how this could be solved are much appreciated.
You're defining a list of strings, but you're not giving Python any way of knowing that "df_1" is in some way connected to df_1
To answer your question, you're looking for the eval function, which takes a string, executes it as code, and returns the output. So eval("df_1") will give you the dataframe df_1.
df_list= [ ]
for i in range (1,101): #~ look up list comprehensions for a more elegant way to do this.
df_list.append('df_'+str(i))
for df in df_list:
eval(df)['Column_A'].someoperation
However, you should take the advice you've gotten and use a dictionary or list. Putting the dataframes in a dictionary would definitely not create copies of them. The dictionary is simply a mapping from a set of strings to the corresponding object in memory. This is also a much more elegant solution, keeping all of the relevant dataframes in one place without having to adhere to a strict naming convention that will inevitably get messed up in some way.
If you don't really need names for each dataframe and just want them accessible together, an even simpler solution would be to put them in a list and access each one as dfs[0]-dfs[100].
If you've already got df_1-df_100 loaded the way you're describing, eval will let you organize them all into one place like that: dfs = [eval("df_"+str(i)) for i in range(1,101)] or dfs={i:eval(f"df_{i}") for i in range(1,101)}
Finally, you can access columns and do operations on dataframes accessed through lists and dictionaries in the normal way. E.g.
dfs[0]['column 1'] = 1.
means = dfs[40].groupby('date').mean()
#~ ect.
I am trying to join/merge two dataframes (df_apply and df_result) based on a common column (name). Sounds simple enough, but one of the dataframes has column types pandas.core.series.Series and the other one has column types pandas.core.frame.Dataframe. This causes the merge (pd.merge(df_apply, df_result, on='name') to result in an error:
ValueError: The column label 'name' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
After dropping the indexes of both tables I was able to join (df_apply.join(df_result)) the tables, but this results in a dataframe with weird column names, which are inaccessible in any way - the column names become (sbt,) (gra,) (pot,) (oni,) (wwh,) (class_max,) (prob_max,) (tf_time,) (name,) (processing_time,).
I've tried converting the pandas.core.series.Series to a pandas.core.frame.Dataframe like so:
df_apply.name = df_apply.name.rename(None).to_frame()
df_apply.name = df_apply.name.to_frame()
but in the end the result of type(df_apply.name) is always: pandas.core.series.Series and the result of type(df_result.name) is always pandas.core.frame.DataFrame.
The two dataframes (a single row of each) look like this:
df_result:
df_apply:
I expect to be able to easily join these tables based on the name, but this pesky pandas column type structures are making it very hard. How does one go about it?
UPDATE:
I solved the issue by exporting the df_result to csv and importing it back again. At this point both columns have column types of pandas.core.series.Series . I hope this helps, but still doesn't answer my question of how to join such tables without doing this...?
I have found that when df[colvar] results in core.series.Series,
it can be changed to a data frame by referencing it with additional brackets:
df[[covar]].