Merge two unequal length data frames by factor matching - manifest-merging

I'm new to R and I have been searching all over to look for a solution to merge two data frames and match them by factor. Some of the data does have white space. Here is a simple example in what I am trying to do:
df1 = data.frame(id=c(1,2,3,4,5), item=c("apple", " ", "coffee", "orange", "bread"))
df2 = data.frame(item=c("orange", "carrot", "peas", "coffee", "cheese", "apple", "bacon"),count=c(2,5,13,4,11,9,3))
When I use the merge() function to combine df1 into df1 by matching by 'item' name, I end up with a "item" column of NAs.
ndf = merge(df1, df2, by="item")
How do I resolve this issue? Am I getting this because I have white space in my data? Any help would be great. Thanks,

Related

pandas Dataframe from itertools.product output not able to be created

I want to create a Dataframe of all possible combinations:
Primary_function=['Office','Hotel','Hospital(General Medical & Surgical)','Other - Education']
City=['Miami','Houston','Phoenix','Atlanta','Las Vegas','San Francisco','Baltimore','Chicago','Boulder','Minneapolis']
Gross_Floor_Area=[50,100,200]
Years_Built=[1950,1985,2021]
Floors_above_grade=[2,6,15]
Heat=['Electricity - Grid Purchase','Natural Gas','District Steam']
WWR=[30,50,70]
Buildings=[Primary_function,City,Gross_Floor_Area,Years_Built,Floors_above_grade,Heat,WWR]
a=list((itertools.product(*Buildings)))
df=pd.DataFrame(a,columns=Buildings)
The error that I am getting is :
ValueError: Length of columns passed for MultiIndex columns is different
Pass a list with strings of the columns, i.e.
columns = ["Primary Function", "City", "Gross Floor Area", "Year Built", "Floors Above Grade", "Heat", "WWR"]
df = pd.DataFrame(a, columns = columns)
As Mr. T suggests, if you do this frequently you will be better off using a dict.

How to build a loop for converting entires of categorical columns to numerical values in Pandas?

I have a Pandas data frame with several columns, with some columns comprising categorical entries. I am 'manually' converting these entries to numerical values. For example,
df['gender'] = pd.Series(df['gender'].factorize()[0])
df['race'] = pd.Series(df['race'].factorize()[0])
df['city'] = pd.Series(df['city'].factorize()[0])
df['state'] = pd.Series(df['state'].factorize()[0])
If the number of columns is huge, this method is obviously inefficient. Is there a way to do this by constructing a loop over all columns (only those columns with categorical entries)?
Use DataFrame.apply by columns in variable cols:
cols = df.select_dtypes(['category']).columns
df[cols] = df[cols].apply(lambda x: x.factorize()[0])
EDIT:
Your solution should be simplify:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize()[0]
I tried the following, which seems to work fine:
for column in df.select_dtypes(['category']):
df[column] = pd.Series(df[column].factorize()[0])
where 'category' could be 'bool', 'object', etc.

Pandas - How best to combing dataframes based on specific column values

I have my main data frame (df) with the six columns defined in 'column_main'.
The needed data comes from two much larger df's. Let's call them df1 and df2.
Plus df1 & df2 do not have the same columns labels. But they both include the required df columns.
The df just has the few pieces that are needed from each for the two bigger ones. And by bigger, I mean many times the columns.
Since it is all going into a DB I want to get rid of all the unwanted columns.
How do I combine/merge/join/mask the needed data from the large data frames into the main (smaller) data frame? or maybe drop the columns not covered by 'columns_main'.
df = pd.DataFrame(columns = columns_main)
The other two df's are coming from excel workbooks with a lot of unwanted trash.
wb = load_workbook(filename = filename )
ws = wb[_sheets[0]]
df1 = pd.DataFrame(ws.values)
ws = wb[_sheets[1]]
df2 = pd.DataFrame(ws.values)
How can I do without some sort of crazy looping?
Thank you.
You can select another DataFrames by subset:
df1[df['column_main']]
df2[df['column_main']]
If possible some columns not match use Index.intersection:
cols = df['column_main']
df1[df1.columns.intersection(cols)]
df2[df2.columns.intersection(cols)]

Combine two dataframe to send a automated message [duplicate]

is there a way to conveniently merge two data frames side by side?
both two data frames have 30 rows, they have different number of columns, say, df1 has 20 columns and df2 has 40 columns.
how can i easily get a new data frame of 30 rows and 60 columns?
df3 = pd.someSpecialMergeFunct(df1, df2)
or maybe there is some special parameter in append
df3 = pd.append(df1, df2, left_index=False, right_index=false, how='left')
ps: if possible, i hope the replicated column names could be resolved automatically.
thanks!
You can use the concat function for this (axis=1 is to concatenate as columns):
pd.concat([df1, df2], axis=1)
See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html
I came across your question while I was trying to achieve something like the following:
So once I sliced my dataframes, I first ensured that their index are the same. In your case both dataframes needs to be indexed from 0 to 29. Then merged both dataframes by the index.
df1.reset_index(drop=True).merge(df2.reset_index(drop=True), left_index=True, right_index=True)
If you want to combine 2 data frames with common column name, you can do the following:
df_concat = pd.merge(df1, df2, on='common_column_name', how='outer')
I found that the other answers didn't cut it for me when coming in from Google.
What I did instead was to set the new columns in place in the original df.
# list(df2) gives you the column names of df2
# you then use these as the column names for df
df[list(df2)] = df2
There is way, you can do it via a Pipeline.
** Use a pipeline to transform your numerical Data for ex-
Num_pipeline = Pipeline
([("select_numeric", DataFrameSelector([columns with numerical value])),
("imputer", SimpleImputer(strategy="median")),
])
**And for categorical data
cat_pipeline = Pipeline([
("select_cat", DataFrameSelector([columns with categorical data])),
("cat_encoder", OneHotEncoder(sparse=False)),
])
** Then use a Feature union to add these transformations together
preprocess_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Read more here - https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
This solution also works if df1 and df2 have different indices:
df1.loc[:, df2.columns] = df2.to_numpy()

Numpy where perform multiple actions

I have two dataframe columns where I want to check if the element of one are inside the other one. I perform this using the pandas isin method.
However, if the element is present in the second dataframe, I also want to subtract is from both:
attivo['S'] = np.where(attivo['SKU'].isin(stampate['SKU-S']), attivo['S'] - 1, attivo['S'])
In this example, if an item in the column S of attivo dataframe is present in the column SKU-S of the stampate dataframe, the S column will decrease by one unit, however, I also want that the same column S will decrease in the stampate dataframe.
How is it possible to achieve this?
EDIT with sample data:
df1 = pd.DataFrame({'SKU': 'productSKU', 'S': 5}, index=[0])
df2 = pd.DataFrame({'SKU-S': 'productSKU', 'S': 5}, index=[0])
Currently, I am achieving this:
df1['S'] = np.where(df1['SKU'].isin(df2['SKU-S']), df1['S'] - 1, df1['S'])
However, I would like that both dataframes are updated, in this case, both of them will display 4 in the S column.
IIUC:
s = df1['SKU'].isin(df2['SKU-S'])
# modify df1
df1['S'] -= s
# count the SKU in df1 that belongs to df2 by values
counts = df1['SKU'].where(s).value_counts()
# modify df2
df2['S'] -= df2['SKU-S'].map(counts).fillna(0)