Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...
Hi I have a dataframe full of strings and I want to encode these strings and store their corresponding codes.
I want to produce these codes on one column and fit onto another column.
When I fit these codes on some other column that has a string that I haven't seen on my training column I want to create another unique value for that.
I have tried LabelEncoding function but it gives error on the previously unseen strings.
For example a have dataframe:
col1 col2
a a
b b
c e
d f
After training LabelEncoding on first column I get something like this:
col1 col2
1 a
2 b
3 e
4 f
After fitting on the created codes onthe second column I want to have something like this:
col1 col2
1 1
2 2
3 5
4 6
What is the easiest way to do this. Thank you.
Created df dataframe by copying sample from OP's post as follows.
df=pd.read_clipboard()
Its value will be as follows when we print it:
col1 col2
0 a a
1 b b
2 c e
3 d f
Could you please try following. I have given here only 1st 6 alphabets you could mention all in case you have them in your actual Input_file.
dict1 = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':6}
df.applymap(lambda s: dict1.get(s) if s in dict1 else s)
Output will be as follows.
col1 col2
0 1 1
1 2 2
2 3 5
3 4 6
You could encoding yourself using pd.factorize:
v, k = pd.factorize(sorted(df.stack().unique()))
m = dict(zip(k.tolist(), (v+1).tolist()))
df.replace(m)
Output:
col1 col2
0 1 1
1 2 2
2 3 5
3 4 6
I think the real trick is to stack col1 and col2 then encoding the values of both list as one.
le = LabelEncoder()
le.fit(df.stack())
I am now learning R. I feel that there is a very easy succinct answer to my problem, but I am having trouble solving it myself.
I have a large data set. One column contains various 'categories'. I aggregated these categories to get the mean for each one. So, right now, my aggregated table looks like this:
Category __ Average
A ________ a
B ________ b
C ________ c
etc...
I want now to take these average and combine it as another column onto my original data.
So, I want it to look something like this:
Categories _____ Averages
B _____________ b
A______________a
B______________b
C______________c
B______________b
C______________c
In other words, I want to match each category with its corresponding mean. I have tried variations of merge(), match(), and different apply functions. The fact that my aggregated table is so much smaller than my original data is causing some problems.
Is there a specific function I can use for this simple problem? Thanks in advance.
In base R:
data <- data.frame(Category=c(rep("A",3), rep("B",4), rep("C",2)), Value=1:9)
> data
Category Value
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 B 7
8 C 8
9 C 9
> avg <- lapply(split(data$Value, data$Category), mean)
$A
[1] 2
$B
[1] 5.5
$C
[1] 8.5
> data$Averages <- avg[data$Category]
> data
Category Value Averages
1 A 1 2
2 A 2 2
3 A 3 2
4 B 4 5.5
5 B 5 5.5
6 B 6 5.5
7 B 7 5.5
8 C 8 8.5
9 C 9 8.5
You can use plyr, data.table, etc. more efficiently for larger datasets.