Python Pandas: LabelEncoding fitting unknown variables - pandas

Hi I have a dataframe full of strings and I want to encode these strings and store their corresponding codes.
I want to produce these codes on one column and fit onto another column.
When I fit these codes on some other column that has a string that I haven't seen on my training column I want to create another unique value for that.
I have tried LabelEncoding function but it gives error on the previously unseen strings.
For example a have dataframe:
col1 col2
a a
b b
c e
d f
After training LabelEncoding on first column I get something like this:
col1 col2
1 a
2 b
3 e
4 f
After fitting on the created codes onthe second column I want to have something like this:
col1 col2
1 1
2 2
3 5
4 6
What is the easiest way to do this. Thank you.

Created df dataframe by copying sample from OP's post as follows.
df=pd.read_clipboard()
Its value will be as follows when we print it:
col1 col2
0 a a
1 b b
2 c e
3 d f
Could you please try following. I have given here only 1st 6 alphabets you could mention all in case you have them in your actual Input_file.
dict1 = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':6}
df.applymap(lambda s: dict1.get(s) if s in dict1 else s)
Output will be as follows.
col1 col2
0 1 1
1 2 2
2 3 5
3 4 6

You could encoding yourself using pd.factorize:
v, k = pd.factorize(sorted(df.stack().unique()))
m = dict(zip(k.tolist(), (v+1).tolist()))
df.replace(m)
Output:
col1 col2
0 1 1
1 2 2
2 3 5
3 4 6
I think the real trick is to stack col1 and col2 then encoding the values of both list as one.
le = LabelEncoder()
le.fit(df.stack())

Related

How to group by one column or another in pandas

I have a table like:
col1 col2
0 1 a
1 2 b
2 2 c
3 3 c
4 4 d
I'd like rows to be grouped together if they have a matching value in col1 or col2. That is, I'd like something like this:
> (
df
.groupby(set('col1', 'col2')) # Made-up syntax
.ngroup())
0 0
1 1
2 1
3 1
4 2
Is there a way to do this with pandas?
This is not easy to achieve simply with pandas. Indeed, two far away groups can become connected when two items are connected in the second group.
You can approach this using graph theory. Find the connected components using edges formed by the two (or more) groups. A python library for this is networkx:
import networkx as nx
g1 = df.groupby('col1').ngroup()
g2 = 'a'+df.groupby('col2').ngroup().astype(str)
# make graph and get connected components to form a mapping dictionary
G = nx.from_edgelist(zip(g1, g2))
d = {k:v for v,s in enumerate(nx.connected_components(G)) for k in s}
# find common group
group = g1.map(d)
df.groupby(group).ngroup()
output:
0 0
1 1
2 1
3 1
4 2
dtype: int64
graph:

Use row values as data frame headers

I am dealing with several data frames (DataFrames = [DataFrame_a,b,c...z]) with long description as their headers, for examples, a = pd.DataFrame(data = [[1,2,7],["A","B","C"],[5,6,0]], columns = ['SuperSuperlong name columnA', 'SuperSuperlong name columnB','SuperSuperlong name columnC'])
SuperSuperlong_name_columnA SuperSuperlong_name_columnB SuperSuperlong_name_columnC
0 1 2 7
1 ABC BCD CDE
2 5 6 0
I'd like it to be transformed to
ABC BCD CDE
0 SuperSuperlong_name_columnA SuperSuperlong_name_columnB SuperSuperlong_name_columnC
1 1 2 7
2 5 6 0
What's the easiest way?
I also like to apply the method to all data frame I have. How should I do it?
Hope this helps.
# Pass column name as new value in DataFrame and reset index
df.loc['new'] = df.columns
df.reset_index(inplace=True, drop=True)
# Pass the row you want as the column name
df.columns = df.iloc[1]

Pandas, multiply part of one DF against another based on condition

Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...

Map column names if data is same in two dataframes

I have two pandas dataframes
df1 = A B C
1 2 3
2 3 4
3 4 5
df2 = X Y Z
1 2 3
2 3 4
3 4 5
I need to map based on data If data is same then map column namesenter code here
Output = col1 col2
A X
B Y
C Z
I cannot find any built-in function to support this, hence simply loop over all columns:
pairs = []
for col1 in df1.columns:
for col2 in df2.columns:
if df1[col1].equals(df2[col2]):
pairs.append((col1, col2))
output = pandas.DataFrame(pairs, columns=['col1', 'col2'])

PIG generate data using different load variables

I have 2 files and i want to generate data using different columns of diff
files. I want to do something like this:-
Here is my problem with example:-
I have 2 files abc.txt(col1,col2) and xyz.txt(col3,col4) Number of records in both the files differ say abc.txt has 1000 records and xyz.txt has 100 records.
I want to store output in a file such that , i get col1,col2 from abc.txt and col3 from xyz.txt (as we have less records in xyz then abc i want my col3 values to get repeated either randomly or in same sequence as in input file anything is ok)
Input
abc.txt xyz.txt
col1 col2 col3 col4
1 A 4 X
2 B 5 Y
3 C 6 Z
4 D
5 D
6 F
7 A
A = LOAD '/user/abc.txt' Using PigStorage('|');
B = LOAD '/user/xyz.txt' Using PigStorage('|');
C = FOREACH A GENERATE A.$0,A.$1,B.$0;
Output
col1 col2 col3
1 A 4
2 B 5
3 C 6
4 D 5
5 D 4
6 F 4
7 A 6
Is it possible to do this using PIG?
GENERATE is not operator in Pig. So you cannot use it to generate data. Pig provides FOREACH for iterating over a relation. It works for one relation only. To me it looks like you can generate the data as you have specified in question until you want to perform some sort of JOIN on data.