I have the following dfe :-
id categ level cols value comment
1 A PG Apple 428 comment1
1 A CD Apple 175 comment1
1 C PG Apple 226 comment1
1 C AB Apple 884 comment1
1 C CD Apple 288 comment1
1 B PG Apple 712 comment1
1 B AB Apple 849 comment1
2 B CD Apple 376 comment1
2 C None Orange 591 comment1
2 B CD Orange 135 comment1
2 D None Orange 423 comment1
2 A AB Orange 1e13 comment1
2 D PG Orange 1e15 comment2
df2 = pd.DataFrame({'s2': {0: 1, 1: 2, 2: 3}, `level': {0: 'PG', 1: 'AB', 2: 'CD'}})
df1 = pd.DataFrame({'sl': {0: 1, 1: 2, 2: 3, 3: 4}, 'set': {0: 'A', 1: 'C', 2: 'B', 3: 'D'}})
dfe = (dfe[['categ','level','cols','id','comment','value']]
.merge(df1.rename({'set' : 'categ'}, axis=1),how='left',on='categ')
.merge(df2, how='left', on='level'))
na = dfe['level'].isna()
dfs = {'no_null': dfe[~na], 'null': dfe[na]}
with pd.ExcelWriter('XYZ.xlsx') as writer:
for p,r in dfs.items():
if p== 'no_null':
c= ['cols','s2','level']
else:
c = 'cols'
df = r.pivot_table(index=['id','sl','comment','categ'], columns=c, values=['value'])
df.columns = df.columns.droplevel([0,2])
df = df.reset_index().drop(('sl',''), axis=1).set_index('categ')
for (id,comment), sdf in df.groupby(['id','comment']):
df = sdf.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
df.to_excel(writer,sheet_name=name)
Running this I get results displayed in excel this way :-
I want to order in certain way, what I tried :-
df = r.pivot_table(index=['id','sl','comment','categ'], columns=c, values='value')
df.columns = df.columns.droplevel([1])
df = df.reset_index().drop(('sl',''), axis=1).set_index('categ')
This gives me Too many levels: Index has only 2 levels, not 3 error, I don't know what Im missing /wrong here .
My expected output for arrangement of headings is :-
Would like to know if headings can be written to excel in CAPS as shown in expected output.
EDIT 1
I tried the answer and Im getting this view :-
I want to be able to display ID & COMMENT only once (as its already grouped by ID in code logic), and drop the sl column and the first column 0,1,2 and also delete the blank row above 0
Given dfe as:
categ level cols id comment value sl s2
0 A PG Apple 1 comment1 4.280000e+02 1 1.0
1 A CD Apple 1 comment1 1.750000e+02 1 3.0
2 C PG Apple 1 comment1 2.260000e+02 2 1.0
3 C AB Apple 1 comment1 8.840000e+02 2 2.0
4 C CD Apple 1 comment1 2.880000e+02 2 3.0
5 B PG Apple 1 comment1 7.120000e+02 3 1.0
6 B AB Apple 1 comment1 8.490000e+02 3 2.0
7 B CD Apple 2 comment1 3.760000e+02 3 3.0
8 C None Orange 2 comment1 5.910000e+02 2 NaN
9 B CD Orange 2 comment1 1.350000e+02 3 3.0
10 D None Orange 2 comment1 4.230000e+02 4 NaN
11 A AB Orange 2 comment1 1.000000e+13 1 2.0
12 D PG Orange 2 comment2 1.000000e+15 4 1.0
Then try:
df = dfe.pivot_table(index=['id','comment','categ'], columns=c, values='value')
df.columns = df.columns.droplevel([1])
df = (df.rename_axis(columns=[None, None])
.reset_index(col_level=1)
.rename(columns = lambda x: x.upper()))
df.to_excel('testa1.xlsx')
Output:
Notes:
Removed [] around 'value' in pivot_table to not include 'value' as
a column index.
Aligned 'categ', 'label' and 'comments' with column index level 1 using col_level parameter.
See this post about the blank line, https://stackoverflow.com/a/52498899/6361531.
I think it would be easier to drop columns name and the replace it with a custome one:
df.columns = df.columns.droplevel()
df.columns = pd.MultiIndex.from_tuples([("", "ID"), ("", "CATEG"), ("apple", "PG"), ("apple", "AB"), ("apple", "CD"), ("orange", "PG"), ("orange", "AB"), ("orange", "CD")])
Related
I want to merge the rows of the two dataframes hereunder, when the strings in Test1 column of DF2 contain a substring of Test1 column of DF1.
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
print (DF1)
Test1 Test2
0 A 1
1 B 2
2 C 3
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print (DF2)
Test1 Test2
0 ee 1
1 bA 2
2 cCc 3
3 D 4
For that, I am able with "str contains" to identify the substring of DF1.Test1 available in the strings of DF2.Test1
INPUT:
for i in DF1.Test1:
ok = DF2[Df2.Test1.str.contains(i)]
print(ok)
OUPUT:
Now, I would like to add in the output, the merge of the substrings of Test1 which match with the strings of Test2
OUPUT:
For that, I tried with "pd.merge" and "if" but i am not able to find the right code yet..
Do you have suggestions please?
for i in DF1.Test1:
if DF2.Test1.str.contains(i) == 'True':
ok = pd.merge(DF1, DF2, on= ['Test1'[i]], how='outer')
print(ok)
Thank you for your ideas :)
I could not respnd to jezrael's comment because of my reputation. But I changed his answer to a function to merge on non-capitalized text.
def str_merge(part_string_df,full_string_df, merge_column):
merge_column_lower = 'merge_column_lower'
part_string_df[merge_column_lower] = part_string_df[merge_column].str.lower()
full_string_df[merge_column_lower] = full_string_df[merge_column].str.lower()
pat = '|'.join(r"{}".format(x) for x in part_string_df[merge_column_lower])
full_string_df['Test3'] = full_string_df[merge_column_lower].str.extract('('+ pat + ')', expand=True)
DF = pd.merge(part_string_df, full_string_df, left_on= merge_column_lower, right_on='Test3').drop([merge_column_lower + '_x',merge_column_lower + '_y','Test3'],axis=1)
return DF
Used with example:
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print(str_merge(DF1,DF2, 'Test1'))
Test1_x Test2_x Test1_y Test2_y
0 B 2 bA 2
1 C 3 cCc 3
I believe you need extract values to new column and then merge, last remove helper column Test3:
pat = '|'.join(r"{}".format(x) for x in DF1.Test1)
DF2['Test3'] = DF2.Test1.str.extract('('+ pat + ')', expand=False)
DF = pd.merge(DF1, DF2, left_on= 'Test1', right_on='Test3').drop('Test3', axis=1)
print (DF)
Test1_x Test2_x Test1_y Test2_y
0 A 1 bA 2
1 C 3 cCc 3
Detail:
print (DF2)
Test1 Test2 Test3
0 ee 1 NaN
1 bA 2 A
2 cCc 3 C
3 D 4 NaN
I have a dataframe like this. I wanted to know how can I apply map function to its index and rename it into a easier format.
df = pd.DataFrame({'d': [1, 2, 3, 4]}, index=['apple_017', 'orange_054', 'orange_061', 'orange_053'])
df
d
apple_017 1
orange_054 2
orange_061 3
orange_053 4
There are only two labels in the indeces of the dataframe, so it's either apple or orange in this case and this is how I tried:
data.index = data.index.map(i = "apple" if "apple" in i else "orange")
(Apparently it's not how it works)
Desired output:
d
apple 1
orange 2
orange 3
orange 4
Appreciate anyone's help and suggestion!
Try via split():
df.index=df.index.str.split('_').str[0]
OR
via map():
df.index=df.index.map(lambda x:'apple' if 'apple' in x else 'orange')
output of df:
d
apple 1
orange 2
orange 3
orange 4
I have an id column for each person (data with the same id belongs to one person). I want these:
Now the id column is not based on numbering, it's 10 digit. How can I reset id with integers, e.g. 1, 2, 3, 4?
For example:
id col1
12a4 summer
12a4 goest
3b yes
3b No
3b why
4t Hi
Output:
id col1
1 summer
1 goest
2 yes
2 No
2 why
3 Hi
Use, factorize:
df['id']=df['id'].factorize()[0]+1
Output:
id col1
0 1 summer
1 1 goest
2 2 yes
3 2 No
4 2 why
5 3 Hi
Another option is to use categorical data:
df['id'] = df['id'].astype('category').cat.codes + 1
Try:
df.reset_index(inplace=True)
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame([('bird', 389.0),
('bird', 24.0),
('mammal', 80.5),
('mammal', np.nan)],
index=['falcon', 'parrot', 'lion', 'monkey'],
columns=('class', 'max_speed'))
print(df)
class max_speed
falcon bird 389.0
parrot bird 24.0
lion mammal 80.5
monkey mammal NaN
This is how looks like, let replace the index:
df.reset_index(inplace=True)
print(df)
index class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal NaN
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['12a4', '12a4', '3b', '3b', '3b', '4t'],
'col1': ['summer', 'goest', 'yes', 'No', 'why', 'Hi']})
unique_id = df.drop_duplicates(subset=['id']).reset_index(drop=True)
id_dict = dict(zip(unique_id['id'], unique_id.index))
df['id'] = df['id'].apply(lambda x: id_dict[x])
df.drop_duplicates(subset=['id']).reset_index(drop=True) removes duplicate rows in column id.
# print(unique_id)
id col1
0 12a4 summer
1 3b yes
2 4t Hi
dict(zip(unique_id['id'], unique_id.index)) creates a dictionary from column id and index value.
# print(id_dict)
{'12a4': 0, '3b': 1, '4t': 2}
df['id'].apply(lambda x: id_dict[x]) sets the column value mapping with value from dict.
# print(df)
id col1
0 0 summer
1 0 goest
2 1 yes
3 1 No
4 1 why
5 2 Hi
I have a dataframe like this:
matrix = [(222, {'a': 1, 'b':3, 'c':2, 'd':1}),
(333, {'a': 1, 'b':0, 'c':0, 'd':1})]
df = pd.DataFrame(matrix, columns=['ordernum', 'dict_of item_counts'])
ordernum dict_of item_counts
0 222 {'a': 1, 'b': 3, 'c': 2, 'd': 1}
1 333 {'a': 1, 'b': 0, 'c': 0, 'd': 1}
and I would like to create a dataframe in which each ordernum is repeated for each dictionary key in dict_of_item_counts that is not 0. I would also like to create a key column that shows the corresponding dictionary key for this row as well as a value column that contains the dictionary values. Finally, I would also an ordernum_index that counts the different rows in the dataframe for each ordernum.
The final dataframe should look like this:
ordernum ordernum_index key value
222 1 a 1
222 2 b 3
222 3 c 2
222 4 d 1
333 1 a 1
333 2 d 1
Any help would be much appreciated :)
Always try to structure your data, Can be done easily like below:
>>> matrix
[(222, {'a': 1, 'b': 3, 'c': 2, 'd': 1}), (333, {'a': 1, 'b': 0, 'c': 0, 'd': 1})]
>>> data = [[item[0]]+[i+1]+list(value) for item in matrix for i,value in enumerate(item[1].items()) if value[-1]!=0]
>>> data
[[222, 1, 'a', 1], [222, 2, 'b', 3], [222, 3, 'c', 2], [222, 4, 'd', 1], [333, 1, 'a', 1], [333, 4, 'd', 1]]
>>> pd.DataFrame(data, columns=['ordernum', 'ordernum_index', 'key', 'value'])
ordernum ordernum_index key value
0 222 1 a 1
1 222 2 b 3
2 222 3 c 2
3 222 4 d 1
4 333 1 a 1
5 333 4 d 1
Expand the dictionary by using apply with pd.Series and use concat to concatenate that to your other column (ordernum). See below for your in-between result of df2.
Now to turn every column into a row, use melt, then use query to drop all the 0-rows and finally assign the cumcount to get the index (after ordering) and add 1 to start counting from 1, not 0.
df2 = pd.concat([df[['ordernum']], df['dict_of item_counts'].apply(pd.Series)], axis=1)
(df2.melt(id_vars='ordernum', var_name='key')
.query('value != 0')
.sort_values(['ordernum', 'key'])
.assign(ordernum_index = lambda df: df.groupby('ordernum').cumcount().add(1)))
# ordernum key value ordernum_index
#0 222 a 1 1
#2 222 b 3 2
#4 222 c 2 3
#6 222 d 1 4
#1 333 a 1 1
#7 333 d 1 2
Now df2 looks like:
# ordernum a b c d
#0 222 1 3 2 1
#1 333 1 0 0 1
You can do this by unpacking your dictionarys while accesing them with iterrows and creating a tuple out of the ordernum, key, value.
Finally to create your ordernum_index we groupby on ordernum and do a cumcount:
data = [(r['ordernum'], k, v) for _, r in df.iterrows() for k, v in r['dict_of item_counts'].items() ]
new = pd.DataFrame(data, columns=['ordernum', 'key', 'value']).sort_values('ordernum').reset_index(drop=True)
new['ordernum_index'] = new[new['value'].ne(0)].groupby('ordernum').cumcount().add(1)
new.dropna(inplace=True)
ordernum key value ordernum_index
0 222 a 1 1.0
1 222 b 3 2.0
2 222 c 2 3.0
3 222 d 1 4.0
4 333 a 1 1.0
7 333 d 1 2.0
Construct dataframe df1 using df['dict_of item_counts'].tolist() for values and df.ordernum for index. replace 0 with np.nan and stack with dropna=True to ignore 0 values. reset_index to get all columns.
Next, create column ordernum_index by using groupby and cumcount.
Finally, change column names to appropriate names.
df1 = pd.DataFrame(df['dict_of item_counts'].tolist(), index=df.ordernum).replace(0, np.nan).stack(dropna=True).reset_index(name='value')
df1['ordernum_index'] = df1.groupby('ordernum')['value'].cumcount() + 1
df1 = df1.rename(columns={'level_1': 'key'})
Out[732]:
ordernum key value ordernum_index
0 222 a 1.0 1
1 222 b 3.0 2
2 222 c 2.0 3
3 222 d 1.0 4
4 333 a 1.0 1
5 333 d 1.0 2
I have three columns C1,C2,C3 in panda dataframe. My aim is to replace C1_i by C2_j whenever C3_i=C1_j. These are all strings. I was trying where but failed. What is a good way to do this avoiding for loop?
If my data frame is
df=pd.DataFrame({'c1': ['a', 'b', 'c'], 'c2': ['d','e','f'], 'c3': ['c', 'z', 'b']})
Then I want c3 to be replaced by ['f','z','e']
I tried this, which takes very long time.
for i in range(0,len(df)):
for j in range(0,len(df)):
if (df.iloc[i]['c1']==df.iloc[j]['c3']):
df.iloc[j]['c3']=accounts.iloc[i]['c2']
Use map by Series created by set_index:
df['c3'] = df['c3'].map(df.set_index('c1')['c2']).fillna(df['c3'])
Alternative solution with update:
df['c3'].update(df['c3'].map(df.set_index('c1')['c2']))
print (df)
c1 c2 c3
0 a d f
1 b e z
2 c f e
Example data:
dataframe = pd.DataFrame({'a':['10','4','3','40','5'], 'b':['5','4','3','2','1'], 'c':['s','d','f','g','h']})
Output:
a b c
0 10 5 s
1 4 4 d
2 3 3 f
3 40 2 g
4 5 1 h
Code:
def replace(df):
if len(dataframe[dataframe.b==df.a]) != 0:
df['a'] = dataframe[dataframe.b==df.a].c.values[0]
return df
dataframe = dataframe.apply(replace, 1)
Output:
a b c
0 1 5 0
1 2 4 0
2 0 3 0
3 4 2 0
4 5 1 0
Is it what you want?