How to add rows according to other column - pandas

now the result looks like this
file_name text 1
2a.txt 0 0.712518 0.61525 0.43918 0.2065 1 0.635078 0.81175 0.292786 0.0925
2b.txt 2 0.551273 0.5705 0.30198 0.0922 0 0.550212 0.31125 0.486563 0.2455
But I want duplicate rows according to the third column(as shown below), is there an easy way to do this?
file_name text
2a.txt 0 0.712518 0.61525 0.43918 0.2065
2a.txt 1 0.635078 0.81175 0.292786 0.0925
2b.txt 2 0.551273 0.5705 0.30198 0.0922
2b.txt 0 0.550212 0.31125 0.486563 0.2455

That should help:
df = pd.melt(df,id_vars='file_name' ,value_vars=['text','1'])
df = df.drop('variable', axis=1)
df = df.sort_values(by = 'file_name')

Related

If column is substring of another dataframe column set value

df1 = pd.DataFrame({'Key':['OK340820.1','OK340821.1'],'Length':[50000,67000]})
df2 = pd.DataFrame({'Key':['OK340820','OK340821'],'Length':[np.nan,np.nan]})
If df2.Key is a substring of df1.Key, set Length of df2 as value of Length in df1
I tried doing this:
df2['Length']=np.where(df2.Key.isin(df1.Key.str.extract(r'(.+?(?=\.))')), df1.Length, '')
But it's not returning the matches.
Map df2.Key to a "prepared" Key values of df1:
df2['Length'] = df2.Key.map(dict(zip(df1.Key.str.replace(r'\..+', '', regex=True), df1.Length)))
In [45]: df2
Out[45]:
Key Length
0 OK340820 50000
1 OK340821 67000
You can use a regex to extract the string, then map the values:
import re
pattern = '|'.join(map(re.escape, df2['Key']))
s = pd.Series(df1['Length'].values, index=df1['Key'].str.extract(f'({pattern})', expand=False))
df2['Length'] = df2['Key'].map(s)
Updated df2:
Key Length
0 OK340820 50000
1 OK340821 67000
Or with a merge:
import re
pattern = '|'.join(map(re.escape, df2['Key']))
(df2.drop(columns='Length')
.merge(df1, how='left', left_on='Key', suffixes=(None, '_'),
right_on=df1['Key'].str.extract(f'({pattern})', expand=False))
.drop(columns='Key_')
)
Alternative if the Key in df1 is always in the form XXX.1 and removing the .1 is enough:
df2['Length'] = df2['Key'].map(df1.set_index(df1['Key'].str.extract('([^.]+)', expand=False))['Length'])
Another possible solution, which is based on pandas.DataFrame.update:
df2.update(df1.assign(Key = df1['Key'].str.extract('(.*)\.')))
Output:
Key Length
0 OK340820 50000.0
1 OK340821 67000.0

Mapping selective data with other dataframe in pandas

I want to map data( df2 & df1 ) with selective columns
import pandas as pd
df_data = [{'id':'1234','task':'data_trasnfer','filename':'orance_bank','date':'17-3-22'},{'id':'234','task':'data2trasnfer','filename':'ftr_data','date':'16-03-2022'},{'id':'4567','task':'data3_transfer','filename':'trnienn_data','date':'15-2-22'}]
df1 = pd.DataFrame(df_data)
df1
id task filename date
0 1234 data_trasnfer orance_bank 17-3-22
1 234 data2trasnfer ftr_data 16-03-2022
2 4567 data3_transfer trnienn_data 15-2-22
df_data1 = [{'target':'ed34','status':'sucess','flow_in':'ntfc_to_pad'},{'target':'der456','status':'error','flow_in':'htr_tokid'}]
df2 = pd.DataFrame(df_data1)
df2
target status flow_in
0 ed34 sucess ntfc_to_pad
1 der456 error htr_tokid
expected output :
df2 data ed34 should map with only with fileaname orance_bank & der456 only map with trnienn_data
id task filename date target status flow_in
0 1234 data_trasnfer orance_bank 17-3-22 ed34 sucess ntfc_to_pad
1 234 data2trasnfer ftr_data 16-03-2022
2 4567 data3_transfer trnienn_data 15-2-22 der456 error htr_tokid
First make a mapping function, like this:
filemap = {
"ed34": "orance_bank",
"der456": "trnienn_data"
}
df2['filename'] = df2['target'].map(filemap)
Then merge the 2 dataframes:
df1.merge(df2, on='filename', how='outer').fillna('')

Recursively update the dataframe

I have a dataframe called datafe from which I want to combine the hyphenated words.
for example input dataframe looks like this:
,author_ex
0,Marios
1,Christodoulou
2,Intro-
3,duction
4,Simone
5,Speziale
6,Exper-
7,iment
And the output dataframe should be like:
,author_ex
0,Marios
1,Christodoulou
2,Introduction
3,Simone
4,Speziale
5,Experiment
I have written a sample code to achieve this but I am not able to get out of the recursion safely.
def rm_actual(datafe, index):
stem1 = datafe.iloc[index]['author_ex']
stem2 = datafe.iloc[index + 1]['author_ex']
fixed_token = stem1[:-1] + stem2
datafe.drop(index=index + 1, inplace=True, axis=0)
newdf=datafe.reset_index(drop=True)
newdf.iloc[index]['author_ex'] = fixed_token
return newdf
def remove_hyphens(datafe):
for index, row in datafe.iterrows():
flag = False
token=row['author_ex']
if token[-1:] == '-':
datafe=rm_actual(datafe, index)
flag=True
break
if flag==True:
datafe=remove_hyphens(datafe)
if flag==False:
return datafe
datafe=remove_hyphens(datafe)
print(datafe)
Is there any possibilities I can get out of this recursion with expected output?
Another option:
Given/Input:
author_ex
0 Marios
1 Christodoulou
2 Intro-
3 duction
4 Simone
5 Speziale
6 Exper-
7 iment
Code:
import pandas as pd
# read/open file or create dataframe
df = pd.DataFrame({'author_ex':['Marios', 'Christodoulou', 'Intro-', \
'duction', 'Simone', 'Speziale', 'Exper-', 'iment']})
# check input format
print(df)
# create new column 'Ending' for True/False if column 'author_ex' ends with '-'
df['Ending'] = df['author_ex'].shift(1).str.contains('-$', na=False, regex=True)
# remove the trailing '-' from the 'author_ex' column
df['author_ex'] = df['author_ex'].str.replace('-$', '', regex=True)
# create new column with values of 'author_ex' and shifted 'author_ex' concatenated together
df['author_ex_combined'] = df['author_ex'] + df.shift(-1)['author_ex']
# create a series true/false but shifted up
index = (df['Ending'] == True).shift(-1)
# set the last row to 'False' after it was shifted
index.iloc[-1] = False
# replace 'author_ex' with 'author_ex_combined' based on true/false of index series
df.loc[index,'author_ex'] = df['author_ex_combined']
# remove rows that have the 2nd part of the 'author_ex' string and are no longer required
df = df[~df.Ending]
# remove the extra columns
df.drop(['Ending', 'author_ex_combined'], axis = 1, inplace=True)
# output final dataframe
print('\n\n')
print(df)
# notice index 3 and 6 are missing
Outputs:
author_ex
0 Marios
1 Christodoulou
2 Introduction
4 Simone
5 Speziale
6 Experiment

Pandas get row if column is a substring of string

I can do the following if I want to extract rows whose column "A" contains the substring "hello".
df[df['A'].str.contains("hello")]
How can I select rows whose column is the substring for another word? e.g.
df["hello".contains(df['A'].str)]
Here's an example dataframe
df = pd.DataFrame.from_dict({"A":["hel"]})
df["hello".contains(df['A'].str)]
IIUC, you could apply str.find:
import pandas as pd
df = pd.DataFrame(['hell', 'world', 'hello'], columns=['A'])
res = df[df['A'].apply("hello".find).ne(-1)]
print(res)
Output
A
0 hell
2 hello
As an alternative use __contains__
res = df[df['A'].apply("hello".__contains__)]
print(res)
Output
A
0 hell
2 hello
Or simply:
res = df[df['A'].apply(lambda x: x in "hello")]
print(res)

why astype is not chance type of values?

df = pd.DataFrame({"A" : ["1", "7.0", "xyz"]})
type(df.A[0])
the result is "str".
df.A = df.A.astype(int, errors = "ignore")
type(df.A[0])
the result is also "str". I wanna convert "1" and "7.0" to 1 and 7.
where did i do wrong?
why astype is not chance type of values?
Because errors = "ignore" working different like you think.
If it failed, it return same values, so nothing change.
If want values in numeric and NaN if failed:
df['A'] = pd.to_numeric(df['A'], errors='coerce').astype('Int64')
print (df)
A
0 1
1 7
2 <NA>
For mixed values - numbers with strings:
def num(x):
try:
return(int(float(x)))
except:
return x
df['A'] = df['A'].apply(num)
print (df)
0 1
1 7
2 xyz