I have a data frame below:
import pandas as pd
df = pd.DataFrame({"SK":["EYF","EYF","RMK","MB","RMK","GYF","RMK","MYF"],
"SA":["a","b","tm","tmb","tm","cd","tms","alb"],
"C":["","11","12","13","","15","16","17"]})
df
I want to map some values of "SK","SA" and "C" to a new column:
df["D"]= df["SK"].map({"EYF":1,"MB":2,"GYF":3})
df
df["D"]= df["SA"].map({"tm":4})
df
df["D"]= df["C"].map({"16":5,"17":6})
df
But when I run the next map function, "D" column values mapped by previous map function turn to NaN.
I wanna get df below:
Any help will be appreciated.
You can create 3 Series and then replace misisng values from previous Series by Series.fillna or Series.combine_first:
a = df["SK"].map({"EYF":1,"MB":2,"GYF":3})
b = df["SA"].map({"tm":4})
c = df["C"].map({"16":5,"17":6})
df["D"] = a.fillna(b).fillna(c)
#alternative
df["D"] = a.combine_first(b).combine_first(c)
print (df)
SK SA C D
0 EYF a 1.0
1 EYF b 11 1.0
2 RMK tm 12 4.0
3 MB tmb 13 2.0
4 RMK tm 4.0
5 GYF cd 15 3.0
6 RMK tms 16 5.0
7 MYF alb 17 6.0
Order is important for priority if same match for some value:
df = pd.DataFrame({"SK":["EYF","EYF"],
"SA":["a","tm"],
"C":["16","17"]})
a = df["SK"].map({"EYF":1,"MB":2,"GYF":3})
b = df["SA"].map({"tm":4})
c = df["C"].map({"16":5,"17":6})
df["D1"] = a.fillna(b).fillna(c)
df["D2"] = b.fillna(a).fillna(c)
df["D3"] = c.fillna(b).fillna(a)
print (df)
SK SA C D1 D2 D3
0 EYF a 16 1 1.0 5
1 EYF tm 17 1 4.0 6
Related
I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E
list1 = [ 1,2]
list2 = [2,3,4]
main = pd.DataFrame( columns = ['a','b'])
main = main.append(pd.DataFrame(list1, columns=['a']), ignore_index= True)
main = main.append(pd.DataFrame(list2, columns=['b']), ignore_index= True)
Output :
a b
1 NA
2 NA
NA 2
NA 3
NA 4
I want to expect the output of both lists in the same rows of this different list in particular columns.
Your solution with append values of list later is possible if use concat with axis=1:
main = pd.DataFrame()
main = pd.concat([main, pd.DataFrame(list1, columns=['a'])], axis=1)
main = pd.concat([main, pd.DataFrame(list2, columns=['b'])], axis=1)
print (main)
a b
0 1.0 2
1 2.0 3
2 NaN 4
If possible create DataFrame first with Series by lists:
main = pd.DataFrame({'a':pd.Series(list1), 'b':pd.Series(list2)})
print (main)
a b
0 1.0 2
1 2.0 3
2 NaN 4
I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E
import pandas as pd
import recordlinkage as rl
lst_left = [...]
lst_right = [...]
df_left = pd.DataFrame(lst_left, columns=pd.Index(["city_id", "street_name"]))
df_right = pd.DataFrame(lst_right, columns=pd.Index(["city_id", "street_name"]))
indexer = rl.Index()
indexer.block("city_id")
pairs = indexer.index(df_left, df_right)
compare = rl.Compare(indexing_type="label")
compare.string("street_name", "street_name", method="damerau_levenshtein", threshold=0.7)
features = compare.compute(pairs, df_left, df_right)
matches = features[features[0] == 1.0]
And I get matches pairs MultiIndex
Out[4]:
0
0 0 1.0
1 1 1.0
2 2 1.0
4 3 1.0
6 5 1.0
7 6 1.0
8 7 1.0
10 8 1.0
12 9 1.0
13 10 1.0
14 11 1.0
15 12 1.0
And now I want to left join (sql left outer join) df_left and df_right dataframes based on those matches pairs keeping unmatched elements from df_left DataFrame.
How can I do that?
P.S. To get only matched records I use
df_left.loc[matches.index.get_level_values(0)].reset_index().merge(df_right.loc[matches.index.get_level_values(1)].reset_index(), how="left", left_index=True, right_index=True)
But I don't know how to merge and keep unmatched rows from left DataFrame.
Thank You
I'm trying to add two dataframes using concat with axis = 0, so the columns stay the same but the index increases. One of the dataframes contains a specific columns with a serial number (going from one upwards - but not necessarily in sequence eg. 1,2,3,4,5, etc.)
import pandas as pd
import numpy as np
a = pd.DataFrame(data = {'Name': ['A', 'B','C'],
'Serial Number': [1, 2,5]} )
b = pd.DataFrame(data = {'Name': ['D','E','F'],
'Serial Number': [np.nan,np.nan,np.nan]})
c = pd.concat([a,b],axis=0).reset_index()
I would like to have column 'Serial Number' in dataframe C to start from 5+1 the next one 6+1.
I've tried a variety of things eg:
c.loc[c['B'].isna(), 'B'] = c['B'].shift(1)+1
But it doesn't seem to work.
Desired output:
| Name | Serial Number|
-------------------------
1 A | 1
2 B | 2
3 C | 5
4 D | 6
5 E | 7
6 F | 8
One idea is create arange by number od missinng values add maximal value and 1:
a = np.arange(c['Serial Number'].isna().sum()) + c['Serial Number'].max() + 1
c.loc[c['Serial Number'].isna(), 'Serial Number'] = a
print (c)
index Name Serial Number
0 0 A 1.0
1 1 B 2.0
2 2 C 5.0
3 0 D 6.0
4 1 E 7.0
5 2 F 8.0