Add column to pandas df depending on a condition statement - pandas

I have a df called lin_reg_df with two columns Surface_Elevation_mAHD and Adopted_SS_WL.
The df holds measurements of each for 88 groundwater wells that each have a particular well name.
lin_reg_df is indexed by well name.
I want to add another column to the df that is called Aquifer_Type and specifies if the well is deep or shallow.
The well names of all the deep wells are held in a list called deep_wells and shallow wells are held in shallow_wells
I want to cycle through the well names (index of the df) and if the well name is listed in the
list called deep_wells I want to put a deep in the Aquifer_Type column. If it is listed
in the list called shallow_wells I want to put a shallow in the Aquifer_Type column.
I tried using isin within the loop but I couldnt get it to work.
Any advice?`

I tried to create a minimum example from your information, with following code:
well_name = ['a','b','c','d','e','f','g']
deep_wells = ['a','c','g']
shallow_wells = ['b','d','e','f']
lin_reg_df = pd.DataFrame({'Surface_Elevation_mAHD ': [1,2,3,4,5,6,7],
'Adopted_SS_WL': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]})
lin_reg_df.index = well_names
so my DataFrame initially looks like this:
Surface_Elevation_mAHD Adopted_SS_WL
a 1 0.1
b 2 0.2
c 3 0.3
d 4 0.4
e 5 0.5
f 6 0.6
g 7 0.7
I would then just use this code snippet to do the job:
for well in well_name:
if well in deep_wells:
lin_reg_df.loc[well, 'Aquifier_Type'] = 'deep'
elif well in shallow_wells:
lin_reg_df.loc[well, 'Aquifier_Type'] = 'shallow'
The output will be:
Surface_Elevation_mAHD Adopted_SS_WL Aquifier_Type
a 1 0.1 deep
b 2 0.2 shallow
c 3 0.3 deep
d 4 0.4 shallow
e 5 0.5 shallow
f 6 0.6 shallow
g 7 0.7 deep

Always give data to get help faster
Data
df=pd.DataFrame({'Surface_Elevation_mAHD':[3,4,6,7.8,9,2,5],'Adopted_SS_WL':[1,2,3,4,5,6,7],'well name':['WQ','RT','KL','SZ','TR','YH','YP']})
df
Lists
deep_wells=['WQ','RT','YH','YP']
shallow_wells=['KL','SZ','TR']
One other way is to use np.where for numpy.where(condition,yes,no)
df['Aquifer_Type']= np.where(df['well name'].isin(shallow_wells), 'shallow_wells', 'deep_wells')
df
If your well name is the index. Please reset it before applying np.wehere. You do that by df.reset_index(inplace=True)

Related

Some trouble as a beginner [duplicate]

This seems like a ridiculously easy question... but I'm not seeing the easy answer I was expecting.
So, how do I get the value at an nth row of a given column in Pandas? (I am particularly interested in the first row, but would be interested in a more general practice as well).
For example, let's say I want to pull the 1.2 value in Btime as a variable.
Whats the right way to do this?
>>> df_test
ATime X Y Z Btime C D E
0 1.2 2 15 2 1.2 12 25 12
1 1.4 3 12 1 1.3 13 22 11
2 1.5 1 10 6 1.4 11 20 16
3 1.6 2 9 10 1.7 12 29 12
4 1.9 1 1 9 1.9 11 21 19
5 2.0 0 0 0 2.0 8 10 11
6 2.4 0 0 0 2.4 10 12 15
To select the ith row, use iloc:
In [31]: df_test.iloc[0]
Out[31]:
ATime 1.2
X 2.0
Y 15.0
Z 2.0
Btime 1.2
C 12.0
D 25.0
E 12.0
Name: 0, dtype: float64
To select the ith value in the Btime column you could use:
In [30]: df_test['Btime'].iloc[0]
Out[30]: 1.2
There is a difference between df_test['Btime'].iloc[0] (recommended) and df_test.iloc[0]['Btime']:
DataFrames store data in column-based blocks (where each block has a single
dtype). If you select by column first, a view can be returned (which is
quicker than returning a copy) and the original dtype is preserved. In contrast,
if you select by row first, and if the DataFrame has columns of different
dtypes, then Pandas copies the data into a new Series of object dtype. So
selecting columns is a bit faster than selecting rows. Thus, although
df_test.iloc[0]['Btime'] works, df_test['Btime'].iloc[0] is a little bit
more efficient.
There is a big difference between the two when it comes to assignment.
df_test['Btime'].iloc[0] = x affects df_test, but df_test.iloc[0]['Btime']
may not. See below for an explanation of why. Because a subtle difference in
the order of indexing makes a big difference in behavior, it is better to use single indexing assignment:
df.iloc[0, df.columns.get_loc('Btime')] = x
df.iloc[0, df.columns.get_loc('Btime')] = x (recommended):
The recommended way to assign new values to a
DataFrame is to avoid chained indexing, and instead use the method shown by
andrew,
df.loc[df.index[n], 'Btime'] = x
or
df.iloc[n, df.columns.get_loc('Btime')] = x
The latter method is a bit faster, because df.loc has to convert the row and column labels to
positional indices, so there is a little less conversion necessary if you use
df.iloc instead.
df['Btime'].iloc[0] = x works, but is not recommended:
Although this works, it is taking advantage of the way DataFrames are currently implemented. There is no guarantee that Pandas has to work this way in the future. In particular, it is taking advantage of the fact that (currently) df['Btime'] always returns a
view (not a copy) so df['Btime'].iloc[n] = x can be used to assign a new value
at the nth location of the Btime column of df.
Since Pandas makes no explicit guarantees about when indexers return a view versus a copy, assignments that use chained indexing generally always raise a SettingWithCopyWarning even though in this case the assignment succeeds in modifying df:
In [22]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [24]: df['bar'] = 100
In [25]: df['bar'].iloc[0] = 99
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
In [26]: df
Out[26]:
foo bar
0 A 99 <-- assignment succeeded
2 B 100
1 C 100
df.iloc[0]['Btime'] = x does not work:
In contrast, assignment with df.iloc[0]['bar'] = 123 does not work because df.iloc[0] is returning a copy:
In [66]: df.iloc[0]['bar'] = 123
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [67]: df
Out[67]:
foo bar
0 A 99 <-- assignment failed
2 B 100
1 C 100
Warning: I had previously suggested df_test.ix[i, 'Btime']. But this is not guaranteed to give you the ith value since ix tries to index by label before trying to index by position. So if the DataFrame has an integer index which is not in sorted order starting at 0, then using ix[i] will return the row labeled i rather than the ith row. For example,
In [1]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [2]: df
Out[2]:
foo
0 A
2 B
1 C
In [4]: df.ix[1, 'foo']
Out[4]: 'C'
Note that the answer from #unutbu will be correct until you want to set the value to something new, then it will not work if your dataframe is a view.
In [4]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [5]: df['bar'] = 100
In [6]: df['bar'].iloc[0] = 99
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.16.0_19_g8d2818e-py2.7-macosx-10.9-x86_64.egg/pandas/core/indexing.py:118: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Another approach that will consistently work with both setting and getting is:
In [7]: df.loc[df.index[0], 'foo']
Out[7]: 'A'
In [8]: df.loc[df.index[0], 'bar'] = 99
In [9]: df
Out[9]:
foo bar
0 A 99
2 B 100
1 C 100
Another way to do this:
first_value = df['Btime'].values[0]
This way seems to be faster than using .iloc:
In [1]: %timeit -n 1000 df['Btime'].values[20]
5.82 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit -n 1000 df['Btime'].iloc[20]
29.2 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df.iloc[0].head(1) - First data set only from entire first row.
df.iloc[0] - Entire First row in column.
In a general way, if you want to pick up the first N rows from the J column from pandas dataframe the best way to do this is:
data = dataframe[0:N][:,J]
To access a single value you can use the method iat that is much faster than iloc:
df['Btime'].iat[0]
You can also use the method take:
df['Btime'].take(0)
.iat and .at are the methods for getting and setting single values and are much faster than .iloc and .loc. Mykola Zotko pointed this out in their answer, but they did not use .iat to its full extent.
When we can use .iat or .at, we should only have to index into the dataframe once.
This is not great:
df['Btime'].iat[0]
It is not ideal because the 'Btime' column was first selected as a series, then .iat was used to index into that series.
These two options are the best:
Using zero-indexed positions:
df.iat[0, 4] # get the value in the zeroth row, and 4th column
Using Labels:
df.at[0, 'Btime'] # get the value where the index label is 0 and the column name is "Btime".
Both methods return the value of 1.2.
To get e.g the value from column 'test' and row 1 it works like
df[['test']].values[0][0]
as only df[['test']].values[0] gives back a array
Another way of getting the first row and preserving the index:
x = df.first('d') # Returns the first day. '3d' gives first three days.
According to pandas docs, at is the fastest way to access a scalar value such as the use case in the OP (already suggested by Alex on this page).
Building upon Alex's answer, because dataframes don't necessarily have a range index it might be more complete to index df.index (since dataframe indexes are built on numpy arrays, you can index them like an array) or call get_loc() on columns to get the integer location of a column.
df.at[df.index[0], 'Btime']
df.iat[0, df.columns.get_loc('Btime')]
One common problem is that if you used a boolean mask to get a single value, but ended up with a value with an index (actually a Series); e.g.:
0 1.2
Name: Btime, dtype: float64
you can use squeeze() to get the scalar value, i.e.
df.loc[df['Btime']<1.3, 'Btime'].squeeze()

Locating column of dataframe in a dataframe

I have a dataframe in a dataframe similar to this one (my real one is much larger)
df_peter = pd.DataFrame({"height": [50,np.nan,65], "weight": [20,25,27]})
df_anna = pd.DataFrame({"height": [47,55,np.nan], "weight": [18,23,30]})
df = pd.DataFrame({"Name":["Peter", "Anna"], "Year":[2000, 2002], "Data":[df_peter, df_anna]})
df
Name Year Data
0 Peter 2000 height weight 0 50.0 20 1 NaN ...
1 Anna 2002 height weight 0 47.0 18 1 55.0 ...
my final goal is to use the .fillna(method = "ffill") function on the height column of Peter and Anna,
so i need a way to locate these two height columns
would be also practical to plot the data
using .fillna(method = "ffill") for one row is easy
df.loc[0,"Data"]["height"].fillna(method = "ffill")
But for both/all rows is not that easy beacause: df["Data"]["height"] doesn't work
actually i found a solution writing this question:
df["Data"].apply(lambda x: x["height"].fillna(method = "ffill", inplace = True))
but is there a way to do this without apply and lambda?
As an alternative, you can use a loop:
for i in range(0,len(df)):
df.loc[i,"Data"]["height"]=df.loc[i,"Data"]["height"].fillna(method = "ffill")

How to split this data into neat dataframe in pandas. NOTE the dtype is object [duplicate]

I am trying to split a column into multiple columns based on comma/space separation.
My dataframe currently looks like
KEYS 1
0 FIT-4270 4000.0439
1 FIT-4269 4000.0420, 4000.0471
2 FIT-4268 4000.0419
3 FIT-4266 4000.0499
4 FIT-4265 4000.0490, 4000.0499, 4000.0500, 4000.0504,
I would like
KEYS 1 2 3 4
0 FIT-4270 4000.0439
1 FIT-4269 4000.0420 4000.0471
2 FIT-4268 4000.0419
3 FIT-4266 4000.0499
4 FIT-4265 4000.0490 4000.0499 4000.0500 4000.0504
My code currently removes The KEYS column and I'm not sure why. Could anyone improve or help fix the issue?
v = dfcleancsv[1]
#splits the columns by spaces into new columns but removes KEYS?
dfcleancsv = dfcleancsv[1].str.split(' ').apply(Series, 1)
In case someone else wants to split a single column (deliminated by a value) into multiple columns - try this:
series.str.split(',', expand=True)
This answered the question I came here looking for.
Credit to EdChum's code that includes adding the split columns back to the dataframe.
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
Note: The first argument df[[0]] is DataFrame.
The second argument df[1].str.split is the series that you want to split.
split Documentation
concat Documentation
Using Edchums answer of
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
I was able to solve it by substituting my variables.
dfcleancsv = pd.concat([dfcleancsv['KEYS'], dfcleancsv[1].str.split(', ', expand=True)], axis=1)
The OP had a variable number of output columns.
In the particular case of a fixed number of output columns another elegant solution to name the resulting columns is to use a multiple assignation.
Load a sample dataset and reshape it to long format to obtain a variable
called organ_dimension.
import seaborn
iris = seaborn.load_dataset('iris')
df = iris.melt(id_vars='species', var_name='organ_dimension', value_name='value')
Split the organ_dimension variable in 2 variables organ and dimension based on the _ separator.
df[['organ', 'dimension']] = df['organ_dimension'].str.split('_', expand=True)
df.head()
Out[10]:
species organ_dimension value organ dimension
0 setosa sepal_length 5.1 sepal length
1 setosa sepal_length 4.9 sepal length
2 setosa sepal_length 4.7 sepal length
3 setosa sepal_length 4.6 sepal length
4 setosa sepal_length 5.0 sepal length
Based on this answer "How to split a column into two columns?"
The simplest way to use is, vectorization
df = df.apply(lambda x:pd.Series(x))
maybe this should work:
df = pd.concat([df['KEYS'],df[1].apply(pd.Series)],axis=1)
Check this out
Responder_id LanguagesWorkedWith
0 1 HTML/CSS;Java;JavaScript;Python
1 2 C++;HTML/CSS;Python
2 3 HTML/CSS
3 4 C;C++;C#;Python;SQL
4 5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA
... ... ...
87564 88182 HTML/CSS;Java;JavaScript
87565 88212 HTML/CSS;JavaScript;Python
87566 88282 Bash/Shell/PowerShell;Go;HTML/CSS;JavaScript;W...
87567 88377 HTML/CSS;JavaScript;Other(s):
87568 88863 Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...`
###Split the LanguagesWorkedWith column into multiple columns by using` data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)`.###
` data1 = pd.read_csv('data.csv', sep=',')
data1.set_index('Responder_id',inplace=True)
data1
data1.loc[1,:]
data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)
data.head()`
You may also want to try datar, a package ports dplyr, tidyr and related R packages to python:
>>> df
i j A
<object> <int64> <object>
0 AR 5 Paris,Green
1 For 3 Moscow,Yellow
2 For 4 NewYork,Black
>>> from datar import f
>>> from datar.tidyr import separate
>>> separate(df, f.A, ['City', 'Color'])
i j City Color
<object> <int64> <object> <object>
0 AR 5 Paris Green
1 For 3 Moscow Yellow
2 For 4 NewYork Black

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

Pipelining pandas: create columns that depend on freshly created ones

Let's say you have the following DataFrame
df=pd.DataFrame({'A': [1, 2]})
now I want to construct the column B = A+1, then the column C=A+2 and D = B +C. These calculations are only here for simplicity. Typically, I want to use some e.g. nonlinear transformations, normalizations etc.
what one could do is the following:
df.assign(**{'B': lambda x: x['A'] +1, 'C': lambda x :['A']+2})\
.assign(**{'D':lambda x: x['B']+ x['C']})
However, this is obviously a bit annoying, specifically, if you have a large number of preprocessing steps in a pipeline. Putting both dictionaries together (even in an orderedDict) fails.
Is there a way to obtain a similar result faster or more elegantly?
Additionally, the same problem occurs, if you want to add a column that uses e.g. the sum of a just defined column. This, as far as I know, will always require two assign calls.
You can using eval
df.eval("""
B= A+1
C= A+2
D = B+C""", inplace=False)
Out[625]:
A B C D
0 1 2 3 5
1 2 3 4 7
If you want the calculation within the query ''
df.eval('B=A.max()',inplace=True)
df
Out[647]:
A B
0 1 2
1 2 2