Move strings within a mixed string and float column to new column in Pandas - pandas

Can't seem to find the answer anywhere. I have a column 'q' within my dataframe that has both strings and floats. I would like to remove the string values from 'q' and move them into an existing string column 'comments'. Any help is appreciated.
I have tried:
df['comments']=[isinstance(x, str) for x in df.q]
I have also tried some str methods on q but to no avail. Any direction on this would be appreciated

If series is:
s=pd.Series([1.0,1.1,1.2,1.3,'this','is',1.4,'a',1.5,'comment'])
s
Out[24]:
0 1
1 1.1
2 1.2
3 1.3
4 this
5 is
6 1.4
7 a
8 1.5
9 comment
dtype: object
then only floats can be:
[e if type(e) is float else np.NaN for e in s if type(e)]
Out[25]: [1.0, 1.1, 1.2, 1.3, nan, nan, 1.4, nan, 1.5, nan]
And comments can be:
[e if type(e) is not float else '' for e in s if type(e)]
Out[26]: ['', '', '', '', 'this', 'is', '', 'a', '', 'comment']
This is what you are trying to do.
But element-wise iteration with pandas does not scale well, so extract floats only using:
pd.to_numeric(s,errors='coerce')
Out[27]:
0 1.0
1 1.1
2 1.2
3 1.3
4 NaN
5 NaN
6 1.4
7 NaN
8 1.5
9 NaN
dtype: float64
and :
pd.to_numeric(s,errors='coerce').to_frame('floats').merge(s.loc[pd.to_numeric(s,errors='coerce').isnull()].to_frame('comments'), left_index=True, right_index=True, how='outer')
Out[71]:
floats comments
0 1.0 NaN
1 1.1 NaN
2 1.2 NaN
3 1.3 NaN
4 NaN this
5 NaN is
6 1.4 NaN
7 NaN a
8 1.5 NaN
9 NaN comment
there is a side effect to pd.to_numeric(s,errors='coerce') where it'll convert all strings with float literals to float instead of keeping it as a string.
pd.to_numeric(pd.Series([1.0,1.1,1.2,1.3,'this','is',1.4,'a',1.5,'comment','12.345']), errors='coerce')
Out[73]:
0 1.000
1 1.100
2 1.200
3 1.300
4 NaN
5 NaN
6 1.400
7 NaN
8 1.500
9 NaN
10 12.345 <--- this is now the float 12.345 not str
dtype: float64

If you don't want to convert strings with float literals into floats, you can use also str.isnumeric() method:
df = pd.DataFrame({'q':[1.5,2.5,3.5,'a', 'b', 5.1,'3.55','1.44']})
df['comments'] = df.loc[df['q'].str.isnumeric()==False, 'q']
In [4]: df
Out[4]:
q comments
0 1.5 NaN
1 2.5 NaN
2 3.5 NaN
3 a a
4 b b
5 5.1 NaN
6 3.55 3.55 <-- strings are not converted into floats
7 1.44 1.44
Or something like this:
criterion = df.q.apply(lambda x: isinstance(x,str))
df['comments'] = df.loc[criterion, 'q']
Again, it won't convert strings into floats.

Related

In pandas, replace table column with Series while joining indexes

I have a table with preexisting columns, and I want to entirely replace some of those columns with values from a series. The tricky part is that each series will have different indexes and I need to add these varying indexes to the table as necessary, like doing a join/merge operation.
For example, this code generates a table and 5 series where each series only has a subset of the indexes.
import random
cols=['a', 'b', 'c', 'd', 'e', 'f', 'g']
table = pd.DataFrame(columns=cols)
series = []
for i in range(5):
series.append(
pd.Series(
np.random.randint(0, 3, 2)*10,
index=pd.Index(random.sample(range(3), 2))
)
)
series
Output:
[1 10
2 0
dtype: int32,
2 0
0 20
dtype: int32,
2 20
1 0
dtype: int32,
2 0
0 10
dtype: int32,
1 20
2 10
dtype: int32]
But when I try to replace columns of the table with the series, a simple assignment doesn't work
for i in range(5):
col = cols[i]
table[col] = series[i]
table
Output:
a b c d e f g
1 10 NaN 0 NaN 20 NaN NaN
2 0 0 20 0 10 NaN NaN
because the assignment won't add any more indexes after the first series is assigned
Other things I've tried:
combine or combine_first gives the same result as above. (table[col] = table[col].combine(series[i], lambda a, b: b) and table[col] = series[i].combine_first(table[col]))
pd.concat doesn't work either because of duplicate labels (table[col] = pd.concat([table[col], series[i]]) gives ValueError: cannot reindex on an axis with duplicate labels) and I can't just drop the duplicates because other columns may already have values in those indexes
DataFrame.update won't work since it only takes indexes from the table (join='left'). I need to add indexes from the series to the table as necessary.
Of course, I can always do something like this:
table = table.join(series[i].rename('new'), how='outer')
table[col] = table.pop('new')
which gives the correct result:
a b c d e f g
0 NaN 20.0 NaN 10.0 NaN NaN NaN
1 10.0 NaN 0.0 NaN 20.0 NaN NaN
2 0.0 0.0 20.0 0.0 10.0 NaN NaN
But it's doing it in quite a roundabout way, and still isn't robust to column name collisions, so you'd have to add a handful more code to fiddle with column names and protect against that. This produces quite verbose and ugly code for what is a conceptually a very simple operation, that I believe there must be a better way of doing it.
pd.concat should work along the column axis:
out = pd.concat(series, axis=1)
print(out)
# Output
0 1 2 3 4
0 10.0 0.0 0.0 NaN 10.0
1 NaN 10.0 NaN 0.0 20.0
2 0.0 NaN 0.0 0.0 NaN
You could try constructing the dataframe using a dict comprehension like this:
series:
[0 10
1 0
dtype: int64,
0 0
1 0
dtype: int64,
2 20
0 0
dtype: int64,
0 20
2 0
dtype: int64,
0 0
1 0
dtype: int64]
code:
table = pd.DataFrame({
col: series[i]
for i, col in enumerate(cols)
if i < len(series)
})
table
output:
a b c d e
0 10.0 0.0 0.0 20.0 0.0
1 0.0 0.0 NaN NaN 0.0
2 NaN NaN 20.0 0.0 NaN
If you really need the nan columns at the end you could do:
table = pd.DataFrame({
col: series[i] if i < len(series) else np.nan
for i, col in enumerate(cols)
})
Output:
a b c d e f g
0 10.0 0.0 0.0 20.0 0.0 NaN NaN
1 0.0 0.0 NaN NaN 0.0 NaN NaN
2 NaN NaN 20.0 0.0 NaN NaN NaN

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

In pandas replace consecutive 0s with NaN

I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.
Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN

The previous value in each group is padded with missing values

If there are three columns of data, the first column is some category id, the second column and the third column have some missing values, I want to aggregate the id of the first column, after grouping, fill in the third column of each group by the method: 'ffill' Missing value
I found a good idea here: Pandas: filling missing values by weighted average in each group! , but it didn't solve my problem because the output it produced was not what I wanted
Enter the following code to get an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
'sss':[1, np.nan, 3, np.nan, np.nan, np.nan, 2, np.nan, np.nan]})
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN NaN
2 B NaN 3.0
3 B 2.0 NaN
4 B 3.0 NaN
5 B 1.0 NaN
6 C 3.0 2.0
7 C NaN NaN
8 C 3.0 NaN
Fill in missing values with a previous value after grouping
Then I ran the following code, but it outputs strange results:
df["sss"] = df.groupby("name").transform(lambda x: x.fillna(axis = 0,method = 'ffill'))
df
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN NaN
3 B 2.0 2.0
4 B 3.0 3.0
5 B 1.0 1.0
6 C 3.0 3.0
7 C NaN 3.0
8 C 3.0 3.0
The result I want is this:
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN 3.0
3 B 2.0 3.0
4 B 3.0 3.0
5 B 1.0 3.0
6 C 3.0 2.0
7 C NaN 2.0
8 C 3.0 2.0
Can someone point out where I am wrong?strong text

Strange pandas.DataFrame.sum(axis=1) behaviour

I have a pandas DataFrame compiled from some web data (for tennis games) that exhibits strange behaviour when summing across selected rows.
DataFrame:
In [178]: tdf.shape
Out[178]: (47028, 57)
In [201]: cols
Out[201]: ['L1', 'L2', 'L3', 'L4', 'L5', 'W1', 'W2', 'W3', 'W4', 'W5']
In [177]: tdf[cols].head()
Out[177]:
L1 L2 L3 L4 L5 W1 W2 W3 W4 W5
0 4.0 2 NaN NaN NaN 6.0 6 NaN NaN NaN
1 3.0 3 NaN NaN NaN 6.0 6 NaN NaN NaN
2 7.0 5 3 NaN NaN 6.0 7 6 NaN NaN
3 1.0 4 NaN NaN NaN 6.0 6 NaN NaN NaN
4 6.0 7 4 NaN NaN 7.0 5 6 NaN NaN
When then trying to compute the sum over the rows using tdf[cols].sum(axis=1). From the above table, the sum for the 1st row should be 18.0, but it is reported as 10, as below:
In [180]: tdf[cols].sum(axis=1).head()
Out[180]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
The problem seems to be caused by a specific record (row 13771), because when I exclude this row, the sum is calculated correctly:
In [182]: tdf.iloc[:13771][cols].sum(axis=1).head()
Out[182]:
0 18.0
1 18.0
2 34.0
3 17.0
4 35.0
dtype: float64
whereas, including it:
In [183]: tdf.iloc[:13772][cols].sum(axis=1).head()
Out[183]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
Gives the wrong result for the entire column.
The offending record is as follows:
In [196]: tdf[cols].iloc[13771]
Out[196]:
L1 1
L2 1
L3 NaN
L4 NaN
L5 NaN
W1 6
W2 0
W3
W4 NaN
W5 NaN
Name: 13771, dtype: object
In [197]: tdf[cols].iloc[13771].W3
Out[197]: ' '
In [198]: type(tdf[cols].iloc[13771].W3)
Out[198]: str
I'm running the following versions:
In [192]: sys.version
Out[192]: '3.4.3 (default, Nov 17 2016, 01:08:31) \n[GCC 4.8.4]'
In [193]: pd.__version__
Out[193]: '0.19.2'
In [194]: np.__version__
Out[194]: '1.12.0'
Surely a single poorly formatted record should not influence the sum of other records? Is this a bug or am I doing something wrong?
Help much appreciated!
Problem is with empty string - then dtype of column W3 is object (obviously string) and sum omit it.
Solutions:
Try replace problematic empty string value to NaN and then cast to float
tdf.loc[13771, 'W3'] = np.nan
tdf.W3 = tdf.W3.astype(float)
Or try replace all empty strings to NaN in subset cols:
tdf[cols] = tdf[cols].replace({'':np.nan})
#if necessary
tdf[cols] = tdf[cols].astype(float)
Another solution is use to_numeric in problematic column - replace all non numeric to NaN:
tdf.W3 = pd.to_numerice(tdf.W3, erors='coerce')
Or generally apply for columns cols:
tdf[cols] = tdf[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))