Can't seem to find the answer anywhere. I have a column 'q' within my dataframe that has both strings and floats. I would like to remove the string values from 'q' and move them into an existing string column 'comments'. Any help is appreciated.
I have tried:
df['comments']=[isinstance(x, str) for x in df.q]
I have also tried some str methods on q but to no avail. Any direction on this would be appreciated
If series is:
s=pd.Series([1.0,1.1,1.2,1.3,'this','is',1.4,'a',1.5,'comment'])
s
Out[24]:
0 1
1 1.1
2 1.2
3 1.3
4 this
5 is
6 1.4
7 a
8 1.5
9 comment
dtype: object
then only floats can be:
[e if type(e) is float else np.NaN for e in s if type(e)]
Out[25]: [1.0, 1.1, 1.2, 1.3, nan, nan, 1.4, nan, 1.5, nan]
And comments can be:
[e if type(e) is not float else '' for e in s if type(e)]
Out[26]: ['', '', '', '', 'this', 'is', '', 'a', '', 'comment']
This is what you are trying to do.
But element-wise iteration with pandas does not scale well, so extract floats only using:
pd.to_numeric(s,errors='coerce')
Out[27]:
0 1.0
1 1.1
2 1.2
3 1.3
4 NaN
5 NaN
6 1.4
7 NaN
8 1.5
9 NaN
dtype: float64
and :
pd.to_numeric(s,errors='coerce').to_frame('floats').merge(s.loc[pd.to_numeric(s,errors='coerce').isnull()].to_frame('comments'), left_index=True, right_index=True, how='outer')
Out[71]:
floats comments
0 1.0 NaN
1 1.1 NaN
2 1.2 NaN
3 1.3 NaN
4 NaN this
5 NaN is
6 1.4 NaN
7 NaN a
8 1.5 NaN
9 NaN comment
there is a side effect to pd.to_numeric(s,errors='coerce') where it'll convert all strings with float literals to float instead of keeping it as a string.
pd.to_numeric(pd.Series([1.0,1.1,1.2,1.3,'this','is',1.4,'a',1.5,'comment','12.345']), errors='coerce')
Out[73]:
0 1.000
1 1.100
2 1.200
3 1.300
4 NaN
5 NaN
6 1.400
7 NaN
8 1.500
9 NaN
10 12.345 <--- this is now the float 12.345 not str
dtype: float64
If you don't want to convert strings with float literals into floats, you can use also str.isnumeric() method:
df = pd.DataFrame({'q':[1.5,2.5,3.5,'a', 'b', 5.1,'3.55','1.44']})
df['comments'] = df.loc[df['q'].str.isnumeric()==False, 'q']
In [4]: df
Out[4]:
q comments
0 1.5 NaN
1 2.5 NaN
2 3.5 NaN
3 a a
4 b b
5 5.1 NaN
6 3.55 3.55 <-- strings are not converted into floats
7 1.44 1.44
Or something like this:
criterion = df.q.apply(lambda x: isinstance(x,str))
df['comments'] = df.loc[criterion, 'q']
Again, it won't convert strings into floats.
Related
I have a table with preexisting columns, and I want to entirely replace some of those columns with values from a series. The tricky part is that each series will have different indexes and I need to add these varying indexes to the table as necessary, like doing a join/merge operation.
For example, this code generates a table and 5 series where each series only has a subset of the indexes.
import random
cols=['a', 'b', 'c', 'd', 'e', 'f', 'g']
table = pd.DataFrame(columns=cols)
series = []
for i in range(5):
series.append(
pd.Series(
np.random.randint(0, 3, 2)*10,
index=pd.Index(random.sample(range(3), 2))
)
)
series
Output:
[1 10
2 0
dtype: int32,
2 0
0 20
dtype: int32,
2 20
1 0
dtype: int32,
2 0
0 10
dtype: int32,
1 20
2 10
dtype: int32]
But when I try to replace columns of the table with the series, a simple assignment doesn't work
for i in range(5):
col = cols[i]
table[col] = series[i]
table
Output:
a b c d e f g
1 10 NaN 0 NaN 20 NaN NaN
2 0 0 20 0 10 NaN NaN
because the assignment won't add any more indexes after the first series is assigned
Other things I've tried:
combine or combine_first gives the same result as above. (table[col] = table[col].combine(series[i], lambda a, b: b) and table[col] = series[i].combine_first(table[col]))
pd.concat doesn't work either because of duplicate labels (table[col] = pd.concat([table[col], series[i]]) gives ValueError: cannot reindex on an axis with duplicate labels) and I can't just drop the duplicates because other columns may already have values in those indexes
DataFrame.update won't work since it only takes indexes from the table (join='left'). I need to add indexes from the series to the table as necessary.
Of course, I can always do something like this:
table = table.join(series[i].rename('new'), how='outer')
table[col] = table.pop('new')
which gives the correct result:
a b c d e f g
0 NaN 20.0 NaN 10.0 NaN NaN NaN
1 10.0 NaN 0.0 NaN 20.0 NaN NaN
2 0.0 0.0 20.0 0.0 10.0 NaN NaN
But it's doing it in quite a roundabout way, and still isn't robust to column name collisions, so you'd have to add a handful more code to fiddle with column names and protect against that. This produces quite verbose and ugly code for what is a conceptually a very simple operation, that I believe there must be a better way of doing it.
pd.concat should work along the column axis:
out = pd.concat(series, axis=1)
print(out)
# Output
0 1 2 3 4
0 10.0 0.0 0.0 NaN 10.0
1 NaN 10.0 NaN 0.0 20.0
2 0.0 NaN 0.0 0.0 NaN
You could try constructing the dataframe using a dict comprehension like this:
series:
[0 10
1 0
dtype: int64,
0 0
1 0
dtype: int64,
2 20
0 0
dtype: int64,
0 20
2 0
dtype: int64,
0 0
1 0
dtype: int64]
code:
table = pd.DataFrame({
col: series[i]
for i, col in enumerate(cols)
if i < len(series)
})
table
output:
a b c d e
0 10.0 0.0 0.0 20.0 0.0
1 0.0 0.0 NaN NaN 0.0
2 NaN NaN 20.0 0.0 NaN
If you really need the nan columns at the end you could do:
table = pd.DataFrame({
col: series[i] if i < len(series) else np.nan
for i, col in enumerate(cols)
})
Output:
a b c d e f g
0 10.0 0.0 0.0 20.0 0.0 NaN NaN
1 0.0 0.0 NaN NaN 0.0 NaN NaN
2 NaN NaN 20.0 0.0 NaN NaN NaN
I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN
I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.
Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
If there are three columns of data, the first column is some category id, the second column and the third column have some missing values, I want to aggregate the id of the first column, after grouping, fill in the third column of each group by the method: 'ffill' Missing value
I found a good idea here: Pandas: filling missing values by weighted average in each group! , but it didn't solve my problem because the output it produced was not what I wanted
Enter the following code to get an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
'sss':[1, np.nan, 3, np.nan, np.nan, np.nan, 2, np.nan, np.nan]})
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN NaN
2 B NaN 3.0
3 B 2.0 NaN
4 B 3.0 NaN
5 B 1.0 NaN
6 C 3.0 2.0
7 C NaN NaN
8 C 3.0 NaN
Fill in missing values with a previous value after grouping
Then I ran the following code, but it outputs strange results:
df["sss"] = df.groupby("name").transform(lambda x: x.fillna(axis = 0,method = 'ffill'))
df
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN NaN
3 B 2.0 2.0
4 B 3.0 3.0
5 B 1.0 1.0
6 C 3.0 3.0
7 C NaN 3.0
8 C 3.0 3.0
The result I want is this:
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN 3.0
3 B 2.0 3.0
4 B 3.0 3.0
5 B 1.0 3.0
6 C 3.0 2.0
7 C NaN 2.0
8 C 3.0 2.0
Can someone point out where I am wrong?strong text
I have a pandas DataFrame compiled from some web data (for tennis games) that exhibits strange behaviour when summing across selected rows.
DataFrame:
In [178]: tdf.shape
Out[178]: (47028, 57)
In [201]: cols
Out[201]: ['L1', 'L2', 'L3', 'L4', 'L5', 'W1', 'W2', 'W3', 'W4', 'W5']
In [177]: tdf[cols].head()
Out[177]:
L1 L2 L3 L4 L5 W1 W2 W3 W4 W5
0 4.0 2 NaN NaN NaN 6.0 6 NaN NaN NaN
1 3.0 3 NaN NaN NaN 6.0 6 NaN NaN NaN
2 7.0 5 3 NaN NaN 6.0 7 6 NaN NaN
3 1.0 4 NaN NaN NaN 6.0 6 NaN NaN NaN
4 6.0 7 4 NaN NaN 7.0 5 6 NaN NaN
When then trying to compute the sum over the rows using tdf[cols].sum(axis=1). From the above table, the sum for the 1st row should be 18.0, but it is reported as 10, as below:
In [180]: tdf[cols].sum(axis=1).head()
Out[180]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
The problem seems to be caused by a specific record (row 13771), because when I exclude this row, the sum is calculated correctly:
In [182]: tdf.iloc[:13771][cols].sum(axis=1).head()
Out[182]:
0 18.0
1 18.0
2 34.0
3 17.0
4 35.0
dtype: float64
whereas, including it:
In [183]: tdf.iloc[:13772][cols].sum(axis=1).head()
Out[183]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
Gives the wrong result for the entire column.
The offending record is as follows:
In [196]: tdf[cols].iloc[13771]
Out[196]:
L1 1
L2 1
L3 NaN
L4 NaN
L5 NaN
W1 6
W2 0
W3
W4 NaN
W5 NaN
Name: 13771, dtype: object
In [197]: tdf[cols].iloc[13771].W3
Out[197]: ' '
In [198]: type(tdf[cols].iloc[13771].W3)
Out[198]: str
I'm running the following versions:
In [192]: sys.version
Out[192]: '3.4.3 (default, Nov 17 2016, 01:08:31) \n[GCC 4.8.4]'
In [193]: pd.__version__
Out[193]: '0.19.2'
In [194]: np.__version__
Out[194]: '1.12.0'
Surely a single poorly formatted record should not influence the sum of other records? Is this a bug or am I doing something wrong?
Help much appreciated!
Problem is with empty string - then dtype of column W3 is object (obviously string) and sum omit it.
Solutions:
Try replace problematic empty string value to NaN and then cast to float
tdf.loc[13771, 'W3'] = np.nan
tdf.W3 = tdf.W3.astype(float)
Or try replace all empty strings to NaN in subset cols:
tdf[cols] = tdf[cols].replace({'':np.nan})
#if necessary
tdf[cols] = tdf[cols].astype(float)
Another solution is use to_numeric in problematic column - replace all non numeric to NaN:
tdf.W3 = pd.to_numerice(tdf.W3, erors='coerce')
Or generally apply for columns cols:
tdf[cols] = tdf[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))