iteration through column values in python w.r.t index - pandas

enter image description here
where columns are true ,i want column name,
where more than one columns are true: columns name separated with /
blank cells are empty string
Required output:
Natural gas: Dark cloud
Copper : Bearish Harami
Nickel : bearish belthold
Aluminium : inverted hammer/bearish harami

You can do this unsing pandas.DataFrame.apply() on axis=1.
Example:
df = pd.DataFrame({'a':[np.nan, True], 'b':[True, np.nan], 'c':[True, np.nan]})
>> df
a b c
0 NaN True True
1 True NaN NaN
df_ = df.apply(lambda x: ' / '.join(x.dropna().index), axis=1).to_frame()
>>> df_
0
0 b / c
1 a
I hope this is what youa re asking for.

Related

Dataframe columns cleaning

I am trying to clean a number of columns in a dataset and try to iterate to different columns.
import pandas as pd
df = pd.DataFrame({
'A': [7.3\N\P,nan\T\Z,11.0\R\Z],
'B': [nan\J\N, nan\A\G, 10.8\F\U],
'C': [12.4\A\I, 13.3\H\Z, 8.200000000000001\B\W]})
for name, values in df.iloc[:, 0:3].iteritems():
def myreplace(s):
for char in ['\A','\B','\C','\D','\E','\F','\G','\H','\I',
'\J','\K','\L','\M','\\N','\O','\P','\Q','\R',
'\S','\T','\V','\W','\X','\Y','\Z','\\U']:
s = s.map(lambda x: x.replace(char, ''))
return s
df = df.apply(myreplace)
I get the error: 'float' object has no attribue 'replace'
I could run this part on one column and it works, but I need to run it on several columns so this part does not work as I get an error that 'Dataframe'objec has no attribute 'str'
df_data.str.replace('[\\\|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z]', '')
I am really new to python pandas dataframe. Will appreciate the help
Given, assuming the goal is to extract numbers from the strings:
A B C
0 7.3\N\P nan\J\N 12.4\A\I
1 nan\T\Z nan\A\G 13.3\H\Z
2 11.0\R\Z 10.8\F\U 8.200000000000001\B\W
Doing:
cols = ['A', 'B', 'C']
for col in cols:
df[col] = df[col].str.extract('(\d*\.\d*)').astype(float)
Output:
A B C
0 7.3 NaN 12.4
1 NaN NaN 13.3
2 11.0 10.8 8.2

change value in row of dataframe on condition

I have a dataframe with certain values and want to exchange values in one row on a condition. If the value is greater than x I want it to change to zero. I tried with .loc but somehow I get a Keyerror everytime I try. Does .loc work to select rows instead of columns? I used it for columns before, but I cant get it to work for rows.
df = pd.DataFrame({'a': np.random.randn(4), 'b': np.random.randn(4), 'c': np.random.randn(4)})
print(df)
df.loc['Total'] = df.sum()
df.loc[(df['Total'] < x), ['Total']] = 0
I also tried using iloc, but get another error. I dont think its a complex problem, but im kind of stuck, so help would be much appreciated!
You can assign values with loc - first set rows for repalce values by string - here Total, because set row label Total and then compare values of this rows selected by loc - It return boolean mask:
np.random.seed(2019)
df = pd.DataFrame({'a': np.random.randn(4), 'b': np.random.randn(4), 'c': np.random.randn(4)})
print(df)
a b c
0 -0.217679 -0.361865 -0.235634
1 0.821455 0.685609 0.953490
2 1.481278 0.573761 -1.689625
3 1.331864 0.287728 -0.344943
df.loc['Total'] = df.sum()
x = 1
df.loc['Total', df.loc['Total'] < x] = 0
print (df)
a b c
0 -0.217679 -0.361865 -0.235634
1 0.821455 0.685609 0.953490
2 1.481278 0.573761 -1.689625
3 1.331864 0.287728 -0.344943
Total 3.416918 1.185233 0.000000
Detail:
print (df.loc['Total'] < x)
a False
b False
c True
Name: Total, dtype: bool

what is the simplest way to check for occurrence of character/substring in dataframe values?

consider a pandas dataframe that has values such as 'a - b'. I would like to check for the occurrence of '-' anywhere across all values of the dataframe without looping through individual columns. Clearly a check such as the following won't work:
if '-' in df.values
Any suggestions on how to check for this? Thanks.
I'd use stack() + .str.contains() in this case:
In [10]: df
Out[10]:
a b c
0 1 a - b w
1 2 c z
2 3 d 2 - 3
In [11]: df.stack().str.contains('-').any()
Out[11]: True
In [12]: df.stack().str.contains('-')
Out[12]:
0 a NaN
b True
c False
1 a NaN
b False
c False
2 a NaN
b False
c True
dtype: object
You can use replace to to swap a regex match with something else then check for equality
df.replace('.*-.*', True, regex=True).eq(True)
One way may be to try using flatten to values and list comprehension.
df = pd.DataFrame([['val1','a-b', 'val3'],['val4','3', 'val5']],columns=['col1','col2', 'col3'])
print(df)
Output:
col1 col2 col3
0 val1 a-b val3
1 val4 3 val5
Now, to search for -:
find_value = [val for val in df.values.flatten() if '-' in val]
print(find_value)
Output:
['a-b']
Using NumPy: np.core.defchararray.find(a,s) returns an array of indices where the substring s appears in a;
if it's not present, -1 is returned.
(np.core.defchararray.find(df.values.astype(str),'-') > -1).any()
returns True if '-' is present anywhere in df.

Assigning index column to empty pandas dataframe

I am creating an empty dataframe that i then want to add data to one row at a time. I want to index on the first column, 'customer_ID'
I have this:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'],index=['customer_ID'])
In[2]: df
Out[3]:
customer_ID a b c
customer_ID NaN NaN NaN NaN
So there is already a row of NaN that I don't want.
Can I point the index to the first column without adding a row of data?
The answer, I think, as hinted at by #JD Long is to set the index in a seprate instruction:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'])
In[2]: df.set_index('customer_ID',inplace = True)
In[3]: df
Out[3]:
Empty DataFrame
Columns: [customer_ID, a, b, c]
Index: []
I can then add rows:
In[4]: id='x123'
In[5]: df.loc[id]=[id,4,5,6]
In[6]: df
Out[7]:
customer_ID a b c
x123 x123 4.0 5.0 6.0
yes... and you can dropna at any time if you are so inclined:
df = df.set_index('customer_ID').dropna()
df
Because you didn't have any row in your dataframe when you just create it.
df= pd.DataFrame({'customer_ID': ['2'],'a': ['1'],'b': ['A'],'c': ['1']})
df.set_index('customer_ID',drop=False)
df

pandas string contains lookup: NaN leads to Value Error

If you would like to filter those rows for which a string is in a column value, it is possible to use something like data.sample_id.str.contains('hph') (answered before: check if string in pandas dataframe column is in list, or Check if string is in a pandas dataframe).
However, my lookup column contains emtpy cells. Terefore, str.contains() yields NaN values and I get an value error upon indexing.
`ValueError: cannot index with vector containing NA / NaN values``
What works:
# get all runs
mask = [index for index, item in enumerate(data.sample_id.values) if 'zent' in str(item)]
Is there a more elegant and faster method (similar to str.contains()) than this one ?
You can set parameter na in str.contains to False:
print (df.a.str.contains('hph', na=False))
Using EdChum sample:
df = pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
print (df)
a
0 hph
1 NaN
2 sadhphsad
3 hello
print (df.a.str.contains('hph', na=False))
0 True
1 False
2 True
3 False
Name: a, dtype: bool
IIUC you can filter those rows out also
data['sample'].dropna().str.contains('hph')
Example:
In [38]:
df =pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
df
Out[38]:
a
0 hph
1 NaN
2 sadhphsad
3 hello
In [39]:
df['a'].dropna().str.contains('hph')
Out[39]:
0 True
2 True
3 False
Name: a, dtype: bool
So by calling dropna first you can then safely use str.contains on the Series as there will be no NaN values
Another way to handle the null values would be to use notnull:
In [43]:
(df['a'].notnull()) & (df['a'].str.contains('hph'))
Out[43]:
0 True
1 False
2 True
3 False
Name: a, dtype: bool
but I think passing na=False would be cleaner (#jezrael's answer)