changing case in pandas dataframe - pandas

I have a pandas dataframe x :
sequence score
AAtttGG 2
CCCgttT 3
I want the output
sequence score
AATTTGG 2
CCCGTTT 3
so that all sequences are uppercase. I would appreciate a oneliner that will do this.

Call the vectorised str method upper:
In [98]:
df.sequence = df.sequence.str.upper()
df
Out[98]:
sequence score
0 AATTTGG 2
1 CCCGTTT 3

Related

How to split this data into neat dataframe in pandas. NOTE the dtype is object [duplicate]

I am trying to split a column into multiple columns based on comma/space separation.
My dataframe currently looks like
KEYS 1
0 FIT-4270 4000.0439
1 FIT-4269 4000.0420, 4000.0471
2 FIT-4268 4000.0419
3 FIT-4266 4000.0499
4 FIT-4265 4000.0490, 4000.0499, 4000.0500, 4000.0504,
I would like
KEYS 1 2 3 4
0 FIT-4270 4000.0439
1 FIT-4269 4000.0420 4000.0471
2 FIT-4268 4000.0419
3 FIT-4266 4000.0499
4 FIT-4265 4000.0490 4000.0499 4000.0500 4000.0504
My code currently removes The KEYS column and I'm not sure why. Could anyone improve or help fix the issue?
v = dfcleancsv[1]
#splits the columns by spaces into new columns but removes KEYS?
dfcleancsv = dfcleancsv[1].str.split(' ').apply(Series, 1)
In case someone else wants to split a single column (deliminated by a value) into multiple columns - try this:
series.str.split(',', expand=True)
This answered the question I came here looking for.
Credit to EdChum's code that includes adding the split columns back to the dataframe.
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
Note: The first argument df[[0]] is DataFrame.
The second argument df[1].str.split is the series that you want to split.
split Documentation
concat Documentation
Using Edchums answer of
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
I was able to solve it by substituting my variables.
dfcleancsv = pd.concat([dfcleancsv['KEYS'], dfcleancsv[1].str.split(', ', expand=True)], axis=1)
The OP had a variable number of output columns.
In the particular case of a fixed number of output columns another elegant solution to name the resulting columns is to use a multiple assignation.
Load a sample dataset and reshape it to long format to obtain a variable
called organ_dimension.
import seaborn
iris = seaborn.load_dataset('iris')
df = iris.melt(id_vars='species', var_name='organ_dimension', value_name='value')
Split the organ_dimension variable in 2 variables organ and dimension based on the _ separator.
df[['organ', 'dimension']] = df['organ_dimension'].str.split('_', expand=True)
df.head()
Out[10]:
species organ_dimension value organ dimension
0 setosa sepal_length 5.1 sepal length
1 setosa sepal_length 4.9 sepal length
2 setosa sepal_length 4.7 sepal length
3 setosa sepal_length 4.6 sepal length
4 setosa sepal_length 5.0 sepal length
Based on this answer "How to split a column into two columns?"
The simplest way to use is, vectorization
df = df.apply(lambda x:pd.Series(x))
maybe this should work:
df = pd.concat([df['KEYS'],df[1].apply(pd.Series)],axis=1)
Check this out
Responder_id LanguagesWorkedWith
0 1 HTML/CSS;Java;JavaScript;Python
1 2 C++;HTML/CSS;Python
2 3 HTML/CSS
3 4 C;C++;C#;Python;SQL
4 5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA
... ... ...
87564 88182 HTML/CSS;Java;JavaScript
87565 88212 HTML/CSS;JavaScript;Python
87566 88282 Bash/Shell/PowerShell;Go;HTML/CSS;JavaScript;W...
87567 88377 HTML/CSS;JavaScript;Other(s):
87568 88863 Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...`
###Split the LanguagesWorkedWith column into multiple columns by using` data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)`.###
` data1 = pd.read_csv('data.csv', sep=',')
data1.set_index('Responder_id',inplace=True)
data1
data1.loc[1,:]
data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)
data.head()`
You may also want to try datar, a package ports dplyr, tidyr and related R packages to python:
>>> df
i j A
<object> <int64> <object>
0 AR 5 Paris,Green
1 For 3 Moscow,Yellow
2 For 4 NewYork,Black
>>> from datar import f
>>> from datar.tidyr import separate
>>> separate(df, f.A, ['City', 'Color'])
i j City Color
<object> <int64> <object> <object>
0 AR 5 Paris Green
1 For 3 Moscow Yellow
2 For 4 NewYork Black

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

How to (idiomatically) use pandas .loc to return an empty dataframe when key is not in index [duplicate]

This question already has answers here:
Pandas .loc without KeyError
(6 answers)
Closed 2 years ago.
Say I have a DataFrame (with a multi-index, for that matter), and I wish to take the values at some index - but, if that index does not exist, I wish for it to return an empty df instead of a KeyError.
I've searched for similar questions, but they are all about pandas returning an empty dataframe when it is not desired at some cases (conversely, I do desire an empty dataframe in return).
For example:
import pandas as pd
df = pd.DataFrame(index=pd.MultiIndex.from_tuples([(1,1),(1,2),(3,1)]),
columns=['a','b'], data=[[1,2],[3,4],[10,20]])
so, df is:
a b
1 1 1 2
2 3 4
3 1 10 20
and df.loc[1] is:
a b
1 1 2
2 3 4
df.loc[2] raises a KeyError, and I'd like something that returns
a b
The closest I could get is by calling df.loc[idx:idx] as a slice, which gives the correct result for idx=2, but for idx=1 it returns
a b
1 1 1 2
2 3 4
instead of the desires result.
Of course I can define a function to do it,
One idea with if-else statament:
def get_val(x):
return df.loc[x] if x in df.index.levels[0] else pd.DataFrame(columns=df.columns)
Or generally with try-except statement:
def get_val(x):
try:
return df.loc[x]
except KeyError:
return pd.DataFrame(columns=df.columns)
print (get_val(1))
a b
1 1 2
2 3 4
print (get_val(2))
Empty DataFrame
Columns: [a, b]
Index: []

Pandas: find most frequent values in columns of lists

x animal
0 5 [dog, cat]
1 6 [dog]
2 8 [elephant]
I have dataframe like this. How can i find most frequent animals contained in all lists of column.
Method value_counts() consider list as one element and i can't use it.
something along these lines?
import pandas as pd
df = pd.DataFrame({'x' : [5,6,8], 'animal' : [['dog', 'cat'], ['elephant'], ['dog']]})
x = sum(df.animal, [])
#x
#Out[15]: ['dog', 'cat', 'elephant', 'dog']
from collections import Counter
c = Counter(x)
c.most_common(1)
#Out[17]: [('dog', 2)]
Maybe take a step back and redefine your data structure? Pandas is more suited if your dataframe is "flat".
Instead of:
x animal
0 5 [dog, cat]
1 6 [dog]
2 8 [elephant]
Do:
x animal
0 5 dog
1 5 cat
2 6 dog
3 8 elephant
Now you can count easily with len(df[df['animal'] == 'dog']) as well as many other Pandas things!
To flatten your dataframe, reference this answer:
Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas

selecting data from pandas panel with MultiIndex

I have a DataFrame with MultiIndex, for example:
In [1]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [2]: df = DataFrame(randn(6,2),index=MultiIndex.from_tuples(zip(*arrays)),columns=['A','B'])
In [3]: df
Out [3]:
A B
one 1 -2.028736 -0.466668
2 -1.877478 0.179211
3 0.886038 0.679528
two 1 1.101735 0.169177
2 0.756676 -1.043739
3 1.189944 1.342415
Now I want to compute the means of elements 2 and 3 (index level 1) for each row (index level 0) and each column. So I need a DataFrame which would look like
A B
one 1 mean(df['A'].ix['one'][1:3]) mean(df['B'].ix['one'][1:3])
two 1 mean(df['A'].ix['two'][1:3]) mean(df['B'].ix['two'][1:3])
How do I do that without using loops over rows (index level 0) of the original data frame? What if I want to do the same for a Panel? There must be a simple solution with groupby, but I'm still learning it and can't think of an answer.
You can use the xs function to select on levels.
Starting with:
A B
one 1 -2.712137 -0.131805
2 -0.390227 -1.333230
3 0.047128 0.438284
two 1 0.055254 -1.434262
2 2.392265 -1.474072
3 -1.058256 -0.572943
You can then create a new dataframe using:
DataFrame({'one':df.xs('one',level=0)[1:3].apply(np.mean), 'two':df.xs('two',level=0)[1:3].apply(np.mean)}).transpose()
which gives the result:
A B
one -0.171549 -0.447473
two 0.667005 -1.023508
To do the same without specifying the items in the level, you can use groupby:
grouped = df.groupby(level=0)
d = {}
for g in grouped:
d[g[0]] = g[1][1:3].apply(np.mean)
DataFrame(d).transpose()
I'm not sure about panels - it's not as well documented, but something similar should be possible
I know this is an old question, but for reference who searches and finds this page, the easier solution I think is the level keyword in mean:
In [4]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [5]: df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(z
ip(*arrays)),columns=['A','B'])
In [6]: df
Out[6]:
A B
one 1 -0.472890 2.297778
2 -2.002773 -0.114489
3 -1.337794 -1.464213
two 1 1.964838 -0.623666
2 0.838388 0.229361
3 1.735198 0.170260
In [7]: df.mean(level=0)
Out[7]:
A B
one -1.271152 0.239692
two 1.512808 -0.074682
In this case it means that level 0 is kept over axis 0 (the rows, default value for mean)
Do the following:
# Specify the indices you want to work with.
idxs = [("one", elem) for elem in [2,3]] + [("two", elem) for elem in [2,3]]
# Compute grouped mean over only those indices.
df.ix[idxs].mean(level=0)