How do you strip out only the integers of a column in pandas? - pandas

I am trying to strip out only the numeric values--which is the first 1 or 2 digits. Some values in the column contain pure strings and others contain special characters. See pic for the value count:
enter image description here
I have tried multiple methods:
breaks['_Size'] = breaks['Size'].fillna(0)
breaks[breaks['_Size'].astype(str).str.isdigit()]
breaks['_Size'] = breaks['_Size'].replace('\*','',regex=True).astype(float)
breaks['_Size'] = breaks['_Size'].str.extract('(\d+)').astype(int)
breaks['_Size'].map(lambda x: x.rstrip('aAbBcC'))
None are working. The dtype is object. To be clear, I am attempting to make a new column with only the digits (as an int/float) and if I could convert the fraction to a decimal that would be bonus

This works for dividing fractions and also allows for extra numbers to be present in the string (it returns you just the first sequence of numbers):
In [60]: import pandas as pd
In [61]: import re
In [62]: df = pd.DataFrame([0, "6''", '7"', '8in', 'text', '3/4"', '1a3'], columns=['_Size'])
In [63]: df
Out[63]:
_Size
0 0
1 6''
2 7"
3 8in
4 text
5 3/4"
6 1a3
In [64]: def cleaning_function(row):
...: row = str(row)
...: fractions = re.findall(r'(\d+)/(\d+)', row)
...: if fractions:
...: return float(int(fractions[0][0])/int(fractions[0][1]))
...: numbers = re.findall(r'[0-9]+', str(row))
...: if numbers:
...: return numbers[0]
...: return 0
...:
In [65]: df._Size.apply(cleaning_function)
Out[65]:
0 0
1 6
2 7
3 8
4 0
5 0.75
6 1
Name: _Size, dtype: object

Related

New column with word at nth position of string from other column pandas

import numpy as np
import pandas as pd
d = {'ABSTRACT_ID': [14145090,1900667, 8157202,6784974],
'TEXT': [
"velvet antlers vas are commonly used in tradit",
"we have taken a basic biologic RPA to elucidat4",
"ceftobiprole bpr is an investigational cephalo",
"lipoperoxidationderived aldehydes for example",],
'LOCATION': [1, 4, 2, 1]}
df = pd.DataFrame(data=d)
df
def word_at_pos(x,y):
pos=x
string= y
count = 0
res = ""
for word in string:
if word == ' ':
count = count + 1
if count == pos:
break
res = ""
else :
res = res + word
print(res)
word_at_pos(df.iloc[0,2],df.iloc[0,1])
For this df I want to create a new column WORD that contains the word from TEXT at the position indicated by LOCATION. e.g. first line would be "velvet".
I can do this for a single line as an isolated function world_at_pos(x,y), but can't work out how to apply this to whole column. I have done new columns with Lambda functions before, but can't work out how to fit this function to lambda.
Looping over TEXT and LOCATION could be the best idea because splitting creates a jagged array, so filtering using numpy advanced indexing won't be possible.
df["WORDS"] = [txt.split()[loc] for txt, loc in zip(df["TEXT"], df["LOCATION"]-1)]
print(df)
ABSTRACT_ID ... WORDS
0 14145090 ... velvet
1 1900667 ... a
2 8157202 ... bpr
3 6784974 ... lipoperoxidationderived
[4 rows x 4 columns]

Pandas create new column base on groupby and apply lambda if statement

I have the issue with groupby and apply
df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})
I want to create a column C for each group take value 1 if B > z_score=2 and 0 otherwise. The code:
from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)
However, I am unsuccessful with code and cannot figure out the issue
Use GroupBy.transformwith lambda, function, then compare and for convert True/False to 1/0 convert to integers:
from scipy import stats
s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)
Or use numpy.where:
df['C'] = np.where(s > 2, 1, 0)
Error in your solution is per groups:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If check gotcha in pandas docs:
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.
So if use one of solutions instead if-else:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))
print (df)
A
a [0, 0, 0]
b [0, 0, 0, 0]
Name: B, dtype: object
but then need convert to column, for avoid this problems is used groupby.transform.
You can use groupby + apply a function that finds the z-scores of each item in each group; explode the resulting list; use gt to create a boolean series and convert it to dtype int
df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)
Output:
A B C
0 a 1 0
1 a 2 0
2 a 3 0
3 b 4 0
4 b 5 0
5 b 6 0
6 b 7 0

How to use the values of one column to access values in another column?

How to use the values of one column to access values in another
import numpy
impot pandas
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=[['Value']])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
so how to access the value 'bleh' for each row?
df.Value.iloc[df['bleh']]
Edit:
Thanks to #ScottBoston. My DF constructor had one layer of [] too much.
The correct answer is:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
df['idx_int'] = range(df.shape[0])
df['haa'] = df['idx_int'] - df.bleh.values
df['newcol'] = df.Value.iloc[df['haa'].values].values
Try:
df['Value'].tolist()
Output:
[-1.0856306033005612,
0.9973454465835858,
0.28297849805199204,
-1.506294713918092,
-0.5786002519685364,
1.651436537097151,
-2.426679243393074,
-0.42891262885617726,
1.265936258705534,
-0.8667404022651017]
Your dataframe constructor still needs to be fixed.
Are you looking for:
df.set_index('bleh')
output:
Value
bleh
0 -1.085631
1 0.997345
2 0.282978
1 -1.506295
4 -0.578600
0 1.651437
0 -2.426679
4 -0.428913
1 1.265936
7 -0.866740
If so you, your dataframe constructor has as extra set of [] in it.
np.random.seed(123)
df = pd.DataFrame((np.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: np.random.randint(0, x + 1, 1)[0])
columns paramater in dataframe takes a list not a list of list.

Return count for specific value in pandas .value_counts()?

Assume running pandas' dataframe['prod_code'].value_counts() and storing result as 'df'. The operation outputs:
125011 90300
762 72816
None 55512
7156 14892
75162 8825
How would I extract the count for None? I'd expect the result to be 55512.
I've tried
>>> df.loc[df.index.isin(['None'])]
>>> Series([], Name: prod_code, dtype: int64)
and also
>>> df.loc['None']
>>> KeyError: 'the label [None] is not in the [index]'
It seems you need None, not string 'None':
df.loc[df.index.isin([None])]
df.loc[None]
EDIT:
If need check where NaN in index:
print (s1.loc[np.nan])
#or
print (df[pd.isnull(df.index)])
Sample:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', None, np.nan])
s1 = s.value_counts(dropna=False)
print (s1)
8825 3
90300 2
NaN 2
dtype: int64
print (s1[pd.isnull(s1.index)])
NaN 2
dtype: int64
print (s1.loc[np.nan])
2
print (s1.loc[None])
2
EDIT1:
For stripping whitespaces:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', 'None ', np.nan])
print (s)
0 90300
1 90300
2 8825
3 8825
4 8825
5 None
6 NaN
dtype: object
s1 = s.value_counts()
print (s1)
8825 3
90300 2
None 1
dtype: int64
s1.index = s1.index.str.strip()
print (s1.loc['None'])
1
Couple of things
pd.Series([None] * 2 + [1] * 3).value_counts() automatically drops the None.
pd.Series([None] * 2 + [1] * 3).value_counts(dropna=False) converts the None to np.NaN
That tells me that your None is a string. But since df.loc['None'] didn't work, I suspect your string has white space around it.
Try:
df.filter(regex='None', axis=0)
Or:
df.index = df.index.to_series().str.strip().combine_first(df.index.to_series())
df.loc['None']
All that said, I was curious how to reference np.NaN in the index
s = pd.Series([1, 2], [0, np.nan])
s.iloc[s.index.get_loc(np.nan)]
2

Defining a function to play a graph from CSV data - Python panda

I am trying to play around with data analysis, taking in data from a simple CSV file I have created with random values in it.
I have defined a function that should allow the user to type in a value3 then from the dataFrame, plot a bar graph. The below:
def analysis_currency_pair():
x=raw_input("what currency pair would you like to analysie ? :")
print type(x)
global dataFrame
df1=dataFrame
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
When I call the function, the code returns my question, along with giving the output of the currency pair. However, it doesn't seem to put x (the value input by the user) into the later half of the function, and so no graph is produced.
Am I doing something wrong here?
This code works when we just put the value in, and not within a function.
I am confused!
I think you need rewrite your function with two parameters: x and df, which are passed to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": pd.Series({1: 'EURUSD', 2: 'EURGBP', 3: 'CADUSD'}),
"amount": pd.Series({1: 2, 2: 2, 3: 3.5}),
"a": pd.Series({1: 7, 2: 8, 3: 9})})
print df
# a amount currencyPair
#1 7 2.0 EURUSD
#2 8 2.0 EURGBP
#3 9 3.5 CADUSD
def analysis_currency_pair(x, df1):
print type(x)
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
#raw input is EURUSD or EURGBP or CADUSD
pair=raw_input("what currency pair would you like to analysie ? :")
analysis_currency_pair(pair, df)
Or you can pass string to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": [ 'EURUSD', 'EURGBP', 'CADUSD', 'EURUSD', 'EURGBP'],
"amount": [ 1, 2, 3, 4, 5],
"amount1": [ 5, 4, 3, 2, 1]})
print df
# amount amount1 currencyPair
#0 1 5 EURUSD
#1 2 4 EURGBP
#2 3 3 CADUSD
#3 4 2 EURUSD
#4 5 1 EURGBP
def analysis_currency_pair(x, df1):
print type(x)
#<type 'str'>
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
print df2
# amount
#currencyPair
#CADUSD 3
#EURGBP 7
#EURUSD 5
df2 = df2.loc[x].plot(kind = 'bar')
analysis_currency_pair('CADUSD', df)