pandas dataframe convert column type to string or categorical - pandas

How do I convert a single column of a pandas dataframe to type string? In the df of housing data below I need to convert zipcode to string so that when I run linear regression, zipcode is treated as categorical and not numeric. Thanks!
df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}})
print (df)
bathrooms bedrooms floors sqft_living sqft_lot zipcode
722 3.25 4 2.0 4670 51836 98005
2680 0.75 2 1.0 1440 3700 98107
14554 2.50 4 2.0 3180 9603 98155
17384 1.50 2 3.0 1430 1650 98125
18754 1.00 2 1.0 1130 2640 98109

You need astype:
df['zipcode'] = df.zipcode.astype(str)
#df.zipcode = df.zipcode.astype(str)
For converting to categorical:
df['zipcode'] = df.zipcode.astype('category')
#df.zipcode = df.zipcode.astype('category')
Another solution is Categorical:
df['zipcode'] = pd.Categorical(df.zipcode)
Sample with data:
import pandas as pd
df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}})
print (df)
bathrooms bedrooms floors sqft_living sqft_lot zipcode
722 3.25 4 2.0 4670 51836 98005
2680 0.75 2 1.0 1440 3700 98107
14554 2.50 4 2.0 3180 9603 98155
17384 1.50 2 3.0 1430 1650 98125
18754 1.00 2 1.0 1130 2640 98109
print (df.dtypes)
bathrooms float64
bedrooms int64
floors float64
sqft_living int64
sqft_lot int64
zipcode int64
dtype: object
df['zipcode'] = df.zipcode.astype('category')
print (df)
bathrooms bedrooms floors sqft_living sqft_lot zipcode
722 3.25 4 2.0 4670 51836 98005
2680 0.75 2 1.0 1440 3700 98107
14554 2.50 4 2.0 3180 9603 98155
17384 1.50 2 3.0 1430 1650 98125
18754 1.00 2 1.0 1130 2640 98109
print (df.dtypes)
bathrooms float64
bedrooms int64
floors float64
sqft_living int64
sqft_lot int64
zipcode category
dtype: object

With pandas >= 1.0 there is now a dedicated string datatype:
1) You can convert your column to this pandas string datatype using .astype('string'):
df['zipcode'] = df['zipcode'].astype('string')
2) This is different from using str which sets the pandas object datatype:
df['zipcode'] = df['zipcode'].astype(str)
3) For changing into categorical datatype use:
df['zipcode'] = df['zipcode'].astype('category')
You can see this difference in datatypes when you look at the info of the dataframe:
df = pd.DataFrame({
'zipcode_str': [90210, 90211] ,
'zipcode_string': [90210, 90211],
'zipcode_category': [90210, 90211],
})
df['zipcode_str'] = df['zipcode_str'].astype(str)
df['zipcode_string'] = df['zipcode_str'].astype('string')
df['zipcode_category'] = df['zipcode_category'].astype('category')
df.info()
# you can see that the first column has dtype object
# while the second column has the new dtype string
# the third column has dtype category
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 zipcode_str 2 non-null object
1 zipcode_string 2 non-null string
2 zipcode_category 2 non-null category
dtypes: category(1), object(1), string(1)
From the docs:
The 'string' extension type solves several issues with object-dtype
NumPy arrays:
You can accidentally store a mixture of strings and non-strings in an
object dtype array. A StringArray can only store strings.
object dtype breaks dtype-specific operations like
DataFrame.select_dtypes(). There isn’t a clear way to select just text
while excluding non-text, but still object-dtype columns.
When reading code, the contents of an object dtype array is less clear
than string.
More info on working with the new string datatype can be found here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

Prior answers focused on nominal data (e.g. unordered). If there is a reason to impose order for an ordinal variable, then one would use:
# Transform to category
df['zipcode_category'] = df['zipcode_category'].astype('category')
# Add ordered category
df['zipcode_ordered'] = df['zipcode_category']
# Setup the ordering
df.zipcode_ordered.cat.set_categories(
new_categories = [90211, 90210], ordered = True, inplace = True
)
# Output IDs
df['zipcode_ordered_id'] = df.zipcode_ordered.cat.codes
print(df)
# zipcode_category zipcode_ordered zipcode_ordered_id
# 90210 90210 1
# 90211 90211 0
More details on setting ordered categories can be found at the pandas website:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#sorting-and-order

To convert a column into a string type (that will be an object column per se in pandas), use astype:
df.zipcode = zipcode.astype(str)
If you want to get a Categorical column, you can pass the parameter 'category' to the function:
df.zipcode = zipcode.astype('category')

Related

Product demand down calculation in pandas df without loop

I'm having trouble with shift and diff and I feel it is simple?
Assume I have customers with different product demands, and they get handled with priority top down. I'd like to have it efficient without looping....
df_situation = pd.DataFrame(
{
"cust": [1, 2, 3, 3,4],
"prod": [1, 1, 1, 2,2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"needed": [200, 300, 1000, 1000,1000],
}
)
My objective is to get some additional columns like this, but it looks like difference calculations and shift operation are in a "chicken and egg problem situation".
Thanks in advance for any hint
leftover_prod is the available ffill - the cumulative demand groupby cumsum:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
0 800.0
1 500.0
2 -500.0
3 1000.0
4 0.0
Name: leftover_prod, dtype: float64
fulfilled_cust is either the demand if there is enough leftover_prod or the leftover_prod groupby shift + np.where:
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
0 200.0
1 300.0
2 500.0
3 1000.0
4 1000.0
Name: fulfilled_cust, dtype: float64
missing_cust is the demand - the fulfilled_cust:
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
0 0.0
1 0.0
2 500.0
3 0.0
4 0.0
Name: missing_cust, dtype: float64
Together:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
cust prod available demand leftover_prod fulfilled_cust missing_cust
0 1 1 1000.0 200 800.0 200.0 0.0
1 2 1 NaN 300 500.0 300.0 0.0
2 3 1 NaN 1000 -500.0 500.0 500.0
3 3 2 2000.0 1000 1000.0 1000.0 0.0
4 4 2 NaN 1000 0.0 1000.0 0.0
imports and DataFrame used:
import numpy as np
import pandas as pd
df_situation = pd.DataFrame({
"cust": [1, 2, 3, 3, 4],
"prod": [1, 1, 1, 2, 2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"demand": [200, 300, 1000, 1000, 1000],
})
(changed "needed" to "demand" as it appears in image.)

Group Pandas rows by ID and forward fill them to the right retaining NaN when it appears on all the rows with the same ID

I have a Pandas DataFrame that I need to:
group by the ID column (not in index)
forward fill rows to the right with the previous value (multiple columns) only if it's not a NaN (np.nan)
For each ID categorical value and each metric column (see the aX columns in the examples below) there is only value (the others when having multiple rows are NaN - np.nan).
Take this as an example:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: my_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
...: {"id": 1, "a1": np.nan, "a2": np.nan, "a3": 80.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
...: ])
In [4]: my_df.head(len(my_df))
Out[4]:
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 NaN NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
I have many more columns like a1 to a4.
I would like to:
pretend np.nan is zero 0.0 when on the same column and different row (with same ID) there is a number so I can sum them together like with groupby and subsequent aggregation functions
forward fill to the right on the same unique row (by ID) only if somewhere on a previous column to the left there was a number
So basically in the example this means that:
for ID 1 "a2"=100.0
for ID 2 "a1" and "a2" are both np.nan
See here:
In [5]: wanted_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": 100.0, "a3": 80.0, "a4": 90.0},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": 30.0},
...: ])
In [6]: wanted_df.head(len(wanted_df))
Out[6]:
id a1 a2 a3 a4
0 1 100.0 100.0 80.0 90.0
1 20 NaN NaN 100.0 30.0
In [7]:
The forward filling to the right should apply to multiple columns on the same row,
not only for the closest row to the right.
When I use my_df.interpolate(method='pad', axis=1,limit=None,limit_direction='forward',limit_area=None,downcast=None,) then I still get multiple rows for the same ID.
When I use my_df.groupby("id").sum() then I see 0.0 everywhere rather than retaining the NaN values in those scenarios defined above.
When I use my_df.groupby("id").apply(np.sum) the ID columns is summed as well, so this is wrong as it should be retained.
How do I do this?
One idea is use min_count=1 to sum:
df = my_df.groupby("id").sum(min_count=1)
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
Or if need first non missing value is possible use GroupBy.first:
df = my_df.groupby("id").first()
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
More problematic is if multiple non missing values per groups and need all of them:
#added 20 to a1
my_df = pd.DataFrame([
{"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
{"id": 1, "a1": 20, "a2": np.nan, "a3": 80.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
])
print (my_df)
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 20.0 NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0
First and second solution working differently:
df2 = my_df.groupby("id").sum(min_count=1)
print (df2)
a1 a2 a3 a4
id
1 120.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
df3 = my_df.groupby("id").first()
print (df3)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
If same type of values, here numbers is possible also use:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
f = lambda x: pd.DataFrame(justify(x.to_numpy(),
invalid_val=np.nan,
axis=0,
side='up'), columns=my_df.columns.drop('id'))
.dropna(how='all')
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0

pd.df find rows pairwise using groupby and change bogus values

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.
import pandas as pd
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df.index = pd.to_datetime(df.index)
df.opt_expiry = pd.to_datetime(df.opt_expiry)
Out[2]:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.001 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
here is what I am looking to achieve:
1) find the pairs with identical timestamp, opt_expiry and strike:
groups = df.groupby(['timestamp','opt_expiry','strike'])
2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:
for group in groups:
# if sum of absolute deltas != 1
if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
correct_sigma = group.sigma.max()
group.sigma = correct_sigma
Expected output:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.170 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty
Data
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df
Working
Absolute delta sum for each groupfor each row
df['absdelta']=df['delta'].abs()
Absolute delta sum and maximum sigma for each group in a new dataframe df2
df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2
Merge df2 with df
df3=df.merge(df2, left_on='strike', right_on='strike',
suffixes=('', '_right'))
df3
mask groups with sum absolute delta not equal to 1
m=df3['absdelta_right']!=1
m
Using mask, apply maximum sigma to entities in groups masked above
df3.loc[m,'sigma']=df3.loc[m,'sigma_right']
Slice to return to original dataframe
df3.iloc[:,:-4]
Output

Barplot not grouped together

I want to create a barplot where the 'Age_round' are grouped together and also in ascending order. Right now the bars are all separated
import matplotlib.pyplot as plt
df.plot(kind='bar',x='Age_round',y='number of purchased hours(mins)')
plt.xlabel('Age_round')
plt.ylabel('number of purchased hours(mins)')
# plt.xticks(np.arange(start = 4, stop = 17, step = 1))
plt.title('Age Distribution Graph')
plt.grid()
This is my dataframe below
Package Age_round gender
1 7000 9.0 1
2 7000 10.0 0
3 5000 9.0 0
4 9000 10.0 1
5 3000 12.0 1
6 5000 9.0 1
7 9000 10.0 1
8 6000 16.0 1
9 6000 12.0 0
10 6000 7.0 1
11 12000 7.0 1
12 12000 15.0 1
13 6000 10.0 1
Essentially, I would love to create a barplot where the x-axis is 'Age_round' ,y-axis showing the frequency and the 'Package' are differentiated by bars of different colour
I wrote a piece of code that does this job, not sure if this is the best way :
made a newdf to read frequency data for each age against Package and assigned 'values(age)' as its index
values = df.Age_round.unique()
values.sort()
newdf = pd.DataFrame()
for x in values :
freq_x = df[df['Age_round']==x]['Package'].value_counts()
newdf = newdf.append(freq_x)
newdf.index = values
newdf.plot(kind='bar',stacked=True, figsize=(15,6))
Here is a possible implementation:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=['Package', 'Age_round', 'gender'],
data=[[7000, 9.0, 1], [7000, 10.0, 0], [5000, 9.0, 0], [9000, 10.0, 1], [3000, 12.0, 1],
[5000, 9.0, 1], [9000, 10.0, 1], [6000, 16.0, 1], [6000, 12.0, 0], [6000, 7.0, 1],
[12000, 7.0, 1], [12000, 15.0, 1], [6000, 10.0, 1]])
df['Age_round'] = df['Age_round'].astype(int) # optionally round the numbers to integers
df.sort_values(['Age_round', 'Package']).plot(kind='bar', x='Age_round', y='Package', rot=0, color='deeppink')
plt.xlabel('Age (rounded)')
plt.ylabel('Number of purchased hours(mins)')
plt.title('Age Distribution Graph')
plt.grid(True, axis='y')
plt.show()

pandas 0.20.3 DataFrame behavior changes for pyspark.ml.vectors object in a column

The following code works in pandas 0.20.0 but not in 0.20.3:
import pandas as pd
from pyspark.ml.linalg import Vectors
df = pd.DataFrame({'A': [1,2,3,4],
'B': [1,2,3,4],
'C': [1,2,3,4],
'D': [1,2,3,4]},
index=[0, 1, 2, 3])
df.apply(lambda x: pd.Series(Vectors.dense([x["A"], x["B"]])), axis=1)
This produces from pandas 0.20.0:
0
0 [1.0, 1.0]
1 [2.0, 2.0]
2 [3.0, 3.0]
3 [4.0, 4.0]
but it is different in pandas 0.20.3:
0 1
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
How can I achieve the first behavior in 0.20.3?