I have a data frame as shown below
ID Name Address
1 Kohli Country: India; State: Delhi; Sector: SE25
2 Sachin Country: India; State: Mumbai; Sector: SE39
3 Ponting Country: Australia; State: Tasmania
4 Ponting State: Tasmania; Sector: SE27
From the above I would like to prepare below data frame
ID Name Country State Sector
1 Kohli India Delhi SE25
2 Sachin India Mumbai SE39
3 Ponting Australia Tasmania None
4 Ponting None Tasmania SE27
I tried below code
df[['Country', 'State', 'Sector']] = pd.DataFrame(df['ADDRESS'].str.split(';',2).tolist(),
columns = ['Country', 'State', 'Sector'])
But from the above again I have to clean the data by slicing the column. I would like to know is there any easy method than this.
Use list comprehension with dict comprehension for list of dictionaries and pass to DataFrame constructor:
L = [{k:v for y in x.split('; ') for k, v in dict([y.split(': ')]).items()}
for x in df.pop('Address')]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
ID Name Country State Sector
0 1 Kohli India Delhi SE25
1 2 Sachin India Mumbai SE39
2 3 Ponting Australia Tasmania NaN
Or use split with reshape stack:
df1 = (df.pop('Address')
.str.split('; ', expand=True)
.stack()
.reset_index(level=1, drop=True)
.str.split(': ', expand=True)
.set_index(0, append=True)[1]
.unstack()
)
print (df1)
0 Country Sector State
0 India SE25 Delhi
1 India SE39 Mumbai
2 Australia NaN Tasmania
df = df.join(df1)
print (df)
ID Name Country Sector State
0 1 Kohli India SE25 Delhi
1 2 Sachin India SE39 Mumbai
2 3 Ponting Australia NaN Tasmania
You are almost there
cols = ['ZONE', 'State', 'Sector']
df[cols] = pd.DataFrame(df['ADDRESS'].str.split('; ',2).tolist(),
columns = cols)
for col in cols:
df[col] = df[col].str.split(': ').apply(lambda x:x[1])
Original answer
This can also do the job:
import pandas as pd
df = pd.DataFrame(
[
{'ID': 1, 'Name': 'Kohli', 'Address': 'Country: India; State: Delhi; Sector: SE25'},
{'ID': 2, 'Name': 'Sachin','Address': 'Country: India; State: Mumbai; Sector: SE39'},
{'ID': 3,'Name': 'Ponting','Address': 'Country: Australia; State: Tasmania'}
]
)
cols_to_extract = ['ZONE', 'State', 'Sector']
list_of_rows = df['Address'].str.split(';', 2).tolist()
df[cols_to_extract] = pd.DataFrame(
[[item.split(': ')[1] for item in row] for row in list_of_rows],
columns=cols_to_extract)
Output would be the following:
>> df[['ID', 'Name', 'ZONE', 'State', 'Sector']]
ID Name ZONE State Sector
1 Kohli India Delhi SE25
2 Sachin India Mumbai SE39
3 Ponting Australia Tasmania None
Edited answer
As #jezrael pointed out very well in question comment, my original answer was wrong, because it aligned values by position and could tend to wrong key - value pairs, when some of the values were NaNs. The following code should work on edited data set.
import pandas as pd
df = pd.DataFrame(
[
{'ID': 1, 'Name': 'Kohli', 'Address': 'Country: India; State: Delhi; Sector: SE25'},
{'ID': 2, 'Name': 'Sachin','Address': 'Country: India; State: Mumbai; Sector: SE39'},
{'ID': 3,'Name': 'Ponting','Address': 'Country: Australia; State: Tasmania'},
{'ID': 4, 'Name': 'Ponting','Address': 'State: Tasmania; Sector: SE27'}
]
)
cols_to_extract = ['Country', 'State', 'Sector']
list_of_rows = df['Address'].str.split(';', 2).tolist()
df[cols_to_extract] = pd.DataFrame(
[{item.split(': ')[0].strip(): item.split(': ')[1] for item in row} for row in list_of_rows],
columns=cols_to_extract)
df = df.rename(columns={'Country': 'ZONE'})
Output would be:
>> df[['ID', 'Name', 'ZONE', 'State', 'Sector']]
ID Name ZONE State Sector
1 Kohli India Delhi SE25
2 Sachin India Mumbai SE39
3 Ponting Australia Tasmania NaN
3 Ponting NaN Tasmania SE27
Related
I have a dataframe with the following nested dictionary in one of the columns:
ID dict
1 {Comp A: {Street: 123 Street}, Comp B: {Street: 456 Street}}
2 {Comp C: {Street: 749 Street}}
3 {Comp D: {Street: }}
I want to expand out the dictionary with the resulting data frame
ID company_name street
1 Comp A 123 Street
1 Comp B 456 Street
2 Comp C 749 Street
3 Comp D
I have tried the following
dft['dict'] = df.dict.apply(eval)
dft = dft.explode('dict')
Which gives me the ID and company_name column correctly, though I haven't been able to figure out how to expand out the street column as well.
This is the data in dictionary form, for reproducibility:
data = [{'ID': 1, 'entity_details': "{'comp a': {'street_address': '123 street'}}"},
{'ID': 2, 'entity_details': "{'comp b': {'street_address': '456 street'}}"},{'ID': 3, 'entity_details': "{'comp c': {'street_address': '555 street'},'comp d': {'street_address': '585 street'}, 'comp e': {'street_address': '873 street'}}"},
{'ID': 4, 'entity_details': "{'comp f': {'street_address': '898 street'}}"}]
A for loop should suffice and be efficient for your use case; the key is to export it into a dictionary - you've done that already with the df.to_dict() code - and then iterate based on the logic - if you are on python 3.10 you could have more simplicity with the pattern matching syntax.
out = []
for entry in data:
for key, value in entry.items():
if key == "entity_details":
val = eval(value)
for k, v in val.items():
result = (entry['ID'], k, v['street_address'])
out.append(result)
pd.DataFrame(out, columns = ['ID', 'company_name', 'street_address'])
ID company_name street_address
0 1 comp a 123 street
1 2 comp b 456 street
2 3 comp c 555 street
3 3 comp d 585 street
4 3 comp e 873 street
5 4 comp f 898 street
Below is a possible structural pattern matching option; however, I feel the for loop above is more explicit :
out = []
for entry in data:
match entry:
case {"entity_details": other}:
output = eval(other)
output = [(entry['ID'], key, value['street_address'])
for key, value in output.items()]
out.extend(output)
pd.DataFrame(out, columns = ['ID', 'company_name', 'street_address'])
ID company_name street_address
0 1 comp a 123 street
1 2 comp b 456 street
2 3 comp c 555 street
3 3 comp d 585 street
4 3 comp e 873 street
5 4 comp f 898 street
Of course, for more complex destructuring, the pattern matching can come in handy
Initial data
As far as the original data wasn't provided, I'll supose that we have this one:
data = {
1: "{'Comp A': {'Street': '123 Street'}, 'Comp B': {'Street': '456 Street'}}",
2: "{'Comp C': {'Street': '749 Street'}}",
3: "{'Comp D': {'Street': ''}}",
}
df = pd.DataFrame.from_dict(data, orient='index', columns=['dict'])
At least, the use of the eval function is justified with these data.
The main idea
To transform them in the format Company_name, Street, we can use DataFrame.from_dict and concat in addition to apply(eval) like this:
f = partial(pd.DataFrame.from_dict, orient='index')
df_transformed = pd.concat(map(f, df['dict'].map(literal_eval)))
Here
f converts a dictionary into DataFrame as if its keys were indexes;
.map(literal_eval) is converting json-strings into dictionaries;
map(f, ...) is supplying data frames into pd.concat
The final touch could be setting the index and renaming the columns, which we can do inside pd.concat like this:
pd.concat(..., keys=df.index, names=['id', 'company']).reset_index('company')
The code
import pandas as pd
from functools import partial
from ast import literal_eval
data = {
1: "{'Comp A': {'Street': '123 Street'}, 'Comp B': {'Street': '456 Street'}}",
2: "{'Comp C': {'Street': '749 Street'}}",
3: "{'Comp D': {'Street': ''}}",
}
df = pd.DataFrame.from_dict(data, orient='index', columns=['dict'])
f = partial(pd.DataFrame.from_dict, orient='index')
dft = pd.concat(
map(f, df['dict'].map(literal_eval)),
keys=df.index, # use the original index to identify where each record comes from
names=['id', 'Company']
).reset_index('Company')
print(dft)
The output:
Company Street
id
1 Comp A 123 Street
1 Comp B 456 Street
2 Comp C 749 Street
3 Comp D
P.S.
Let's say, that:
data = \
[{'ID': 1, 'entity_details': "{'comp a': {'street_address': '123 street'}}"},
{'ID': 2, 'entity_details': "{'comp b': {'street_address': '456 street'}}"},{'ID': 3, 'entity_details': "{'comp c': {'street_address': '555 street'},'comp d': {'street_address': '585 street'}, 'comp e': {'street_address': '873 street'}}"},
{'ID': 4, 'entity_details': "{'comp f': {'street_address': '898 street'}}"}]
df = pd.DataFrame(data).set_index('ID')
In this case the only thing we should change in the code is the initial column name. It was dict, and now it's entity_details:
pd.concat(
map(f, df['entity_details'].map(literal_eval)),
keys=df.index,
names=['id', 'Company']
).reset_index('Company')
I have a dataframe with the population by age in several cities:
City Age_25 Age_26 Age_27 Age_28 Age_29 Age_30
New York 11312 3646 4242 4344 4242 6464
London 6446 2534 3343 63475 34433 34434
Paris 5242 34343 6667 132 323 3434
Hong Kong 354 979 878 6776 7676 898
Buenos Aires 4244 7687 78 8676 786 9798
I want to create a new dataframe with the sum of the columns based on ranges of three years. That is, people from 25 to 27 and people from 28 to 30. Like this:
City Age_25_27 Age_28_30
New York 19200 15050
London 12323 132342
Paris 46252 3889
Hong Kong 2211 15350
Buenos Aires 12009 19260
In this example I gave a range of three year but in mine real database it has to be 5 five and with 100 ages.
How could I do that? I've saw some related answers but neither work very well in my case.
Try this:
age_columns = df.filter(like='Age_').columns
n = age_columns.str.split('_').str[-1].astype(int)
df['Age_25-27'] = df[age_columns[(n >= 25) & (n <= 27)]].sum(axis=1)
df['Age_28-30'] = df[age_columns[(n >= 28) & (n <= 30)]].sum(axis=1)
Output:
>>> df
City Age_25 Age_26 Age_27 Age_28 Age_29 Age_30 Age_25-27 Age_28-30
New York 11312 3646 4242 4344 4242 6464.0 19200 15050.0
London 6446 2534 3343 63475 34433 34434 NaN 69352 68867.0
Paris 5242 34343 6667 132 323 3434 NaN 41142 3757.0
Hong Kong 354 979 878 6776 7676 898.0 2211 15350.0
Buenos Aires 4244 7687 78 8676 786 9798.0 12009 19260.0
You can use groupby:
In [1]: import pandas as pd
...: import numpy as np
In [2]: d = {
...: 'City': ['New York', 'London', 'Paris', 'Hong Kong', 'Buenos Aires'],
...: 'Age_25': [11312, 6446, 5242, 354, 4244],
...: 'Age_26': [3646, 2534, 34343, 979, 7687],
...: 'Age_27': [4242, 3343, 6667, 878, 78],
...: 'Age_28': [4344, 63475, 132, 6776, 8676],
...: 'Age_29': [4242, 34433, 323, 7676, 786],
...: 'Age_30': [6464, 34434, 3434, 898, 9798]
...: }
...:
...: df = pd.DataFrame(data=d)
...: df = df.set_index('City')
...: df
Out[2]:
Age_25 Age_26 Age_27 Age_28 Age_29 Age_30
City
New York 11312 3646 4242 4344 4242 6464
London 6446 2534 3343 63475 34433 34434
Paris 5242 34343 6667 132 323 3434
Hong Kong 354 979 878 6776 7676 898
Buenos Aires 4244 7687 78 8676 786 9798
In [3]: n_cols = 3 # change to 5 for actual dataset
...: sum_every_n_cols_df = df.groupby((np.arange(len(df.columns)) // n_cols) + 1, axis=1).sum()
...: sum_every_n_cols_df
Out[3]:
1 2
City
New York 19200 15050
London 12323 132342
Paris 46252 3889
Hong Kong 2211 15350
Buenos Aires 12009 19260
You can extract the columns of the dataframe and put them in a list. Use
col_list = df.columns
But ultimately, I think what you'd want to do is more of a while loop with your inputs (band of 5 and up to 100 ages) as static values that you iterate over.
band = 5
start = 20
max_age = 120
i = start
while i < max_age:
age_start = i
age_end = i
sum_cols = []
col_name = 'age_' + str(age_start) + '_to_' + str(age_end)
for i in range(age_start,age_end):
age_adder = 'age_' + str(i)
df[col_name] += df[age_adder]
i += band
i have a dataframe looks like:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'number': [0, 29, 0, 52, 0, 0]})
And i am working on solution to numbering 0 values in Number column.
My code, with isnot working right
for i, row in df.iterrows():
df.loc[df['number'] == 0, 'number'] = i+1
df
This code replaces 0 with 1, but must replace first 0 with 1..second 0 with 2 etc...
i would like to have solution based on iteration method(.
Note: numbers "29", "52" etc, must not be changed
Try np.where on a Boolean index based on 0 values in df then replace with cumsum of the index to enumerate:
m = df['number'].eq(0)
df['number'] = np.where(m, m.cumsum(), df['number'])
Or use Series.mask
m = df['number'].eq(0)
df['number'] = df['number'].mask(m, m.cumsum())
df:
first_name last_name number
0 John Smith 1
1 Jane Doe 29
2 Marry Jackson 2
3 Victoria Smith 52
4 Gabriel Brown 3
5 Layla Martinez 4
m.cumsum():
0 1
1 1
2 2
3 2
4 3
5 4
Name: number, dtype: int32
Complete Working Example:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'number': [0, 29, 0, 52, 0, 0]
})
m = df['number'].eq(0)
df['number'] = np.where(m, m.cumsum(), df['number'])
print(df)
Try via boolean masking and loc accessor:
mask=df['number']==0 #created boolean mask
df.loc[mask,'number']=mask.cumsum()
OR
via where() method:
df['number']=df.where(~mask,mask.cumsum(),axis=0)['number']
OR
via boolean masking and assign() method
df[mask]=df[mask].assign(number=mask.cumsum())
Output of df:
first_name last_name number
0 John Smith 1
1 Jane Doe 29
2 Marry Jackson 2
3 Victoria Smith 52
4 Gabriel Brown 3
5 Layla Martinez 4
Alternative via replace and fillna:
df.number = df.number.replace(0,np.NAN).fillna(df.number.eq(0).cumsum())
How to convert dictionary to dataframe with default index and column names
dictionary d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23
Use DataFrame.from_dict with orient='index' parameter:
d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df = pd.DataFrame.from_dict(d, orient='index', columns=['id','type','value'])
print (df)
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23
How do I convert a single column of a pandas dataframe to type string? In the df of housing data below I need to convert zipcode to string so that when I run linear regression, zipcode is treated as categorical and not numeric. Thanks!
df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}})
print (df)
bathrooms bedrooms floors sqft_living sqft_lot zipcode
722 3.25 4 2.0 4670 51836 98005
2680 0.75 2 1.0 1440 3700 98107
14554 2.50 4 2.0 3180 9603 98155
17384 1.50 2 3.0 1430 1650 98125
18754 1.00 2 1.0 1130 2640 98109
You need astype:
df['zipcode'] = df.zipcode.astype(str)
#df.zipcode = df.zipcode.astype(str)
For converting to categorical:
df['zipcode'] = df.zipcode.astype('category')
#df.zipcode = df.zipcode.astype('category')
Another solution is Categorical:
df['zipcode'] = pd.Categorical(df.zipcode)
Sample with data:
import pandas as pd
df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}})
print (df)
bathrooms bedrooms floors sqft_living sqft_lot zipcode
722 3.25 4 2.0 4670 51836 98005
2680 0.75 2 1.0 1440 3700 98107
14554 2.50 4 2.0 3180 9603 98155
17384 1.50 2 3.0 1430 1650 98125
18754 1.00 2 1.0 1130 2640 98109
print (df.dtypes)
bathrooms float64
bedrooms int64
floors float64
sqft_living int64
sqft_lot int64
zipcode int64
dtype: object
df['zipcode'] = df.zipcode.astype('category')
print (df)
bathrooms bedrooms floors sqft_living sqft_lot zipcode
722 3.25 4 2.0 4670 51836 98005
2680 0.75 2 1.0 1440 3700 98107
14554 2.50 4 2.0 3180 9603 98155
17384 1.50 2 3.0 1430 1650 98125
18754 1.00 2 1.0 1130 2640 98109
print (df.dtypes)
bathrooms float64
bedrooms int64
floors float64
sqft_living int64
sqft_lot int64
zipcode category
dtype: object
With pandas >= 1.0 there is now a dedicated string datatype:
1) You can convert your column to this pandas string datatype using .astype('string'):
df['zipcode'] = df['zipcode'].astype('string')
2) This is different from using str which sets the pandas object datatype:
df['zipcode'] = df['zipcode'].astype(str)
3) For changing into categorical datatype use:
df['zipcode'] = df['zipcode'].astype('category')
You can see this difference in datatypes when you look at the info of the dataframe:
df = pd.DataFrame({
'zipcode_str': [90210, 90211] ,
'zipcode_string': [90210, 90211],
'zipcode_category': [90210, 90211],
})
df['zipcode_str'] = df['zipcode_str'].astype(str)
df['zipcode_string'] = df['zipcode_str'].astype('string')
df['zipcode_category'] = df['zipcode_category'].astype('category')
df.info()
# you can see that the first column has dtype object
# while the second column has the new dtype string
# the third column has dtype category
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 zipcode_str 2 non-null object
1 zipcode_string 2 non-null string
2 zipcode_category 2 non-null category
dtypes: category(1), object(1), string(1)
From the docs:
The 'string' extension type solves several issues with object-dtype
NumPy arrays:
You can accidentally store a mixture of strings and non-strings in an
object dtype array. A StringArray can only store strings.
object dtype breaks dtype-specific operations like
DataFrame.select_dtypes(). There isn’t a clear way to select just text
while excluding non-text, but still object-dtype columns.
When reading code, the contents of an object dtype array is less clear
than string.
More info on working with the new string datatype can be found here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
Prior answers focused on nominal data (e.g. unordered). If there is a reason to impose order for an ordinal variable, then one would use:
# Transform to category
df['zipcode_category'] = df['zipcode_category'].astype('category')
# Add ordered category
df['zipcode_ordered'] = df['zipcode_category']
# Setup the ordering
df.zipcode_ordered.cat.set_categories(
new_categories = [90211, 90210], ordered = True, inplace = True
)
# Output IDs
df['zipcode_ordered_id'] = df.zipcode_ordered.cat.codes
print(df)
# zipcode_category zipcode_ordered zipcode_ordered_id
# 90210 90210 1
# 90211 90211 0
More details on setting ordered categories can be found at the pandas website:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#sorting-and-order
To convert a column into a string type (that will be an object column per se in pandas), use astype:
df.zipcode = zipcode.astype(str)
If you want to get a Categorical column, you can pass the parameter 'category' to the function:
df.zipcode = zipcode.astype('category')