Extract dictionary value from a list contained in Pandas dataframe column - pandas

I'm trying to extract values from a dictionary contained within list in a Pandas dataframe .Objective is to split the id key into multiple columns. Sample data is like :
Column_Header
[{'id': '498', 'relTypeId': 2'},{'id': '499', 'relTypeId': 3'}]
[{'id': '499', 'relTypeId': 3'},{'id': '500', 'relTypeId': 4'},{'id': '501', 'relTypeId': 5'}]
I have tried as below
list(map(lambda x: x["id"], df["Column_Header"]))
But getting error as following:
"list indices must be integers or slices, not str". Desired o/p is :
col1|col2|col3
498 |499 |
499 |500 |501
Can some one please help ?

We can do explode first then create the additional key with cumcount , and pivot
s=df.Column_Header.explode().str['id']
s=pd.crosstab(index=s.index,columns=s.groupby(level=0).cumcount(),values=s,aggfunc='sum')
Out[133]:
col_0 0 1 2
row_0
0 498 499 NaN
1 499 500 501

Use nested list comprehension with select id in keys of dictionaries if performance is important:
df = pd.DataFrame([[y['id'] for y in x] for x in df['Column_Header']], index=df.index)
print (df)
0 1 2
0 498 499 None
1 499 500 501
If possible some missing values use:
L = [[y['id'] for y in x] if isinstance(x, list) else [None] for x in df['Column_Header']]
df = pd.DataFrame(L, index=df.index)

Related

Getting variable no of pandas rows w.r.t. a dictionary lookup

In this sample dataframe df:
import pandas as pd
import numpy as np
import random, string
max_rows = {'A': 3, 'B': 2, 'D': 4} # max number of rows to be extracted
data_size = 1000
df = pd.DataFrame({'symbol': pd.Series(random.choice(string.ascii_uppercase) for _ in range(data_size)),
'qty': np.random.randn(data_size)}).sort_values('symbol')
How to get a dataframe with variable rows from a dictionary?
Tried using [df.groupby('symbol').head(i) for i in df.symbol.map(max_rows)]. It gives a RuntimeWarning and looks very incorrect.
You can use concat with list comprehension:
print (pd.concat([df.loc[df["symbol"].eq(k)].head(v) for k,v in max_rows.items()]))
symbol qty
640 A -0.725947
22 A -1.361063
190 A -0.596261
451 B -0.992223
489 B -2.014979
593 D 1.581863
600 D -2.162044
793 D -1.162758
738 D 0.345683
Adding another method using groupby+cumcount and df.query
df.assign(v=df.groupby("symbol").cumcount()+1,k=df['symbol'].map(max_rows)).query("v<=k")
Or same logic without assigning extra columns #thanks #jezrael
df[df.groupby("symbol").cumcount()+1 <= df['symbol'].map(max_rows)]
symbol qty
882 A -0.249236
27 A 0.625584
122 A -1.154539
229 B -1.269212
55 B 1.403455
457 D -2.592831
449 D -0.433731
634 D 0.099493
734 D -1.551012

pandas dataframe to coo matrix and to lil matix

I have following series:
groups['combined']
0 (28, 1) 1
1 (32, 1) 1
2 (36, 1) 1
3 (37, 1) 1
4 (84, 1) 1
....
Name: combined, Length: 14476, dtype: object
How can I convert this dataframe into .tocoo() matrix and .tolil()?
Reference how combined column is formed from
Original Pandas DataFrame:
import pandas as pd pd.DataFrame ({0:[28,32,36,37,84],1: [1,1,1,1,1], 2: [1,1,1,1,1]}). col 0 has over 10K unique features, col 1 has 39 groups and col 2 is just 1.
Formation of COOrdinate format from original pandas DataFrame
import scipy.sparse as sps
groups.set_index([0, 1], inplace=True)
sps.coo_matrix((groups[2], (groups.index.labels[0], groups.index.labels[1])))
-------------Results to---------
<10312x39 sparse matrix of type '<class 'numpy.int64'>'
with 14476 stored elements in COOrdinate format>
In regards to lil matrix
print(len(networks[0]), len(networks[1]), networks[0].nunique(), networks[1].nunique())
667966 667966 10312 10312
networks[:5]
0 1
0 176 1
1 233 1
2 283 1
3 371 1
4 394 1
# make row and col labels
rows = networks[0]
cols = networks[1]
# crucial third array in python
networks.set_index([0, 1], inplace=True)
Ntw= sps.coo_matrix((networks[2], (networks.index.labels[0],
networks.index.labels[1])))
d=Ntw.tolil()
d
generates
<10312x10312 sparse matrix of type '<class 'numpy.int64'>'
with 667966 stored elements in LInked List format>

Two-level header in pandas?

I created a new dataframe from an old one and now I have something like this:
df = pd.DataFrame({0:[1,5,1,1,3]}, index=[243,254,507,1903,2358]).rename_axis('uid')
print (df)
0
uid
243 1
254 5
507 1
1903 1
2358 3
I don't really understand what it means. Is that a double header with the first header having just one index and the second having the other one? How can I transform this dataframe into having a single header, with names ['userID' , 'counts'] ?
Here is one column DataFrame with column 0 and index name uid.
So need:
df = df.reset_index()
df.columns = ['userID' , 'counts']
print (df)
userID counts
0 243 1
1 254 5
2 507 1
3 1903 1
4 2358 3
Another solution:
df = df.rename_axis('userID').squeeze().reset_index(name='counts')

Issue looping through dataframes in Pandas

I have a dict 'd' set up which is a list of dataframes E.g.:
d["DataFrame1"]
Will return that dataframe with all its columns:
ID Name
0 123 John
1 548 Eric
2 184 Sam
3 175 Andy
Each dataframe has a column in it called 'Names'. I want to extract this column from each dataframe in the dict and to create a new dataframe consisting of these columns.
df_All_Names = pd.DataFrame()
for df in d:
df_All_Names[df] = df['Names']
Returns the error:
TypeError: string indices must be integers
Unsure where I'm going wrong here.
For example you have df as follow
df=pd.DataFrame({'Name':['X', 'Y']})
df1=pd.DataFrame({'Name':['X1', 'Y1']})
And we create a dict
d=dict()
d['df']=df
d['df1']=df1
Then presetting a empty data frame:
yourdf=pd.DataFrame()
Using items with for loop
for key,val in d.items():
yourdf[key]=val['Name']
yield :
yourdf
Out[98]:
df df1
0 X X1
1 Y Y1
Your can use reduce and concatenate all of the columns named ['Name'] in your dictionary of dataframes
Sample Data
from functools import reduce
d = {'df1':pd.DataFrame({'ID':[0,1,2],'Name':['John','Sam','Andy']}),'df2':pd.DataFrame({'ID':[3,4,5],'Name':['Jen','Cara','Jess']})}
You can stack the data side by side using axis=1
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=1),d.values())
Name Name
0 John Jen
1 Sam Cara
2 Andy Jess
Or on top of one an other usingaxis=0
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=0),d.values())
0 John
1 Sam
2 Andy
0 Jen
1 Cara
2 Jess

Pandas dataframe apply function

I have a dataframe which looks like this.
df.head()
Ship Date Cost Amount
0 2010-08-01 4257.23300
1 2010-08-01 9846.94540
2 2010-08-01 35.77764
3 2010-08-01 420.82920
4 2010-08-01 129.49638
I had to club the data week wise for which I did :
df['week_num'] = pd.DatetimeIndex(df['Ship Date']).week
x = df.groupby('week_num').sum()
it produces a dataframe which looks like this:
Cost Amount
week_num
30 3.273473e+06
31 9.715421e+07
32 9.914568e+07
33 9.843721e+07
34 1.065546e+08
35 1.087598e+08
36 8.050456e+07
now I wanted to add a column with week and year information to do this I did:
def my_conc(row):
return str(row['week_num'])+str('2011')
and
x['year_week'] = x.apply(my_conc,axis= 1)
This gives me an error message:
KeyError: ('week_num', u'occurred at index 30')
Now my questions are
1) Why groupby function produced a dataframe which looks a little odd as it doesn't have week_num as column name ?
2) Is there a better way of producing the dataframe with grouped data ?
3) How to use apply function on the above dataframe temp ?
Here's one way to do it.
Use as_index=False in groupby to not create index.
In [50]: df_grp = df.groupby('week_num', as_index=False).sum()
Then apply lambda function.
In [51]: df_grp['year_week'] = df_grp.apply(lambda x: str(x['week_num']) + '2011',
axis=1)
In [52]: df_grp
Out[52]:
week_num Cost year_week
0 30 3273473 302011
1 31 97154210 312011
2 32 99145680 322011
3 33 98437210 332011
4 34 106554600 342011
5 35 108759800 352011
6 36 80504560 362011
Or use df_grp.apply(lambda x: '%d2011' % x['week_num'], axis=1)
On your first question, I have no idea. When I try and replicate it, I just get an error.
On the other questions, Use the .dt accessor for groupby() functions ...
# get your data into a DataFrame
data = """Ship Date Cost Amount
0 2010-08-01 4257.23300
1 2010-08-01 9846.94540
2 2010-08-01 35.77764
3 2010-08-01 420.82920
4 2010-08-01 129.49638
"""
from StringIO import StringIO # import from io for Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep=' ', skipinitialspace=True)
# make the dtype for the column datetime64[ns]
df['Ship Date'] = pd.to_datetime(df['Ship Date'])
# then you can use the .dt accessor to group on
x = df.groupby(df['Ship Date'].dt.dayofyear).sum()
y = df.groupby(df['Ship Date'].dt.weekofyear).sum()
There are a host more of these .dt accessors ... link