Pandas Column Transformation with list of dict in column - pandas

I am getting the data from a nosql database own by third party. Post data fetch the dataframe look like below: I wish to explode perfomance column but can't figure out a way. Is it even possible?
import pandas as pd
cols = ['name', 'performance']
data = [
['bob', [{'dates': '15-12-2021', 'gdp': 19},
{'dates': '16-12-2021', 'gdp': 36},
{'dates': '12-12-2022', 'gdp': 39},
{'dates': '13-12-2022', 'gdp': 35},
{'dates': '14-12-2022', 'gdp': 35}]]]
df = pd.DataFrame(data, columns=cols)
Expected output:
cols = ['name', 'dates', 'gdp']
data = [
['bob', '15-12-2021', 19],
['bob', '16-12-2021', 36],
['bob', '12-12-2022', 39],
['bob', '13-12-2022', 35],
['bob', '14-12-2022', 35]]
df = pd.DataFrame(data, columns=cols)

Use DataFrame.explode with DataFrame.reset_index first and then flatten dictionaries by json_normalize, DataFrame.pop is used for remove column performance in ouput DataFrame:
df1 = df.explode('performance').reset_index(drop=True)
df1 = df1.join(pd.json_normalize(df1.pop('performance')))
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
Another solutions with list comprehension - if only 2 columns input DataFrame:
L = [{**{'name':a},**x} for a, b in zip(df['name'], df['performance']) for x in b]
df1 = pd.DataFrame(L)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35
If multiple columns use DataFrame.join with original DataFrame:
L = [{**{'i':a},**x} for a, b in df.pop('performance').items() for x in b]
df1 = df.join(pd.DataFrame(L).set_index('i')).reset_index(drop=True)
print (df1)
name dates gdp
0 bob 15-12-2021 19
1 bob 16-12-2021 36
2 bob 12-12-2022 39
3 bob 13-12-2022 35
4 bob 14-12-2022 35

Related

Is there a way i could work with this multiindex?

I have a dataframe like this one, https://i.stack.imgur.com/2Sr29.png. RBD is a code that identifies each school, LET_CUR corresponds to a class and MRUN corresponds to the amount of students in each class, what i need is the following:
I would like to know how many of the schools have at least one class with more than 45 students, so far I haven't figured out yet a code to do that.
Thanks.
From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
RBD,LET_CUR,MRUN
1,A,65
1,B,23
1,C,21
2,A,22
2,B,20
2,C,34
3,A,54
4,A,23
4,B,11
5,A,15
5,C,16
6,A,76"""))
>>> df = df.set_index(['RBD', 'LET_CUR'])
>>> df
MRUN
RBD LET_CUR
1 A 65
B 23
C 21
2 A 22
B 20
C 34
3 A 54
4 A 23
B 11
5 A 15
C 16
6 A 76
As we want to know the number of school with at leat one class having more than 45 students, we can first filter the DataFrame on the column MRUN and then use the nunique() method to count the number of unique school :
>>> df_filtered = df[df['MRUN'] > 45].reset_index()
>>> df_filtered['RBD'].nunique()
3
Try with the following (here i build a similar dataframe structure as yours):
df = pd.DataFrame({'RBD': [1, 1, 2, 3],
'COD_GRADO': ['1', '2', '1', '3'],
'LET_CUR':['A', 'C', 'B', 'A'],
'MRUN':[65, 34, 64, 25]},
columns=['RBD', 'COD_GRADO', 'LET_CUR', 'MRUN'])
print(df)
n_schools = df.loc[df['MRUN'] >= 45].shape[0]
print(f"Number of shools with 45+ students is {n_schools}")
And output, for my example would (table formatted for easier reading):
(pd indices)
RBD
COD_GRADO
LET_CUR
MRUN
0
1
1
A
65
1
1
2
C
34
2
2
1
B
64
3
3
3
A
25
> Number of shools with 45+ students is 2

Comparing strings in two different dataframe and adding a column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two dataframes as follows:
df1 =
Index Name Age
0 Bob1 20
1 Bob2 21
2 Bob3 22
The second dataframe is as follows -
df2 =
Index Country Name
0 US Bob1
1 UK Bob123
2 US Bob234
3 Canada Bob2
4 Canada Bob987
5 US Bob3
6 UK Mary1
7 UK Mary2
8 UK Mary3
9 Canada Mary65
I would like to compare the names from df1 to the countries in df2 and create a new dataframe as follows:
Index Country Name Age
0 US Bob1 20
1 Canada Bob2 21
2 US Bob3 22
Thank you.
Using merge() should solve the problem.
df3 = pd.merge(df1, df2, on='Name')
Outcome:
import pandas as pd
df1 = pd.DataFrame({ "Name":["Bob1", "Bob2", "Bob3"], "Age":[20,21,22]})
df2 = pd.DataFrame({ "Country":["US", "UK", "US", "Canada", "Canada", "US", "UK", "UK", "UK", "Canada"],
"Name":["Bob1", "Bob123", "Bob234", "Bob2", "Bob987", "Bob3", "Mary1", "Mary2", "Mary3", "Mary65"]})
df3 = pd.merge(df1, df2, on='Name')
df3

How to make new cell based on appearance in dataframe cell

I want to create new column in dataframe if a value is in existed column with array type and another column matches another condition.
Dataset:
name loto
0 Jason [22]
1 Molly [222]
2 Tina [232]
3 Jake [223]
4 Amy [73, 1, 2, 3]
If name=="Jason" and loto has 22 new=1
I tried to use np.where, but having issues check element in array.
import numpy as np
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'loto': [[22], [222], [232], [223], [73,1,2,3]]}
df = pd.DataFrame(data, columns = ['name', 'loto'])
df['new'] = np.where((22 in df['loto']) & (df[name]=="Jason"), 1, 0)
first create value you want to check in a set like set([22])
provide loto_chck in map and apply condition in .loc
loto_val = set([22])
loto_chck= loto_val.issubset
df.loc[(df['loto'].map(loto_chck))&(df['name']=='Jason'),"new"]=1
name loto new
0 Jason [22] 1
1 Molly [222] Nan
2 Tina [232] Nan
3 Jake [223] Nan
4 Amy [73, 1, 2, 3] Nan
You could try :
df['new'] = ((df.apply(lambda x : 22 in x.loto , axis = 1)) & \
(df.name =='Jason')).astype(int)
Even though it's not a good idea to store lists in a dataframe

Python Pandas: how to overwrite subset of a dataframe with a subset of another dataframe?

Given df1 and df2, how do I get df3 using pandas, where df3 has df1 elements:
[11, 12, 21, 22]
in the place of df2 elements
[22, 23, 32, 33]
Condition: indexes of row 1 & 2 in df1 are the same as indexes of row 2 & 3 in df2
You are looking for the DataFrame.loc method
Small example:
import pandas as pd
df1 = pd.DataFrame({"data":[1,2,3,4,5]})
df2 = pd.DataFrame({"data":[11,12,13,14,15]})
df3 = df1.copy()
df3.loc[3:4] = df2.loc[3:4]
df3
data
0 1
1 2
2 3
3 14
4 15

Pandas Column Construction with np.where()

I'm working through an assignment with Pandas and am using np.where() to create add a column to a Pandas DataFrame with three possible values:
fips_df['geog_type'] = np.where(fips_df.fips.str[-3:] != '000', 'county', np.where(fips_df.fips.str[:] == '00000', 'country', 'state'))
The state of the DataFrame after adding the column is like this:
print fips_df[:5]
fips geog_entity fips_prefix geog_type
0 00000 UNITED STATES 00 country
1 01000 ALABAMA 01 state
2 01001 Autauga County, AL 01 county
3 01003 Baldwin County, AL 01 county
4 01005 Barbour County, AL 01 county
This column construction is tested by two asserts. The first passes and the second fails.
## check the numbers of geog_type
assert set(fips_df['geog_type'].value_counts().iteritems()) == set([('state', 51), ('country', 1), ('county', 3143)])
assert set(fips_df.geog_type.value_counts().iteritems()) == set([('state', 51), ('country', 1), ('county', 3143)])
What is the difference between calling columns as fips_df.geog_type and fips_df['geog_type'] that causes my second assert to fail?
Just in case, you can create a new column with much less effort. E.g.:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.random.uniform(size=10))
In [4]: df
Out[4]:
0
0 0.366489
1 0.697744
2 0.570066
3 0.756647
4 0.036149
5 0.817588
6 0.884244
7 0.741609
8 0.628303
9 0.642807
In [5]: categorize = lambda value: "ABC"[int(value > 0.3) + int(value > 0.6)]
In [6]: df["new_col"] = df[0].apply(categorize)
In [7]: df
Out[7]:
0 new_col
0 0.366489 B
1 0.697744 C
2 0.570066 B
3 0.756647 C
4 0.036149 A
5 0.817588 C
6 0.884244 C
7 0.741609 C
8 0.628303 C
9 0.642807 C
It should be the same (and will be most of the time)...
One situation it's not is when you already have an attribute or method set with that value (in which case it won't be overridden and hence the column won't be accessible with dot notation):
In [1]: df = pd.DataFrame([[1, 2] ,[3 ,4]])
In [2]: df.A = 7
In [3]: df.B = lambda: 42
In [4]: df.columns = list('AB')
In [5]: df.A
Out[5]: 7
In [6]: df.B()
Out[6]: 42
In [7]: df['A']
Out[7]:
0 1
1 3
Name: A
Interestingly, dot notation for accessing columns isn't mentioned in the selection syntax.