Exploding nested lists using Pandas Series keeps failing - pandas

not used pandas explode before. I got the gist of the pd.explode but for value lists where selective cols have nested lists I heard that pd.Series.explode is useful. However, i keep getting : "KeyError: "None of ['city'] are in the columns". Yet 'city' is defined in the keys:
keys = ["city", "temp"]
values = [["chicago","london","berlin"], [[32,30,28],[39,40,25],[33,34,35]]]
df = pd.DataFrame({"keys":keys,"values":values})
df2 = df.set_index(['city']).apply(pd.Series.explode).reset_index()
desired output is:
city / temp
chicago / 32
chicago / 30
chicago / 28
etc.
I would appreciate an expert weighing in as to why this throws an error, and a fix, thank you.

The problem comes from how you define df:
df = pd.DataFrame({"keys":keys,"values":values})
This actually gives you the following dataframe:
keys values
0 city [chicago, london, berlin]
1 temp [[32, 30, 28], [39, 40, 25], [33, 34, 35]]
You probably meant:
df = pd.DataFrame(dict(zip(keys, values)))
Which gives you:
city temp
0 chicago [32, 30, 28]
1 london [39, 40, 25]
2 berlin [33, 34, 35]
You can then use explode:
print(df.explode('temp'))
Output:
city temp
0 chicago 32
0 chicago 30
0 chicago 28
1 london 39
1 london 40
1 london 25
2 berlin 33
2 berlin 34
2 berlin 35

Related

how to get cell value of a pd data frame [duplicate]

Let's say we have a pandas dataframe:
name age sal
0 Alex 20 100
1 Jane 15 200
2 John 25 300
3 Lsd 23 392
4 Mari 21 380
Let's say, a few rows are now deleted and we don't know the indexes that have been deleted. For example, we delete row index 1 using df.drop([1]). And now the data frame comes down to this:
fname age sal
0 Alex 20 100
2 John 25 300
3 Lsd 23 392
4 Mari 21 380
I would like to get the value from row index 3 and column "age". It should return 23. How do I do that?
df.iloc[3, df.columns.get_loc('age')] does not work because it will return 21. I guess iloc takes the consecutive row index?
Use .loc to get rows by label and .iloc to get rows by position:
>>> df.loc[3, 'age']
23
>>> df.iloc[2, df.columns.get_loc('age')]
23
More about Indexing and selecting data
dataset = ({'name':['Alex', 'Jane', 'John', 'Lsd', 'Mari'],
'age': [20, 15, 25, 23, 21],
'sal': [100, 200, 300, 392, 380]})
df = pd.DataFrame(dataset)
df.drop([1], inplace=True)
df.loc[3,['age']]
try this one:
[label, column name]
value = df.loc[1,"column_name]

Pandas Decile Rank

I just used the pandas qcut function to create a decile ranking, but how do I look at the bounds of each ranking. Basically, how do I know what numbers fall in the range of the ranking of 1 or 2 or 3 etc?
I hope the following python code with 2 short examples can help you. For the second example I used the isin method.
import numpy as np
import pandas as pd
df = {'Name' : ['Mike', 'Anton', 'Simon', 'Amy',
'Claudia', 'Peter', 'David', 'Tom'],
'Score' : [42, 63, 75, 97, 61, 30, 80, 13]}
df = pd.DataFrame(df, columns = ['Name', 'Score'])
df['decile_rank'] = pd.qcut(df['Score'], 10,
labels = False)
print(df)
Output:
Name Score decile_rank
0 Mike 42 2
1 Anton 63 5
2 Simon 75 7
3 Amy 97 9
4 Claudia 61 4
5 Peter 30 1
6 David 80 8
7 Tom 13 0
rank_1 = df[df['decile_rank']==1]
print(rank_1)
Output:
Name Score decile_rank
5 Peter 30 1
rank_1_and_2 = df[df['decile_rank'].isin([1,2])]
print(rank_1_and_2)
Output:
Name Score decile_rank
0 Mike 42 2
5 Peter 30 1

Selecting the higher of two data

I'm working with Python Pandas trying to sort some student testing data. On occasion, students will test twice during the same testing window, and I want to save only the highest of the two tests. Here's an example of my dataset.
Name Score
Alice 32
Alice 75
John 89
Mark 40
Mark 70
Amy 60
Any ideas of how can I save only the higher score for each student?
If your data is in the dataframe df, you can sort by the score in descencing order and drop duplicate names, keeping the first:
df.sort_values(by='Score', ascending=False).drop_duplicates(subset='Name', keep='first')
You can do this with groupby. It works like this:
df.groupby('Name').agg({'Score': 'max'})
It results in:
Score
Name
Alice 75
Amy 60
John 89
Mark 70
Btw. in that special setup, you could also use drop_duplicates to make the name unique after sorting on the score. This would yield the same result, but would not be extensible (e.g. if you later would like to add the average score etc). It would look like this:
df.sort_values(['Name', 'Score']).drop_duplicates(['Name'], keep='last')
From the test data you posted:
import pandas as pd
from io import StringIO
sio= StringIO("""Name Score
Alice 32
Alice 75
John 89
Mark 40
Mark 70
Amy 60 """)
df= pd.read_csv(sio, sep='\s+')
There are multiple ways to do that, two of them are:
In [8]: df = pd.DataFrame({"Score" : [32, 75, 89, 40, 70, 60],
...: "Name" : ["Alice", "Alice", "John", "Mark", "Mark", "Amy"]})
...: df
Out[8]:
Score Name
0 32 Alice
1 75 Alice
2 89 John
3 40 Mark
4 70 Mark
5 60 Amy
In [13]: %time df.groupby("Name").max()
CPU times: user 2.26 ms, sys: 286 µs, total: 2.54 ms
Wall time: 2.11 ms
Out[13]:
Score
Name
Alice 75
Amy 60
John 89
Mark 70
In [14]: %time df.sort_values("Name").drop_duplicates(subset="Name", keep="last")
CPU times: user 2.25 ms, sys: 0 ns, total: 2.25 ms
Wall time: 1.89 ms
Out[14]:
Score Name
1 75 Alice
5 60 Amy
2 89 John
4 70 Mark
This question has already been answered here on StackOverflow.
You can merge two pandas data frames and after that calculate the maximum number in each row. df1 and df2 are the pandas of students score:
import pandas as pd
df1 = pd.DataFrame({'Alice': 3,
'John': 8,
'Mark': 7.5,
'Amy': 0},
index=[0])
df2 = pd.DataFrame({'Alice': 7,
'Mark': 7},
index=[0])
result = pd.concat([df1, df2], sort=True)
result = result.T
result["maxvalue"] = result.max(axis=1)

How do I make a Dataframe of columns and unique values stacked?

I have a large data frame that I would like to develop a summation table from. In other words, column 1 would be the columns of the first data frame, column 2 would be each unique value of each column and column three thru ... would be a summation of different variables I choose. Like the below:
Variable Level Summed_Column
Here is some sample code:
data = {"name": ['bob', 'john', 'mary', 'timmy']
, "age": [32, 32, 29, 28]
, "location": ['philly', 'philly', 'philly', 'ny']
, "amt": [100, 2000, 300, 40]}
df = pd.DataFrame(data)
df.head()
So the output in the above example would be as follows:
Variable Level Summed_Column
Name Bob 100
Name john 2000
Name Mary 300
Name timmy 40
age 32 2100
age 29 300
age 29 40
location philly 2400
location ny 40
I'm not even sure where to start. The actual dataframe has 32 columns in which 4 will be summed and 28 put into the variable and Level format.
You don't need a loop for this and concatenation, you can do this in one go by combining melt with groupby and using the agg method:
final = df.melt(value_vars=['name', 'age', 'location'], id_vars='amt')\
.groupby(['variable', 'value']).agg({'amt':'sum'})\
.reset_index()
Which yields:
print(final)
variable value amt
0 age 28 40
1 age 29 300
2 age 32 2100
3 location ny 40
4 location philly 2400
5 name bob 100
6 name john 2000
7 name mary 300
8 name timmy 40
ok #Datanovice. I figured out how to do this using a for loop w/ pd.melt.
id = ['name', 'age', 'location']
final = pd.DataFrame(columns = ['variable', 'value', 'amt'])
for i in id:
table = df.groupby(i).agg({'amt':'sum'}).reset_index()
table2 = pd.melt(table, value_vars = i, id_vars = ['amt'])
final = pd.concat([final, table2])
print(final)

Return subset of DataFrame in Python Pandas

I have the following DataFrame:
import pandas as pd
# create simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
'Location': ["New York", "Paris", "Berlin", "London"],
'Age': [24, 13, 53, 33]
}
data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook
#display(data_pandas)
data_pandas
What is returned is the following DF:
Age Location Name
0 24 New York John
1 13 Paris Anna
2 53 Berlin Peter
3 33 London Linda
I then do this:
olderThan30 = data_pandas[data_pandas > 30]
olderThan30
And it returns the following:
Age Location Name
0 NaN New York John
1 NaN Paris Anna
2 53.0 Berlin Peter
3 33.0 London Linda
What I would like to return is only those that have the Age column greater than 30. Something like this:
Age Location Name
2 53.0 Berlin Peter
3 33.0 London Linda
How do I do that?
You need to pass the appropriate boolean condition to mask:
In [104]:
data_pandas[data_pandas['Age'] > 30]
Out[104]:
Age Location Name
2 53 Berlin Peter
3 33 London Linda
what you did was compare the entire df:
In [105]:
data_pandas > 30
Out[105]:
Age Location Name
0 False True True
1 False True True
2 True True True
3 True True True
this then masks the cells in the entire df, which is why you get NaN in the first 2 rows of age
Whilst masking just the col of interest:
In [106]:
data_pandas['Age'] > 30
Out[106]:
0 False
1 False
2 True
3 True
Name: Age, dtype: bool
when passed as a mask to a df, masks the rows
as #JonClements has suggested, you may feel more comfortable using query:
In [110]:
data_pandas.query('Age > 30')
Out[110]:
Age Location Name
2 53 Berlin Peter
3 33 London Linda
This has a dependency on numexpr library but this is normally installed correctly in my experience