I have a large data frame that I would like to develop a summation table from. In other words, column 1 would be the columns of the first data frame, column 2 would be each unique value of each column and column three thru ... would be a summation of different variables I choose. Like the below:
Variable Level Summed_Column
Here is some sample code:
data = {"name": ['bob', 'john', 'mary', 'timmy']
, "age": [32, 32, 29, 28]
, "location": ['philly', 'philly', 'philly', 'ny']
, "amt": [100, 2000, 300, 40]}
df = pd.DataFrame(data)
df.head()
So the output in the above example would be as follows:
Variable Level Summed_Column
Name Bob 100
Name john 2000
Name Mary 300
Name timmy 40
age 32 2100
age 29 300
age 29 40
location philly 2400
location ny 40
I'm not even sure where to start. The actual dataframe has 32 columns in which 4 will be summed and 28 put into the variable and Level format.
You don't need a loop for this and concatenation, you can do this in one go by combining melt with groupby and using the agg method:
final = df.melt(value_vars=['name', 'age', 'location'], id_vars='amt')\
.groupby(['variable', 'value']).agg({'amt':'sum'})\
.reset_index()
Which yields:
print(final)
variable value amt
0 age 28 40
1 age 29 300
2 age 32 2100
3 location ny 40
4 location philly 2400
5 name bob 100
6 name john 2000
7 name mary 300
8 name timmy 40
ok #Datanovice. I figured out how to do this using a for loop w/ pd.melt.
id = ['name', 'age', 'location']
final = pd.DataFrame(columns = ['variable', 'value', 'amt'])
for i in id:
table = df.groupby(i).agg({'amt':'sum'}).reset_index()
table2 = pd.melt(table, value_vars = i, id_vars = ['amt'])
final = pd.concat([final, table2])
print(final)
Related
Let's say we have a pandas dataframe:
name age sal
0 Alex 20 100
1 Jane 15 200
2 John 25 300
3 Lsd 23 392
4 Mari 21 380
Let's say, a few rows are now deleted and we don't know the indexes that have been deleted. For example, we delete row index 1 using df.drop([1]). And now the data frame comes down to this:
fname age sal
0 Alex 20 100
2 John 25 300
3 Lsd 23 392
4 Mari 21 380
I would like to get the value from row index 3 and column "age". It should return 23. How do I do that?
df.iloc[3, df.columns.get_loc('age')] does not work because it will return 21. I guess iloc takes the consecutive row index?
Use .loc to get rows by label and .iloc to get rows by position:
>>> df.loc[3, 'age']
23
>>> df.iloc[2, df.columns.get_loc('age')]
23
More about Indexing and selecting data
dataset = ({'name':['Alex', 'Jane', 'John', 'Lsd', 'Mari'],
'age': [20, 15, 25, 23, 21],
'sal': [100, 200, 300, 392, 380]})
df = pd.DataFrame(dataset)
df.drop([1], inplace=True)
df.loc[3,['age']]
try this one:
[label, column name]
value = df.loc[1,"column_name]
not used pandas explode before. I got the gist of the pd.explode but for value lists where selective cols have nested lists I heard that pd.Series.explode is useful. However, i keep getting : "KeyError: "None of ['city'] are in the columns". Yet 'city' is defined in the keys:
keys = ["city", "temp"]
values = [["chicago","london","berlin"], [[32,30,28],[39,40,25],[33,34,35]]]
df = pd.DataFrame({"keys":keys,"values":values})
df2 = df.set_index(['city']).apply(pd.Series.explode).reset_index()
desired output is:
city / temp
chicago / 32
chicago / 30
chicago / 28
etc.
I would appreciate an expert weighing in as to why this throws an error, and a fix, thank you.
The problem comes from how you define df:
df = pd.DataFrame({"keys":keys,"values":values})
This actually gives you the following dataframe:
keys values
0 city [chicago, london, berlin]
1 temp [[32, 30, 28], [39, 40, 25], [33, 34, 35]]
You probably meant:
df = pd.DataFrame(dict(zip(keys, values)))
Which gives you:
city temp
0 chicago [32, 30, 28]
1 london [39, 40, 25]
2 berlin [33, 34, 35]
You can then use explode:
print(df.explode('temp'))
Output:
city temp
0 chicago 32
0 chicago 30
0 chicago 28
1 london 39
1 london 40
1 london 25
2 berlin 33
2 berlin 34
2 berlin 35
I just used the pandas qcut function to create a decile ranking, but how do I look at the bounds of each ranking. Basically, how do I know what numbers fall in the range of the ranking of 1 or 2 or 3 etc?
I hope the following python code with 2 short examples can help you. For the second example I used the isin method.
import numpy as np
import pandas as pd
df = {'Name' : ['Mike', 'Anton', 'Simon', 'Amy',
'Claudia', 'Peter', 'David', 'Tom'],
'Score' : [42, 63, 75, 97, 61, 30, 80, 13]}
df = pd.DataFrame(df, columns = ['Name', 'Score'])
df['decile_rank'] = pd.qcut(df['Score'], 10,
labels = False)
print(df)
Output:
Name Score decile_rank
0 Mike 42 2
1 Anton 63 5
2 Simon 75 7
3 Amy 97 9
4 Claudia 61 4
5 Peter 30 1
6 David 80 8
7 Tom 13 0
rank_1 = df[df['decile_rank']==1]
print(rank_1)
Output:
Name Score decile_rank
5 Peter 30 1
rank_1_and_2 = df[df['decile_rank'].isin([1,2])]
print(rank_1_and_2)
Output:
Name Score decile_rank
0 Mike 42 2
5 Peter 30 1
I have a dataframe which collects readings from device. Sometimes there are multiple readings for same sample, and thats stored as seperate ID in my dataframe. Is there a way for me to detect the duplicated ID's by using the columns that have same value?
Sample dataframe:
test_df = {'ID': [1,2,3,4,5,6], 'Age': [18,18,19,19,20,21], 'Sex':['Male','Male','Female','Female','Male','Female'], 'Values':[1200,200, 300, 400, 500,600]}
I want the result to return ID's 1,2,3,4 since they are duplicated when we compare Age and Sex column values.
Expected Output:
ID Age Sex Values
1 18 Male 1200
2 18 Male 200
3 19 Female 300
4 19 Female 400
I have a dataframe as below:
raw_data = {
'age': [20, 20, 20, 22, 21],
'favorite_color': ['blue', 'blue', 'blue','yellow', "green"],
'grade': [92,"" , 92, 95, 70],
'key': ['a', 'b', 'Total', 'a', 'b']}
df = pd.DataFrame(raw_data)
df
age favorite_color grade key
20 blue 92 a
20 blue b
20 blue 92 Total
22 yellow 95 a
21 green 70 b
For equal values of "age" and "favorite_color"
if "grade" for "total" value of "key" = sum of "grade" for non total value of "key",
replace grade to 0 for "Total".
So my output dataframe should look like the below:
age favorite_color grade key
20 blue 92 a
20 blue b
20 blue 0 Total
22 yellow 95 a
21 green 70 b
Here is the my solution:
First convert non-blank grades to ints:
df.grade = df.grade.astype(int, errors='ignore')
Make a function to check non-total Sum = Total sum
def fixer(x):
if x[x.key == 'Total']['grade'].sum() == x[x.key == 'Total']['grade'].sum():
x.loc[x.key == 'Total', 'grade'] = 0
return x
Profit!
df[df.grade.apply(lambda x: isinstance(x, int))].groupby(
['age', 'favorite_color']).apply(fixer)
age favorite_color grade key
0 20 blue 92 a
2 20 blue 0 Total
3 22 yellow 95 a
4 21 green 70 b