Pandas pivot table value counts - pandas

I have a dataframe in format:
Name Score Bin
John 90 80-100
Marc 30 20-40
John 10 0-20
David 20 0-20
...
I want to create a pivot table that looks like this:
Name 0-20 20-40 40-60 60-80 80-100 Total count Avg score
John 1 2 nan nan 2 5 60.53
Marc nan 2 nan nan nan 2 32.13
David 3 2 nan nan nan 5 21.80
So I want to have columns that show count of values for each bucket, as well as total count of values and average score.
I have tried
table = pd.pivot_table(df, values=['Score', "Bin"], index=["nAME"],
aggfunc={"Score" : np.average, "Bin" : "count"},
dropna=True, margins = True)
however I just get overall count and not broken down per bucket

Do your task in 3 steps:
Generate a pivot_table:
df2 = pd.pivot_table(df, index='Name', columns='Bin', values='Score', aggfunc='count')\
.reindex(columns=['0-20', '20-40', '40-60', '60-80', '80-100'])\
.rename_axis(columns='')
The result, for your source data extended to give roughly your expected
result, is:
0-20 20-40 40-60 60-80 80-100
Name
David 3.0 2.0 NaN NaN NaN
John 1.0 2.0 NaN NaN 2.0
Marc NaN 2.0 NaN NaN NaN
Note: As NaN is a special case of float, other values are also of
float type.
Generate Total_count and Avg_score:
df3 = df.groupby('Name')\
.agg(Total_count=('Score', 'count'), Avg_score=('Score', 'mean'))\
.rename(columns={'Total_count': 'Total count', 'Avg_score': 'Avg score'})
The result is:
Total count Avg score
Name
David 5 21.8
John 5 61.0
Marc 2 32.0
Join both above tables:
result = df2.join(df3)
The result is:
0-20 20-40 40-60 60-80 80-100 Total count Avg score
Name
David 3.0 2.0 NaN NaN NaN 5 21.8
John 1.0 2.0 NaN NaN 2.0 5 61.0
Marc NaN 2.0 NaN NaN NaN 2 32.0

Related

how to split this kind of column in a dataframe?

I have data frame same like below:
Name Rating Review Price
1 The park NaN NaN 5040
2 The Westin Good 7.6 NaN 6045
3 Courtyard NaN NaN 4850
4 Radisson Excellent 9.8 NaN 7050
5 Banjara Average 6.7 NaN 5820
6 Mindspace NaN NaN 8000
My required output is like this:
Name Review Rating Price
1 The park NaN NaN 5040
2 The Westin Good 7.6 6045
3 Courtyard NaN NaN 4850
4 Radisson Excellent 9.8 7050
5 Banjara Average 6.7 5820
6 Mindspace NaN NaN 8000
I use this split function:
df[["review","ratings"]] = df["rating"].str.split(expand=True)
But I got 'Columns must be same length as key' this type error.
How to split this type of data can anyone help me?
Problem is there is multiple space, not only one at least in one splitted value.
You can add n=1 for split after first space:
df[["review","ratings"]] = df["Rating"].str.split(expand=True, n=1)
Or use rsplit with n=1 for split by last space:
df[["review","ratings"]] = df["Rating"].str.rsplit(expand=True, n=1)
Another idea is use Series.str.extract with regex for get all values before space before float:
df[["review","ratings"]] = df["Rating"].str.extract('(.*)\s+(\d+\.\d+)')

How to perform nearest neighbor imputation in a time-series dataset?

I have a panda-series of the form
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 0.0
9 1.0
10 NaN
11 NaN
12 NaN
13. 0.0
...
The values can either be 0.0 or 1.0. From my knowledge of the data however, the 0's come in groups. Meaning, the entries 0-8 should be 0, then entries 9-12 should all be 1's, and then 13+ will be 0's. Therefore, the best way to impute the NaN's would be to do some kind of nearest neighbor I believe. However, it should return a 0 or 1 obviously and not an average value. Please let me know of anyway to do this!

Extract recommendations for user from pivot table

I have a following pivot table with user/items number of purchases that looks like this:
originalName Red t-shirt Black t-shirt ... Orange sweater Pink sweater
customer ...
165 NaN NaN ... NaN NaN
265 NaN 1.0 ... NaN NaN
288 NaN NaN ... NaN NaN
368 1.0 NaN ... NaN 2.0
396 NaN NaN ... 3.0 NaN
I wrote the method to get related items if I input one item, by using Pearson's correlation
def get_related_items(name, M, num):
number_of_orders = []
for title in M.columns:
if title == name:
continue
cor = pearson(M[name], M[title])
if np.isnan(cor):
continue
else:
number_of_orders.append((title, cor))
number_of_orders.sort(key=lambda tup: tup[1], reverse=True)
return number_of_orders[:num]
I am not sure what should be the logic to get the list of recommended items for a specific customer.
And how can I evaluate that?
Thanks!
import pandas as pd
import numpy as np
df=pd.DataFrame({'customer':[165,265,288,268,296],
'R_shirt':[np.nan,1.0,np.nan,1.0,np.nan],
'B_shirt':[np.nan,np.nan,2.0,np.nan,np.nan],
'X_shirt':[5.0,np.nan,2.0,np.nan,np.nan],
'Y_shirt':[3.0,np.nan,2.0,3.0,np.nan]
})
print(df)
customer R_shirt B_shirt X_shirt Y_shirt
0 165 NaN NaN 5.0 3.0
1 265 1.0 NaN NaN NaN
2 288 NaN 2.0 2.0 2.0
3 268 1.0 NaN NaN 3.0
4 296 NaN NaN NaN NaN
df['customer']=df['customer'].astype(str)
df=df.pivot_table(columns='customer')
customer = '165'
print(df)
customer 165 265 268 288
B_shirt NaN NaN NaN 2.0
R_shirt NaN 1.0 1.0 NaN
X_shirt 5.0 NaN NaN 2.0
Y_shirt 3.0 NaN 3.0 2.0
best_for_customer=df[customer][df[customer]!=np.nan].to_frame().sort_values(by=customer,ascending=False).dropna()
print(best_for_customer)
165
X_shirt 5.0
Y_shirt 3.0
variable customer is a name of customer that you want to check
The expected rating for a user for a given item will be the weighted average of the ratings given to the item by other users, where the weights will be the Pearson coefficients you calculated. Then you can pick the item with the highest expected rating and recommend it.

How do I make separate columns from one column that contains null and multiple values?

I have this file converted from PDF to CSV to train a model. three columns from the pdf file have merged into one in the csv e.g the ProductID, Commodity and country.
I was trying to separate these columns with the help of regular expression, but I am not quite sure how the columns will go.
This set of data is what I am dealing with:
country/commodity Unit Quantity Value
1 0011101 BREEDING BULLS (OXEN) NO NaN 75
2 DUBAI NaN NaN 75
3 0011102 BREEDING BULLS (BUFFALO) NO 248 1921
4 SRI LUNKA NaN 248 1921
5 0011103 BUFFALO,BREEDING NO NaN 90
6 SRI LUNKA NaN NaN 90
7 0011104 COWS BREEDING NO 1249 258921665
8 AJMAN NaN NaN NaN
9 CYPRUS NaN NaN NaN
I need this data to be in this format:
0 ProductID Commodity Country Unit Quantity Value
1 0011101 BREEDING BULLS (OXEN) DUBAI NaN NaN 75
3 0011102 BREEDING BULLS (BUFFALO) SRI LUNKA NaN 248 1921
4 0011103 BUFFALO,BREEDING SRI LUNKA NaN NaN 90
7 0011104 COWS BREEDING AJMAN NaN NaN NaN
8 0011104 COWS BREEDING CYPRUS NaN NaN NaN
9 0011104 COWS BREEDING CHINA NaN 590 3290
First we make your columns ProductID, Commodity, Country by substracting the information from the country/commodity column with:
str.split
str.extract
Series.where
Series.mask
str.contains
Then we GroupBy on ProductID to get the information of the corresponding products together and we use named aggregation for this, which is new since pandas 0.25.0:
# Extract information from country/commodity
df['ProductID'] = df['country/commodity'].str.split(' ', 1).str[0].str.extract('(\d+)').ffill()
df['Commodity'] = df['country/commodity'].str.split('\d+').str[-1].where(df['Unit'].notna())
df['Country'] = df['country/commodity'].mask(df['country/commodity'].str.contains('\d+')).fillna('')
# Groupby ProductID to get information together
df_new = df.groupby(['ProductID']).agg(
Commodity=('Commodity', 'first'),
Country=('Country', ', '.join),
Unit=('Unit', 'first'),
Quantity=('Quantity', 'first'),
Value=('Value', 'first')
).reset_index()
# Remove unnecessary comma's
df_new['Country'] = df_new['Country'].str.lstrip(', ')
Output
ProductID Commodity Country Unit Quantity \
0 0011101 BREEDING BULLS (OXEN) DUBAI NO NaN
1 0011102 BREEDING BULLS (BUFFALO) SRI LUNKA NO 248.0
2 0011103 BUFFALO,BREEDING SRI LUNKA NO NaN
3 0011104 COWS BREEDING AJMAN, CYPRUS NO 1249.0
Value
0 75.0
1 1921.0
2 90.0
3 258921665.0

Create new dataframe columns from old dataframe rows using for loop --> N/A values

I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output