How to give one value to column based on their column id - data-science

Click on this link to view the dataframe
Click this link to view Unique values customer id column
How to give one value based on the customer id on the other column(Monthly_Inhand_Salary)
For example:
If the customer having customer id as "CUS_0xd40" then only that customer value sholud change on the allocated column(Monthly_Inhand_Salary)
I use this code
a = []
b=[]
sum = 0
for i in credit_score_df['Customer_ID'].unique().tolist():
a.append(i)
#print(i)
if i == a[sum]:
print(i)
#print(credit_score_df.head())
credit_score_df['Monthly_Inhand_Salary'] = credit_score_df['Monthly_Inhand_Salary'].fillna(credit_score_df['Monthly_Inhand_Salary'][sum])
sum = sum + 1
credit_score_df.head(20)
Customer id
Monthly_Inhand_Salary
CUS_0xd40
1824.843333
CUS_0xd40
NaN
CUS_0xd40
NaN
CUS_0xd50
1354.265756
CUS_0xd50
NaN
CUS_0xd60
3541.5675
CUS_0xd70
6434.5541
Output Expected:
Customer id
Monthly_Inhand_Salary
CUS_0xd40
1824.843333
CUS_0xd40
1824.843333
CUS_0xd40
1824.843333
CUS_0xd50
1354.265756
CUS_0xd50
1354.265756
CUS_0xd60
3541.5675
CUS_0xd70
6434.5541

Related

Expanding group by window to count nunique

I have the following df:
df=pd.DataFrame(data={'month':[1]*4+[2]*4+[3]*4,'customer':[1,2,3,4,1,5,6,7,2,3,10,7]})
I want to create an expanding window to count number of unique customers at any point.
the output for the following df should be:
{1:4,2:7,3:8}
because in the first month we had 4 different customers, in the 2nd one, 3 where added (the other one was in the first month, and in the last month only one added (number 10))
Thanks
You can first drop the duplicated customers (only keep the first ones that appeared) and then cumulatively sum the number of (now unique) customers per month:
counts = df.drop_duplicates("customer").groupby("month").size().cumsum().to_dict()
to get
>>> counts
{1: 4, 2: 7, 3: 8}
Since there are repeated customers, you can drop those repeated customers using
df.drop_duplicates(subset='customer',ignore_index=True,inplace=True)
By default it will keep the first occurence of customer number and will drop next occurences. To count the number of unique customers each month,
df['customer'] = df.groupby('month')['customer'].transform('count')
df = df.drop_duplicates(ignore_index=True)
To roll the window over the customer column, calculate cumulative sum of that column
df['customer'] = df['customer'].cumsum()
It will give the desired output
month customers
1 4
2 7
3 8

counting unique values in column using sub-id

I have a df containing sub-trajectories (segments) of users, with mode of travel indicated by 0,1,2... which looks like this:
df = pd.read_csv('sample.csv')
df
id lat lon mode
0 5138001 41.144540 -8.562926 0
1 5138001 41.144538 -8.562917 0
2 5138001 41.143689 -8.563012 0
3 5138003 43.131562 -8.601273 1
4 5138003 43.132107 -8.598124 1
5 5145001 37.092095 -8.205070 0
6 5145001 37.092180 -8.204872 0
7 5145015 39.289341 -8.023454 2
8 5145015 39.197432 -8.532761 2
9 5145015 39.198361 -8.375641 2
In the above sample, id is for the segments but a full trajectory maybe covered by different modes (i.e. contains multiple segments).
So the first 4-digits of id is the unique trajectories, and the last 3-digits, unique segment with that trajectory.
I know that I can count the number of unique segments in the dfusing:
df.groupby('id').['mode'].nunique()
How do I then count the number of unique trajectories 5138, 5145, ...?
Use indexing for get first 4 values with str, if necessary first convert values to strings by Series.astype:
df = df.groupby(df['id'].astype(str).str[:4])['mode'].nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
If need processing values after first 4 ids:
s = df['id'].astype(str)
df = s.str[4:].groupby(s.str[:4]).nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
Another idea is use lambda function:
df.groupby(df['id'].apply(lambda x: str(x)[:4]))['mode'].nunique()

Groupby two columns in pandas, and perform operations over totals for each group

The code below:
df = pd.read_csv('./filename.csv', header='infer').dropna()
df.groupby(['category_code','event_type']).event_type.count().head(20)
Returns the following table:
How can I obtain, for all the sub groups under event_type that have both "purchase" and "view", the ratio between the total of "purchase" and the total of "view"?
In this specific case, for instance, I need a function that returns:
1/57
1/232
3/249
Eventually, I will need to plot such result.
I have been trying for a day, without success. I am still new to pandas, and I searched across every possible forum without finding anything useful.
Next time please consider adding a sample of your data as text instead of as an image. It helps us testing..
Anyway, in your case you can combine different dataframe methods, such as groupby, as you have already done, and pivot_table. I used this data just as an example:
category_code event_type
0 A purchase
1 A view
2 B view
3 B view
4 C view
5 D purchase
6 D view
7 D view
You can create a new column from your groupby
df['event_count'] = df.groupby(['category_code', 'event_type'])\
.event_type.transform('count')
Then create a pivot_table
my_table = df.pivot_table(values='event_count',
index='category_code',
columns='event_type',
fill_value=0)
Then, finally, you can calculate the purchase_ratio directly:
my_table['purchase_ratio'] = my_table['purchase'] / my_table['view']
Which results in the following DataFrame:
event_type purchase view purchase_ratio
category_code
A 1 1 1.0
B 0 2 0.0
C 0 1 0.0
D 1 2 0.5

Getting max value of counts of unique values in pandas dataframe

I have a dataframe of tweet data that is originally like this:
lang long lat hashtag country
1 it -69.940500 18.486700 DaddyYankeeAlertaRoja DO
2 it -69.940500 18.486700 QueremosConciertoDeAURA DO
3 it -69.940500 18.486700 LoQueDiceLaFoto DO
4 sv 14.167014 56.041735 escSE S
I have converted it into count information sorted by country and hashtag via:
d = pd.DataFrame({'count' : all_tweets.groupby(['country', 'hashtag']).size()}).reset_index()
d=
country hashtag count
0 A 100DaysofJapaneseLettering 3
1 A 100happydays 1
2 A 10cities1backpack 2
3 A 12points 6
... ... ... ...
848857 ZW reflections 1
848858 ZW saveKBD 1
848859 ZW sekuru 1
848860 ZW selfie 2
I ultimately want to plot the top hashtag per country. How do I take the max count for each country in the df and plot it?
I realized this question was a bit of a duplicate to Extract row with maximum value in a group pandas dataframe.
I extracted the most popular hashtag with this command:
max = d.iloc[d.groupby(['country']).apply(lambda x: x['count'].idxmax())]

dimple.js aggregating non-numerical values

Say i have a csv containing exactly one column Name, here's the first few values:
Name
A
B
B
C
C
C
D
Now i wish to create a bar plot using dimple.js that displays the distribution of the Name (i.e the count of each distinct Name) but i can't find a way to do that in dimple without having the actual counts of each name. I looked up the documentation of chart.addMeasureAxis(
https://github.com/PMSI-AlignAlytics/dimple/wiki/dimple.chart#addMeasureAxis) and found that for the measure argument:
measure (Required): This should be a field name from the data. If the field contains numerical values they will be aggregated, if it contains any non-numeric values the distinct count of values will be used.
So basically what that says is it doesn't matter how many times each category occurred, dimple is just going to treat each of them as if they occurred once ??
I tried the following chart.addMeasureAxis('y', 'Name') and the result was a barplot that had each bar at value of 1 (i.e each name occurred only once).
Is there a way to achieve this without changing the data set ?
You do need something else in your data which distinguishes the rows:
e.g.
Type Name
X A
Y A
Z A
X B
Y B
A C
You could do it with:
chart.addCategoryAxis("x", "Name");
chart.addMeasureAxis("y", "Type");
or add a count in:
Name Count
A 1
A 1
A 1
B 1
B 1
C 1
Then it's just
chart.addCategoryAxis("x", "Name");
chart.addMeasureAxis("y", "Count");
But there isn't a way to do it with just the name.
Hope this helps.. First use d3.nest() to get the unique values of the dataset. then add the count of each distinct values to the newly created object.
var countData = d3.nest()
.key(function (d) {
return d["name];
})
.entries(yourDate);
countData.forEach(function (c) {
c.count = c.values.length;
});
after this created the dimple chart from the new "countData Object"
var myChart = new dimple.chart(svg, countData);
Then for 'x' and 'y' axis,
myChart.addCategoryAxis("x", "key");
myChart.addMeasureAxis("y", "count");
Please look in to d3.nest() to understand the "countData" object.