I have a data source like this:
ID Product Age
1 Amazon 18
2 Google 19
3 Facebook 20
4 Apple 21
5 Apple 22
6 Google 23
7 Amazon 25
8 Google 25
9 Facebook 27
10 Apple 29
11 Apple 28
12 Google 31
13 Amazon 32
14 Google 33
15 Facebook 34
And I want to create custom age segments like this (My visualization should look like this):
Product
Custom Age Group Amazon Facebook Google Apple
18-21 1 1 1 1
22-25 1 0 2 1
26-30 0 1 0 2
31-34 1 1 2 0
The user should be able to dynamically set the age groups.
For example, if the user wants to look at 18-25 and 28-34 age groups, the user should be able to make the changes on their own. Is this possible?
Maybe not the most elegant solution, but you could creat an inline table like this :
LOAD * INLINE [
Age, AgeGroup
18, 18-21
19, 18-21
20, 18-21
21, 18-21
22, 22-25
23, 22-25
...
];
Related
Can anyone help me to extact the car model names from the following sample dataframe?
index,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type
0,Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate
1,Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual
2,Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual
3,Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual
I have used this code :
model_name = df['Model'].str.extract(r'(\w+)')
How ever, i'm unable to get the car names which has names such as WR-V, CR-V ( or which has space or hyfen in between the names)
This is the detailed link of the dataset:https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho?select=car+details+v4.csv
Desired output should be:
index,0
0,Amaze
1,Swift
2,i10
3,Glanza
4,Innova
5,Ciaz
6,CLA
7,X1 xDrive20d
8,Octavia
9,Terrano
10,Elite
11,Kwid
12,Ciaz
13,Harrier
14,Polo
15,Celerio
16,Alto
17,Baleno
18,Wagon
19,Creta
20,S-Presso
21,Vento
22,Santro
23,Venue
24,Alto
25,Ritz
26,Creta
27,Brio
28,Elite
29,WR-V
30,Venue
Please help me!!
The exact logic is unclear, but assuming you want the first word (including special characters) or the first two words if the first word has only one or two characters:
df['Model'].str.extract(r'(\S{3,}|\S{1,2}\s+\S+)', expand=False)
Output:
0 Amaze
1 Swift
2 i10
3 Glanza
4 Innova
5 Ciaz
6 CLA
7 X1 xDrive20d
8 Octavia
9 Terrano
10 Elite
11 Kwid
12 Ciaz
13 Harrier
14 Polo
15 Celerio
16 Alto
17 Baleno
18 Wagon
19 Creta
20 S-Presso
21 Vento
22 Santro
23 Venue
24 Alto
25 Ritz
26 Creta
27 Brio
28 Elite
29 WR-V
... ...
Name: Model, dtype: object
I want to drop both rows in a pandas data frame where the value in one column(account) is not duplicate and the value in some other column (recharge_number) is duplicate given A. An illustrative example:
data = {'account': [43,43,43,43,45,45],
'recharge_number': [17777, 17777, 17999, 17888, 17222, 17999] ,
'year': [2021,2021,2021,2021,2020,2020],
'month': [2,3,5,6,2,9]}
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17999 2021 5
43 17888 2021 6
45 17222 2020 2
45 17999 2020 9
input data
output:
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17888 2021 6
45 17222 2020 2
output data
Another method is to drop rows instead of keep them:
>>> df.drop(df[~df.duplicated(['id', 'number'], keep=False)
& df.duplicated('number', keep=False)].index)
id number
0 5 10
1 5 10
3 6 20
5 7 40
The first condition protect all duplicate ('id', 'number') records. The second condition remove all records where 'number' are the same.
Basically, you want "the full row (or the two columns if larger dataframe) is duplicated" or "number is not duplicated"
You can use duplicated:
df[df['id', 'number'].duplicated(keep=False)|~df['number'].duplicated(keep=False)]
Output:
id number
0 5 10
1 5 10
3 6 20
5 7 40
Solution with .crosstab:
mask = pd.crosstab(df["account"], df["recharge_number"]).ne(0).sum().gt(1)
print(df[~df["recharge_number"].isin(mask[mask].index)])
Prints:
account recharge_number year month
0 43 17777 2021 2
1 43 17777 2021 3
3 43 17888 2021 6
4 45 17222 2020 2
I have a df_trg with, say 10 rows numbered 0-9.
I get from various sources values for an additional column foo which contains only a subset of rows, e.g. S1 has 0-3, 7, 9 and S2 has 4, 6.
I would like to get a data frame with a single new column foo where some rows may remain NaN.
Is there a "nicer" way other than:
df_trg['foo'] = np.nan
for src in sources:
df_trg['foo'][df_trg.index.isin(src.index)] = src
for example, using join or merge?
Let's create the source DataFrame (df), s1 and s2 (Series objects with
updating data) and a list of them (sources):
df = pd.DataFrame(np.arange(1, 51).reshape((5, -1)).T)
s1 = pd.Series([11, 12, 13, 14, 15, 16], index=[0, 1, 2, 3, 7, 9])
s2 = pd.Series([27, 28], index=[4, 6])
sources = [s1, s2]
Start the computation from adding foo column, initially filled with
an empty string:
df = df.assign(foo='')
Then run the following "updating" loop:
for src in sources:
df.foo.update(other=src)
The result is:
0 1 2 3 4 foo
0 1 11 21 31 41 11
1 2 12 22 32 42 12
2 3 13 23 33 43 13
3 4 14 24 34 44 14
4 5 15 25 35 45 27
5 6 16 26 36 46
6 7 17 27 37 47 28
7 8 18 28 38 48 15
8 9 19 29 39 49
9 10 20 30 40 50 16
In my opinion, this solution is (at least a little) nicer than yours and
shorter.
Alternative: Fill foo column initially with NaN, but this time
updating values will be converted to float (side effect of using NaN).
I am trying to make an auctions system but can not figure out the logical conditions for doing so..
Lets say that I have 10 credit
$credit
I have already bet 5 credits on another auction... so I owe 5 from 10 $owe
I thus have 5 available... $available = $credit - $owe (=5)
I bet 3 from available (on a different item)...
I wish to bet again 4 (cancel 3, update to 4), but credit available is now $available - 3 (=2)
Can't find a logical solution.... written in code.
What is the condition for setting a bet???
Made up a matrix with the dependence between variables:
bet available owe lastbet
1 10 10 0
2 9 11 1
3 7 13 2
4 4 16 3
5 0 20 4
6 -5 25 5
7 -11 31 6
8 -18 38 7
9 -26 46 8
10 -35 55 9
11 -45 65 10
Need to translate it into a condition statement.... (the next row would not meet the conditions)
The condition should fail on the 11th row....
Based on the Matrix... I found out that the condition is:
if ($bet <= (($owe + $available) / 2)) {}
Not very intuitive......
In this dataframe, column key values correspond to integer notation of each song key.
df
track key
0 Last Resort 4
1 Casimir Pulaski Day 8
2 Glass Eyes 8
3 Ohio - Live At Massey Hall 1971 7
4 Ballad of a Thin Man 11
5 Can You Forgive Her? 11
6 The Only Thing 3
7 Goodbye Baby (Baby Goodbye) 4
8 Heart Of Stone 0
9 Ohio 0
10 the gate 2
11 Clampdown 2
12 Cry, Cry, Cry 4
13 What's Happening Brother 8
14 Stupid Girl 11
15 I Don't Wanna Play House 7
16 Inner City Blues (Make Me Wanna Holler) 11
17 The Lonesome Death of Hattie Carroll 4
18 Paint It, Black - (Original Single Mono Version) 5
19 Let Him Run Wild 11
20 Undercover (Of The Night) - Remastered 5
21 Between the Bars 7
22 Like a Rolling Stone 0
23 Once 2
24 Pale Blue Eyes 5
25 The Way You Make Me Feel - 2012 Remaster 1
26 Jeremy 2
27 The Entertainer 7
28 Pressure 9
29 Play With Fire - Mono Version / Remastered 2002 2
30 D-I-V-O-R-C-E 9
31 Big Shot 0
32 What's Going On 1
33 Folsom Prison Blues - Live 0
34 American Woman 1
35 Cocaine Blues - Live 8
36 Jesus, etc. 5
the notation is as follows:
'C' --> 0
'C#'--> 1
'D' --> 2
'Eb'--> 3
'E' --> 4
'F' --> 5
'F#'--> 6
'G' --> 7
'Ab'--> 8
'A' --> 9
'Bb'--> 10
'B' --> 11
what is specific about this notation is that 11 is closer to 0 than 2, for instance.
GOAL:
given an input_notation = 0, I would like to sort according to closeness to key 0, or 'C'.
you can get closest value by doing:
closest_key = (input_notation -1) % 12
so I would like to sort according to this logic, having on top input_notation values and then closest matches, like so:
8 Heart Of Stone 0
9 Ohio 0
22 Like a Rolling Stone 0
31 Big Shot 0
33 Folsom Prison Blues - Live 0
(...)
I have tried:
v = df[['key']].values
df = df.iloc[np.lexsort(np.abs(v - (input_notation - 1) %12 ).T)]
but this does not work..
any clues?
You can define the closeness firstly and then use argsort with iloc to sort the data frame:
input_notation = 0
# define the closeness or distance
diff = (df.key - input_notation).abs()
closeness = np.minimum(diff, 12 - diff)
# use argsort to calculate the sorting index, and iloc to reorder the data frame
closest_to_input = df.iloc[closeness.argsort(kind='mergesort')]
closest_to_input.head()
# track key
#8 Heart Of Stone 0
#9 Ohio 0
#22 Like a Rolling Stone 0
#31 Big Shot 0
#33 Folsom Prison Blues - Live 0