filling column with another column value with condition - pandas

How do I fill root_cp_id column with cp_id of location that doesn't end with -.
The table I have
cp_id
location
1998
180
2294
180-1
2000
220
2150
2000
2001
240
2139
240-1
2157
120
2164
120-1
2244
120-2
2227
130
The expected result
cp_id
root_cp_id
location
1998
1998
180
2294
1998
180-1
2000
2000
220
2150
2000
2000
2001
2001
240
2139
2001
240-1
2157
2157
120
2164
2157
120-1
2244
2157
120-2
2227
2227
130

Use Series.mask for missing values if exist - and then forward filling previous non NaNs values:
df['root_cp_id'] = df['cp_id'].mask(df['location'].str.contains('-')).ffill()
print (df)
cp_id location root_cp_id
0 1998 180 1998.0
1 2294 180-1 1998.0
2 2000 220 2000.0
3 2150 2000 2150.0
4 2001 240 2001.0
5 2139 240-1 2001.0
6 2157 120 2157.0
7 2164 120-1 2157.0
8 2244 120-2 2157.0
9 2227 130 2227.0
Or if need new second column use DataFrame.insert:
df.insert(1, 'root_cp_id', df['cp_id'].mask(df['location'].str.contains('-')).ffill())
print (df)
cp_id root_cp_id location
0 1998 1998.0 180
1 2294 1998.0 180-1
2 2000 2000.0 220
3 2150 2150.0 2000
4 2001 2001.0 240
5 2139 2001.0 240-1
6 2157 2157.0 120
7 2164 2157.0 120-1
8 2244 2157.0 120-2
9 2227 2227.0 130

Related

Create pandas new columns based on rows category + convert rows into column - in groupby method

The dataset I'm working with is as follows. I'd want to add new columns to the table based on the row classifications. At the same time, I'd want to have only one row every month. I'm not sure if I should use groupby or not.
Year Month Index Humidity Temperature Pressure date
2019 1 High 100% 30 °C 1021 mbar 20191
2019 1 Low 28% 9 °C 1011 mbar 20191
2019 1 Average 65% 21 °C 1016 mbar 20191
2019 2 High 100% 32 °C 1020 mbar 20192
2019 2 Low 28% 10 °C 1008 mbar 20192
2019 2 Average 63% 18°C 1014 mbar 20192
So the output value looks like this:
**Desired output**
Year Month HighHumidity LowHumidity AverageHumidity HighTemperature LowTemperature AverageTemperature HighPressure LowPressure AveragePressure date
2019 1 100% 28% 65% 30 °C 9 °C 21 °C 1021 mbar 1011 mbar 1016 mbar 20191
2019 2 100% 27% 63% 32 °C 10 °C 18 °C 1020 mbar 1008 mbar 1014 mbar 20192
I experimented with the following code. However, the index column works with all other columns, however I only want to use particular columns in this operation.:
df = df.pivot(index='date', columns=['Index'])
df.columns = ['_'.join((col[1], col[0])) for col in df.columns]
Average_Year High_Year Low_Year Average_Month High_Month Low_Month Average_Humidity High_Humidity Low_Humidity Average_Temperature High_Temperature Low_Temperature Average_Pressure High_Pressure Low_Pressure
date
20191 2019 2019 2019 1 1 1 65% 100% 28% 21 °C 30 °C 9 °C 1016 mbar 1021 mbar 1011 mbar
201910 2019 2019 2019 10 10 10 81% 100% 49% 29 °C 35 °C 22 °C 1011 mbar 1016 mbar 1007 mbar
201911 2019 2019 2019 11 11 11 77% 100% 49% 26 °C 33 °C 16 °C 1013 mbar 1017 mbar 1006 mbar
201912 2019 2019 2019 12 12 12 77% 100% 38% 21 °C 31 °C 10 °C 1016 mbar 1021 mbar 1012 mbar
20192 2019 2019 2019 2 2 2 65% 100% 28% 23 °C 32 °C 10 °C 1015 mbar 1020 mbar 1008 mbar
You can add Year and Month to parameter index:
df = df.pivot(index=['Year','Month','date'], columns=['Index'])
df.columns = ['_'.join((col[1], col[0])) for col in df.columns]
df = df.reset_index()
#for correct order date column
df['date'] = df.pop('date')
print (df)
Year Month Average_Humidity High_Humidity Low_Humidity \
0 2019 1 65% 100% 28%
1 2019 2 63% 100% 28%
Average_Temperature High_Temperature Low_Temperature Average_Pressure \
0 21°C 30°C 9°C 1016mbar
1 18°C 32°C 10°C 1014mbar
High_Pressure Low_Pressure date
0 1021mbar 1011mbar 20191
1 1020mbar 1008mbar 20192

Filter rows of a table based on a condition that implies: 1) value of a field within a range 2) id of the business and 3) date?

I want to filter a TableA, taking into account only those rows whose "TotalInvoice" field is within the minimum and maximum values expressed in a ViewB, based on month and year values and RepairShopId (the sample data only has one RepairShopId, but all the data has multiple IDs).
In the view I have minimum and maximum values for each business and each month and year.
TableA
RepairOrderDataId
RepairShopId
LastUpdated
TotalInvoice
1
10
2017-06-01 07:00:00.000
765
1
10
2017-06-05 12:15:00.000
765
2
10
2017-02-25 13:00:00.000
400
3
10
2017-10-19 12:15:00.000
295679
4
10
2016-11-29 11:00:00.000
133409.41
5
10
2016-10-28 12:30:00.000
127769
6
10
2016-11-25 16:15:00.000
122400
7
10
2016-10-18 11:15:00.000
1950
8
10
2016-11-07 16:45:00.000
79342.7
9
10
2016-11-25 19:15:00.000
1950
10
10
2016-12-09 14:00:00.000
111559
11
10
2016-11-28 10:30:00.000
106333
12
10
2016-12-13 18:00:00.000
23847.4
13
10
2016-11-01 17:00:00.000
22782.9
14
10
2016-10-07 15:30:00.000
NULL
15
10
2017-01-06 15:30:00.000
138958
16
10
2017-01-31 13:00:00.000
244484
17
10
2016-12-05 09:30:00.000
180236
18
10
2017-02-14 18:30:00.000
92752.6
19
10
2016-10-05 08:30:00.000
161952
20
10
2016-10-05 08:30:00.000
8713.08
ViewB
RepairShopId
Orders
Average
MinimumValue
MaximumValue
year
month
yearMonth
10
1
370343
370343
370343
2015
7
2015-7
10
1
109645
109645
109645
2015
10
2015-10
10
1
148487
148487
148487
2015
12
2015-12
10
1
133409.41
133409.41
133409.41
2016
3
2016-3
10
1
19261
19261
19261
2016
8
2016-8
10
4
10477.3575
2656.65644879821
18298.0585512018
2016
9
2016-9
10
69
15047.709565
10
90942.6052417394
2016
10
2016-10
10
98
22312.077244
10
147265.581935242
2016
11
2016-11
10
96
20068.147395
10
99974.1750708773
2016
12
2016-12
10
86
25334.053372
10
184186.985160105
2017
1
2017-1
10
69
21410.63855
10
153417.00126689
2017
2
2017-2
10
100
13009.797
10
59002.3589332934
2017
3
2017-3
10
101
11746.191287
10
71405.3391452842
2017
4
2017-4
10
123
11143.49756
10
55306.8202091131
2017
5
2017-5
10
197
15980.55406
10
204538.144334771
2017
6
2017-6
10
99
10852.496969
10
63283.9899761938
2017
7
2017-7
10
131
52601.981526
10
1314998.61355187
2017
8
2017-8
10
124
10983.221854
10
59444.0535811233
2017
9
2017-9
10
115
12467.148434
10
72996.6054527277
2017
10
2017-10
10
123
14843.379593
10
129673.931373139
2017
11
2017-11
10
111
8535.455945
10
50328.1495501884
2017
12
2017-12
I've tried:
SELECT *
FROM TableA
INNER JOIN ViewB ON TableA.RepairShopId = ViewB.RepairShopId
WHERE TotalInvoice > MinimumValue AND TotalInvoice < MaximumValue
AND TableA.RepairShopId = ViewB.RepairShopId
But I'm not sure how to compare it the yearMonth field with the datetime field "LastUpdated".
Any help is very appreciated!
here is how you can do it:
I assumed LastUpdated column is the column from tableA which indicate date of
SELECT *
FROM TableA A
INNER JOIN ViewB B
ON A.RepairShopId = B.RepairShopId
AND A.TotalInvoice > B.MinimumValue
AND A.TotalInvoice < B.MaximumValue
AND YEAR(LastUpdated) = B.year
AND MONTH(LastUpdated) = B.month

Groupby sum in years in pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below dataframe and plot that into line plot.
Expected Output
bought_year total_profit
2016 250
2017 1250
2018 1000
X axis = bought_year
Y axis = profit
use groupby with dt.year and .agg to name your column.
df1 = df.groupby(df['sale_date'].dt.year).agg(total_profit=('profit','sum'))\
.reset_index().rename(columns={'sale_date': 'bought_year'})
print(df1)
bought_year total_profit
0 2016 250
1 2017 1250
2 2018 1000
df1.set_index('bought_year').plot(kind='bar')

pd transpose multiple rows into single column

Dear friends i want to transpose the following dataframe into a single column. I cant figure out a way to transform it so your help is welcome!! I tried pivottable but sofar no succes
X 0.00 1.25 1.75 2.25 2.99 3.25
X 3.99 4.50 4.75 5.25 5.50 6.00
X 6.25 6.50 6.75 7.50 8.24 9.00
X 9.50 9.75 10.25 10.50 10.75 11.25
X 11.50 11.75 12.00 12.25 12.49 12.75
X 13.25 13.99 14.25 14.49 14.99 15.50
and it should look like this
X
0.00
1.25
1.75
2.25
2.99
3.25
3.99
4.5
4.75
5.25
5.50
6.00
6.25
etc..
This will do it, df.columns[0] is used as I don't know what are your headers:
df = pd.DataFrame({'X': df.set_index(df.columns[0]).stack().reset_index(drop=True)})
df
X
0 0.00
1 1.25
2 1.75
3 2.25
4 2.99
5 3.25
6 3.99
7 4.50
8 4.75
9 5.25
10 5.50
11 6.00
12 6.25
13 6.50
14 6.75
15 7.50
16 8.24
17 9.00
18 9.50
19 9.75
20 10.25
21 10.50
22 10.75
23 11.25
24 11.50
25 11.75
26 12.00
27 12.25
28 12.49
29 12.75
30 13.25
31 13.99
32 14.25
33 14.49
34 14.99
35 15.50
ty so much!! A follow up question(a)
Is it also possible to stack the df into 2 columns X and Y
this is the data set
This is the data set.
1 2 3 4 5 6 7
X 0.00 1.25 1.75 2.25 2.99 3.25
Y -1.08 -1.07 -1.07 -1.00 -0.81 -0.73
X 3.99 4.50 4.75 5.25 5.50 6.00
Y -0.37 -0.20 -0.15 -0.17 -0.15 -0.16
X 6.25 6.50 6.75 7.50 8.24 9.00
Y -0.17 -0.18 -0.24 -0.58 -0.93 -1.24
X 9.50 9.75 10.25 10.50 10.75 11.25
Y -1.38 -1.42 -1.51 -1.57 -1.64 -1.75
X 11.50 11.75 12.00 12.25 12.49 12.75
Y -1.89 -2.00 -2.00 -2.04 -2.04 -2.10
X 13.25 13.99 14.25 14.49 14.99 15.50
Y -2.08 -2.13 -2.18 -2.18 -2.27 -2.46

Access unnamed first column in a data frame?

For example mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
The first “Column” isn’t named. I want to access it like:
mtcars$hp[mtcars$***carnamecolumn***="Mazda RX4”]
to ask “What’s the Horsepower of the Mazda RX4?”
Can I use a $accessor for this?