Access unnamed first column in a data frame? - dataframe

For example mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
The first “Column” isn’t named. I want to access it like:
mtcars$hp[mtcars$***carnamecolumn***="Mazda RX4”]
to ask “What’s the Horsepower of the Mazda RX4?”
Can I use a $accessor for this?

Related

filling column with another column value with condition

How do I fill root_cp_id column with cp_id of location that doesn't end with -.
The table I have
cp_id
location
1998
180
2294
180-1
2000
220
2150
2000
2001
240
2139
240-1
2157
120
2164
120-1
2244
120-2
2227
130
The expected result
cp_id
root_cp_id
location
1998
1998
180
2294
1998
180-1
2000
2000
220
2150
2000
2000
2001
2001
240
2139
2001
240-1
2157
2157
120
2164
2157
120-1
2244
2157
120-2
2227
2227
130
Use Series.mask for missing values if exist - and then forward filling previous non NaNs values:
df['root_cp_id'] = df['cp_id'].mask(df['location'].str.contains('-')).ffill()
print (df)
cp_id location root_cp_id
0 1998 180 1998.0
1 2294 180-1 1998.0
2 2000 220 2000.0
3 2150 2000 2150.0
4 2001 240 2001.0
5 2139 240-1 2001.0
6 2157 120 2157.0
7 2164 120-1 2157.0
8 2244 120-2 2157.0
9 2227 130 2227.0
Or if need new second column use DataFrame.insert:
df.insert(1, 'root_cp_id', df['cp_id'].mask(df['location'].str.contains('-')).ffill())
print (df)
cp_id root_cp_id location
0 1998 1998.0 180
1 2294 1998.0 180-1
2 2000 2000.0 220
3 2150 2150.0 2000
4 2001 2001.0 240
5 2139 2001.0 240-1
6 2157 2157.0 120
7 2164 2157.0 120-1
8 2244 2157.0 120-2
9 2227 2227.0 130

Moving Average Pandas Across Group

My data has the following structure:
np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
2,2,
3,3,3,3,3,
4,4,4,
5,5,5,5,5,5,5,
6,
7,7,
8,8,8,8,8,8,8,
9,9,
10,10
],
'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
'2021-01-02','2021-01-05','2021-01-07',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05'
],
'Quantity': np.floor(np.random.random(size=35)*100)
})
And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data.
Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count.
Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).
Finally, perform the ratio of sum and count to get the mean:
g = tdf.groupby('Date')['Quantity']
out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
output:
Date
2021-01-02 NaN
2021-01-05 50.210526
2021-01-07 45.071429
2021-01-09 41.000000
2021-01-11 44.571429
2021-01-13 48.800000
2021-01-15 50.500000
Name: Quantity, dtype: float64
joining the original data:
g = tdf.groupby('Date')['Quantity']
s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
output:
person_id Date Quantity Quantity_MA(2)
0 1 2021-01-02 87.0 NaN
4 2 2021-01-02 41.0 NaN
6 3 2021-01-02 68.0 NaN
11 4 2021-01-02 11.0 NaN
14 5 2021-01-02 16.0 NaN
21 6 2021-01-02 51.0 NaN
22 7 2021-01-02 38.0 NaN
24 8 2021-01-02 51.0 NaN
31 9 2021-01-02 90.0 NaN
33 10 2021-01-02 45.0 NaN
1 1 2021-01-05 58.0 50.210526
5 2 2021-01-05 11.0 50.210526
7 3 2021-01-05 43.0 50.210526
12 4 2021-01-05 44.0 50.210526
15 5 2021-01-05 52.0 50.210526
23 7 2021-01-05 99.0 50.210526
25 8 2021-01-05 55.0 50.210526
32 9 2021-01-05 66.0 50.210526
34 10 2021-01-05 28.0 50.210526
2 1 2021-01-07 27.0 45.071429
8 3 2021-01-07 55.0 45.071429
13 4 2021-01-07 58.0 45.071429
16 5 2021-01-07 32.0 45.071429
26 8 2021-01-07 3.0 45.071429
3 1 2021-01-09 18.0 41.000000
9 3 2021-01-09 36.0 41.000000
17 5 2021-01-09 69.0 41.000000
27 8 2021-01-09 71.0 41.000000
10 3 2021-01-11 40.0 44.571429
18 5 2021-01-11 36.0 44.571429
28 8 2021-01-11 42.0 44.571429
19 5 2021-01-13 83.0 48.800000
29 8 2021-01-13 43.0 48.800000
20 5 2021-01-15 48.0 50.500000
30 8 2021-01-15 28.0 50.500000

pd transpose multiple rows into single column

Dear friends i want to transpose the following dataframe into a single column. I cant figure out a way to transform it so your help is welcome!! I tried pivottable but sofar no succes
X 0.00 1.25 1.75 2.25 2.99 3.25
X 3.99 4.50 4.75 5.25 5.50 6.00
X 6.25 6.50 6.75 7.50 8.24 9.00
X 9.50 9.75 10.25 10.50 10.75 11.25
X 11.50 11.75 12.00 12.25 12.49 12.75
X 13.25 13.99 14.25 14.49 14.99 15.50
and it should look like this
X
0.00
1.25
1.75
2.25
2.99
3.25
3.99
4.5
4.75
5.25
5.50
6.00
6.25
etc..
This will do it, df.columns[0] is used as I don't know what are your headers:
df = pd.DataFrame({'X': df.set_index(df.columns[0]).stack().reset_index(drop=True)})
df
X
0 0.00
1 1.25
2 1.75
3 2.25
4 2.99
5 3.25
6 3.99
7 4.50
8 4.75
9 5.25
10 5.50
11 6.00
12 6.25
13 6.50
14 6.75
15 7.50
16 8.24
17 9.00
18 9.50
19 9.75
20 10.25
21 10.50
22 10.75
23 11.25
24 11.50
25 11.75
26 12.00
27 12.25
28 12.49
29 12.75
30 13.25
31 13.99
32 14.25
33 14.49
34 14.99
35 15.50
ty so much!! A follow up question(a)
Is it also possible to stack the df into 2 columns X and Y
this is the data set
This is the data set.
1 2 3 4 5 6 7
X 0.00 1.25 1.75 2.25 2.99 3.25
Y -1.08 -1.07 -1.07 -1.00 -0.81 -0.73
X 3.99 4.50 4.75 5.25 5.50 6.00
Y -0.37 -0.20 -0.15 -0.17 -0.15 -0.16
X 6.25 6.50 6.75 7.50 8.24 9.00
Y -0.17 -0.18 -0.24 -0.58 -0.93 -1.24
X 9.50 9.75 10.25 10.50 10.75 11.25
Y -1.38 -1.42 -1.51 -1.57 -1.64 -1.75
X 11.50 11.75 12.00 12.25 12.49 12.75
Y -1.89 -2.00 -2.00 -2.04 -2.04 -2.10
X 13.25 13.99 14.25 14.49 14.99 15.50
Y -2.08 -2.13 -2.18 -2.18 -2.27 -2.46

How can I divide and print the beginning of fields for each unique ID?

I need to divide the the first record of $6 by first record of $4 for each unique ID($1).
4 2016-07-19 06:09:50 546.5 3 11.5
4 2016-07-20 06:40:03 543.667 3 11.5
4 2016-07-21 05:43:18 539 3 11.5
4 2016-07-22 07:18:20 535 3 11.5
10 2016-07-20 08:08:45 488 3 17.5
10 2016-07-21 07:32:35 490.5 3 17.5
10 2016-07-23 06:01:58 470.5 3 17.5
10 2016-07-24 08:26:02 472 3 17.5
the output will look like this,
4 2016-07-19 06:09:50 546.5 3 11.5 0.02
4 2016-07-20 06:40:03 543.667 3 11.5 0.02
4 2016-07-21 05:43:18 539 3 11.5 0.02
4 2016-07-22 07:18:20 535 3 11.5 0.02
10 2016-07-20 08:08:45 488 3 17.5 0.036
10 2016-07-21 07:32:35 490.5 3 17.5 0.036
10 2016-07-23 06:01:58 470.5 3 17.5 0.036
10 2016-07-24 08:26:02 472 3 17.5 0.036
$ awk 'p!=$1{q=sprintf("%.3f", $6/$4)} {$(NF+1)=q;p=$1}1' file
4 2016-07-19 06:09:50 546.5 3 11.5 0.021
4 2016-07-20 06:40:03 543.667 3 11.5 0.021
4 2016-07-21 05:43:18 539 3 11.5 0.021
4 2016-07-22 07:18:20 535 3 11.5 0.021
10 2016-07-20 08:08:45 488 3 17.5 0.036
10 2016-07-21 07:32:35 490.5 3 17.5 0.036
10 2016-07-23 06:01:58 470.5 3 17.5 0.036
10 2016-07-24 08:26:02 472 3 17.5 0.036
Explained:
p!=$1 { # when the $1 changes
q=sprintf("%.3f", $6/$4) # calculate the value q to append to records
}
{ # for all records
$(NF+1)=q # append q to them
p=$1 # remember previous $1
} 1 # print
awk to the rescue!
$ awk '!($1 in a){a[$1]=$6/$4} {printf "%s\t%.3f\n",$0,a[$1]}' file
4 2016-07-19 06:09:50 546.5 3 11.5 0.021
4 2016-07-20 06:40:03 543.667 3 11.5 0.021
4 2016-07-21 05:43:18 539 3 11.5 0.021
4 2016-07-22 07:18:20 535 3 11.5 0.021
10 2016-07-20 08:08:45 488 3 17.5 0.036
10 2016-07-21 07:32:35 490.5 3 17.5 0.036
10 2016-07-23 06:01:58 470.5 3 17.5 0.036
10 2016-07-24 08:26:02 472 3 17.5 0.036
your output format is not consistent (2 or 3 decimal digits), there are ways to match exactly but not sure it was intentional.
#Alula- same logic as karakfa, but in spite of going through the loop first and then printing, doing that check within print itself.
awk '{printf "%s\t%.3f\n",$0,!a[$1]?a[$1]=$6/$4:a[$1]}' Input_file
I hope this helps you.

Randomly sum rows

I would like to randomly insert in a new temp_table the records from the Initial Table below, grouping them by a new PO number (1234-1, 1234-2,etc..) where each group sum(TKG) is <20 and sum(TVOL) is <0.1
INITIAL TABLE
lineID PO Item QTY Weight Volume T.KG T.VOL
1 1234 ABCD 12 0.40 0.0030 4.80 0.036
2 1234 EFGH 8 0.39 0.0050 3.12 0.040
3 1234 IJKL 5 0.48 0.0070 2.40 0.035
4 1234 MNOP 8 0.69 0.0040 5.53 0.032
5 1234 QRST 9 0.58 0.0025 5.22 0.023
6 1234 UVWX 7 0.87 0.0087 6.09 0.061
7 1234 YZAB 10 0.71 0.0064 7.10 0.064
8 1234 CDEF 6 0.69 0.0054 4.14 0.032
9 1234 GHIJ 7 0.65 0.0036 4.55 0.025
10 1234 KLMN 9 0.67 0.0040 6.03 0.036
NEW Temp_Table should look like:
LineID PO Item QTY Weight Volume T.KG T.VOL
1 1234-1 ABCD 12 0.40 0.0030 4.80 0.036
2 1234-1 EFGH 8 0.39 0.0050 3.12 0.040
5 1234-1 QRST 9 0.58 0.0025 5.22 0.023
3 1234-2 IJKL 5 0.48 0.0070 2.40 0.035
4 1234-2 MNOP 8 0.69 0.0040 5.53 0.032
8 1234-2 CDEF 6 0.69 0.0054 4.14 0.032
6 1234-3 UVWX 7 0.87 0.0087 6.09 0.061
10 1234-3 KLMN 9 0.67 0.0040 6.03 0.036
9 1234-4 GHIJ 7 0.65 0.0036 4.55 0.025
7 1234-4 YZAB 10 0.71 0.0064 7.10 0.064
I can't figure out how to code this...
It's probably a job for a cursor.
The algorithm could basically be like this:
Collect the rows from the initial table one by one, accumulating sum(TKG) and sum(TVOL):
pick out the rows into the temp while the conditions are still met (omit those that exceed either sum);
use lineID as the order;
iterate up to the end of the list.
Upon hitting the end of the table call it a group, then start all over again, omitting the rows that have already been collected into the temp.
Continue while there still are rows not collected.
But I'm too lazy at the moment to give out the actual code, besides it's a homework, and cursors hate me anyway.
The logic of the 1234-1, 1234-2, etc is to break the records into groups that represent a carton. If the order has 100 line items, I may need n cartons (n groups) to pack all the items.