pd transpose multiple rows into single column - pandas

Dear friends i want to transpose the following dataframe into a single column. I cant figure out a way to transform it so your help is welcome!! I tried pivottable but sofar no succes
X 0.00 1.25 1.75 2.25 2.99 3.25
X 3.99 4.50 4.75 5.25 5.50 6.00
X 6.25 6.50 6.75 7.50 8.24 9.00
X 9.50 9.75 10.25 10.50 10.75 11.25
X 11.50 11.75 12.00 12.25 12.49 12.75
X 13.25 13.99 14.25 14.49 14.99 15.50
and it should look like this
X
0.00
1.25
1.75
2.25
2.99
3.25
3.99
4.5
4.75
5.25
5.50
6.00
6.25
etc..

This will do it, df.columns[0] is used as I don't know what are your headers:
df = pd.DataFrame({'X': df.set_index(df.columns[0]).stack().reset_index(drop=True)})
df
X
0 0.00
1 1.25
2 1.75
3 2.25
4 2.99
5 3.25
6 3.99
7 4.50
8 4.75
9 5.25
10 5.50
11 6.00
12 6.25
13 6.50
14 6.75
15 7.50
16 8.24
17 9.00
18 9.50
19 9.75
20 10.25
21 10.50
22 10.75
23 11.25
24 11.50
25 11.75
26 12.00
27 12.25
28 12.49
29 12.75
30 13.25
31 13.99
32 14.25
33 14.49
34 14.99
35 15.50

ty so much!! A follow up question(a)
Is it also possible to stack the df into 2 columns X and Y
this is the data set
This is the data set.
1 2 3 4 5 6 7
X 0.00 1.25 1.75 2.25 2.99 3.25
Y -1.08 -1.07 -1.07 -1.00 -0.81 -0.73
X 3.99 4.50 4.75 5.25 5.50 6.00
Y -0.37 -0.20 -0.15 -0.17 -0.15 -0.16
X 6.25 6.50 6.75 7.50 8.24 9.00
Y -0.17 -0.18 -0.24 -0.58 -0.93 -1.24
X 9.50 9.75 10.25 10.50 10.75 11.25
Y -1.38 -1.42 -1.51 -1.57 -1.64 -1.75
X 11.50 11.75 12.00 12.25 12.49 12.75
Y -1.89 -2.00 -2.00 -2.04 -2.04 -2.10
X 13.25 13.99 14.25 14.49 14.99 15.50
Y -2.08 -2.13 -2.18 -2.18 -2.27 -2.46

Related

Moving Average Pandas Across Group

My data has the following structure:
np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
2,2,
3,3,3,3,3,
4,4,4,
5,5,5,5,5,5,5,
6,
7,7,
8,8,8,8,8,8,8,
9,9,
10,10
],
'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
'2021-01-02','2021-01-05','2021-01-07',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05'
],
'Quantity': np.floor(np.random.random(size=35)*100)
})
And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data.
Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count.
Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).
Finally, perform the ratio of sum and count to get the mean:
g = tdf.groupby('Date')['Quantity']
out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
output:
Date
2021-01-02 NaN
2021-01-05 50.210526
2021-01-07 45.071429
2021-01-09 41.000000
2021-01-11 44.571429
2021-01-13 48.800000
2021-01-15 50.500000
Name: Quantity, dtype: float64
joining the original data:
g = tdf.groupby('Date')['Quantity']
s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
output:
person_id Date Quantity Quantity_MA(2)
0 1 2021-01-02 87.0 NaN
4 2 2021-01-02 41.0 NaN
6 3 2021-01-02 68.0 NaN
11 4 2021-01-02 11.0 NaN
14 5 2021-01-02 16.0 NaN
21 6 2021-01-02 51.0 NaN
22 7 2021-01-02 38.0 NaN
24 8 2021-01-02 51.0 NaN
31 9 2021-01-02 90.0 NaN
33 10 2021-01-02 45.0 NaN
1 1 2021-01-05 58.0 50.210526
5 2 2021-01-05 11.0 50.210526
7 3 2021-01-05 43.0 50.210526
12 4 2021-01-05 44.0 50.210526
15 5 2021-01-05 52.0 50.210526
23 7 2021-01-05 99.0 50.210526
25 8 2021-01-05 55.0 50.210526
32 9 2021-01-05 66.0 50.210526
34 10 2021-01-05 28.0 50.210526
2 1 2021-01-07 27.0 45.071429
8 3 2021-01-07 55.0 45.071429
13 4 2021-01-07 58.0 45.071429
16 5 2021-01-07 32.0 45.071429
26 8 2021-01-07 3.0 45.071429
3 1 2021-01-09 18.0 41.000000
9 3 2021-01-09 36.0 41.000000
17 5 2021-01-09 69.0 41.000000
27 8 2021-01-09 71.0 41.000000
10 3 2021-01-11 40.0 44.571429
18 5 2021-01-11 36.0 44.571429
28 8 2021-01-11 42.0 44.571429
19 5 2021-01-13 83.0 48.800000
29 8 2021-01-13 43.0 48.800000
20 5 2021-01-15 48.0 50.500000
30 8 2021-01-15 28.0 50.500000

Write & Apply Python Function with Grouped Pandas Data

I have data that is grouped by a column 'plant_name' and I need to write & apply a function to test for a trend on one of the columns, i.e., named "10%" or '90%' for example.
My data looks like this -
plant_name year count mean std min 10% 50% 90% max
0 ARIZONA I 2005 8760.0 8.25 2.21 1.08 5.55 8.19 11.09 15.71
1 ARIZONA I 2006 8760.0 7.87 2.33 0.15 4.84 7.82 10.74 16.75
2 ARIZONA I 2007 8760.0 8.31 2.25 0.03 5.52 8.27 11.23 16.64
3 ARIZONA I 2008 8784.0 7.67 2.46 0.21 4.22 7.72 10.78 15.73
4 ARIZONA I 2009 8760.0 6.92 2.33 0.23 3.79 6.95 9.96 14.64
5 ARIZONA I 2010 8760.0 8.07 2.21 0.68 5.51 7.85 11.14 17.31
6 ARIZONA I 2011 8760.0 7.54 2.38 0.33 4.44 7.45 10.54 17.77
7 ARIZONA I 2012 8784.0 8.61 1.92 0.33 6.37 8.48 11.07 15.84
8 ARIZONA I 2015 8760.0 8.21 2.13 0.60 5.58 8.24 10.88 16.74
9 ARIZONA I 2016 8784.0 8.39 2.27 0.46 5.55 8.32 11.34 16.09
10 ARIZONA I 2017 8760.0 8.32 2.11 0.85 5.70 8.25 11.12 17.96
11 ARIZONA I 2018 8760.0 7.94 2.28 0.07 5.17 7.72 11.04 16.31
12 ARIZONA I 2019 8760.0 7.71 2.49 0.38 4.28 7.75 10.87 15.79
13 ARIZONA I 2020 8784.0 7.57 2.43 0.50 4.36 7.47 10.78 15.69
14 CAETITE I 2005 8760.0 8.11 3.15 0.45 3.76 8.38 12.08 18.89
15 CAETITE I 2006 8760.0 7.70 3.21 0.05 3.50 7.66 12.05 19.08
16 CAETITE I 2007 8760.0 8.64 3.18 0.01 4.05 8.83 12.63 18.57
17 CAETITE I 2008 8784.0 7.87 3.09 0.28 3.75 7.80 11.92 18.54
18 CAETITE I 2009 8760.0 7.31 3.02 0.17 3.46 7.21 11.40 19.46
19 CAETITE I 2010 8760.0 8.00 3.24 0.34 3.63 8.03 12.29 17.27
I'm using this function from here -
import pymannkendall as mk
and you apply the function like this:
mk.original_test(dataframe)
I need the final dataframe to look like this which is the result of the series columns returned by the function (mk.original_test):
trend, h, p, z, Tau, s, var_s, slope, intercept = mk.original_test(data)
plant_name trend h p z Tau s var_s slope intercept
0 ARIZONA I no trend False 0.416 0.812 xxx x x x x
1 CAETITE I increasing True 0.002 3.6 xxx x x x x
I just am not sure how to use groupby to group by plant_name column and then apply the mk function by plant_name to either of the columns in the data shown. Thank you,
For a given column, you can run the test in a GroupBy.apply() and return the result as a Series indexed by result._fields:
def mktest(x):
result = mk.original_test(x)
return pd.Series(result, index=result._fields)
column = '10%'
df.groupby('plant_name', as_index=False)[column].apply(mktest)
plant_name
trend
h
p
z
Tau
s
var_s
slope
intercept
ARIZONA I
no trend
False
0.956276
-0.054827
-0.021978
-2.0
332.666667
-0.003333
5.361667
CAETITE I
no trend
False
0.452370
-0.751469
-0.333333
-5.0
28.333333
-0.026000
3.755000

Pandas df.shift(axis=1) adds extra entries, why?

Here is a sample of the original table.
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83 11
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41 11
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91 11
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68 11
I need to shift the entire table over by 1 column to produce this:
z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68
Here is the code that ingests the data and tries to shift it over by one column
wind_rass_table_df=pd.read_csv(file_path, header=j+3, engine='python', nrows=77,sep=r'\s{2,}',skip_blank_lines=False,index_col=False)
wind_rass_table_df=wind_rass_table_df.shift(periods=1,axis=1)
Supposedly df.shift(axis=1) should shift the dataframe over by 1 column but it does more than that, it does this:
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC]
0 NaN NaN 2.83 181.0 0.05 2.83 40.0 -0.20 -0.20 2.24
1 NaN NaN 2.41 184.8 0.20 2.40 50.0 -0.01 -0.01 2.47
2 NaN NaN 1.92 192.4 0.41 1.88 60.0 0.25 0.25 2.46
3 NaN NaN 1.75 201.7 0.65 1.63 70.0 0.50 0.50 2.47
The shift function has taken the first column, inserted into the 7th column, shifted the 7th into the 8th and repeated the 8th, shifting the 9th over and so on.
What is the correct way of shifting a dataframe over by one column?
Many thanks!
You can use iloc and create another dataframe:
df = pd.DataFrame(data=df.iloc[:, :-1], columns=df.columns[1:], index=df.index)

How can I divide and print the beginning of fields for each unique ID?

I need to divide the the first record of $6 by first record of $4 for each unique ID($1).
4 2016-07-19 06:09:50 546.5 3 11.5
4 2016-07-20 06:40:03 543.667 3 11.5
4 2016-07-21 05:43:18 539 3 11.5
4 2016-07-22 07:18:20 535 3 11.5
10 2016-07-20 08:08:45 488 3 17.5
10 2016-07-21 07:32:35 490.5 3 17.5
10 2016-07-23 06:01:58 470.5 3 17.5
10 2016-07-24 08:26:02 472 3 17.5
the output will look like this,
4 2016-07-19 06:09:50 546.5 3 11.5 0.02
4 2016-07-20 06:40:03 543.667 3 11.5 0.02
4 2016-07-21 05:43:18 539 3 11.5 0.02
4 2016-07-22 07:18:20 535 3 11.5 0.02
10 2016-07-20 08:08:45 488 3 17.5 0.036
10 2016-07-21 07:32:35 490.5 3 17.5 0.036
10 2016-07-23 06:01:58 470.5 3 17.5 0.036
10 2016-07-24 08:26:02 472 3 17.5 0.036
$ awk 'p!=$1{q=sprintf("%.3f", $6/$4)} {$(NF+1)=q;p=$1}1' file
4 2016-07-19 06:09:50 546.5 3 11.5 0.021
4 2016-07-20 06:40:03 543.667 3 11.5 0.021
4 2016-07-21 05:43:18 539 3 11.5 0.021
4 2016-07-22 07:18:20 535 3 11.5 0.021
10 2016-07-20 08:08:45 488 3 17.5 0.036
10 2016-07-21 07:32:35 490.5 3 17.5 0.036
10 2016-07-23 06:01:58 470.5 3 17.5 0.036
10 2016-07-24 08:26:02 472 3 17.5 0.036
Explained:
p!=$1 { # when the $1 changes
q=sprintf("%.3f", $6/$4) # calculate the value q to append to records
}
{ # for all records
$(NF+1)=q # append q to them
p=$1 # remember previous $1
} 1 # print
awk to the rescue!
$ awk '!($1 in a){a[$1]=$6/$4} {printf "%s\t%.3f\n",$0,a[$1]}' file
4 2016-07-19 06:09:50 546.5 3 11.5 0.021
4 2016-07-20 06:40:03 543.667 3 11.5 0.021
4 2016-07-21 05:43:18 539 3 11.5 0.021
4 2016-07-22 07:18:20 535 3 11.5 0.021
10 2016-07-20 08:08:45 488 3 17.5 0.036
10 2016-07-21 07:32:35 490.5 3 17.5 0.036
10 2016-07-23 06:01:58 470.5 3 17.5 0.036
10 2016-07-24 08:26:02 472 3 17.5 0.036
your output format is not consistent (2 or 3 decimal digits), there are ways to match exactly but not sure it was intentional.
#Alula- same logic as karakfa, but in spite of going through the loop first and then printing, doing that check within print itself.
awk '{printf "%s\t%.3f\n",$0,!a[$1]?a[$1]=$6/$4:a[$1]}' Input_file
I hope this helps you.

Randomly sum rows

I would like to randomly insert in a new temp_table the records from the Initial Table below, grouping them by a new PO number (1234-1, 1234-2,etc..) where each group sum(TKG) is <20 and sum(TVOL) is <0.1
INITIAL TABLE
lineID PO Item QTY Weight Volume T.KG T.VOL
1 1234 ABCD 12 0.40 0.0030 4.80 0.036
2 1234 EFGH 8 0.39 0.0050 3.12 0.040
3 1234 IJKL 5 0.48 0.0070 2.40 0.035
4 1234 MNOP 8 0.69 0.0040 5.53 0.032
5 1234 QRST 9 0.58 0.0025 5.22 0.023
6 1234 UVWX 7 0.87 0.0087 6.09 0.061
7 1234 YZAB 10 0.71 0.0064 7.10 0.064
8 1234 CDEF 6 0.69 0.0054 4.14 0.032
9 1234 GHIJ 7 0.65 0.0036 4.55 0.025
10 1234 KLMN 9 0.67 0.0040 6.03 0.036
NEW Temp_Table should look like:
LineID PO Item QTY Weight Volume T.KG T.VOL
1 1234-1 ABCD 12 0.40 0.0030 4.80 0.036
2 1234-1 EFGH 8 0.39 0.0050 3.12 0.040
5 1234-1 QRST 9 0.58 0.0025 5.22 0.023
3 1234-2 IJKL 5 0.48 0.0070 2.40 0.035
4 1234-2 MNOP 8 0.69 0.0040 5.53 0.032
8 1234-2 CDEF 6 0.69 0.0054 4.14 0.032
6 1234-3 UVWX 7 0.87 0.0087 6.09 0.061
10 1234-3 KLMN 9 0.67 0.0040 6.03 0.036
9 1234-4 GHIJ 7 0.65 0.0036 4.55 0.025
7 1234-4 YZAB 10 0.71 0.0064 7.10 0.064
I can't figure out how to code this...
It's probably a job for a cursor.
The algorithm could basically be like this:
Collect the rows from the initial table one by one, accumulating sum(TKG) and sum(TVOL):
pick out the rows into the temp while the conditions are still met (omit those that exceed either sum);
use lineID as the order;
iterate up to the end of the list.
Upon hitting the end of the table call it a group, then start all over again, omitting the rows that have already been collected into the temp.
Continue while there still are rows not collected.
But I'm too lazy at the moment to give out the actual code, besides it's a homework, and cursors hate me anyway.
The logic of the 1234-1, 1234-2, etc is to break the records into groups that represent a carton. If the order has 100 line items, I may need n cartons (n groups) to pack all the items.