Pandas df.shift(axis=1) adds extra entries, why? - pandas

Here is a sample of the original table.
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83 11
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41 11
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91 11
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68 11
I need to shift the entire table over by 1 column to produce this:
z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68
Here is the code that ingests the data and tries to shift it over by one column
wind_rass_table_df=pd.read_csv(file_path, header=j+3, engine='python', nrows=77,sep=r'\s{2,}',skip_blank_lines=False,index_col=False)
wind_rass_table_df=wind_rass_table_df.shift(periods=1,axis=1)
Supposedly df.shift(axis=1) should shift the dataframe over by 1 column but it does more than that, it does this:
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC]
0 NaN NaN 2.83 181.0 0.05 2.83 40.0 -0.20 -0.20 2.24
1 NaN NaN 2.41 184.8 0.20 2.40 50.0 -0.01 -0.01 2.47
2 NaN NaN 1.92 192.4 0.41 1.88 60.0 0.25 0.25 2.46
3 NaN NaN 1.75 201.7 0.65 1.63 70.0 0.50 0.50 2.47
The shift function has taken the first column, inserted into the 7th column, shifted the 7th into the 8th and repeated the 8th, shifting the 9th over and so on.
What is the correct way of shifting a dataframe over by one column?
Many thanks!

You can use iloc and create another dataframe:
df = pd.DataFrame(data=df.iloc[:, :-1], columns=df.columns[1:], index=df.index)

Related

Groupby multiple rows and get values in Pandas Dataframe

I would like to know that how would I get a data like, Date in first column and all stations ID in second column and all respective values of Average Wind speed, Sunshine duration and all and below that, Next date in next row and all stations in column beside this. I have data as below,
Stations_ID Date Average windspeed (Beaufort) sunshine duration average cloud cover
0 102 2016-01-01 6.8 5.733 NaN
1 164 2016-01-01 1.6 0.000 8.0
2 232 2016-01-01 2.0 0.000 7.8
3 282 2016-01-01 1.2 0.000 7.8
4 183 2016-01-01 2.9 0.000 7.8
... ... ... ... ... ...
164035 4271 2021-12-31 6.7 0.000 7.6
164036 4625 2021-12-31 7.1 0.000 7.3
164037 4177 2021-12-31 2.6 4.167 3.9
164038 4887 2021-12-31 4.7 5.333 3.8
164039 4336 2021-12-31 3.4 0.000 7.4
I have used below line of code,
df_data_3111 = pd.DataFrame(df_data_311.groupby(['Date','Stations_ID'])['Minimum Temperature'])
But this does not work.

pd transpose multiple rows into single column

Dear friends i want to transpose the following dataframe into a single column. I cant figure out a way to transform it so your help is welcome!! I tried pivottable but sofar no succes
X 0.00 1.25 1.75 2.25 2.99 3.25
X 3.99 4.50 4.75 5.25 5.50 6.00
X 6.25 6.50 6.75 7.50 8.24 9.00
X 9.50 9.75 10.25 10.50 10.75 11.25
X 11.50 11.75 12.00 12.25 12.49 12.75
X 13.25 13.99 14.25 14.49 14.99 15.50
and it should look like this
X
0.00
1.25
1.75
2.25
2.99
3.25
3.99
4.5
4.75
5.25
5.50
6.00
6.25
etc..
This will do it, df.columns[0] is used as I don't know what are your headers:
df = pd.DataFrame({'X': df.set_index(df.columns[0]).stack().reset_index(drop=True)})
df
X
0 0.00
1 1.25
2 1.75
3 2.25
4 2.99
5 3.25
6 3.99
7 4.50
8 4.75
9 5.25
10 5.50
11 6.00
12 6.25
13 6.50
14 6.75
15 7.50
16 8.24
17 9.00
18 9.50
19 9.75
20 10.25
21 10.50
22 10.75
23 11.25
24 11.50
25 11.75
26 12.00
27 12.25
28 12.49
29 12.75
30 13.25
31 13.99
32 14.25
33 14.49
34 14.99
35 15.50
ty so much!! A follow up question(a)
Is it also possible to stack the df into 2 columns X and Y
this is the data set
This is the data set.
1 2 3 4 5 6 7
X 0.00 1.25 1.75 2.25 2.99 3.25
Y -1.08 -1.07 -1.07 -1.00 -0.81 -0.73
X 3.99 4.50 4.75 5.25 5.50 6.00
Y -0.37 -0.20 -0.15 -0.17 -0.15 -0.16
X 6.25 6.50 6.75 7.50 8.24 9.00
Y -0.17 -0.18 -0.24 -0.58 -0.93 -1.24
X 9.50 9.75 10.25 10.50 10.75 11.25
Y -1.38 -1.42 -1.51 -1.57 -1.64 -1.75
X 11.50 11.75 12.00 12.25 12.49 12.75
Y -1.89 -2.00 -2.00 -2.04 -2.04 -2.10
X 13.25 13.99 14.25 14.49 14.99 15.50
Y -2.08 -2.13 -2.18 -2.18 -2.27 -2.46

How to pivot a dataframe using multiple column?

How can I use pandas.pivot_table or any other method to split the following data frame into two?
This is my input data frame:
Method N Mean Max Min Median Mode Meduim
0 A 5 0.40 0.55 0.25 0.39 N/A m1
2 A 10 0.26 0.47 0.10 0.25 N/A m2
1 B 5 0.48 0.62 0.33 0.50 N/A m1
3 B 7 0.41 0.47 0.36 0.42 0.36 m2
And I want to output the two following data frames
A m1 m2
N 5 10
Mean 0.40 0.26
Max 0.55 0.47
Min 0.25 0.10
Median 0.39 0.25
Mode N/A N/A
and
B m1 m2
N 5 7
Mean 0.48 0.41
Max 0.62 0.47
Min 0.33 0.36
Median 0.50 0.42
Mode N/A 0.36
Thank you.
Is it pivot?
df.set_index(['Method','Meduim']).T
gives:
Method A B
Meduim m1 m2 m1 m2
N 5.00 10.00 5.00 7.00
Mean 0.40 0.26 0.48 0.41
Max 0.55 0.47 0.62 0.47
Min 0.25 0.10 0.33 0.36
Median 0.39 0.25 0.50 0.42
Mode NaN NaN NaN 0.36

Pandas - ufunc 'subtract' did not contain a loop with signature matching type

I have this piece of code:
self.value=0.8
for col in df.ix[:,'value1':'value3']:
df = df.iloc[abs(df[col] - self.value).argsort()]
which works perfectly as part of main() function. at return, it prints:
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
6 Radiohead Codex 0.00 1.00 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
however, when I import this function as part of a module, and even though I'm passing and printing the same 0.8 self.value, I get the following error:
df = df.iloc[(df[col] - self.flavor).argsort()]
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 721, in wrapper
result = wrap_results(safe_na_op(lvalues, rvalues))
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 682, in safe_na_op
return na_op(lvalues, rvalues)
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 668, in na_op
result[mask] = op(x[mask], y)
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
why is it so? what is going on?
pd.DataFrame.ix is has been deprecated. You should stop using it.
Your use of 'value1':'value3' is dangerous as it can include columns you didn't expect if your columns aren't positioned in the order you thought.
df = pd.DataFrame(
[['a', 'b', 1, 2, 3]],
columns='artist track v1 v2 v3'.split()
)
list(df.loc[:, 'v1':'v3'])
['v1', 'v2', 'v3']
But rearrange the columns and
list(df.loc[:, ['v1', 'v2', 'artist', 'v3', 'b']].loc[:, 'v1':'v3'])
['v1', 'v2', 'artist', 'v3']
You got 'artist' in the the list. And column 'artist' is of type string and can't be subtracted from or by an integer or float.
df['artist'] - df['v1']
> TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21')
Setup
Shuffle df
df = df.sample(frac=1)
df
artist track pos neg neu
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
Solution
Use np.lexsort
value = 0.8
v = df[['pos', 'neg', 'neu']].values
df.iloc[np.lexsort(np.abs(v - value).T)]
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0

Randomly sum rows

I would like to randomly insert in a new temp_table the records from the Initial Table below, grouping them by a new PO number (1234-1, 1234-2,etc..) where each group sum(TKG) is <20 and sum(TVOL) is <0.1
INITIAL TABLE
lineID PO Item QTY Weight Volume T.KG T.VOL
1 1234 ABCD 12 0.40 0.0030 4.80 0.036
2 1234 EFGH 8 0.39 0.0050 3.12 0.040
3 1234 IJKL 5 0.48 0.0070 2.40 0.035
4 1234 MNOP 8 0.69 0.0040 5.53 0.032
5 1234 QRST 9 0.58 0.0025 5.22 0.023
6 1234 UVWX 7 0.87 0.0087 6.09 0.061
7 1234 YZAB 10 0.71 0.0064 7.10 0.064
8 1234 CDEF 6 0.69 0.0054 4.14 0.032
9 1234 GHIJ 7 0.65 0.0036 4.55 0.025
10 1234 KLMN 9 0.67 0.0040 6.03 0.036
NEW Temp_Table should look like:
LineID PO Item QTY Weight Volume T.KG T.VOL
1 1234-1 ABCD 12 0.40 0.0030 4.80 0.036
2 1234-1 EFGH 8 0.39 0.0050 3.12 0.040
5 1234-1 QRST 9 0.58 0.0025 5.22 0.023
3 1234-2 IJKL 5 0.48 0.0070 2.40 0.035
4 1234-2 MNOP 8 0.69 0.0040 5.53 0.032
8 1234-2 CDEF 6 0.69 0.0054 4.14 0.032
6 1234-3 UVWX 7 0.87 0.0087 6.09 0.061
10 1234-3 KLMN 9 0.67 0.0040 6.03 0.036
9 1234-4 GHIJ 7 0.65 0.0036 4.55 0.025
7 1234-4 YZAB 10 0.71 0.0064 7.10 0.064
I can't figure out how to code this...
It's probably a job for a cursor.
The algorithm could basically be like this:
Collect the rows from the initial table one by one, accumulating sum(TKG) and sum(TVOL):
pick out the rows into the temp while the conditions are still met (omit those that exceed either sum);
use lineID as the order;
iterate up to the end of the list.
Upon hitting the end of the table call it a group, then start all over again, omitting the rows that have already been collected into the temp.
Continue while there still are rows not collected.
But I'm too lazy at the moment to give out the actual code, besides it's a homework, and cursors hate me anyway.
The logic of the 1234-1, 1234-2, etc is to break the records into groups that represent a carton. If the order has 100 line items, I may need n cartons (n groups) to pack all the items.