Pandas pivot_table -- index values extended through rows - pandas

I'm trying to tidy some data, specifically by taking two columns "measure" and "value" and making more columns for each unique value of measure.
So far I have some python (3) code that reads in data and pivots it to the form that I want--roughly. This code looks like so:
import pandas as pd
#Load the data
df = pd.read_csv(r"C:\Users\User\Documents\example data.csv")
#Pivot the dataframe
df_pivot = df.pivot_table(index=['Geography Type', 'Geography Name', 'Week Ending',
'Item Name'], columns='Measure', values='Value')
print(df_pivot.head())
This outputs:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Item B 95 37 17
1/8/2018 Item A 92 8 32
Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
This is almost perfect, but for my work I need to put this file in software and for the software to read the data correctly it needs values for each of the rows, so I need the columns values for each of those indexes to extend through the rows, like so:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Type 1 Total US 1/1/2018 Item B 95 37 17
Type 1 Total US 1/8/2018 Item A 92 8 32
Type 1 Total US 1/8/2018 Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
and so on.

Related

How to add new column by getting the data from each?

I have Dataframe:
teamId pts xpts
Liverpool 82 59
Man City 57 63
Leicester 53 47
Chelsea 48 55
And I'm trying to add new columns that identify the team position by each column
I wanna get this:
teamId pts xpts №pts №xpts
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 53 4 3
I tried to do something similar with the following code, but to no avail. The result is a list
df = [df.sort_values(by=i, ascending=False).assign(new_col=lambda x: range(1, len(df) + 1)) for i in df.columns]
You can use np.argsort:
df[["no_pts", "no_xpts"]] = df.apply(lambda x: np.argsort(-x)) + 1
We pass each column but negated so that it gives the ascending order sort's indices. Since it starts from 0, we add 1 at the end.
to get
teamId pts xpts no_pts no_xpts
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 55 4 3
Solutions if teamId is index:
Use DataFrame.rank with all columns, add prefix and append to original DataFrame:
df = df.join(df.rank(method='dense', ascending=False).astype(int).add_prefix('no_'))
print (df)
pts xpts no_pts no_xpts
teamId
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 55 4 3
If need specify columns for processing use list:
cols = ['pts','xpts']
df = df.join(df[cols].rank(method='dense', ascending=False).astype(int).add_prefix('no_'))

How merge several time series data sets to a data frame and then using cross validation on it?

Hi I have 50 time series datasets in 50 .CSV files and each of them have 2 column: data and label. exactly like below. each of these .CSV file contain more than 800,000 row of signal records and none of them is equal to each other
How i can merge these .CSV file in to a data frame in order to select training and testing data with Cross validation?
because i working with RNN the sequence of data is important.
DATA OF CASE 1:
Data label
23 A
88 A
56 B
87 B
56 c
87 D
17 C
44 B
12 A
----------------
DATA OF CASE 2:
Data label
13 A
98 B
56 B
77 C
49 D
89 c
19 B
23 B
32 A

Summing columns and rows

How do I add up rows and columns.
The last column Sum needs to be the sum of the rows R0+R1+R2.
The last row needs to be the sum of these columns.
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
The result:
Type R0 R1 R2 Sum
0 AP 16 20 78 NaN
1 AP+ 10 14 55 NaN
2 SP 32 26 90 NaN
3 Total 0 0 0 NaN
Let us try .iloc position selection
df.iloc[-1,1:]=df.iloc[:-1,1:].sum()
df['Sum']=df.iloc[:,1:].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341
In general it may be better practice to specify column names:
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
# List columns
cols_to_sum=['R0', 'R1', 'R2']
# Access last row and sum columns-wise
df.loc[df.index[-1], cols_to_sum] = df[cols_to_sum].sum(axis=0)
# Create 'Sum' column summing row-wise
df['Sum']=df[cols_to_sum].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341

Column names after transposing a dataframe

I have a small dataframe - six rows (not counting the header) and 53 columns (a store name, and the rest weekly sales for the past year). Each row contains a particular store and each column the store's name and sales for each week. I need to transpose the data so that the weeks appear as rows, the stores appear as columns, and their sales appear as the rows.
To generate the input data:
df_store = pd.read_excel(SourcePath+SourceFile, sheet_name='StoreSales', header=0, usecols=['StoreName'])
# Number rows of all irrelevant stores.
row_numbers = [x+1 for x in df_stores[(df_store['StoreName'] != 'Store1') & (df_store['StoreName'] != 'Store2')
& (df_store['StoreName'] !='Store3')].index]
# Read in entire Excel file, skipping the rows of irrelevant stores.
df_store = pd.read_excel(SourcePath+SourceFile, sheet_name='StoreSales', header=0, usecols = "A:BE",
skiprows = row_numbers, converters = {'StoreName' : str})
# Transpose dataframe
df_store_t = df_store.transpose()
My output puts index numbers above each store name ( 0 to 5), and then each column starts out as StoreName (above the week), then each store name. Yet, I cannot manipulate them by their names.
Is there a way to clear those index numbers so that I can work directly with the resulting column names (e.g., rename "StoreName" to "WeekEnding" and make reference to each store columns ("Store1", "Store2", etc.?)
IIUC, you need to set_index first, then transpose, T:
See this example:
df = pd.DataFrame({'Store':[*'ABCDE'],
'Week 1':np.random.randint(50,200, 5),
'Week 2':np.random.randint(50,200, 5),
'Week 3':np.random.randint(50,200, 5)})
Input Dataframe:
Store Week 1 Week 2 Week 3
0 A 99 163 148
1 B 119 86 92
2 C 145 98 162
3 D 144 143 199
4 E 50 181 177
Now, set_index and transpose:
df_out = df.set_index('Store').T
df_out
Output:
Store A B C D E
Week 1 99 119 145 144 50
Week 2 163 86 98 143 181
Week 3 148 92 162 199 177

PowerPivot formula for row wise weighted average

I have a table in PowerPivot which contains the logged data of a traffic control camera mounted on a road. This table is filled the velocity and the number of vehicles that pass this camera during a specific time(e.g. 14:10 - 15:25). Now I want to know that how can I get the average velocity of cars for an specific hour and list them in a separate table with 24 rows(hour 0 - 23) where the second column of each row is the weighted average velocity of that hour? A sample of my stat_table data is given below:
count vel hour
----- --- ----
133 96.00237 15
117 91.45705 21
81 81.90521 6
2 84.29946 21
4 77.7841 18
1 140.8766 17
2 56.14951 14
6 71.72839 13
4 64.14309 9
1 60.949 17
1 77.00728 21
133 100.3956 6
109 100.8567 15
54 86.6369 9
1 83.96901 17
10 114.6556 21
6 85.39127 18
1 76.77993 15
3 113.3561 2
3 94.48055 2
In a separate PowerPivot table I have 24 rows and 2 columns but when I enter my formula, the whole rows get updated with the same number. My formula is:
=sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count] * stat_table[vel])/sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count])
Create a new calculated column named "WeightedVelocity" as follows
WeightedVelocity = [count]*[vel]
Create a measure "WeightedAverage" as follows
WeightedAverage = sum(stat_table[WeightedVelocity]) / sum(stat_table[count])
Use measure "WeightedAverage" in VALUES area of pivot Table and use "hour" column in ROWS to get desired result.