Trying to transpose row data into their own columns in a dataframe - pandas

I'm trying to transpose row data in a dataframe to be a unique column for each value. I have a vertical top-down report, and need to break it out into a horizontal-style report. I'm using alpaca API that makes a dataframe, but now a new version creates a different dataframe structure. Columns have been eliminated and I need them back. Other code that relies on the old style dataframes will be hard to re-tool.
The original dataframe was horizontal, with unique column, with four sub-columns in the header for each stock ticker.
But now it's producing a vertical dataframe. The unique columns are now under 1 column symbol as row values.
This is how I build the dataframe using api features.
df_ticker = api.get_bars(
tickers,
timeframe,
start=start_date,
end=end_date,
limit=1000
).df # to make it a dataframe.
I tried to transpose the values of the symbol column using the pivot_table() function to make the stock values their own columns again like original dataframe, but it didn't come out good.
df_ticker_fixed = df_ticker.reset_index()
df_ticker_fixed = df_ticker_fixed.pivot_table(
index='timestamp',
columns='symbol',
values=['open', 'high', 'low', 'close', 'volume']
)
df_ticker_fixed.head()
Result has incorrect column headers
What I basically need is make that double-header column format again, where there's a column with sub-divided columns underneath. I don't know what that's called when you have two layers of columns in a report.

This is called a MultiIndex and you want to swaplevel after your pivot, with an optional sort_index:
df_ticker_fixed = (df_ticker
.reset_index()
.pivot_table(
index='timestamp',
columns='symbol',
values=['open', 'high', 'low', 'close', 'volume']
)
.swaplevel(axis=1)
.sort_index(level='symbol', axis=1, sort_remaining=False)
)

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

How to broadcast a list of data into dataframe (Or multiIndex )

I have a big dataframe its about 200k of rows and 3 columns (x, y, z). Some rows doesn't have y,z values and just have x value. I want to make a new column that first set of data with z value be 1,second one be 2,then 3, etc. Or make a multiIndex same format.
Following image shows what I mean
Like this image
I made a new column called "NO." and put zero as initial value. Then
I tried to record the index of where I want the new column get a new value. with following code
df = pd.read_fwf(path, header=None, names=['x','y','z'])
df['NO.']=0
index_NO_changed = df.index[df['z'].isnull()]
Then I loop through it and change the number:
for i in range(len(index_NO_changed)-1):
df['NO.'].iloc[index_NO_changed[i]:index_NO_changed[i+1]]=i+1
df['NO.'].iloc[index_NO_changed[-1]:]=len(index_NO_changed)
But the problem is I get a warning that "
A value is trying to be set on a copy of a slice from a DataFrame
I was wondering
Is there any better way? Is creating multiIndex instead of adding another column easier considering size of dataframe?

How to change values of a column in a Pyspark dataframe from a map of two columns of the same df

So I have a dataframe say df, with multiple columns. I now create a dataframe from df, say map, passing only columns A and B and keeping only the unique rows. Now I want to modify df in such a way such that, if for a row in df, I find df['B'] in the map, then df['A'] should be key value from the map otherwise df['A' remains the same.
Any useful suggestions would be appreciated.

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

Convert Series to Dataframe where series index is Dataframe column names

I am selecting row by row as follows:
for i in range(num_rows):
row = df.iloc[i]
as a result I am getting a Series object where row.index.values contains names of df columns.
But I wanted instead dataframe with only one row having dataframe columns in place.
When I do row.to_frame() instead of 1x85 dataframe (1 row, 85 cols) I get 85x1 dataframe where index contains names of columns and row.columns
outputs
Int64Index([0], dtype='int64').
But all I want is just original data-frame columns with only one row. How do I do it?
Or how do I convert row.index values to row.column values and change 85x1 dimension to 1x85
You just need to adding T
row.to_frame().T
Also change your for loop with adding []
for i in range(num_rows):
row = df.iloc[[i]]