How to show values used for color in texttemplate in px.treemap - plotly-python

I'm doing my first work with a Plotly Treemap. I want to show the values used for the colors in the blocks.
Using this answer, I was able to create customdata, and the treemap renders however, the values are not in the same order as the dataframe and it shows the 'Growth' value for a different record in some cases. It is also not off-by-one.
Sample data:
Type,Stock,Holding,Growth
Equities,Sasol,500,118.06
ETFs,SATRIX Top 40,500,12.99
Code:
shares_data = pd.read_csv("stocks.csv", index_col=False)
fig = px.treemap(shares_data,
path=['Type', 'Stock'],
values='Holding',
color='Growth',
color_continuous_scale=[(0, "red"), (0.05, "yellow"), (1, "green")],
)
fig.data[0].customdata = shares_data.Growth
fig.data[0].texttemplate = "<b>%{label}</b><br>Holding: R%{value}<br>Growth: %{customdata:.2f}%<br>"
fig.show()
Rendered:

Having read this, I was able to print the fig's data, which included:
'marker': {'coloraxis': 'coloraxis',
'colors': array([ 53.02, 25.17, -23.29, 10.21, 57.59,...
So I could then access the values with:
fig.data[0].customdata = fig.data[0].marker.colors
And now it works.
[Because I really struggled with this, I decided to post the answer instead of deleting the question in case someone else in the future finds ity useful.]

Related

How do I return this groupby calculated values back to the dataframe as a single column?

I am new to Pandas. Sorry for using images instead of tables here; I tried to follow the instructions for inserting a table, but I couldn't.
Pandas version: '1.3.2'
Given this dataframe with Close and Volume for stocks, I've managed to calculate OBV, using pandas, like this:
df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum())
The above gave me the correct values for OBV as
shown here.
However, I'm not able to assign the calculated values to a new column.
I would like to do something like this:
df['OBV'] = df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum())
But simply doing the expression above of course will throw us the error:
ValueError: Columns must be same length as key
What am I missing?
How can I insert the calculated values into the original dataframe as a single column, df['OBV'] ?
I've checked this thread so I'm sure I should use apply.
This discussion looked promising, but it is not for my case
Use Series.droplevel for remove first level of MultiIndex:
df['OBV'] = df.groupby('Ticker').apply(lambda x: (np.sign(x['Close'].diff().fillna(0)) * x['Volume']).cumsum()).droplevel(0)

Adding a row to a FITS table with astropy

I have a problem which ought to be trivial but seems to have been massively over-complicated by the column-based nature of FITS BinTableHDU.
The script I'm writing should be trivial: iterate through a FITS file and write a subset of rows to an identically formatted FITS file, reducing the row count from c700k/3.6GB to about 350 rows. I have processed the input file and have each row that I want to save in a python array of FITS records:
outarray = []
self.indata=Table.read(self.infile, hdu=1)
for r in self._indata:
RecPassesFilter = FilterProc(r, self)
#
# Add to output array only if passes all filters...
#
if RecPassesFilter:
outarray.append(r)
Now, I've created an empty BintableHDU with exactly the same columns and formats and I want to add the filtered data:
[...much omitted code later...}
mycols = []
for inputcol in self._coldefs:
mycols.append(fits.Column(name=inputcol.name, format=inputcol.format))
# Next line should produce an empty BinTableHDU in the identical format to the output data
SaveData = fits.BinTableHDU.from_columns(mycols)
for s in self._outdata:
SaveData.data.append(s)
Now that last line not only fails, but every variant of it (SaveData.append() or .add_row() or whatever) also fails with a "no such method" error. There seems to be a singular lack of documentation on how to do the trivial task of adding a record. Clearly I am missing something, but two days later I'm still drawing a blank.
Can anyone point me in the right direction here?
OK, I managed to resolve this with some brute force and nested iterations essentially to create column data arrays on the fly. It's not much in terms of code and I don't care that it's inefficient as I won't need to run it too often. Example code here:
with fits.open(self._infile) as HDUSet:
tableHDU=HDUSet[1]
self._coldefs = tableHDU.columns
FITScols = []
for inputcol in self._coldefs:
NewColData = []
for r in self._outdata:
NewColData.append(r[inputcol.name])
FITScols.append(fits.Column(name=inputcol.name, format=inputcol.format, array=NewColData))
SaveData = fits.BinTableHDU.from_columns(FITScols)
SaveData.writeto(fname)
This solves my problem for a 350 row subset. I haven't yet dared try it for the 250K row subset that I need for the next part of my project!
I just recalled that BinTableHDU.from_columns takes an nrows argument. If you pass that along with the columns of an existing table HDU, it will copy the column structure but initialize subsequent rows with empty data:
>>> hdul = fits.open('astropy/io/fits/tests/data/table.fits')
>>> table = hdul[1]
>>> table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '>f4')]))
>>> new_table = fits.BinTableHDU.from_columns(table.columns, nrows=5)
>>> new_table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> new_table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2),
('', 0. ), ('', 0. )],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
As you can see, this still copies the data from the original columns. I think the idea behind this originally was for adding new rows to an existing table. However, you can also initialize a completely empty new table by passing fill=True:
>>> new_table_zeroed = fits.BinTableHDU.from_columns(table.columns, nrows=5, fill=True)
>>> new_table_zeroed.data
FITS_rec([('', 0.), ('', 0.), ('', 0.), ('', 0.), ('', 0.)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))

How to insert a column in a julia DataFrame at specific position (without referring to existing column names)

I have a DataFrame in Julia with hundreds of columns, and I would like to insert a column after the first one.
For example in this DataFrame:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
I would like to insert a column area after colour, but without referring specifically to shape and border (that in my real case are hundreds of different columns).
df[:area] = [1,2]
In this example I can use (but referring specifically to shape and border):
df = df[[:colour, :area, :shape, :border]] # with specific reference to shape and border names
Update: This function has changed. See #DiegoJavierZea ’s comment.
Well, congratulate you found a workaround your self, but there is a built-in function that is semantically more clear and possibly a little bit faster:
using DataFrames
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
insert!(df, 3, [1,2], :area)
Where 3 is the expected index for the new column after the insertion, [1,2] is its content, and :area is the name. You can find a more detailed document by typing ?insert! in REPL after loading the DataFrames package.
It is worth noting that the ! is a part of the function name. It's a Julia convention to indicate that the function will mutate its argument.
rows = size(df)[1] # tuple gives you (rows,columns) of the DataFrame
insertcols!(df, # DataFrame to be changed
1, # insert as column 1
:Day => 1:rows, # populate as "Day" with 1,2,3,..
makeunique=true) # if the name of the column exist, make is Day_1
While making the question I also found a solution (as often happens).
I still post the question here for keep it in memory (for myself) and for the others..
It is enough to save the column names before "adding" the new column:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
dfnames = names(df)
df[:area] = [1,2]
df = df[vcat(dfnames[1:1],:area,dfnames[2:end])]

Reading csv file with pandas

enter image description hereI have a csv file with 2 columns which is text and boolean(y/n) where I am trying to put all the positive value in 1 file in 1 file and the negative one in the others. Here is what I tried:
df = pd.read_csv('text_trait_with_binary_EXT.csv','rb',delimiter=',',quotechar='"')
#print(df)
df.columns = ["STATUS", "xEXT"]
positive = []
negative = []
for line in df:
text = line[0].strip()
if line[1].strip() == "y":
positive.append(text)
elif line[1].strip() == "n":
negative.append(text)
print(positive)
print(negative)
And when I run this it just give an empty list!
I am new in using pandas so if any of you can help that would be great.
As others have commented, there is almost always a better approach than using iteration in Pandas. It has a lot of built-in functions to help avoid loops.
If I understand your intentions correctly that you want to take the values from column 1 (named 'STATUS'), split them according to whether the corresponding value in column 2 (named 'xEXT') is 'y' or 'n', and generate two lists containing the column 1 values, the following should work (to be used after your first two lines of code you posted):
positive = df.loc[df['xEXT'].str.strip() == 'y', 'STATUS'].tolist()
negative = df.loc[df['xEXT'].str.strip() == 'n', 'STATUS'].tolist()
Here is a link to the documentation on loc, which is useful for problems like this.
The above solution assumes that your data has been read in correctly. If it does not work for you, please do as others have commented and add a small sample of your data so that we are able to try out our proposed solutions.

Pandas column cleanup

I have a dataset with a complex column in pandas. One of column product_info has various types of contents :
#Input type1
df['productInfo'][0]
#Output type1
'Salt & pepper shakers,Material: stoneware,Dimensions:
H6.5cm,Dachshund designs,1x black and tan, 1x brown,Hand
painted,Dishwasher safe'
#Output type2
'Pineapple string lights,Dimensions: 400x6x10cm,10 pineapple shaped LED lights,In a gold hue,3x AA batteries required (not included)'
#Output type 3
''
So essentially my productInfo column contains the above 3 kinds of contents.
What i want is to get the Material for groupby analysis: extracted from the productInfo column of the dataframe, of course only when these values exist, if they don't, just set these values as null/None or whatever
I have tried boolean masks but can't seem to make them work, anyone with any suggestion is highly appreciated.
Thanks in advance
Edit:
this was my original df:
original df
My df after extracting Material from ProductInfo:
df after extracting Material from ProductInfo
My df after extracting Material and Dimensions from ProductInfo:
enter image description here
Hopefully, you guys get an idea of what I'm trying to achieve. Most of my task is to do text extraction from complex text blobs inside each column.
If I find the relevant columns from the text clumps using regex then I update the columns else make them null. It has proven to be a big challenge, please if any of you guys can help me extract the useful info like Material and Dimensions from the productInfo text clump to their own columns, that'd be very helpful of you guys.
Thanks in Advance for anyone who tries to help me and sorry for my vague question without providing relevant information.
Happy Panda-ing(If that's a word!!)
:)
I imported pandas and re
import pandas as pd
import re
I created a helper function that does a simple regex to get the material and dimensions. I delete the material and dimension strings from the original string, returning a Series with the updated description, material, and dimensions.
def get_material_and_dimensions(row):
description = row['productInfo']
material = re.search(r'Material: (.*?),', description)
if material:
material = material.group(1)
description = description.replace(f'Material: {material},', '')
dimensions = re.search(r'Dimensions: (.*?),', description)
if dimensions:
dimensions = dimensions.group(1)
description = description.replace(f'Dimensions: {dimensions},', '')
return pd.Series([description, material, dimensions], index=['description', 'material', 'dimensions'])
Apply the function to the DataFrame
myseries = df.apply(get_material_and_dimensions, axis=1)
Then add the series to the original DataFrame, replacing df['productInfo'] with the clean df['description']
df = df.join(myseries)
df['productInfo'] = df['description']
df.drop('description', inplace=True, axis=1)