Change order of categorical bars in Plotly parallel categories - dataframe

I am trying to visualize changes in gene expression as categorical variables (up, down, no change) over various timepoints.
I have a dataframe describing differential expression data that looks like this:
data = {'gene':['Svm3G0018840','Svm5G0011050','Svm9G0059770'],
'01h': ['nc','up','down'], '04h': ['up', 'down', 'nc'],'08h':['nc','down','up']}
df=pd.DataFrame.from_dict(data)
df=df.set_index('gene')
I can use this df to create the parallel plot using the following code:
fig = px.parallel_categories(herbdf, dimensions=['01h', '04h', '08h','24h','48h'],
labels={'01h':'', '04h':'', '08h':'','24h':'','48h':''})
fig.show()
However, the categories (up, down, nc) are not always in the same order for every time point which makes the figure very difficult to read. I can change this in the interactive figure in a notebook, but I only have the option to output the corrected figure as a low quality png. I need the image in an svg format, which means I need to use the line:
fig.write_image("/figs/herb_de_pp.svg")
But when I add this line in the code block to save the figure I have no control of the order the categorical boxes end up in:
I have tried to add fig.update_ lines to solve this problem, such as:
fig.update_layout(xaxis={'categoryorder':'total descending'})
but this doesn't seem to change the output at all.
I could be missing something simple- any help would be much appreciated!

Parallel coordinates diagrams don't have xaxis/yaxis properties, you need to update traces in order to change the dimensions order:
dimensions = ['01h', '04h', '08h','24h','48h']
...
fig.update_traces(dimensions=[{"categoryorder": "category descending"} for _ in dimensions])

not great answer here, but something that I think will work in a pinch...
It looks like the order of the categories of each figure/column come from the order that they are in the original dataset. That is, in your first column, nc is the first unique item, then down is the second unique item, up is third.
So, if you can rearrange/sort your data so that the data shows up in the order you want it displayed, that should work.
Have your first row be nc | nc | nc | nc | nc, second row down | down | down | down | down, and third row up | up | up | up | up (assuming you actually have records like that). That should do it, but isn't very elegant...

Given the above solution, this is the line needed to sort the dataframe and produce the figure with ordered categories:
sorteddf = df.sort_values(by=['01h','04h','08h'], axis=0, ascending=False)

Related

How to make pie chart of these values in Splunk

Have the following query index=app (splunk_server_group=bex OR splunk_server_group=default) sourcetype=rpm-web* host=rpm-web* "CACHE_NAME=RATE_SHOPPER" method = GET | stats count(eval(searchmatch("true))) as Hit, count(eval(searchmatch("found=false"))) as Miss
Need to make a pie chart of two values "Hit and Miss rates"
The field where it is possible to distinguish the values is Message=[CACHE_NAME=RATE_SHOPPER some_other_strings method=GET found=false]. or found can be true
With out knowing the structure of your data it's harder to say what exactly you need todo but,
Pie charts is a single data series so you need to use a transforming command to generate a single series. PieChart Doc
if you have a field that denotes a hit or miss (You could use an Eval statement to create one if you don't already have this) you can use it to create the single series like this.
Lets say this field is called result.
|stats count by result
Here is a link to the documentation for the Eval Command
Good luck, hope you can get the results your looking for
Since you seem to be concerned only about whether "found" equals either "hit" or "miss", try this:
index=app (splunk_server_group=bex OR splunk_server_group=default) sourcetype=rpm-web* host=rpm-web* "CACHE_NAME=RATE_SHOPPER" method=GET found IN("hit","miss")
| stats count by found
Pie charts require a single field so it's not possible to graph the Hit and Miss fields in a pie. However, if the two fields are combined into one field with two possible values, then it will work.
index=app (splunk_server_group=bex OR splunk_server_group=default) sourcetype=rpm-web* host=rpm-web* "CACHE_NAME=RATE_SHOPPER" method = GET
| eval result=if(searchmatch("found=true"), "Hit", "Miss")
| stats count by result

How can I put several extracted values from a Json in an array in Kusto?

I'm trying to write a query that returns the vulnerabilities found by "Built-in Qualys vulnerability assessment" in log analytics.
It was all going smoothly I was getting the values from the properties Json and turning then into separated strings but I found out that some of the terms posses more than one value, and I need to get all of them in a single cell.
My query is like this right now
securityresources | where type =~ "microsoft.security/assessments/subassessments"
| extend assessmentKey=extract(#"(?i)providers/Microsoft.Security/assessments/([^/]*)", 1, id), IdAzure=tostring(properties.id)
| extend IdRecurso = tostring(properties.resourceDetails.id)
| extend NomeVulnerabilidade=tostring(properties.displayName),
Correcao=tostring(properties.remediation),
Categoria=tostring(properties.category),
Impacto=tostring(properties.impact),
Ameaca=tostring(properties.additionalData.threat),
severidade=tostring(properties.status.severity),
status=tostring(properties.status.code),
Referencia=tostring(properties.additionalData.vendorReferences[0].link),
CVE=tostring(properties.additionalData.cve[0].link)
| where assessmentKey == "1195afff-c881-495e-9bc5-1486211ae03f"
| where status == "Unhealthy"
| project IdRecurso, IdAzure, NomeVulnerabilidade, severidade, Categoria, CVE, Referencia, status, Impacto, Ameaca, Correcao
Ignore the awkward names of the columns, for they are in Portuguese.
As you can see in the "Referencia" and "CVE" columns, I'm able to extract the values from a specific index of the array, but I want all links of the whole array
Without sample input and expected output it's hard to understand what you need, so trying to guess here...
I think that summarize make_list(...) by ... will help you (see this to learn how to use make_list)
If this is not what you're looking for, please delete the question, and post a new one with minimal sample input (using datatable operator), and expected output, and we'll gladly help.

Matplotlib - Draw H and V line by specifying X or Y value on a plot

I was wondering today about how finding a specific value on a plot and drawing the right line that goes with. I used to do that on an old chart library, and I was wondering that perhaps this functionnality exist but I don't know how to find it.
The result should look like this: https://miro.medium.com/max/1070/1*Ckhi9soE9Lx2lIf9tPVLMQ.png
To provide some context, I'm doing a PCA over my data, and I would like to point out some thresholds at 97.5, 99 and 99.5% of explained cumuled variance.
Have a great day!
EDIT:
See Answer
As solved by ImportanceOfBeingErnest, here is the code:
whole_pca = PCA().fit(np.array(inputs['Scale'].tolist()))
cumul = np.cumsum(np.round(whole_pca.explained_variance_ratio_, decimals=3)*100)
over_95 = np.argmax(cumul>95)
over_99 = np.argmax(cumul>99)
over_995 = np.argmax(cumul>99.5)
plt.plot(cumul)
plt.plot([0,over_95,over_95], [95,95,0])
plt.plot([0,over_99,over_99], [99,99,0])
plt.plot([0,over_995,over_995], [99.5,99.5,0])
plt.xlim(left=0)
plt.ylim(bottom=80)
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
Result in:
Thank you!

Group Pandas: How to concat or merge/join/append two csv files with same index but different extensions in grouped data?

I'd like to concat or merge or append/join two csv files with the same indix ID but different extensions on the same ID. The data are grouped by ID also. The 1st file looks like this:
ID,year,age
810006862,2000,49
810006862,2001,
810006862,2002,
810006862,2003,52
810023112,2003,27
810023112,2004,28
810023112,2005,29
810023112,2006,30
810033622,2000,24
810033622,2001,25
and the 2nd file looks like this:
ID,year,from1,to1
810006862,2002,15341,15705
810006862,2003,15706,16070
810006862,2004,16071,16436
810006862,2005,,
810023112,2000,14610,14975
810023112,2001,14976,15340
810023112,2003,15825,16523
810033622,2000,13211,14876
810033622,2001,14761,14987
I have set index of ID for both files after reading it to dataframe, and then concat them together, but it gets error message of "ValueError: Shape of passed values is (25, 2914), indices imply (25, 251)"
I've tried the following codes:
sp = pd.read_csv('sp1.csv')
sp = sp.set_index('ID')
op = pd.read_csv('op1.csv')
op = op.set_index('ID')
ff = pd.concat([sp, op], join = 'outer', sort = False, axis = 1)
I've also tried concat the two files together without setting up index, and the result seemed having correct rows, but the horizontal values were incorrect related.
I've also tried merge as well, but it came with many unnecessary duplicated rows within each group. Since each group has different year and age, I found quite difficult to delete those newly generated rows using this method.
full = pd.merge(sp, op, on = 'ID', how = 'outer', sort = False)
Maybe somebody can suggest ways to easily delete these duplicates, and this will also work for me, because the merged file became so huge! Thanks in advance!
Expected results would be including all different values from both csv files. It is somewhat like this:
ID,year,age,from1,to1
810006862,2000,49,,
810006862,2001,,,
810006862,2002,,15341,15705
810006862,2003,52,15706,16070
810006862,2004,,16071,16436
810006862,2005,,,
810023112,2000,,14610,14975
810023112,2001,,14976,15340
810023112,2003,27,15825,16523
810023112,2004,28,,
810023112,2005,29,,
810023112,2006,30,,
810033622,2000,24,13211,14876
810033622,2001,25,14761,14987
I've searched online for similar posts along for quite some time, but unable to solve my problem. Can anybody offer any clue how to do this? Thanks a lot!

How to stop Jupyter outputting truncated results when using pd.Series.value_counts()?

I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?
If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())