matplotlib strings as labels on x axis - matplotlib

I am building a small tool for data analysis and I have come to the point, where I have to plot the prepared data. The code before this produces the following two lists with equal length.
t11 = ['00', '01', '02', '03', '04', '05', '10', '11', '12', '13', '14', '15', '20', '21', '22', '23', '24', '25', '30', '31', '32', '33', '34', '35', '40', '41', '42', '43', '44', '45', '50', '51', '52', '53', '54', '55']
t12 = [173, 135, 141, 148, 140, 149, 152, 178, 135, 96, 109, 164, 137, 152, 172, 149, 93, 78, 116, 81, 149, 202, 172, 99, 134, 85, 104, 172, 177, 150, 130, 131, 111, 99, 143, 194]
Based on this, I want to built a histogram with matplotlib.plt.hist. However, there are a couple of problems:
1. t11[x] and t12[x] are connected for all x. Where t11[x] is actually a string. It represents a certain detector combination. For example: '01' tells that the detection was made in the 0th segment of the first detector, and 1st segment of the second detector. My goal is to have each entry from t11 as a labeled point on the x axis. The t12 entry is going to define the hight of the bar above the t11 entry (on a logarithmic y axis)
How does one configure such an x axis?
2. This is all very new to me. I could not find anything related in the documentation. Most probably because I did not know what to search for. SO: Is there an "official" name for what I am trying to achieve. This would also help me alot.

Use the xticks command.
import matplotlib.pyplot as plt
t11 = ['00', '01', '02', '03', '04', '05', '10', '11', '12', '13', '14', '15',
'20', '21', '22', '23', '24', '25', '30', '31', '32', '33', '34', '35',
'40', '41', '42', '43', '44', '45', '50', '51', '52', '53', '54', '55']
t12 = [173, 135, 141, 148, 140, 149, 152, 178, 135, 96, 109, 164, 137, 152,
172, 149, 93, 78, 116, 81, 149, 202, 172, 99, 134, 85, 104, 172, 177,
150, 130, 131, 111, 99, 143, 194]
plt.bar(range(len(t12)), t12, align='center')
plt.xticks(range(len(t11)), t11, size='small')
plt.show()

For the object oriented API of matplotlib one can plot custom text on the x-ticks of an axis with following code:
x = np.arange(2,10,2)
y = x.copy()
x_ticks_labels = ['jan','feb','mar','apr']
fig, ax = plt.subplots(1,1)
ax.plot(x,y)
# Set number of ticks for x-axis
ax.set_xticks(x)
# Set ticks labels for x-axis
ax.set_xticklabels(x_ticks_labels, rotation='vertical', fontsize=18)

In matplotlib lingo, you are looking for a way to set custom ticks.
It seems you cannot achieve this with the pyplot.hist shortcut. You will need to build your image step by step instead. There is already an answer here on Stack Overflow to question which is very similar to yours and should get you started: Matplotlib - label each bin

Firstly you need to access the axis object:
fig, ax = plt.subplots(1,1)
then:
ax.set_yticks([x for x in range(-10,11)])
ax.set_yticklabels( ['{0:2d}'.format(abs(x)) for x in range(-10, 11)])

Related

Sorting Pandas dataframe by multiple conditions

I have a large dataframe (thousands of rows by hundreds of columns), a short excerpt is as the following:
data = {'Step':['', '', '', 'First', 'First', 'Second', 'Third', 'Second', 'First', 'Second', 'First', 'First', 'Second', 'Second'],
'Stuff':['tot', 'white', 'random', 7583, 3563, 824, 521, 7658, 2045, 33, 9823, 5, 8090, 51],
'Mark':['marking', '', '', 1, 5, 5, 5, 1, 27, 27, 1, 6, 1, 9],
'A':['item_a', 100, 'st1', 142, 2, 2, 2, 100, 150, 105, 118, 118, 162, 156],
'B':['skill', 66, 'abc', 160, 2, 130, 140, 169, 1, 2, 130, 140, 144, 127],
'C':['item', 50, 'st1', 2000, 2, 65, 2001, 1999, 1, 2, 2000, 4, 2205, 2222],
'D':['item_c', 100, 'st1', 433, 430, 150, 170, 130, 1, 2, 300, 4, 291, 606],
'E':['test', 90, 'st1', 111, 130, 5, 10, 160, 1, 2, 232, 4, 144, 113],
'F':['done', 80, 'abc', 765, 755, 5, 10, 160, 1, 2, 733, 4, 666, 500],
'G':['nd', 90, 'mag', 500, 420, 5, 10, 160, 1, 2, 300, 4, 469, 500],
'H':['prt', 100, 'st1', 999, 200, 5, 10, 160, 1, 2, 477, 4, 620, 7],
'Name':['NS', '', '', "Pat", "Lucy", "Lucy", "Lucy", "Nick", "Kirk", "Kirk", "Joe", "Nico", "Nico", "Bryan"],
'Value':[ -1, 0, 0, 0, 3, 6, 5, 0, 7, 7, 0, 6, 0, 1]}
df = pd.DataFrame(data)
I need to sort this dataframe according to the following conditions that have to be satisfied all together:
In the "Name" column, names that are the same are to remain grouped (e.g. there are 3
records of "Lucy" next to each other, and they cannot be moved apart)
For each group of names, the appearance order has to remain the one
given by the "Step" column (e.g. the first appearance of "Lucy" is
related to the value "First" in the "Step" column, the second to
"Second" and so on)
All the remaining names that in the "Value" column have a value = 0,
have to be moved below the others (e.g. "Pat" can be moved after the
others, but not "Nico" because there are two records of "Nico" and
the other one has a value = 6)
The first three rows cannot be moved
What I have done is to concatenate different sub-dataframes:
df_groupnames=df[df.duplicated(subset=['Name'], keep=False)]
df_nogroup = df[~df.duplicated(subset=['Name'], keep=False)]
df_nogroup_high = df_nogroup[df_nogroup["Value"] > 0 ]
df_nogroup_null = df_nogroup[df_nogroup["Value"] == 0]
# Let's concatenate these dataframes to get the sorted one
df_sorted = pd.concat([df_groupnames, df_nogroup_high, df_nogroup_null])
It works, but I wonder if there's a smarter, simpler way, and maybe faster, to obtain the same.
Thank you for your attention.

Make a color legend in Plotly graph objects

I have the two following codes with their output. One is done in graph objects and the other using plotly express. As you can see, the one in ‘go’ doesn’t have a legend, and the one in ‘px’ doesn’t have individual column width. So how can I either get a legend for the first one, or fix the width in the other?
import plotly.graph_objects as go
import pandas as pd
df = pd.DataFrame({'PHA': [451, 149, 174, 128, 181, 175, 184, 545, 131, 106, 1780, 131, 344, 624, 236, 224, 178, 277, 141, 171, 164, 410],
'PHA_cum': [451, 600, 774, 902, 1083, 1258, 1442, 1987, 2118, 2224, 4004, 4135, 4479, 5103, 5339, 5563, 5741, 6018, 6159, 6330, 6494, 6904],
'trans_cost_cum': [0.14, 0.36, 0.6, 0.99, 1.4, 2.07, 2.76, 3.56, 4.01, 4.5, 5.05, 5.82, 5.97, 6.13, 6.33, 6.53, 6.65, 6.77, 6.9, 7.03, 7.45, 7.9],
'Province': ['East', 'East', 'East', 'East', 'East', 'Lapland', 'Lapland', 'Lapland', 'Oulu', 'Oulu', 'Oulu', 'Oulu', 'South', 'South', 'South', 'South', 'West', 'West', 'West', 'West', 'West', 'West'],
})
col_list = {'South': 'rgb(222,203,228)',
'West': 'rgb(204,235,197)',
'East': 'rgb(255,255,204)',
'Oulu': 'rgb(179,205,227)',
'Lapland': 'rgb(254,217,166)'}
provs = df['Province'].to_list()
colors = [col_list.get(item, item) for item in provs]
fig = go.Figure(data=[go.Bar(
x=df['PHA_cum']-df['PHA']/2,
y=df['trans_cost_cum'],
width=df['PHA'],
marker_color=colors
)])
fig.show()
import plotly.express as px
import pandas as pd
df = pd.DataFrame({'PHA': [451, 149, 174, 128, 181, 175, 184, 545, 131, 106, 1780, 131, 344, 624, 236, 224, 178, 277, 141, 171, 164, 410],
'PHA_cum': [451, 600, 774, 902, 1083, 1258, 1442, 1987, 2118, 2224, 4004, 4135, 4479, 5103, 5339, 5563, 5741, 6018, 6159, 6330, 6494, 6904],
'trans_cost_cum': [0.14, 0.36, 0.6, 0.99, 1.4, 2.07, 2.76, 3.56, 4.01, 4.5, 5.05, 5.82, 5.97, 6.13, 6.33, 6.53, 6.65, 6.77, 6.9, 7.03, 7.45, 7.9],
'Province': ['East', 'East', 'East', 'East', 'East', 'Lapland', 'Lapland', 'Lapland', 'Oulu', 'Oulu', 'Oulu', 'Oulu', 'South', 'South', 'South', 'South', 'West', 'West', 'West', 'West', 'West', 'West'],
})
fig = px.bar(df,
x=df['PHA_cum']-df['PHA']/2,
y=df['trans_cost_cum'],
color="Province",
color_discrete_sequence=px.colors.qualitative.Pastel1
)
fig.show()
Using graph_objects you'll need to pass in each Province as a trace in order for the legend to populate. See below, the only real change is looping through the data per Province.
df = pd.DataFrame({'PHA': [451, 149, 174, 128, 181, 175, 184, 545, 131, 106, 1780, 131, 344, 624, 236, 224, 178, 277, 141, 171, 164, 410],
'PHA_cum': [451, 600, 774, 902, 1083, 1258, 1442, 1987, 2118, 2224, 4004, 4135, 4479, 5103, 5339, 5563, 5741, 6018, 6159, 6330, 6494, 6904],
'trans_cost_cum': [0.14, 0.36, 0.6, 0.99, 1.4, 2.07, 2.76, 3.56, 4.01, 4.5, 5.05, 5.82, 5.97, 6.13, 6.33, 6.53, 6.65, 6.77, 6.9, 7.03, 7.45, 7.9],
'Province': ['East', 'East', 'East', 'East', 'East', 'Lapland', 'Lapland', 'Lapland', 'Oulu', 'Oulu', 'Oulu', 'Oulu', 'South', 'South', 'South', 'South', 'West', 'West', 'West', 'West', 'West', 'West'],
})
col_list = {'South': 'rgb(222,203,228)',
'West': 'rgb(204,235,197)',
'East': 'rgb(255,255,204)',
'Oulu': 'rgb(179,205,227)',
'Lapland': 'rgb(254,217,166)'}
#provs = df['Province'].to_list()
#colors = [col_list.get(item, item) for item in provs]
fig = go.Figure()
for p in df['Province'].unique():
dat = df[df.Province == p]
fig.add_trace(go.Bar(
name = p,
x=dat['PHA_cum']-dat['PHA']/2,
y=dat['trans_cost_cum'],
width=dat['PHA'],
marker_color= col_list[p]
))
fig.show()

Huggingface's BERT tokenizer not adding pad token

It's not entirely clear from the documentation, but I can see that BertTokenizer is initialised with pad_token='[PAD]', so I assume when you encode with add_special_tokens=True then it would automatically pad it. Given that pad_token_id=0, I can't see any 0s in the token_ids however:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text, add_special_tokens=True, max_length=2048)
# Print the original sentence.
print('Original: ', text)
# Print the sentence split into tokens.
print('\nTokenized: ', tokens)
# Print the sentence mapped to token ids.
print('\nToken IDs: ', token_ids)
Output:
Original: Toronto's key stock index ended higher in brisk trading on Thursday, extending Wednesday's rally despite being weighed down by losses on Wall Street.
The TSE 300 Composite Index rose 29.80 points to close at 5828.62, outperforming the Dow Jones Industrial Average which slumped 21.27 points to finish at 6658.60.
Toronto added to Wednesday's 55-point rally while investors took profits in New York after the Dow's 92-point gains, said MMS International analyst Katherine Beattie.
"That shows that the markets are very fragile," Beattie said. "They (investors) want to take advantage of any strength to sell," she said.
Toronto was also buoyed by its heavyweight gold group which jumped nearly 2.2 percent, aided by firmer COMEX gold prices. The key June contract rose $1.00 to $344.30.
Ten of Toronto's 14 sub-indices posted gains, led by golds, transportation, forestry products and consumer products.
The weak side included conglomerates, base metals and utilities.
Trading was heavy at 100 million shares worth C$1.54 billion ($1.1 billion).
Advancing stocks outnumbered declines 556 to 395, with 276 issues flat.
Among hot stocks, Bre-X Minerals Ltd. rose 0.13 to 2.30 on 5.0 million shares as investors continued to consider the viability of its Busang gold discovery in Indonesia.
Kenting Energy Services Inc. rose 0.25 to 9.05 after Precision Drilling Corp. amended its takeover offer
Bakery and foodstuffs maker George Weston Ltd. jumped 4.50 to close at 74.50, the TSE's top gainer.
Tokenized: ['toronto', "'", 's', 'key', 'stock', 'index', 'ended', 'higher', 'in', 'brisk', 'trading', 'on', 'thursday', ',', 'extending', 'wednesday', "'", 's', 'rally', 'despite', 'being', 'weighed', 'down', 'by', 'losses', 'on', 'wall', 'street', '.', 'the', 'ts', '##e', '300', 'composite', 'index', 'rose', '29', '.', '80', 'points', 'to', 'close', 'at', '58', '##28', '.', '62', ',', 'out', '##per', '##form', '##ing', 'the', 'dow', 'jones', 'industrial', 'average', 'which', 'slumped', '21', '.', '27', 'points', 'to', 'finish', 'at', '66', '##58', '.', '60', '.', 'toronto', 'added', 'to', 'wednesday', "'", 's', '55', '-', 'point', 'rally', 'while', 'investors', 'took', 'profits', 'in', 'new', 'york', 'after', 'the', 'dow', "'", 's', '92', '-', 'point', 'gains', ',', 'said', 'mm', '##s', 'international', 'analyst', 'katherine', 'beat', '##tie', '.', '"', 'that', 'shows', 'that', 'the', 'markets', 'are', 'very', 'fragile', ',', '"', 'beat', '##tie', 'said', '.', '"', 'they', '(', 'investors', ')', 'want', 'to', 'take', 'advantage', 'of', 'any', 'strength', 'to', 'sell', ',', '"', 'she', 'said', '.', 'toronto', 'was', 'also', 'bu', '##oy', '##ed', 'by', 'its', 'heavyweight', 'gold', 'group', 'which', 'jumped', 'nearly', '2', '.', '2', 'percent', ',', 'aided', 'by', 'firm', '##er', 'come', '##x', 'gold', 'prices', '.', 'the', 'key', 'june', 'contract', 'rose', '$', '1', '.', '00', 'to', '$', '344', '.', '30', '.', 'ten', 'of', 'toronto', "'", 's', '14', 'sub', '-', 'indices', 'posted', 'gains', ',', 'led', 'by', 'gold', '##s', ',', 'transportation', ',', 'forestry', 'products', 'and', 'consumer', 'products', '.', 'the', 'weak', 'side', 'included', 'conglomerate', '##s', ',', 'base', 'metals', 'and', 'utilities', '.', 'trading', 'was', 'heavy', 'at', '100', 'million', 'shares', 'worth', 'c', '$', '1', '.', '54', 'billion', '(', '$', '1', '.', '1', 'billion', ')', '.', 'advancing', 'stocks', 'outnumbered', 'declines', '55', '##6', 'to', '395', ',', 'with', '276', 'issues', 'flat', '.', 'among', 'hot', 'stocks', ',', 'br', '##e', '-', 'x', 'minerals', 'ltd', '.', 'rose', '0', '.', '13', 'to', '2', '.', '30', 'on', '5', '.', '0', 'million', 'shares', 'as', 'investors', 'continued', 'to', 'consider', 'the', 'via', '##bility', 'of', 'its', 'bus', '##ang', 'gold', 'discovery', 'in', 'indonesia', '.', 'kent', '##ing', 'energy', 'services', 'inc', '.', 'rose', '0', '.', '25', 'to', '9', '.', '05', 'after', 'precision', 'drilling', 'corp', '.', 'amended', 'its', 'takeover', 'offer', 'bakery', 'and', 'foods', '##tu', '##ffs', 'maker', 'george', 'weston', 'ltd', '.', 'jumped', '4', '.', '50', 'to', 'close', 'at', '74', '.', '50', ',', 'the', 'ts', '##e', "'", 's', 'top', 'gain', '##er', '.']
Token IDs: [101, 4361, 1005, 1055, 3145, 4518, 5950, 3092, 3020, 1999, 28022, 6202, 2006, 9432, 1010, 8402, 9317, 1005, 1055, 8320, 2750, 2108, 12781, 2091, 2011, 6409, 2006, 2813, 2395, 1012, 1996, 24529, 2063, 3998, 12490, 5950, 3123, 2756, 1012, 3770, 2685, 2000, 2485, 2012, 5388, 22407, 1012, 5786, 1010, 2041, 4842, 14192, 2075, 1996, 23268, 3557, 3919, 2779, 2029, 14319, 2538, 1012, 2676, 2685, 2000, 3926, 2012, 5764, 27814, 1012, 3438, 1012, 4361, 2794, 2000, 9317, 1005, 1055, 4583, 1011, 2391, 8320, 2096, 9387, 2165, 11372, 1999, 2047, 2259, 2044, 1996, 23268, 1005, 1055, 6227, 1011, 2391, 12154, 1010, 2056, 3461, 2015, 2248, 12941, 9477, 3786, 9515, 1012, 1000, 2008, 3065, 2008, 1996, 6089, 2024, 2200, 13072, 1010, 1000, 3786, 9515, 2056, 1012, 1000, 2027, 1006, 9387, 1007, 2215, 2000, 2202, 5056, 1997, 2151, 3997, 2000, 5271, 1010, 1000, 2016, 2056, 1012, 4361, 2001, 2036, 20934, 6977, 2098, 2011, 2049, 8366, 2751, 2177, 2029, 5598, 3053, 1016, 1012, 1016, 3867, 1010, 11553, 2011, 3813, 2121, 2272, 2595, 2751, 7597, 1012, 1996, 3145, 2238, 3206, 3123, 1002, 1015, 1012, 4002, 2000, 1002, 29386, 1012, 2382, 1012, 2702, 1997, 4361, 1005, 1055, 2403, 4942, 1011, 29299, 6866, 12154, 1010, 2419, 2011, 2751, 2015, 1010, 5193, 1010, 13116, 3688, 1998, 7325, 3688, 1012, 1996, 5410, 2217, 2443, 22453, 2015, 1010, 2918, 11970, 1998, 16548, 1012, 6202, 2001, 3082, 2012, 2531, 2454, 6661, 4276, 1039, 1002, 1015, 1012, 5139, 4551, 1006, 1002, 1015, 1012, 1015, 4551, 1007, 1012, 10787, 15768, 21943, 26451, 4583, 2575, 2000, 24673, 1010, 2007, 25113, 3314, 4257, 1012, 2426, 2980, 15768, 1010, 7987, 2063, 1011, 1060, 13246, 5183, 1012, 3123, 1014, 1012, 2410, 2000, 1016, 1012, 2382, 2006, 1019, 1012, 1014, 2454, 6661, 2004, 9387, 2506, 2000, 5136, 1996, 3081, 8553, 1997, 2049, 3902, 5654, 2751, 5456, 1999, 6239, 1012, 5982, 2075, 2943, 2578, 4297, 1012, 3123, 1014, 1012, 2423, 2000, 1023, 1012, 5709, 2044, 11718, 15827, 13058, 1012, 13266, 2049, 15336, 3749, 18112, 1998, 9440, 8525, 21807, 9338, 2577, 12755, 5183, 1012, 5598, 1018, 1012, 2753, 2000, 2485, 2012, 6356, 1012, 2753, 1010, 1996, 24529, 2063, 1005, 1055, 2327, 5114, 2121, 1012, 102]
No, it would not. There is a different parameter to allow padding:
transformers >=3.0.0 padding (accepts True, max_length and False as values)
transformers < 3.0.0 pad_to_max_length (accepts True or False as Values)
add_special_tokens will add the [CLS] and the [SEP] token (101 and 102 respectively).

plt.bar -> TypeError: cannot concatenate 'str' and 'float' objects

I have a variable x_axis that represents a numpy array:
array(['administrator', 'retired', 'lawyer', 'none', 'student',
'technician', 'programmer', 'salesman', 'homemaker', 'executive',
'doctor', 'entertainment', 'marketing', 'writer', 'scientist',
'educator', 'healthcare', 'librarian', 'artist', 'other', 'engineer'],
dtype='|S13')
... and my y_axis looks like this:
array([ 79, 14, 12, 9, 196, 27, 66, 12, 7, 32, 7, 18, 26,
45, 31, 95, 16, 51, 28, 105, 67])
When I try to plot them:
import matplotlib.pyplot as plt
plt.bar(x_axis,y_axis)
I receive the error:
TypeError: cannot concatenate 'str' and 'float' objects
Note:
I've seen 'similar' questions, but not specifically asking about this error in reference to matplotlib.bar.
That is because bar needs x-coordinates, but your x_axis is an array of strings. So, bar does not know where to plot the bars. What you need is the following:
import numpy as np
import matplotlib.pyplot as plt
y_axis = np.array([ 79, 14, 12, 9, 196, 27, 66, 12, 7, 32, 7, 18, 26,
45, 31, 95, 16, 51, 28, 105, 67])
x_labels = np.array(['administrator', 'retired', 'lawyer', 'none', 'student',
'technician', 'programmer', 'salesman', 'homemaker', 'executive',
'doctor', 'entertainment', 'marketing', 'writer', 'scientist',
'educator', 'healthcare', 'librarian', 'artist', 'other', 'engineer'],
dtype='|S13')
w = 3
nitems = len(y_axis)
x_axis = np.arange(0, nitems*w, w) # set up a array of x-coordinates
fig, ax = plt.subplots(1)
ax.bar(x_axis, y_axis, width=w, align='center')
ax.set_xticks(x_axis);
ax.set_xticklabels(x_labels, rotation=90);
plt.show()

Selecting ID in a certain order specified within IN operator

I would like to try to select a certain set of numbers in a particular order, for use with loops.
SELECT ID
FROM filter
WHERE id in (87, 97, 117, 52, 240, 76, 141, 137, 157, 255, 186, 196, 133,
175, 153, 224, 59, 205, 65, 47, 105, 80, 113, 293, 161, 145,
192, 149, 231, 91, 101, 109, 215, 121, 125, 64, 41, 291, 367,
388, 391, 462, 467)
Doing this returns results sorted by ID, rather than in the order I specified. In most other similar questions a preferred answer was using CASE for particular entries, but what about selecting hundreds of records in a predetermined order?
If you have hundreds of items, then use a derived table, such as:
select f.id
from filter f join
(values(1, 87), (2, 97), (3, 117), . . .) as v(ord, id)
on f.id = v.id
order by ord;