Subclassing dataframe to store modified strings - pandas

I want to subclass a dataframe such that when I do say
df.loc[index,column] = 'green'
the corresponding value specified by [index,column] takes on the value
'background-color: green'
as an example.
Basically I have a lot of custom formatting to do to my dataframes before I send them to excel and since the styler requires storing the styles in dataframes with these strings, I want to create a subclassed dataframe for the styles I am applying to make storing them easier.
I hope my explanation makes sense.
The conditional formatting is no good for me, as a lot of the colors I am getting are being pulled from existing html tables that I have to do a lot of web-scraping to obtain.
Edit: The way the pandas styler object works is that you take a function, which takes the df you wish to style, as an input, and it returns with another dataframe which same index and columns with the style options.
For example I have dataframe
df
A B C D
first second
A1 2 1 3 2 4
A2 2 5 2 2 3
A3 1 3 4 7 6
A4 4 4 3 3 12
A5 6 2 4 4 7
for the values in df that equal two, I want to make their background color blue when exporting this df to excel.
So I'll make a function
def make_2_blue(df):
background_color = pd.DataFrame('',columns=df.columns,index=df.index)
background_color[df == 2] = 'background-color: blue'
return
Then when I want to export this to excel I'd do
df.styler.apply(make_2_blue,axis=None).to_excel('blah.xlsx',engine='openpyxl')
Now in my example I am webscraping html to get the individual cells background color.
What I want to do is subclass a dataframe to make applying these formats easier.
Essentially I would like create some styler-dataframe where I specify the style and then I when I access the individual values in the df it will take in my string and apply the desired formatting.
So lets say I make a background-color dataframe class. I'll make an instance of it called background_color_df.
background_color_df = pd.DataFrame('',index=df.index,columns=df.columns)
>>> background_color_df
A B C \
first second
A1 2 background-color: background-color: background-color:
A2 2 background-color: background-color: background-color:
A3 1 background-color: background-color: background-color:
A4 4 background-color: background-color: background-color:
A5 6 background-color: background-color: background-color:
D
first second
A1 2 background-color:
A2 2 background-color:
A3 1 background-color:
A4 4 background-color:
A5 6 background-color:
Now I go to assign the color blue at a specific index and column
background_color_df.loc['A1','A'] = 'blue'
background_color_df
A B C \
first second
A1 2 background-color: blue background-color: background-color:
A2 2 background-color: background-color: background-color:
A3 1 background-color: background-color: background-color:
A4 4 background-color: background-color: background-color:
A5 6 background-color: background-color: background-color:
D
first second
A1 2 background-color:
A2 2 background-color:
A3 1 background-color:
A4 4 background-color:
A5 6 background-color:

If I got your question right, you have data like this:
color foo
background green bar
foreground red baz
That is, a dataframe with attributes for some kind of category like foreground and background and you need them combined like:
color foo
background background-color: green background-foo: bar
foreground foreground-color: red foreground-foo: baz
Please ignore the column names and index names here, as you could change them easily.
If that is, what you need, you don't need to subclass DataFrame, which will surely be a lot of work. You could just postprocess your dataframe so it looks like what you want. E.g. by doing:
# example data
df= pd.DataFrame({'color': ['green', 'red'], 'foo': ['bar', 'baz']}, index=['background', 'foreground'])
columns_to_transform= ['color', 'foo'] # maybe even: df.columns if you need to transform all columns
for col in columns_to_transform:
df[col]= df.index + '-' + col + ': ' + df[col]

Related

how to make pandas code faster or using dask dataframe or how to use vectorization for this type of problem?

import pandas as pd
# list of name, degree, score
label1 = ["a1", "a1", "a1","a1", "a2","a2","a2","a2", "b1","b1","b1","b1", "b2","b2","b2","b2"]
label2 = ["a1", "a2", "b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2"]
m1 = [ 0, 3, 2, 7, 3, 0, 5, 8, 2, 5, 0, 9, 7, 8, 9, 0]
# dictionary of lists
dict = {'label1': label1, 'label2': label2,'m1':m1}
df = pd.DataFrame(dict)
df
output of this dataframe:
label1 label2 m1
0 a1 a1 0
1 a1 a2 3
2 a1 b1 2
3 a1 b2 7
4 a2 a1 3
5 a2 a2 0
6 a2 b1 5
7 a2 b2 8
8 b1 a1 2
9 b1 a2 5
10 b1 b1 0
11 b1 b2 9
12 b2 a1 7
13 b2 a2 8
14 b2 b1 9
15 b2 b2 0
I want to write a function that will take strings (samp1)a, (samp2)b, and a (df) data frame as input. We have to preprocess those two input strings so that we can get desired strings in our data frame. Then we need to access some particular rows' (like (a1,b1) or (a2,b2)) indices of the data frame to get their corresponding 'm1' value. Next, we will make some (addition) operations for those m1 values and store them in two variables and after that, it will return the minimum of two variables. [looking at coding snippet may be easier to understand]
The following is the code for this function:
def min_4line(samp1,samp2,df):
k=['1','2']
#k and samp are helping to generate variable along with number
#for example it will take a,b and can create a1,a2,b1,b2.....
samp1_1=samp1+k[0]
samp1_2=samp1+k[1]
samp2_1=samp2+k[0]
samp2_2=samp2+k[1]
#print(samp1_1)#a1
#print(samp1_2)#a2
#print(samp2_1)#b1
#print(samp2_2)#b2
"""
#As we are interested about particular rows to get comb1 variable, we need those row's
#indexes
#for comb1 we want to sum (a1,b1)[which located at ind1] and (a2,b2)[which located at ind2]
#same types of thing for comb2
"""
ind1=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_1)].tolist()
ind2=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_2)].tolist()
#print(ind1)#[2]
#print(ind2)#[7]
comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])
#print('comb1: ',comb1)#comb1: 10
ind3=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_1)].tolist()
ind4=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_2)].tolist()
#print(ind3)#[6]
#print(ind4) #[3]
comb2=int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
#print('comb2: ',comb2)#comb2: 12
return min(comb1,comb2)#10
To append unique char like a,b from the dataframe we need to do a list operation:
#this list is needed so that I can compare how many unique values are there...
#it could get a,b,c,d.... and make comparison
#like (a,b), (a,c),(a,d), (b,c),(b,d),(c,d) for the function
list_line=list(df['label1'].unique())
string_test=[a[:-1] for a in list_line]
#string_test will exclude number portion of character
list_img=sorted(list(set(string_test)))
#print(list_img)#['a', 'b']
#print(len(list_img))#2
Now we need to create a data frame that will go over the 'list_img' and call the min4line function to get value like (a,b), (a,c) and corresponding output of the function. Here a nested loop is necessary as suppose list consist [a,b,c,d]. it will go like(a,b),(a,c),(a,d),(b,c),(b,d),(c,d). So that we can have unique pair. The code for this is:
%%time
d=[]
for i in range(len(list_img)):
for j in range(i+1,len(list_img)):
a=min_4line(list_img[i],list_img[j],df)
print(a)
d.append({'label1':str(list_img[i]),'label2':str(list_img[j]), 'metric': str(a)})
dataf=pd.DataFrame(d)
dataf.head(5)
output is:
label1label2metric
0 a b 10
Is there any way to make the code faster? I broke down the problem into small parts. this operation is needed for 16 million rows. I am interested in using dask for this. But when I have asked this type of question previously, many people failed to understand as I was not able to state the problem clearly. Hope this time I broke it down in easier format. You can copy those code cell and run in jupyter notebook to check the output and suggest me any good way to make the program faster.
[updated]
Can anyone suggest, how can I get those particular indices of those rows using numpy or any kind of vectorized operation?

Add rows based on condition and also create and update new columns

I have a pandas dataframe with few thousand rows, subset of it is below
fr var
1.1 10px
2.9 12pz
Expected Output:
fr var vard varv
1.1 10px -5 xval
1.1 10px 5 zval
2.9 12pz -6 zval
2.9 12pz 6 xval
For rows - Each row is to be split into two
Conditions for new columns:
'vard' - divide the numeric part of 'var' column by 2 and store it as two rows in 'vard', one negative and one positive value.
'varv' - if 'px' is in 'var column' and 'vard' has negative value, then 'varv' should be 'xval' else 'zval'.
Similarly if 'pz' is in 'var' column and 'vard' has negative value, then 'varv' should be 'zval' else 'xval'.
I have read through various answers with almost similar problems and tried many option like 'iterrows', 'shift', 'explode' etc but not able to get the expected output.
Use Series.str.extract first for numeric and non numeric part, convert numeric part to integers and divide by 2, then join with multiple value by -1 in concat, sorting index and create default, last use numpy.where for set new values by conditions:
df[['vard','varv']] = df['var'].str.extract('(\d+)(\D+)')
df['vard'] = df['vard'].astype(int).div(2)
df = pd.concat([df, df.assign(vard = df['vard']*-1)]).sort_index().reset_index(drop=True)
m = (df['varv'].eq('px') & df['vard'].lt(0)) | df['varv'].eq('pz') & df['vard'].gt(0)
df['varv'] = np.where(m, 'zval','xval')
print (df)
fr var vard varv
0 1.1 10px 5.0 xval
1 1.1 10px -5.0 zval
2 2.9 12pz 6.0 zval
3 2.9 12pz -6.0 xval
it is something that can easily be done using the melt function.
# recreate your dataframe
df = pd.DataFrame(columns=['fr','var'])
df['fr']=[1.1,2.9]
df['var']=['10px','12pz']
# split the var into its two components by creating two new columns
df['vard_p'] = df['var'].str[:-2]
df['vard_p'] = df['vard_p'].astype(float)/2
df['vard_n'] = -df['vard_p']
# get the vard from the var (I assumed it was simply the last character in the string)
df['varv'] = df['var'].str[-1]+'val'
# and here you melt on the two new vard columns to get the dataframe in the format you wanted
df = pd.melt(df, id_vars=['fr','var','varv'], value_vars=['vard_p','vard_n'])
# now rename or drop the new columns
df.rename(columns={'value':'vard'},inplace=True)
df.drop('variable',axis=1,inplace=True)
df
Output:
fr var varv vard
0 1.1 10px xval 5.0
1 2.9 12pz zval 6.0
2 1.1 10px xval -5.0
3 2.9 12pz zval -6.0
Hope it helped

postgreSQL Combinations

I'm trying to figure out how to find all possible combinations (using SQL) for the following situation:
I have 100 ping pong balls in a bowl (id = 1...100)
Each ball is one of 4 colors (color = red, green, blue, yellow)
I want to pick 5 balls (without replacement) as follows.
1 red ball
2 green balls
3 blue balls
2 yellow balls
1 ball that is green, blue, or yellow
How can I determine all possible combinations using SQL as efficiently as possible?
Below is the best I could come up with, but I don't want order to matter (combinations) and I want no replacement:
SELECT pick1.id, pick2.id, pick3.id, pick4.id, pick5.id, pick6.id, pick7.id, pick8.id, pick9.id
FROM bowl AS pick1, bowl AS pick2, bowl AS pick3, bowl AS pick4, bowl AS pick5, bowl AS pick6,
bowl AS pick7, bowl AS pick8, bowl AS pick9
WHERE
pick1.color = "red" AND
pick2.color = "green" AND
pick3.color = "green" AND
pick4.color = "blue" AND
pick5.color = "blue" AND
pick6.color = "blue" AND
pick7.color = "yellow" AND
pick8.color = "yellow" AND
(pick9.color = "green" OR
pick9.color = "blue" OR
pick9.color = "yellow")
I haven't tried this in an actual postgresql server but here is an idea.
First I would codify the colors in integers:
0 = red
1 = green
2 = blue
3 = yellow
Now, for example, I want to draw 3 balls: 1 red, 1 green, and 1 green or yellow. The corresponding color codes, after sorting, will be used as a filter in the where clause of the final SQL statement:
[0, 1, 1]
[0, 1, 3]
Then the not in (...) basically ensures that there are no repeating ids, and the sorted array of colors is limited to the set that we specified above.
CREATE EXTENSION intarray;
select p1.id, p2.id, p3.id
from bowl as p1
cross join bowl as p2
cross join bowl as p3
where
p2.id not in (p1.id)
and p3.id not in (p1.id, p2.id)
and sort(int[p1.color, p2.color, p3.color]) in (
int[0,1,1],
int[0,1,3]
)
The intarray extension is needed for the sort() function.
A variation not involving array[] nor the intarray extension is also possible as long as you list out all desired combinations of colors in the IN (..) predicate. See link.

How to make points one color when a third column equals zero, and another color otherwise, in Gnuplot?

I need to vary the point color for a row of values based on the color in one column. The data:
# x y z
1, 3, 0
1, 5, 6
3, 5, 2
4, 5, 0
The color should be one value if the column is zero and a different color if the value in the third column is non-zero.
So, I'm assuming:
plot "./file.dat" u 1:2:3 with points palette
as found here: https://stackoverflow.com/a/4115001 will not quite work.
In the above example data, that gnuplot command provides three different colors instead of the two I'm looking for.
This is probably close to what you want:
set palette model RGB defined ( 0 'red', 1 'green' )
plot[0:5][0:6] "file.dat" u 1:2:( $3 == 0 ? 0 : 1 ) with points palette
You could go one step further and remove the "noise":
unset key
unset colorbox
plot[0:5][0:6] "file.dat" u 1:2:( $3 == 0 ? 0 : 1 ) with points pt 7 ps 3 palette
if only the differentiation between zero and non-zero matters.
You can adjust the palette by
set palette defined (-0.1 "blue", 0 "red", 0.1 "blue")

Textile: how do I center all the text in the table cells?

So... here is my table
| Locking | x | x | x | x |
| Public / Private | x | x | x | Default: Private |
now, how do I change it so that certain columns are centered, or that every cell is centered? or just a single cell?
A single cell is centered as following
|=. some-text |
For details see documentation or User Manual.
Update
This is a table example tested on github wiki.
Whole table center
table{text-align:center;}.
table{border:1px solid black; text-align:center;}.
|This|is|a centered table|row|
|This|is|a|row-row-row-row|
Single cell center
|=. centered-cell-text|
table{border:1px solid black; }.
|This|is|a centered table|=. cell|
|This|is|a|row-row-row-row|
Single cell formatting
The cell must start with {css code}. (mind the space).
Outline Every Cell In Every Row Of A Table Using Textile Formatting
E.g.
|_{border:2px solid black}. Provider |_{border:2px solid black}. Text Email Address|_{border:2px solid black}. Verified?|
|{border:2px solid black}. 3 River Wireless|{border:2px solid black}. 10digitphonenumber#sms.3rivers.net |{border:2px solid black}. *Not Verified* |
Try textile in a sandbox: Sandbox: https://txstyle.org/