How to write lists in different directions in pandas in python - pandas

I have three lists as follows.
list_names_1 = ["Salad", "Bread"]
list_names_2 = ["Oil", "Fat", "Salt"]
list_values = [[0.2, 0.1, 0.8], [0.2, 0.9, 0.8]]
Now I want to write the aforementioned three lists to a csv file as follows.
NAMES, Oil, Fat, Salt
Salad, 0.2, 0.1, 0.8
Bread, 0.2, 0.9, 0.8
That is, I want the list names_1 in vertical direction, list_names_2 in horizontal direction and list_values as the values of the two lists.
Is it possible to do this in pandas?

Use DataFrame constructor with to_csv:
df = pd.DataFrame(list_values, columns=list_names_2, index=list_names_1)
df.index.name = 'NAMES'
print (df)
NAMES Oil Fat Salt
Salad 0.2 0.1 0.8
Bread 0.2 0.9 0.8
df.to_csv('file')
NAMES,Oil,Fat,Salt
Salad,0.2,0.1,0.8
Bread,0.2,0.9,0.8

Use pd.DataFrame(data=.., columns=..., index=...) to construct the dataframe.
And, use index_label in to_csv to get the name as NAMES set in output.
In [2167]: print (pd.DataFrame(data=list_values, columns=list_names_2, index=list_names_1)
.to_csv(index_label='NAMES'))
NAMES,Oil,Fat,Salt
Salad,0.2,0.1,0.8
Bread,0.2,0.9,0.8
(pd.DataFrame(data=list_values, columns=list_names_2, index=list_names_1)
.to_csv('name.csv' index_label='NAMES'))

Related

Pandas: take the minimum of two operations on two dataframes, while preserving index

I'm a beginner with Pandas. I've got two dataframes df1 and df2 of three columns each, labelled by some index.
I would like to get a third dataframe whose entries are
min( df1-df2, 1-df1-df2 )
for each column, while preserving the index.
I don't know how to do this on all the three columns at once. If I try e.g. np.min( df1-df2, 1-df1-df2 ) I get TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed, whereas min( df1-df2, 1-df1+df2 ) gives ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I can't use apply because I've got more than one dataframe. Basically, I would like to use something like subtract, but with the ability to define my own function.
Example: consider these two dataframes
df0 = pd.DataFrame( [[0.1,0.2,0.3], [0.3, 0.1, 0.2], [0.1, 0.3, 0.9]], index=[2,1,3], columns=['px', 'py', 'pz'] )
In [4]: df0
Out[4]:
px py pz
2 0.1 0.2 0.3
1 0.3 0.1 0.2
3 0.1 0.3 0.9
and
df1 = pd.DataFrame( [[0.9,0.1,0.9], [0.1,0.2,0.1], [0.3,0.1,0.8]], index=[3,1,2], columns=['px', 'py', 'pz'])
px py pz
3 0.9 0.1 0.9
1 0.1 0.2 0.1
2 0.3 0.1 0.8
my desired output is a new dataframe df, made up of three columns 'px', 'py', 'pz', whose entries are:
for j in range(1,4):
dfx[j-1] = min( df0['px'][j] - df1['px'][j], 1 - df0['px'][j] + df1['px'][j] )
for df['px'], and similarly for 'py' and 'pz'.
px py pz
1 0.2 -0.1 0.1
2 -0.2 0.1 -0.5
3 -0.8 0.2 0.0
I hope it's clear now! Thanks in advance!
pandas is smart enough to match up the columns and index values for you in a vectorized way. If you're looping a dataframe, you're probably doing it wrong.
m1 = df0 - df1
m2 = 1 - (df0 + df1)
# Take the values from m1 where they're less than
# The corresponding value in m2. Otherwise, take m2:
out = m1[m1.lt(m2)].combine_first(m2)
# Another method: Combine our two calculated frames,
# groupby the index, and take the minimum.
out = pd.concat([m1, m2]).groupby(level=0).min()
print(out)
# Output:
px py pz
1 0.2 -0.1 0.1
2 -0.2 0.1 -0.5
3 -0.8 0.2 -0.8

Histogram with Seaborn

I'd like to plot an Histogram which makes comparisons between two arrays of data. Basically, i want to make exactly this:
Suppose i want to make this plot, but using two arrays with four entries, one with the numbers which should go to the blue areas, and the other with the ones for the blue areas. I have tried this:
x1 = np.array([0.1,0.2,0.3])
x2 = np.array([0.1,0.2,0.5])
sns.histplot(data=[x1,x2], x=['1','2','3'], multiple="dodge", hue=['a','b'], shrink=.8)
But it gives me the error “ValueError: arrays must all be same length”
I know that i'm supposed to enter a df and not arrays, but sadly i'm not really an expert on how to use them.
How can i solve this problem? Simply put, i'm looking for a copy and paste solution here, in which i can then change the numbers, and the name of the columns.
It looks like you want a barplot, not a histogram. Creating a seaborn plot from multiple columns usually involves converting them to "long form", making the process less straightforward.
Here is an example:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
x1 = np.array([0.1, 0.2, 0.3])
x2 = np.array([0.1, 0.2, 0.5])
x = ['1', '2', '3'] # or, simpler, x = np.arange(len(x1)) + 1
df = pd.DataFrame({'a': x1, 'b': x2, 'x': x})
df_long = df.melt('x')
ax = sns.barplot(data=df_long, x='x', y='value', dodge=True, hue='variable')
plt.show()
The long form looks like:
x variable value
0 1 a 0.1
1 2 a 0.2
2 3 a 0.3
3 1 b 0.1
4 2 b 0.2
5 3 b 0.5
See pandas' melt for additional options, such as naming the created columns.

Pandas convert array column into multiple columns with a condition

I have a pandas data frame with 2 columns:
embedding as an array column and size of embedding = size_of_embedding
language
like this:
embedding language
[0.1 0.2 0.3] fr
[0.1 0.4 0.4] en
[0.8 0.1 0.1] fr
Given a beginning integer n = 10, for each value of embedding column, I want to add a column to the above data frame like this:
embedding language feature1 feature2 feature3
[0.1 0.2 0.3] fr 10:0.1 11:0.2 12:0.3
[0.1 0.4 0.4] en 13:0.1 14:0.4 15:0.4
[0.8 0.1 0.1] fr 10:0.8 11:0.1 12:0.1
So, feature1 = 1st embedding value, feature2 = 2nd embedding value .... For the next language the beginning feature value = n+size_of_embedding:.
So, for each language, the number of columns added is exactly equal to the size_of_embedding array. and for each next language encountered, we start with n+size_of_embedding:. Is there an easy way of doing this? Thanks.
first ensure that the embedding column is in fact an array. If it is stored as string, you can convert it to a numpy array like so:
df.embedding = df.embedding.apply(lambda x: np.fromstring(x[1:-1], sep=' '))
create a lookup list of languages and their starting values, and use that to generate the
features
lookup = {'fr': 10, 'en': 13}
If you have too many languages to create this by hand, you could try the following statement, replacing 10 & 3 as is appropriate for your actual dataset
lookup = {l:10+i*3 for i, l in enumerate(df.language.drop_duplicates().to_list())}
Generating the features is then just a lookup & a list comprehension. Here I've used the helper function f to keep the code tidy.
def f(lang, embeddings):
return [f'{lookup[lang]+i}:{e}' for i, e in enumerate(embedding)]
new_names = ['feature1', 'feature2', 'feature3']
df[new_names] = df.apply(lambda x: f(x.language, x.embedding), axis=1, result_type='expand')
df now looks like:
embedding language feature1 feature2 feature3
0 [0.1, 0.2, 0.3] fr 10:0.1 11:0.2 12:0.3
1 [0.1, 0.4, 0.4] en 13:0.1 14:0.4 15:0.4
2 [0.8, 0.1, 0.1] fr 10:0.8 11:0.1 12:0.1
Longhand
df=pd.DataFrame({'embedding':['[0.1 0.2 0.3]','[0.1 0.4 0.4]','[0.8 0.1 0.1]'],'language':['fre','en','fr']})
df['feature1']=0
df['feature2']=0
df['feature3']=0
df['z']=df.embedding.str.strip('\[\]')#Remove the box brackets
df['y']=df.z.str.findall('(\d+[.]+\d+)')#extract each digit dot digit in the list
lst=['10:','11:','12:']#Create List lookup for `fr/fre`
lst2=['13:','14:','15:']##Create List lookup for `en`
Create two frames fo fr and en using boolean select
m=df.language.isin(['en'])
df2=df[~m]
df3=df[m]
Compute feature1, feature2 and feature3
df2['k']=[lst+i for i in df2['y']]
df3['m']=[lst2+i for i in df3['y']]
df2['feature1']=[i[0]+i[len(df2['k'])] for i in df2['k']]
df2['feature2']=[i[1]+i[len(df2['k'])+1] for i in df2['k']]
df2['feature3']=[i[2]+i[len(df2['k'])+2] for i in df2['k']]
df3['feature1']=[i[0]+i[len(df3['m'])] for i in df3['m']]
df3['feature2']=[i[1]+i[len(df3['m'])+1] for i in df3['m']]
df3['feature3']=[i[2]+i[len(df3['m'])+2] for i in df3['m']]
Concat df2 and df3
pd.concat([df3.iloc[:,:5:],df2.iloc[:,:5:]])

Add column to pandas df depending on a condition statement

I have a df called lin_reg_df with two columns Surface_Elevation_mAHD and Adopted_SS_WL.
The df holds measurements of each for 88 groundwater wells that each have a particular well name.
lin_reg_df is indexed by well name.
I want to add another column to the df that is called Aquifer_Type and specifies if the well is deep or shallow.
The well names of all the deep wells are held in a list called deep_wells and shallow wells are held in shallow_wells
I want to cycle through the well names (index of the df) and if the well name is listed in the
list called deep_wells I want to put a deep in the Aquifer_Type column. If it is listed
in the list called shallow_wells I want to put a shallow in the Aquifer_Type column.
I tried using isin within the loop but I couldnt get it to work.
Any advice?`
I tried to create a minimum example from your information, with following code:
well_name = ['a','b','c','d','e','f','g']
deep_wells = ['a','c','g']
shallow_wells = ['b','d','e','f']
lin_reg_df = pd.DataFrame({'Surface_Elevation_mAHD ': [1,2,3,4,5,6,7],
'Adopted_SS_WL': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]})
lin_reg_df.index = well_names
so my DataFrame initially looks like this:
Surface_Elevation_mAHD Adopted_SS_WL
a 1 0.1
b 2 0.2
c 3 0.3
d 4 0.4
e 5 0.5
f 6 0.6
g 7 0.7
I would then just use this code snippet to do the job:
for well in well_name:
if well in deep_wells:
lin_reg_df.loc[well, 'Aquifier_Type'] = 'deep'
elif well in shallow_wells:
lin_reg_df.loc[well, 'Aquifier_Type'] = 'shallow'
The output will be:
Surface_Elevation_mAHD Adopted_SS_WL Aquifier_Type
a 1 0.1 deep
b 2 0.2 shallow
c 3 0.3 deep
d 4 0.4 shallow
e 5 0.5 shallow
f 6 0.6 shallow
g 7 0.7 deep
Always give data to get help faster
Data
df=pd.DataFrame({'Surface_Elevation_mAHD':[3,4,6,7.8,9,2,5],'Adopted_SS_WL':[1,2,3,4,5,6,7],'well name':['WQ','RT','KL','SZ','TR','YH','YP']})
df
Lists
deep_wells=['WQ','RT','YH','YP']
shallow_wells=['KL','SZ','TR']
One other way is to use np.where for numpy.where(condition,yes,no)
df['Aquifer_Type']= np.where(df['well name'].isin(shallow_wells), 'shallow_wells', 'deep_wells')
df
If your well name is the index. Please reset it before applying np.wehere. You do that by df.reset_index(inplace=True)

Tensorflow: ''roulette wheel" selection

I try to implement roulette wheel selection in Tensorflow. So I started with this:
x = tf.random_uniform([tf.shape(probabilities)[0]])
cumsum = tf.cumsum(probabilities, axis=1) # cumulative sum
b = tf.greater_equal(x, cumsum) # Boolean values now
...
indices = tf.where(b) # this given indices for all the True values, I need only the first one per row
indices = indices[:,1] # we only need column index
Any suggestions for this? Or a better procedure to do the roulette wheel selection?
So a small example to make it more clear
probabilities = [[0.2 0.3 0.5],
[0.1 0.6 0.3],
[0.5 0.4 0.1]]
x = [0.27, 0.86, 0.73] # drawn randomly
Then I want as output [1, 2, 1]
As far as I understand, you want to draw the samples from multinomial distribution. To do that, it is easiest to simply use tf.multinomial:
samples = tf.multinomial(tf.log(probabilities), 1)
Possibly followed by reshaping:
samples_vector = tf.reshape(samples, [-1])