How can I create a new slide with elements from a dataframe? - dataframe

I want to generate a PowerPoint Presentation with pptx in VSC with python.
The presentation must generate a new slide by each element of a df.
My df looks like this:
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Vorname 14 non-null string
1 Nachname 14 non-null string
2 PSNR 14 non-null string
3 Ausbildung 14 non-null string
4 Tätigkeit 14 non-null string
5 Geburtsdatum 14 non-null string
6 Geschlecht 14 non-null string
7 LinkedInImageUrl 5 non-null string
8 Vorname_Nachname 14 non-null object
dtypes: object(1), string(8)
I want to combine Vorname and Nachname to a new column as string with this:
df['Vorname_Nachname'] = df[['Vorname', 'Nachname']].agg(' '.join, axis=1)
Then I try to use a 'for' loop on my df with this:
```{python}
from pptx import Presentation
# Erstellen der PowerPoint Präsentation mit dem Template: dynamicPowerPointTemplate.pptx
prs = Presentation('dynamicPowerPointTemplate.pptx')
# Neue Slide per for-schleife mit: Vorname_Nachname und PSNR
for row in df.itertuples():
print(row.Vorname_Nachname)
title_slide_layout = prs.slide_layouts[2]
slide = prs.slides.add_slide(title_slide_layout)
title = slide.shapes.title
subtitle = slide.placeholders[1]
title.text = (row.Vorname_Nachname)
subtitle.text = (row.Tätigkeit)
# Speichern der PowerPoint Datei: dynamicPowerPointTemplate-test.pptx
prs.save('dynamicPowerPointTemplate-test.pptx')
```
Is this the best way to do it? Thanks you all for any hints.
Sebastian

Related

Cosine Similarity between 2 cells in a datafame

doc_1 = data.iloc[15]['Description']
doc_2 = "Data is a new oil"
data = [doc_1, doc_2]
count_vectorizer = CountVectorizer()
vector_matrix = count_vectorizer.fit_transform(data)
tokens = count_vectorizer.get_feature_names()
vector_matrix.toarray()
def create_dataframe(matrix, tokens):
doc_names = [f'doc_{i+1}' for i, _ in enumerate(matrix)]
df = pd.DataFrame(data=matrix, index=doc_names, columns=tokens)
return(df)
create_dataframe(vector_matrix.toarray(),tokens)
cosine_similarity_matrix = cosine_similarity(vector_matrix)
create_dataframe(cosine_similarity_matrix,['doc_1','doc_2'])
The code calculate the Cosine Similarity between a cell and the string, but how can i improve my code so that i can calculate the Cosine Similarity between cells. So i want doc1 compared with all the other cells in the column.
So i will get a table like this, where the dots is de cosine similarities:
x
doc2
doc3
doc1
....
....
picture of the table how it should look like

Pandas: Getting indices (numeric position) from external array for each value in Column

I have an fixed value with arrays: ['string1', 'string2', 'string3'] and a Pandas Datafrae:
>>> pd.DataFrame({'column': ['string1', 'string1', 'string2']})
column
0 string1
1 string1
2 string2
And I want to add a new column with the indices position from the previous array, so it becomes:
>>> pd.DataFrame({'column': ['string1', 'string1', 'string2', pd.NA], 'indices': [0,0,1, pd.NA]})
column indices
0 string1 0
1 string1 0
2 string2 1
3 <NA> <NA>
I.e the position of the value in the main array. This will be later fed into pyarrow's DictionaryArray[1]. The Dataframe can have null values as well
Is there any fast way to do this? Been trying to figure out how to vectorize it. Naive implementation:
def create_dictionary_array_indices(column_name, arrow_array):
global dictionary_values
values = arrow_array.to_pylist()
indices = []
for i, value in enumerate(values):
if not value or value != value:
indices.append(None)
else:
indices.append(
dictionary_values[column_name].index(value)
)
indices = pd.array(indices, dtype=pd.Int32Dtype())
return pa.DictionaryArray.from_arrays(indices, dictionary_values[column_name])
[1] https://lists.apache.org/thread/xkpyb3zboksbhmyqzzkj983y6l0t9bjs
Given your two dataframes:
import pandas as pd
df1 = pd.DataFrame({"column": ["string1", "string1", "string2"]})
df2 = pd.DataFrame({"column": ["string1", "string1", "string2", pd.NA]})
Here is one way to do it:
df1 = df1.drop_duplicates(keep="first").reset_index(drop=True)
indices = {value: key for key, value in df1["column"].items()}
df2["indices"] = df2["column"].apply(lambda x: indices.get(x, pd.NA))
print(df2)
# Output
column indices
0 string1 0
1 string1 0
2 string2 1
3 <NA> <NA>

to_string(index = False) results in non empty string even when dataframe is empty

I am doing the following in my python script and I want to hide the index column when I print the dataframe. So I used .to_string(index = False) and then use len() to see if its zero or not. However, when i do to_string(), if the dataframe is empty the len() doesn't return zero. If i print the procinject1 it says "Empty DataFrame". Any help to fix this would be greatly appreciated.
procinject1=dfmalfind[dfmalfind["Hexdump"].str.contains("MZ") == True].to_string(index = False)
if len(procinject1) == 0:
print(Fore.GREEN + "[✓]No MZ header detected in malfind preview output")
else:
print(Fore.RED + "[!]MZ header detected within malfind preview (Process Injection indicator)")
print(procinject1)
That's the expected behaviour in Pandas DataFrame.
In your case, procinject1 stores the string representation of the dataframe, which is non-empty even if the corresponding dataframe is empty.
For example, check the below code snippet, where I create an empty dataframe df and check it's string representation:
df = pd.DataFrame()
print(df.to_string(index = False))
print(df.to_string(index = True))
For both index = False and index = True cases, the output will be the same, which is given below (and that is the expected behaviour). So your corresponding len() will always return non-zero.
Empty DataFrame
Columns: []
Index: []
But if you use a non-empty dataframe, then the outputs for index = False and index = True cases will be different as given below:
data = [{'A': 10, 'B': 20, 'C':30}, {'A':5, 'B': 10, 'C': 15}]
df = pd.DataFrame(data)
print(df.to_string(index = False))
print(df.to_string(index = True))
Then the outputs for index = False and index = True cases respectively will be -
A B C
10 20 30
5 10 15
A B C
0 10 20 30
1 5 10 15
Since pandas handles empty dataframes differently, to solve your problem, you should first check whether your dataframe is empty or not, using pandas.DataFrame.empty.
Then if the dataframe is actually non-empty, you could print the string representation of that dataframe, while keeping index = False to hide the index column.

Pandas dataframe dump to excel with color formatting

I have a large pandas dataframe df as:
Sou ATC P25 P75 Avg
A 11 9 15 10
B 6.63 15 15 25
C 6.63 5 10 8
I want to print this datamframe to excel file but I want to apply formatting to each row of the excel file such that following rules are applied to cells in ATC and Avg columns:
colored in red if value is less than P25
colored in green if value is greater than P75
colored in yellow if value is between P25 and P75
Sample display in excel is as follows:
I am not sure how to approach this.
You can use style.Styler.apply with DataFrame of styles with numpy.select for filling by masks created by DataFrame.lt and
DataFrame.gt:
def color(x):
c1 = 'background-color: red'
c2 = 'background-color: green'
c3 = 'background-color: yellow'
c = ''
cols = ['ATC','Avg']
m1 = x[cols].lt(x['P25'], axis=0)
m2 = x[cols].gt(x['P75'], axis=0)
arr = np.select([m1, m2], [c1, c2], default=c3)
df1 = pd.DataFrame(arr, index=x.index, columns=cols)
return df1.reindex(columns=x.columns, fill_value=c)
df.style.apply(color,axis=None).to_excel('format_file.xlsx', index=False, engine='openpyxl')

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

How this is working?
I know the intuition behind it that given movie_dataset(using panda we have loaded it in "md" and we are finding those rows in 'votecount' which are not null and converting them to int.
but i am not understanding the syntax.
md[md['vote_count'].notnull()] returns a filtered view of your current md dataframe where vote_count is not NULL. Which is being set to the variable vote_counts This is Boolean Indexing.
# Assume this dataframe
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[2,'B'] = np.nan
when you do df['B'].notnull() it will return a boolean vector which can be used to filter your data where the value is True
df['B'].notnull()
0 True
1 True
2 False
3 True
4 True
Name: B, dtype: bool
df[df['B'].notnull()]
A B C
0 -0.516625 -0.596213 -0.035508
1 0.450260 1.123950 -0.317217
3 0.405783 0.497761 -1.759510
4 0.307594 -0.357566 0.279341