read_excel modifies unexpected data - pandas

Is it possible that the pandas function read_excel is modifing data from the excel?
It seems it changes for example TRUE to 1 and FALSE to False.
I use this code
df = pd.read_excel(DEFAULT_PATH_2_XLSX_FILE, , dtype=str)

There is a good explanation to that. Pandas read the raw values of cell not the shown values:
df = pd.read_excel('data.xlsx', dtype=str, header=None)
print(df)
# Output
0 1
0 1 0 # A1, B1
1 1 0 # A2, B2
2 1 0 # A3, B3
In this screenshot, this is how I entered values:
A1: 1
B1: 0
A2: 1 then formatted as boolean value
B2: 0 then formatted as boolean value
A3: TRUE (type as it)
B3: FALSE (type as it)

Related

How to find the indices of the elements that exit in another array in numpy?

I have two arrays, a1 and a2. a1 has only unique elements and are sorted, and a1 has all elements that a2 has.
a1 = np.array([3,4,5,7,8,9])
a2 = np.array([5,3,5,8])
I want the output is
res = np.array([2,0,2,4])
So a1[res] == a2. How to do it quickly?
I'm not sure you want this:
import numpy as np
a1 = np.array([3,4,5,7,8,9])
a2 = np.array([5,3,5,8])
res = []
for i in a2:
res.append(*np.where(a1 == i)[0])
res = np.array(res)
print(res)

Pandas: Getting indices (numeric position) from external array for each value in Column

I have an fixed value with arrays: ['string1', 'string2', 'string3'] and a Pandas Datafrae:
>>> pd.DataFrame({'column': ['string1', 'string1', 'string2']})
column
0 string1
1 string1
2 string2
And I want to add a new column with the indices position from the previous array, so it becomes:
>>> pd.DataFrame({'column': ['string1', 'string1', 'string2', pd.NA], 'indices': [0,0,1, pd.NA]})
column indices
0 string1 0
1 string1 0
2 string2 1
3 <NA> <NA>
I.e the position of the value in the main array. This will be later fed into pyarrow's DictionaryArray[1]. The Dataframe can have null values as well
Is there any fast way to do this? Been trying to figure out how to vectorize it. Naive implementation:
def create_dictionary_array_indices(column_name, arrow_array):
global dictionary_values
values = arrow_array.to_pylist()
indices = []
for i, value in enumerate(values):
if not value or value != value:
indices.append(None)
else:
indices.append(
dictionary_values[column_name].index(value)
)
indices = pd.array(indices, dtype=pd.Int32Dtype())
return pa.DictionaryArray.from_arrays(indices, dictionary_values[column_name])
[1] https://lists.apache.org/thread/xkpyb3zboksbhmyqzzkj983y6l0t9bjs
Given your two dataframes:
import pandas as pd
df1 = pd.DataFrame({"column": ["string1", "string1", "string2"]})
df2 = pd.DataFrame({"column": ["string1", "string1", "string2", pd.NA]})
Here is one way to do it:
df1 = df1.drop_duplicates(keep="first").reset_index(drop=True)
indices = {value: key for key, value in df1["column"].items()}
df2["indices"] = df2["column"].apply(lambda x: indices.get(x, pd.NA))
print(df2)
# Output
column indices
0 string1 0
1 string1 0
2 string2 1
3 <NA> <NA>

Populate empty pandas dataframe with specific conditions

I want to create a pandas dataframe where there are 5000 columns (n=5000) and one row (row G). For row G, 1 (in 10% of samples) or 0 (in 90% of samples).
import pandas as pd
df = pd.DataFrame({"G": np.random.choice([1,0], p=[0.1, 0.9], size=5000)}).T
I also want to add column names such that it is "Cell" followed by "1..5000":
Cell1
Cell2
Cell3
Cell5000
G
0
0
1
0
The columns will default to a RangeIndex from 0-4999. You can add 1 to the column values, and then use DataFrame.add_prefix to add the string "Cell" before all of the column names.
df.columns += 1
df = df.add_prefix("Cell")
print(df)
Cell1 Cell2 Cell3 ... Cell5000
G 0 0 0 ... 0
For a single-liner, you can also add 1 and prefix with "Cell" by converting the column index dtype manually.
df.columns = "Cell" + (df.columns + 1).astype(str)
To make a single row DataFrame, I would construct my data with numpy in the correct shape instead of transposing a DataFrame. You can also pass in the columns as you want them numbered and the index labelled.
import numpy as np
import pandas as pd
df = pd.DataFrame(
np.random.choice([1,0], p=[.1, .9], size=(1, size)),
columns=np.arange(1, size+1),
index=["G"]
).add_prefix("Cell")
print(df)
Cell1 Cell2 Cell3 ... Cell4999 Cell5000
G 0 0 0 ... 0 0
Another Method could be:
size = 5000
pd.DataFrame.from_dict(
{"G": np.random.choice([1,0], p=[0.1, 0.9], size=size)},
columns=(f'Cell{x}' for x in range(1, size+1)),
orient='index'
)
Output:
Cell1 Cell2 Cell3 Cell4 Cell5 Cell6 Cell7 Cell8 Cell9 ... Cell4992 Cell4993 Cell4994 Cell4995 Cell4996 Cell4997 Cell4998 Cell4999 Cell5000
G 0 0 0 0 0 1 0 1 0 ... 0 0 0 0 0 0 0 0 0
[1 rows x 5000 columns]

Pandas dataframe dump to excel with color formatting

I have a large pandas dataframe df as:
Sou ATC P25 P75 Avg
A 11 9 15 10
B 6.63 15 15 25
C 6.63 5 10 8
I want to print this datamframe to excel file but I want to apply formatting to each row of the excel file such that following rules are applied to cells in ATC and Avg columns:
colored in red if value is less than P25
colored in green if value is greater than P75
colored in yellow if value is between P25 and P75
Sample display in excel is as follows:
I am not sure how to approach this.
You can use style.Styler.apply with DataFrame of styles with numpy.select for filling by masks created by DataFrame.lt and
DataFrame.gt:
def color(x):
c1 = 'background-color: red'
c2 = 'background-color: green'
c3 = 'background-color: yellow'
c = ''
cols = ['ATC','Avg']
m1 = x[cols].lt(x['P25'], axis=0)
m2 = x[cols].gt(x['P75'], axis=0)
arr = np.select([m1, m2], [c1, c2], default=c3)
df1 = pd.DataFrame(arr, index=x.index, columns=cols)
return df1.reindex(columns=x.columns, fill_value=c)
df.style.apply(color,axis=None).to_excel('format_file.xlsx', index=False, engine='openpyxl')

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

How this is working?
I know the intuition behind it that given movie_dataset(using panda we have loaded it in "md" and we are finding those rows in 'votecount' which are not null and converting them to int.
but i am not understanding the syntax.
md[md['vote_count'].notnull()] returns a filtered view of your current md dataframe where vote_count is not NULL. Which is being set to the variable vote_counts This is Boolean Indexing.
# Assume this dataframe
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[2,'B'] = np.nan
when you do df['B'].notnull() it will return a boolean vector which can be used to filter your data where the value is True
df['B'].notnull()
0 True
1 True
2 False
3 True
4 True
Name: B, dtype: bool
df[df['B'].notnull()]
A B C
0 -0.516625 -0.596213 -0.035508
1 0.450260 1.123950 -0.317217
3 0.405783 0.497761 -1.759510
4 0.307594 -0.357566 0.279341