Select column for formatting based on cell content - xlsxwriter

In xlsxWriter I can format a column by selecting its column index as it is in Excel via letters e.g. worksheet.set_column('V:V', 20, cell_format). Is it also possible to define which column to format based on its value (e.g. header columns)?
For example worksheet.set_column('cell_value', 20, cell_format)?

Is it also possible to define which column to format based on its value (e.g. header columns)?
That isn't possible using xlsxwriter, which in general is a write-only data store. Instead you could track the column headers and indices in a dict and use that. Something like
cols = {'Year' : 0, 'Income': 1, 'Outgoings': 2}
worksheet.set_column(cols['Year'], cols['Year'], 20, cell_format)

Related

How can I query a column of a dataframe on a specific value and get the values of two other columns corresponding to that value

I have a data frame where the first column contains various countries' ISO codes, while the other 2 columns contain dataset numbers and Linkedin profile links.
Please refer to the image.
I need to query the data frame's first "FBC" column on the "IND" value and get the corresponding values of the "no" and "Linkedin" columns.
Can somebody please suggest a solution?
Using query():
If you want just the no and Linkedin values.
df = df.query("FBC.eq('IND')")[["no", "Linkedin"]]
If you want all 3:
df = df.query("FBC.eq('IND')")

AttributeError for df.apply() when trying to subtract the column mean and divide by the column standard deviation for each column in a dataframe

I have a data frame with roughly 26 columns. For 14 of these columns (all data are floats) I want to determine the mean and standard deviation for each column, then for each value in each column I want to subtract the column mean and divide by the column standard deviation (only for the column to which the value belongs).
I can do this separately for each column like so:
chla_array = df['Chla'].to_numpy()
mean_chla = np.nanmean(chla_array)
std_chla = np.nanstd(chla_array)
df['Chla_standardized'] = (df['Chla'] - mean_chla) / std_chla
Because I have 14 columns to do this for, I am looking for a more concise way of coding this, rather than copy and pasting the above code out thirteen more times and changing the column headers. I was thinking of using df.apply() but I can't get it to work. Here is what I have:
df = df.iloc[:, [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]]
df_standardized = df.apply((df - df.mean(skipna=True)) / df.std(skipna=True, ddof=0))
The error I encounter is this:
AttributeError: 'Canyon_dist' is not a valid function for 'Series' object
Where 'Canyon_dist' is the header for the first column the code encounters.
I'm not sure that df.apply is appropriate for what I am trying to achieve, so if there is a more appropriate way of doing this please let me know (perhaps using a for loop?).
I am open to all suggestions and thank you.

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

Format of data in a column in a data frame

I have read a fixed width file and created a dataframe.
I have a field called claim number which is of length 15.In the data frame I see this field appearing as "1.902431e+14" rather than full 15 length claim number.
how can I resolve this so that I can see entire 15 length of claim number in data frame ?
For example, use pandas float_format option as follows:
#data claim_number dictionary
dictionary = {'claim_number': [1.902431111111141]}
#specification of claim number format
pd.options.display.float_format = '{:,.15f}'.format
#Create dataframe
df = pd.DataFrame(data=dictionary)
If you want to apply your format specifically to one column only, you can use style.format instead of pd.options.float_format as follows:
#data claim_number dictionary
dictionary = {'claim_number_1': [1.902431111111141], 'claim_number_2': ['some_string'], 'claim_number_3': [0.2323]}
#Create dataframe
df = pd.DataFrame(data=dictionary)
#style format single column
df.style.format({'claim_number_1': "{:,.15f}"})
More options on how to use style.format can be found here https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

Write pandas data to a CSV file if column sums are greater than a specified value

I have a CSV file whose columns are frequency counts of words, and whose rows are time periods. I want to sum for each column the total frequencies. Then I want to write to a CSV file for sums greater than or equal to 30, the column and row values, thus dropping columns whose sums are less than 30.
Just learning python and pandas. I know it is a simple question, but my knowledge is at that level. Your help is most appreciated.
I can read in the CSV file and compute the column sums.
df = pd.read_csv('data.csv')
Except of data file containing 3,874 columns and 100 rows
df.sum(axis = 0, skipna = True)
Excerpt of sums for columns
I am stuck on how to create the output file so that it looks like the original file but no longer has columns whose sums were less than 30.
I am stuck on how to write to a CSV file each row for each column whose sums are greater than or equal to 30. The layout of the output file would be the same as for the input file. The sums would not be included in the output.
Thanks very much for your help.
So, here is a link showing an excerpt of a file containing 100 rows and 3,857 columns:
It's easiest to do this in two steps:
1. Filter the DataFrame to just the columns you want to save
df_to_save = df.loc[:, (df.sum(axis=0, skipna=True) >= 30)]
.loc is for picking rows/columns based either on labels or conditions; the syntax is .loc[rows, columns], so : means "take all the rows", and then the second part is the condition on our columns - I've taken the sum you'd given in your question and set it greater than or equal to 30.
2. Save the filtered DataFrame to CSV
df_to_save.to_csv('path/to/write_file.csv', header=True, index=False)
Just put your filepath in as the first argument. header=True means the header labels from the table will be written back out to the file, and index=False means the numbered row labels Pandas automatically created when you read in the CSV won't be included in the export.
See this answer here: How to delete a column in pandas dataframe based on a condition? . Note, the solution for your question doesn't need isnull() before the sum(), as that is specific to their question for counting NaN values.