keep the extra whitespaces in display of pandas dataframe in jupyter notebook - pandas

In jupyter notebook, extra whitespaces in dataframe are removed. But sometime that is not preferred, e.g.
df=pd.DataFrame({'A':['a b','c'],'B':[1,2]})
df
The result I get:
| | A | B |
|---|-----|---|
| 0 | a b | 1 |
| 1 | c | 2 |
But I want:
| | A | B |
|---|-------|---|
| 0 | a b | 1 |
| 1 | c | 2 |
Is it possible? Thanks

It's actually HTML: pandas dutifully write all the spaces into the HTML markup (the front end format used by Jupyter Notebook). HTML, by default, collapses multiple adjacent whitespaces into one. Use the style object to change this:
df.style.set_properties(**{'white-space': 'pre'})
You unfortunately can't change the default render style of a DataFrame yet. You can write a function to wrap that line:
def print_df(df):
return df.style.set_properties(**{'white-space': 'pre'})
print_df(df)

Related

Pandas apply to a range of columns

Given the following dataframe, I would like to add a fifth column that contains a list of column headers when a certain condition is met on a row, but only for a range of dynamically selected columns (ie subset of the dataframe)
| North | South | East | West |
|-------|-------|------|------|
| 8 | 1 | 8 | 6 |
| 4 | 4 | 8 | 4 |
| 1 | 1 | 1 | 2 |
| 7 | 3 | 7 | 8 |
For instance, given that the inner two columns ('South', 'East') are selected and that column headers are to be returned when the row contains the value of one (1), the expected output would look like this:
Headers
|---------------|
| [South] |
| |
| [South, East] |
| |
The following one liner manages to return column headers for the entire dataframe.
df['Headers'] = df.apply(lambda x: df.columns[x==1].tolist(),axis=1)
I tried adding the dynamic column range condition by using iloc but to no avail. What am I missing?
For reference, these are my two failed attempts (N1 and N2 being column range variables here)
df['Headers'] = df.iloc[N1:N2].apply(lambda x: df.columns[x==1].tolist(),axis=1)
df['Headers'] = df.apply(lambda x: df.iloc[N1:N2].columns[x==1].tolist(),axis=1)
This works:
df=pd.DataFrame({'North':[8,4,1,7],'South':[1,4,1,3],'East':[8,8,1,7],\
'West':[6,4,2,8]})
df1=df.melt(ignore_index=False)
condition1=df1['variable']=='South'
condition2=df1['variable']=='East'
condition3=df1['value']==1
df1=df1.loc[(condition1|condition2)&condition3]
df1=df1.groupby(df1.index)['variable'].apply(list)
df=df.join(df1)

Missing 1 digit in time column read excel in pandas

I have excel that have format like this
| No | Date | Time | Name | ID | Serial | Total |
| 1 |2021-03-01| 11.45 | AB | 124535 | 5215635 | 50 |
Im trying to convert excel to pandas dataframe using below code
pd.read_excel(r'path', header=0)
pandas read the excel successfully however, I found strange result when I see the column time.
the dataframe have below result
| No | Date | Time | Name | ID | Serial | Total |
| 1.0 |2021-03-01| 11.4 | AB | 124535 | 5215635.0 | 50.0 |
Column Time is missing 1 digit. is my method to read excel is not correct?
read_excel is interpreting your dot-separated time as a float, which is quite expected.
I suggest telling read_excel to see this column as a string and convert it to datetime afterwards:
df = pd.read_excel(r'path', header=0, converters={'Time': str})
df['Time'] = pd.to_datetime(df.Time, format="%H.%M")

Create new column in Pyspark Dataframe by filling existing Column

I am trying to create new column in an existing Pyspark DataFrame. Currently the DataFrame looks as follows:
+----+----+---+----+----+----+----+
|Acct| M1D|M1C| M2D| M2C| M3D| M3C|
+----+----+---+----+----+----+----+
| B| 10|200|null|null| 20|null|
| C|1000|100| 10|null|null|null|
| A| 100|200| 200| 200| 300| 10|
+----+----+---+----+----+----+----+
I want to fill null values in column M2C with 0 and create a new column Ratio. My expected output would be as follows:
+------+------+-----+------+------+------+------+-------+
| Acct | M1D | M1C | M2D | M2C | M3D | M3C | Ratio |
+------+------+-----+------+------+------+------+-------+
| B | 10 | 200 | null | null | 20 | null | 0 |
| C | 1000 | 100 | 10 | null | null | null | 0 |
| A | 100 | 200 | 200 | 200 | 300 | 10 | 200 |
+------+------+-----+------+------+------+------+-------+
I was trying to achieve my desired results by using following lines of code.
df = df.withColumn('Ratio', df.select('M2C').na.fill(0))
The above line of code resulted in an assertion error as shown below.
AssertionError: col should be Column
The possible solution that I found using this link was to use lit function.
I changed my code to
df = df.withColumn('Ratio', lit(df.select('M2C').na.fill(0)))
The above code led to AttributeError: 'DataFrame' object has no attribute '_get_object_id'
How can I achieve my desired output?
You're doing two things wrong here.
df.select will return a dataframe, not a column.
na.fill will replace null values in all columns, not just in specific columns.
The following code snippet will solve your usecase
from pyspark.sql.functions import col
df = df.withColumn('Ratio', col('M2C')).fillna(0, subset=['Ratio'])

pandas cumcount in pyspark

Currently attempting to convert a script I made from pandas to pyspark, I have a dataframe that contains data in the form of:
index | letter
------|-------
0 | a
1 | a
2 | b
3 | c
4 | a
5 | a
6 | b
I want to create the following dataframe in which the occurrence count for each instance of a letter is stored, for example the first time we see "a" its occurrence count is 0, second time 1, third time 2:
index | letter | occurrence
------|--------|-----------
0 | a | 0
1 | a | 1
2 | b | 0
3 | c | 0
4 | a | 2
5 | a | 3
6 | b | 1
I can achieve this in pandas using:
df['occurrence'] = df.groupby('letter').cumcount()
How would I go about doing this in pyspark? Cannot find an existing method that is similar.
The feature you're looking for is called window functions
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df.withColumn("occurence", row_number().over(Window.partitionBy("letter").orderBy("index")))

Let pandas use 0-based row number as index when reading Excel files

I am trying to use pandas to process a series of XLS files. The code I am currently using looks like:
with pandas.ExcelFile(data_file) as xls:
data_frame = pandas.read_excel(xls, header=[0, 1], skiprows=2, index_col=None)
And the format of the XLS file looks like
+---------------------------------------------------------------------------+
| REPORT |
+---------------------------------------------------------------------------+
| Unit: 1000000 USD |
+---------------------------------------------------------------------------+
| | | | | Balance |
+ ID + Branch + Customer ID + Customer Name +--------------------------+
| | | | | Daily | Monthly | Yearly |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 1 | Company A | 10 | 5 | 2 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 222222 | Branch2 | 2 | Company B | 20 | 25 | 20 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 3 | Company C | 30 | 35 | 40 |
+--------+---------+-------------+---------------+-------+---------+--------+
Even I explicitly gave index_col=None, pandas still take ID column as the index. I am wondering the right way of making row numbers to be the index.
pandas currently doesn't support parsing a MultiIndex columns without also parsing a row index. Related issue here - it probably could be supported, but this gets tricky to define in a non-ambiguous way.
It's a hack, but the easiest way to work around this right now is to add a blank column on the left side of data, then read it in like this.
pd.read_excel('file.xlsx', header=[0,1], skiprows=2).reset_index(drop=True)
Edit:
If you can't / don't want to modify the files, a couple options are:
If the data has a known / common header, use pd.read_excel(..., skiprows=4, header=None) and assign the columns yourself, suggested by #ayhan.
If you need to parse the header, use pd.read_excel(..., skiprows=2, header=0), then munge the second level of labels into a MultiIndex. This will probably mess up dtypes, so you may also need to do some typecasting (pd.to_numeric) as well.