Pandas: Read CSV with multiple headers - pandas

I have the following (caret) delimited csv (the file needs to be in this format):
HEADER^20181130
[Col1]^[Col2]^[Col3]^[Col4]^[Col5]
The^quick^"bro,wn"^fox^jumped
over^the^fat^lazy^dog
m1213^4,12r4^fr,34^,56,gt^12fr,12fr
Trailer^N
and I need to read the file while preserving the order of the headers so that the output matches the following:
However, when I try:
df = pd.read_csv(source_file, header=[0,1], sep=r"[| ^]", engine='python')
I get:
and if I try:
df = pd.read_csv(source_file, header=[1], sep=r"[| ^]",engine='python')
I just get:
Any way to import this file with both headers? Bonus points if we can remove the opening and closing brackets for the header without removing them elsewhere in the file.
Note: I have sep=r"[| ^] because the file could be delimited with pipes as well.

To keep both header rows, I would suggest to create a pd.Multindex from the first two rows of your data.
Therefore, you will need to import your data without header.
import numpy as np
import pandas as pd
df = pd.read_csv('~/Desktop/stackoverflow_data.csv', sep=r"[| ^]", header=None, engine='python')
df.reset_index(inplace=True)
df.fillna(np.nan, inplace=True)
df.head()
Output:
level_0 level_1 level_2 0 1
0 HEADER 20181130 NaN NaN NaN
1 [Col1] [Col2] [Col3] [Col4] [Col5]
2 The quick "bro,wn" fox jumped
3 over the fat lazy dog
4 m1213 4,12r4 fr,34 ,56,gt 12fr,12fr
Then you will need to zip the two first rows as tuples (and btw remove the square brackets) and create a Multindex object:
cols = tuple(zip(df.iloc[0], df.iloc[1].apply(lambda x: x[1:-1])))
header = pd.MultiIndex.from_tuples(cols, names=['Lvl_1', 'Lvl_2'])
# delete the header rows and assign new header
df.drop([0,1], inplace=True)
df.columns = header
df.head()
This is the output:
Lvl_1 HEADER 20181130 NaN
Lvl_2 Col1 Col2 Col3 Col4 Col5
2 The quick "bro,wn" fox jumped
3 over the fat lazy dog
4 m1213 4,12r4 fr,34 ,56,gt 12fr,12fr
5 Trailer N NaN NaN NaN

Related

What is "roundtripping" in the context of Pandas?

The documentation for pandas.read_excel mentions something called 'roundtripping', under the description of the index_col parameter.
Missing values will be forward filled to allow roundtripping with to_excel for merged_cells=True.
I have never heard of this term before, and if I search for a definition, I can find one only in the context of finance. I have seen it referred to in the context of merging dataframes in Pandas, but I have not found a definition.
For context, this is the complete description of the index_col parameter:
index_col : int, list of int, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass
None if there is no such column. If a list is passed, those columns
will be combined into a MultiIndex. If a subset of data is selected
with usecols, index_col is based on the subset.
Missing values will be forward filled to allow roundtripping with
to_excel for merged_cells=True. To avoid forward filling the
missing values use set_index after reading the data instead of
index_col.
For a general idea of the meaning of roundtripping, have a look at the answers to this post on SE. Applied to your example, "allow roundtripping" is used to mean something like this:
facilitate an easy back-and-forth between the data in an Excel file
and the same data in a df. I.e. while maintaining the intended
structure throughout.
Example round trip
The usefulness of this idea is perhaps best seen if we start with a somewhat complex df with both index and columns as named MultiIndices (for the constructor, see pd.MultiIndex.from_product):
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.rand(4,4),
columns=pd.MultiIndex.from_product([['A','B'],[1,2]],
names=['col_0','col_1']),
index=pd.MultiIndex.from_product([[0,1],[1,2]],
names=['idx_0','idx_1']))
print(df)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
If we now use df.to_excel with the default for merge_cells (i.e. True) to write this data to an Excel file, we will end up with data as follows:
df.to_excel('file.xlsx')
Result:
Aesthetics aside, the structure here is very clear, and indeed, the same as the structure in our df. Take notice of the merged cells especially.
Now, let's suppose we want to retrieve this data again from the Excel file at some later point, and we use pd.read_excel with default parameters. Problematically, we will end up with a complete mess:
df = pd.read_excel('file.xlsx')
print(df)
Unnamed: 0 col_0 A Unnamed: 3 B Unnamed: 5
0 NaN col_1 1.000000 2.000000 1.000000 2.000000
1 idx_0 idx_1 NaN NaN NaN NaN
2 0 1 0.952749 0.447125 0.846409 0.699479
3 NaN 2 0.297437 0.813798 0.396506 0.881103
4 1 1 0.581273 0.881735 0.692532 0.725254
5 NaN 2 0.501324 0.956084 0.643990 0.423855
Getting this data "back into shape" would be quite time-consuming. To avoid such a hassle, we can rely on the parameters index_col and header inside pd.read_excel:
df2 = pd.read_excel('file.xlsx', index_col=[0,1], header=[0,1])
print(df2)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
# check for equality
df.equals(df2)
# True
As you can see, we have made a "round trip" here, and index_col and header allow for it to have been smooth sailing!
Two final notes:
(minor) The docs for pd.read_excel contain a typo in the index_col section: it should read merge_cells=True, not merged_cells=True.
The header section is missing a similar comment (or a reference to the comment at index_col). This is somewhat confusing. As we saw above, the two behave exactly the same (for present purposes, at least).

Change NaN to None in Pandas dataframe

I try to replace Nan to None in pandas dataframe. It was working to use df.where(df.notnull(),None).
Here is the thread for this method.
Use None instead of np.nan for null values in pandas DataFrame
When I try to use the same method on another dataframe, it failed.
The new dataframe is like below
A NaN B C D E, the print out of the dataframe is like this:
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 A NaN B C D E
even when I use the working code run against the new dataframe, it failed.
I just wondering is it is because in the excel, the cell format has to be certain type.
Any suggestion on this?
This always works for me
df = df.replace({np.nan:None})
You can check this related question, Credit from here
The problem is that I did not follow the format.
The format I used that cause the problem was
df.where(df.notnull(), None)
If I wrote the code like this, there is no problem
df = df.where(df.notnull(), None)
To do it just over one column
df.col_name.replace({np.nan: None}, inplace=True)
This is not as easy as it looks.
1.NaN is the value set for any cell that is empty when we are reading file using pandas.read_csv()
2.None is the value set for any cell that is NULL when we are reading file using pandas.read_sql() or readin from a database
import pandas as pd
import numpy as np
x=pd.DataFrame()
df=pd.read_csv('file.csv')
df=df.replace({np.NaN:None})
df['prog']=df['prog'].astype(str)
print(df)
if there is compatibility issue of datatype , which will be because on replacing np.NaN will make the column of dataframe as object type.
so in this case first replace np.NaN with None and then choose the required datatype for the column
file.csv
column names : batch,prog,name
'prog' column is empty

Reading from CSV - is it possible to replace all strings with NaN using na_values?

I am using the following line to load a file from CSV. It is very large and has about 1% of the elements as strings, but I don't know what the values are.
Is there a way to use read_csv and na_values or something to replace ALL strings with np.nan?
data = pd.read_csv(fileName,header=None,na_values=['*'])
The method below does not manipulate the data on reading the csv file but would occur after you read the file into memory
# sample data
df = pd.DataFrame([[1,'abc', '123'], ['asdf', 2, 222]], columns=list('ABC'))
# stack all your columns into one
s = df.stack()
# boolean indexing to find all your strings and set them to NULL
s.loc[s.apply(type) == np.str] = np.nan
# unstack your series
new_df = s.unstack()
A B C
0 1 NaN NaN
1 NaN 2 222

Instead of appending value as a new column on the same row, pandas adds a new column AND new row

What I have below is an example of the type of the type of concatenation that I am trying to do.
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df, pd.Series(1, name = 'label')])
The result I am hoping for is:
col1 col2 col3 label
a 1.0 2.0 3.0 1
but I get is
col1 col2 col3 0
a 1.0 2.0 3.0 NaN
0 NaN NaN NaN 1.0
I know that I'm joining these wrong, but I cannot seem to figure out how its done. Any advice?
This is because the series you are adding has an incompatible index. The original dataframe has ['a'] as the specified index and there is no index specified in the series. If you want to add a new column without specifying an index, the following will give you what you want:
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df]) # append the desired dataframe
df2['label'] = 1 # add a new column with the value 1 across all rows
print(df2.to_string())
col1 col2 col3 label
a 1 2 3 1

Removing parts invalid data from panda dataframe (Python)

I have a file with data that I'm trying to put into graphs and such. In some parts of the data, there are - characters that represent not collected data. I know data.dropna() would normally do the job, but the missing data is represented by - instead.
Suppose I have a csv file test.csv that looks like
col1,col2,col3
1,-,2
-,3,4
I can tell pd.read_csv to handle '-' like nan when it's read in
df = pd.read_csv('test.csv', na_values=['-'])
df
col1 col2 col3
0 1.0 NaN 2
1 NaN 3.0 4
From there, you can dropna as normal