Read TXT or DAT file in Python - pandas

I need to read a .DAT or .TXT file, extract the column names and assign them to new names and write the data to a pandas dataframe.
I have an environment variable called 'filetype' and based on it's value(DAT or TXT), I need to read the file accordingly and extract column names from it and assign to new column names.
My input .dat/.txt file has just 2 columns and it looks like as below:
LN_ID,LN_DT
1234,10/01/2020
4567,10/01/2020
8888,10/01/2020
9999,10/01/2020
Read the above file and create new columns new_loan_id=loan_id and new_ln_dt=ln_dt and write to a pandas dataframe
I've tried using pandas something like below but it's giving some error and I also want to check first if myfile is .dat or .txt based on the environment variable 'filetype' value and proceed.
df=pd.read_csv('myfile.dat',sep=',')
new_cols=['new_ln_id','new_ln_dt']
df.columns=new_cols
I think there could be some better and easy way. Appreciate if anyone can help. Thanks!

It is unclear from your question whether you want two new empty columns or if you want to replace the existing names. Either way, you can do this for dte given by:
Add columns
LN_ID LN_DT
0 1234 10/01/2020
1 4567 10/01/2020
2 8888 10/01/2020
3 9999 10/01/2020
define the new columns
cols = ['new_ln_id','new_ln_dt']
and `
print(pd.concat([dte,pd.DataFrame(columns=cols)]))
which gives
LN_ID LN_DT new_ln_id new_ln_dt
0 1234.0 10/01/2020 NaN NaN
1 4567.0 10/01/2020 NaN NaN
2 8888.0 10/01/2020 NaN NaN
3 9999.0 10/01/2020 NaN NaN
Replace column names
df.rename(columns={"LN_ID": "new_ln_id", "LN_DT": "new_ln_dt"})
Thanks for your response and Sorry for the confusion. I want to rename the 2 columns. But, actually, I want to check first whether it's a .dat or .txt file based on unix environment variable called 'filetype'.
For ex: if filetype='TXT' or 'DAT' then read the input file say 'abc.dat' or 'abc.txt' into a new pandas dataframe and rename the 2 columns. I hope it's clear.
Here is what I did. I've created a function to check if the filetype is "dat" or "txt" and read the file into a pandas dataframe and then I'm renaming the 2 columns. The function is loading the data but it's not renaming the columns as required. Appreciate if anyone can point me what am I missing.
filetype=os.environ['TYPE']
print(filetype)
DAT
def load(file_type):
if file_type.lower()=="dat":
df=pd.read_csv(input_file, sep=',',engine='python')
if df.columns[0]=="LN_ID":
df.columns[0]="new_ln_id"
if df.columns[1]=="LN_DT":
df.columns[1]="new_ln_dt"
return(df)
else:
if file_type.lower()=="txt":
df=pd.read_csv("infile",sep=",",engine='python')
if df.columns[0]=="LN_ID":
df.columns[0]="new_ln_id"
if df.columns[1]=="LN_DT":
df.columns[1]="new_ln_dt"
return(df)
load(filetype)
Alternative
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(path) if isfile(join(path, f))]
filename = os.path.join(path, onlyfiles[0])
if filename.endswith('.txt'):
dte = pd.read_csv(filename, sep=",")
elif filename.endswith('.dat'):
dte = pd.read_csv(filename, sep=",")
dte.rename(columns={"LN_ID": "new_ln_id", "LN_DT": "new_ln_dt"})

Related

What is "roundtripping" in the context of Pandas?

The documentation for pandas.read_excel mentions something called 'roundtripping', under the description of the index_col parameter.
Missing values will be forward filled to allow roundtripping with to_excel for merged_cells=True.
I have never heard of this term before, and if I search for a definition, I can find one only in the context of finance. I have seen it referred to in the context of merging dataframes in Pandas, but I have not found a definition.
For context, this is the complete description of the index_col parameter:
index_col : int, list of int, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass
None if there is no such column. If a list is passed, those columns
will be combined into a MultiIndex. If a subset of data is selected
with usecols, index_col is based on the subset.
Missing values will be forward filled to allow roundtripping with
to_excel for merged_cells=True. To avoid forward filling the
missing values use set_index after reading the data instead of
index_col.
For a general idea of the meaning of roundtripping, have a look at the answers to this post on SE. Applied to your example, "allow roundtripping" is used to mean something like this:
facilitate an easy back-and-forth between the data in an Excel file
and the same data in a df. I.e. while maintaining the intended
structure throughout.
Example round trip
The usefulness of this idea is perhaps best seen if we start with a somewhat complex df with both index and columns as named MultiIndices (for the constructor, see pd.MultiIndex.from_product):
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.rand(4,4),
columns=pd.MultiIndex.from_product([['A','B'],[1,2]],
names=['col_0','col_1']),
index=pd.MultiIndex.from_product([[0,1],[1,2]],
names=['idx_0','idx_1']))
print(df)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
If we now use df.to_excel with the default for merge_cells (i.e. True) to write this data to an Excel file, we will end up with data as follows:
df.to_excel('file.xlsx')
Result:
Aesthetics aside, the structure here is very clear, and indeed, the same as the structure in our df. Take notice of the merged cells especially.
Now, let's suppose we want to retrieve this data again from the Excel file at some later point, and we use pd.read_excel with default parameters. Problematically, we will end up with a complete mess:
df = pd.read_excel('file.xlsx')
print(df)
Unnamed: 0 col_0 A Unnamed: 3 B Unnamed: 5
0 NaN col_1 1.000000 2.000000 1.000000 2.000000
1 idx_0 idx_1 NaN NaN NaN NaN
2 0 1 0.952749 0.447125 0.846409 0.699479
3 NaN 2 0.297437 0.813798 0.396506 0.881103
4 1 1 0.581273 0.881735 0.692532 0.725254
5 NaN 2 0.501324 0.956084 0.643990 0.423855
Getting this data "back into shape" would be quite time-consuming. To avoid such a hassle, we can rely on the parameters index_col and header inside pd.read_excel:
df2 = pd.read_excel('file.xlsx', index_col=[0,1], header=[0,1])
print(df2)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
# check for equality
df.equals(df2)
# True
As you can see, we have made a "round trip" here, and index_col and header allow for it to have been smooth sailing!
Two final notes:
(minor) The docs for pd.read_excel contain a typo in the index_col section: it should read merge_cells=True, not merged_cells=True.
The header section is missing a similar comment (or a reference to the comment at index_col). This is somewhat confusing. As we saw above, the two behave exactly the same (for present purposes, at least).

Change NaN to None in Pandas dataframe

I try to replace Nan to None in pandas dataframe. It was working to use df.where(df.notnull(),None).
Here is the thread for this method.
Use None instead of np.nan for null values in pandas DataFrame
When I try to use the same method on another dataframe, it failed.
The new dataframe is like below
A NaN B C D E, the print out of the dataframe is like this:
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 A NaN B C D E
even when I use the working code run against the new dataframe, it failed.
I just wondering is it is because in the excel, the cell format has to be certain type.
Any suggestion on this?
This always works for me
df = df.replace({np.nan:None})
You can check this related question, Credit from here
The problem is that I did not follow the format.
The format I used that cause the problem was
df.where(df.notnull(), None)
If I wrote the code like this, there is no problem
df = df.where(df.notnull(), None)
To do it just over one column
df.col_name.replace({np.nan: None}, inplace=True)
This is not as easy as it looks.
1.NaN is the value set for any cell that is empty when we are reading file using pandas.read_csv()
2.None is the value set for any cell that is NULL when we are reading file using pandas.read_sql() or readin from a database
import pandas as pd
import numpy as np
x=pd.DataFrame()
df=pd.read_csv('file.csv')
df=df.replace({np.NaN:None})
df['prog']=df['prog'].astype(str)
print(df)
if there is compatibility issue of datatype , which will be because on replacing np.NaN will make the column of dataframe as object type.
so in this case first replace np.NaN with None and then choose the required datatype for the column
file.csv
column names : batch,prog,name
'prog' column is empty

Saving a kdb table to a dataframe then saving the dataframe to a csv. null and string values outputting to csv incorrectly?

I am saving a kdb table to a dataframe then saving the dataframe to a csv. This works, however, the csv file and if i print(dataframe); null values are showing as " b" ", and all other string values are showing as " b'STRING' ".
Running Python 3.7, pandas 0.24.2 and qpython 2.0.0.
df = pandas.DataFrame(qpython query)
df.to_csv(path_or_buf="",
sep=",", na_rep='',
float_format=None,
columns=None,
header=True, index=False,
index_label=None, mode='w+', compression=None, quoting=None, quotechar='"',
line_terminator="\n", chunksize=50, tupleize_cols=None, date_format=None,
doublequote=True,
escapechar=None, decimal='.', encoding='utf-8')
I expected the KDB table to output to the csv correctly, with nulls being an empty column and strings just showing the string, without " b'STRING' ".
Any advice or help would be greatly appreciated. If anyone needs any more information, I'd be happy to provide.
Example in csv:
Null cells show as : b"
Cells containing strings show as:" b'Euro' " when in fact should just show "Euro"
qPython has some functionality for converting a kdb table to a pandas dataframe. I begin by creating a table in kdb "t" that has 4 columns where the third column is a column of symbols and the 4th is a column of characters. The entrys in the first row are entirely nulls.
t:([] a: 0N, til 99; b: 0Nf, 99?1f; c: `, 99?`3; d: " ", 99?" ")
a b c d
-----------------
0 0.4123573 iee x
1 0.8397208 app l
2 0.3392927 ncm w
3 0.285506 pjn c
The table can then be read into Python using QConnection. If we convert the table to a dataframe after it is read in we can see that the symbols and chars are converted to bytes and the nulls are not converted correctly.
df=pandas.DataFrame(q('t'))
df.head()
a b c d
0 -9223372036854775808 NaN b'' b' '
1 0 0.412357 b'iee' b'x'
2 1 0.839721 b'app' b'l'
3 2 0.339293 b'ncm' b'w'
4 3 0.285506 b'pjn' b'c'
However if we use the pandas=True argument with our q query then most of the table is converted appropriately as desired:
df=q('t', pandas=True)
df.head()
a b c d
0 NaN NaN b''
1 0.0 0.412357 b'iee' x
2 1.0 0.839721 b'app' l
3 2.0 0.339293 b'ncm' w
4 3.0 0.285506 b'pjn' c
However notice that entries stored as symbols in kdb are not converted as desired. In this case the following code will manually decode any columns specified in string_cols from bytes into strings using a similar method to the one suggested by Callum.
string_cols = ['c']
df[string_cols] = df[string_cols].applymap(lambda s : s.decode('utf-8'))
giving an end result of:
df.head()
a b c d
0 NaN NaN
1 0.0 0.412357 iee x
2 1.0 0.839721 app l
3 2.0 0.339293 ncm w
4 3.0 0.285506 pjn c
Which can easily be converted to a csv file.
Hope this helps
I would have expected strings in kdb to be handled fine, as QPYTHON should convert null strings to python null strings. Null symbols, however, are converted to _QNULL_SYM. In this case, I think the 'b' prefix indicates a byte literal. You can try to decode the byte objects before saving to a csv
Normally in python I would do something along the following
df['STRINGCOL'] = df['STRINGCOL'].apply(lambda s: s.decode('utf-8'))
I don't have much experience with QPYTHON but I believe using qnull() will convert the null to a pythonic value.
df['STRINGCOL'] = df['STRINGCOL'].apply(lambda s: qnull(s))

Python Pandas extract columns from .csv

I have a .csv file, which I can read in Pandas. The .csv file looks like the following.
a b c d e
1 4 3 2 5
6 7 8 3 6
...
What I need to achieve is, that I can extract a and b as a column vector and
[c d e] as a matrix. I used Pandas with the following code to read the .csv file:
pd.read_csv('data.csv', sep=',',header=None)
But this will give me a vector like this: [[a,b,c,d,e],[1,4,3,2,5],...]
How can I extract the columns? I heared about df.iloc, but this cannot be used here, since after pd.read_csv there is only one column.
You should be able to do that with:
ds = pd.read_csv('data.csv', sep=',',header=0)
column_a = ds["a"]
matrix = ds[["c","d","e"]]

Why is column name not going over actual column and creating new columns in dataframe?

I am assigning column names to a dataframe in pandas but the column names are creating new columns how do I go around this issue?
What dataframe looks like now:
abs_subdv_cd abs_subdv_desc
0 A0001A ASHTON ... NaN
1 A0002A J. AYERS ... NaN
2 A0003A NEWTON ALLSUP ... NaN
3 A0004A M. AUSTIN ... NaN
4 A0005A RICHARD W. ALLEN ... NaN
What I want dataframe look like:
abs_subdv_cd abs_subdv_desc
0 A0001A ASHTON
1 A0002A J. AYERS
2 A0003A NEWTON ALLSUP
3 A0004A M. AUSTIN
4 A0005A RICHARD W. ALLEN
code so far:
import pandas as pd
###Declaring path###
path = ('file_path')
###Calling file in folder###
appraisal_abstract_subdv = pd.read_table(path + '/2015-07-28_003820_APPRAISAL_ABSTRACT_SUBDV.txt',
encoding = 'iso-8859-1' ,error_bad_lines = False,
names = ['abs_subdv_cd','abs_subdv_desc'])
print(appraisal_abstract_subdv.head())
-edit-
When I try appraisal_abstract_subdv.shape..the dataframe is showing shape as (4000,1) where as the data has two columns.
this example of data I am using:
A0001A ASHTON
A0002A J. AYERS
Thank you in advance.
it looks like your data file has another delimiter (not a TAB, which is a default separator for pd.read_table()), so try to use: sep='\s+' or delim_whitespace=True parameter.
In order to check your columns after reading your data file do the following:
print(df.columns.tolist())
There is a rename function in pandas that you can use to get the column names
appraisal_abstract_subdv.columns.values
then with those column names use this method to rename them appropriately
df.rename(columns={'OldColumn1': 'Newcolumn1', 'OldColumn2': 'Newcolumn2'}, inplace=True)