Read csv in pandas with different separator (commas) - pandas

I want to read a CSV file and save it as data frame in pandas.
But I have a problem because I have rows like this:
BG,6141.6,6141.6,,3.0,,,ic
As you see there are three separators: ',,,' , ',,' and ,
How can I load it correctly into pandas?

Use regex separator [,]+ - one or more ,:
import pandas as pd
from pandas.compat import StringIO
temp=u"""iBG,6141.6,6141.6,,3.0,,,ic"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="[,]+", header=None, engine='python')
print (df)
0 1 2 3 4
0 iBG 6141.6 6141.6 3.0 ic

Related

assigning csv file to a variable name

I have a .csv file, i uses pandas to read the .csv file.
import pandas as pd
from pandas import read_csv
data=read_csv('input.csv')
print(data)
0 1 2 3 4 5
0 -3.288733e-08 2.905263e-08 2.297046e-08 2.052534e-08 3.767194e-08 4.822049e-08
1 2.345769e-07 9.462636e-08 4.331173e-08 3.137627e-08 4.680112e-08 6.067109e-08
2 -1.386798e-07 1.637338e-08 4.077676e-08 3.339685e-08 5.020153e-08 5.871679e-08
3 -4.234607e-08 3.555008e-08 2.563824e-08 2.320405e-08 4.008257e-08 3.901410e-08
4 3.899913e-08 5.368551e-08 3.713510e-08 2.367323e-08 3.172775e-08 4.799337e-08
My aim is to assign the file to a column name so that i can access the data in later time. For example by doing something like
new_data= df['filename']
filename
0 -3.288733e-08,2.905263e-08,2.297046e-08,2.052534e-08,3.767194e-08,4.822049e-08
1 2.345769e-07,9.462636e-08,4.331173e-08,3.137627e-08,4.680112e-08, 6.067109e-08
2 -1.386798e-07,1.637338e-08,4.077676e-08,3.339685e-08,5.020153e-08,5.871679e-08
3 -4.234607e-08,3.555008e-08,2.563824e-08,2.320405e-08,4.008257e-08,3.901410e-08
4 3.899913e-08,5.368551e-08,3.713510e-08,2.367323e-08,3.172775e-08,4.799337e-08
I don't really like it (and I still don't completely get the point), but you could just read in your data as 1 column (by using a 'wrong' seperator) and renaming the column.
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename, sep=';')
df.columns = [filename]
If you then wish, you could add other files by doing the same thing (with a different name for df at first) and then concatenate that with df.
A more usefull approach IMHO would be to add the dataframe to a dictionary (or a list would be possible).
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename)
data_dict = {filename: df}
# ... Add multiple files to data_dict by repeating steps above in a loop
You can then access your data later on by calling data_dict[filename] or data_dict['input.csv']

Dask .loc only the first result (iloc[0])

Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?
Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()

How to add/edit text in pandas.io.parsers.TextFileReader

I have a large file in CSV. Since it is a large file(almost 7 GB) , it cannot be converted into a pandas dataframe.
import pandas as pd
df1 = pd.read_csv('tblViewPromotionDataVolume_202004070600.csv', sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
df1 is of type pandas.io.parsers.TextFileReader
Now i want to edit/add/insert some text(a new row) into this file , and convert it back to a pandas dataframe. Please let me know of possible solutions. Thanks in advance.
Here is DataFrame called chunk, so for processing use it, last for write to file use DataFrame.to_csv with mode='a' for append mode:
import pandas as pd
import os
infile = 'tblViewPromotionDataVolume_202004070600.csv'
outfile = 'out.csv'
df1 = pd.read_csv(infile, sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
#processing with chunk
# https://stackoverflow.com/a/30991707/2901002
# if file does not exist write header with first chunk
if not os.path.isfile(outfile):
chunk.to_csv(, sep='\t')
else: # else it exists so append without writing the header
chunk.to_csv('out.csv', sep='\t', mode='a', header=False)

Concatenate a pandas dataframe to CSV file without reading the entire file

I have a quite large CSV file. I have a pandas dataframe that has exactly the columns with the CSV file.
I checked on stackoverflow and I see several answers suggested to read_csv then concatenate the read dataframe with the current one then write back to a CSV file.
But for a large file I think it is not the best way.
Can I concatenate a pandas dataframe to an existed CSV file without reading the whole file?
Update: Example
import pandas as pd
df1 = pd.DataFramce ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
# what to do here? I would like to concatenate df2 to my.csv
The expected my.csv
a b
0 1 2
1 3 4
Look at using mode='a' in to_csv:
MCVE:
df1 = pd.DataFrame ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
df2.to_csv('my.csv', mode='a', header=False)
!type my.csv #Windows machine use 'type' command or on unix use 'cat'
Output:
,a,b
0,1,2
1,3,4

How to shift the column headers in pandas

I have .txt files I'm reading in with pandas and the header line starts with '~A'. I need to ignore the '~A' and have the next header correspond to the data in the first column. Thanks!
You can do this:
import pandas as pd
data = pd.read_csv("./test.txt", names=[ 'A', 'B' ], skiprows=1)
print(data)
and the output for input:
~A, A, B
1, 2
3, 4
is:
c:\Temp\python>python test.py
A B
0 1 2
1 3 4
You have to name the columns yourself but given that your file seems to be malformed I guess it is not that bad.
If your header lines are not the same in all files, then you can just read them in Python:
import pandas as pd;
# read first line
with open("./test.txt") as myfile:
headRow = next(myfile)
# read column names
columns = [x.strip() for x in headRow.split(',')]
# process by pandas
data = pd.read_csv("./test.txt", names=columns[1:], skiprows=1)
print(data);