How to add/edit text in pandas.io.parsers.TextFileReader - pandas

I have a large file in CSV. Since it is a large file(almost 7 GB) , it cannot be converted into a pandas dataframe.
import pandas as pd
df1 = pd.read_csv('tblViewPromotionDataVolume_202004070600.csv', sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
df1 is of type pandas.io.parsers.TextFileReader
Now i want to edit/add/insert some text(a new row) into this file , and convert it back to a pandas dataframe. Please let me know of possible solutions. Thanks in advance.

Here is DataFrame called chunk, so for processing use it, last for write to file use DataFrame.to_csv with mode='a' for append mode:
import pandas as pd
import os
infile = 'tblViewPromotionDataVolume_202004070600.csv'
outfile = 'out.csv'
df1 = pd.read_csv(infile, sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
#processing with chunk
# https://stackoverflow.com/a/30991707/2901002
# if file does not exist write header with first chunk
if not os.path.isfile(outfile):
chunk.to_csv(, sep='\t')
else: # else it exists so append without writing the header
chunk.to_csv('out.csv', sep='\t', mode='a', header=False)

Related

adding source.csv file name to first column while merging multiple csv

I have thousands of CSv files, which i have to merge before analysis. I want to add columns having source file name with each row having a sub-number (example SRR1313_1, SRR1313_2, SRR1313_3 and so on.
One limitaion of merging huge data is RAM, which results in error. here is the cmds i tried, but cant add sub-numbering after file name
import pandas as pd
import glob
import os
from pathlib import Path
path = '/home/shoeb/Desktop/vergDB/DATA'
folder = glob.glob(os.path.join("*.csv"))
out_file = "final_big_file.tsv"
first_file=True
for file in folder:
df = pd.read_csv(file, skiprows=1)
df = df.dropna()
df.insert(loc=0, column ="sequence_id", value =file [0:10])
if first_file:
df.to_csv(out_file, sep= "\t", index=False)
first_file=False
else:
df.to_csv(out_file, sep= "\t", index=False, header= False, mode='a')

How can I sort and replace the rank of each item instead of it's value before sorting in a merged final csv?

I have 30 different csv files and each row begin with date and some similar features measured for each of 30 items daily. The value of each feature is not important, but the rank they gain after sorting in each day is important. How can I have one merged csv from 30 separate csv with the rank of each feature?
If your files are all the same format, you can do a loop and consolidate in a single data frame. From there you can manipulate as needed:
import pandas as pd
import glob
path = r'C:\path_to_files\files' # use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Other examples of the same thing: Import multiple csv files into pandas and concatenate into one DataFrame.

Concatenate a pandas dataframe to CSV file without reading the entire file

I have a quite large CSV file. I have a pandas dataframe that has exactly the columns with the CSV file.
I checked on stackoverflow and I see several answers suggested to read_csv then concatenate the read dataframe with the current one then write back to a CSV file.
But for a large file I think it is not the best way.
Can I concatenate a pandas dataframe to an existed CSV file without reading the whole file?
Update: Example
import pandas as pd
df1 = pd.DataFramce ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
# what to do here? I would like to concatenate df2 to my.csv
The expected my.csv
a b
0 1 2
1 3 4
Look at using mode='a' in to_csv:
MCVE:
df1 = pd.DataFrame ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
df2.to_csv('my.csv', mode='a', header=False)
!type my.csv #Windows machine use 'type' command or on unix use 'cat'
Output:
,a,b
0,1,2
1,3,4

How do I combine multiple pandas dataframes into an HDF5 object under one key/group?

I am parsing data from a large csv sized 800 GB. For each line of data, I save this as a pandas dataframe.
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
Now, I would like to save this into an HDF5 format, and query the h5 as if it was the entire csv file.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
My approach so far has been:
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
store.append(hdf5_key, df, data_columns=csv_columns, index=False)
That is, I try to save each dataframe df into the HDF5 under one key. However, this fails:
Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'
So, I could try to save everything into one pandas dataframe first, i.e.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
total_df = pd.concat([total_df, df]) # creates one big CSV
and now store into HDF5 format
store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)
However, I don't think I have the RAM/storage to save all csv lines into total_df into HDF5 format.
So, how do I append each "single-line" df into an HDF5 so that it ends up as one big dataframe (like the original csv)?
EDIT: Here's a concrete example of a csv file with different data types:
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....
Your code should work, can you try the following code:
import pandas as pd
import numpy as np
store = pd.HDFStore("file.h5", "w")
hdf5_key = "single_key"
csv_columns = ["COL%d" % i for i in range(1, 56)]
for i in range(10):
df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
store.append(hdf5_key, df, data_column=csv_columns, index=False)
store.close()
If the code works, then there are something wrong with your data.

Reading variable column and row structure to Pandas by column amount

I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.
Raw data looks like this:
2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0
On one line or several, line breaks are ignored.
End result looks like this, if number of columns is three:
[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]
Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.
Currently I split the file into one big list and split it into rows:
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
(code is from itertools examples)
Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.
How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?
Or is there an other way of reading data into Pandas without auxiliary data structures?
Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:
import pandas as pd
import re
filename = 'raw_data.csv'
class FileLike(file):
""" Modeled after FileWrapper
http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
"""
def __init__(self, *args):
super(FileLike, self).__init__(*args)
self.buffer = []
def next(self):
if not self.buffer:
line = super(FileLike, self).next()
self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
if self.buffer:
line = self.buffer.pop()
return line
with FileLike(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.
If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):
import fileinput
import sys
import re
filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
print(part)
then you can of course load the data normally into Pandas using pd.read_table:
with open(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.
So, it appears you will be better off rewriting your data to disk first.
I don't think there is a way to seperate rows with the same delimiter as columns.
One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:
s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))
In your example:
In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)
In [2]: s
Out[2]:
0 2008231
1 4891866
2 383842
3 2036693
4 4924388
5 375170
Name: 0, dtype: float64
In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]:
0 1 2
0 2008231 4891866 383842
1 2036693 4924388 375170