How to read a no header csv with variable length csv using pandas - pandas

I have a csv file which has no header columns and it has variable length records in each line.
Each record can go upto 398 fields and I want to keep only 256 fields in my dataframe.As I need only those fields to process.
Below is a slim version of the file.
1,2,3,4,5,6
12,34,45,65
34,34,24
In the above I would like to keep only 3 fields(analogous to 256 above) from each line while calling the read_csv.
I tried the below
import pandas as pd
df = pd.read_csv('sample.csv',header=None)
I get the following error as pandas taking the 1st to generate the metadata.
File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 10
Only solution I can think of is using
names = ['column1','column2','column3','column4','column5','column6']
while creating the data frame.
But for the real files which can be upto 50MB I don't want to do that as that is taking a lot of memory and I am trying to run it using aws lambda which will incur more cost. I have to process a large number of files daily.
My question is can I just create a dataframe using the slimmer 256 field while reading the csv alone? Can that be my step one ?
I am very new to pandas so kindly bear my ignorance. I tried to look for a solution for a long time but could find one.

# only 3 columns
df = pd.read_csv('sample.csv', header=None, usecols=range(3))
print(df)
# 0 1 2
# 0 1 2 3
# 1 12 34 45
# 2 34 34 24
So just change range value.

Related

improving fuzzy matching performance

I have two data frames, the first one has 200k records and the second one has 9k. I need to apply fuzzy matching for string matching in two columns. I dropped the duplicate values in both data frames, but still, there might be similar strings. Hence, I wrote the below code. I thought that I can manually go through the best-two-matches in the third column and see if it is a reasonable match or not. The issue is that the code has been running for the last 2 days and I do not know how to speed up this process. I would appreciate if you could share your opinion to improve the speed and performance.
Thanks.
The mock data:
index
comp
var
0
google inc-radio automation bu
apple
1
apple inc
google
2
google inc
microsoft
3
applebee
google inc
4
microsoft
appplebee
the code I wrote:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas
# empty lists for storing the
# matches later
mat1 = []
mat2 = []
list2=df['comp'].tolist()
list1= df['var'].tolist()
# taking the threshold as 90
threshold = 90
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extract(i, list2, limit=2))
df['matches'] = mat1
There is a package rapidfuzz by #maxbachmann
pip install rapidfuzz
Sample usage:
from rapidfuzz import process, utils
df['matches'] = df['var'].apply(lambda x: process.extract(x, df['comp'], limit=2))

using Dask to load many CSV files with different columns

I have many CSV files saved in AWS s3 with the same first set of columns and a lot of optional columns. I don't want to download them one by one and than use pd.concat to read it, since this takes a lot of time and has to fit in to the computer memory. Instead, I'm trying to use Dask to load and sum up all of these files, when optional columns should should be treated as zeros.
If all columns where the same I could use:
import dask.dataframe as dd
addr = "s3://SOME_BASE_ADDRESS/*.csv"
df = dd.read_csv(addr)
df.groupby(["index"]).sum().compute()
but it doesn't work with files that don't have same number of columns, since Dask assumes it can use the first columns for all files:
File ".../lib/python3.7/site-packages/pandas/core/internals/managers.py", line 155, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 64 elements, new values have 62 elements
According to this thread I can either read all headers in advanced (for example by writing them as I produce and save all of the small CSV's) or use something like this:
df = dd.concat([dd.read_csv(f) for f in filelist])
I wonder if this solution is actually faster/better than just directly use pandas? In general, I'd like to know what is the best (mainly fastest) way to tackle this issue?
It might be a good idea to use delayed to standardize dataframes before converting them to a dask dataframe (whether this is optimal for your use case is difficult to judge).
import dask.dataframe as dd
from dask import delayed
list_files = [...] # create a list of files inside s3 bucket
list_cols_to_keep = ['col1', 'col2']
#delayed
def standard_csv(file_path):
df = pd.read_csv(file_path)
df = df[list_cols_to_keep]
# add any other standardization routines, e.g. dtype conversion
return df
ddf = dd.from_delayed([standard_csv(f) for f in list_files])
I ended up giving up using Dask since it was too slow and used aws s3 sync to download the data and multiprocessing.Pool to read and concat them:
# download:
def sync_outputs(out_path):
local_dir_path = f"/tmp/outputs/"
safe_mkdir(os.path.dirname(local_dir_path))
cmd = f'aws s3 sync {job_output_dir} {local_dir_path} > /tmp/null' # the last part is to avoid prints
os.system(cmd)
return local_dir_path
# concat:
def read_csv(path):
return pd.read_csv(path,index_col=0)
def read_csvs_parallel(local_paths):
from multiprocessing import Pool
import os
with Pool(os.cpu_count()) as p:
csvs = list(tqdm(p.imap(read_csv, local_paths), desc='reading csvs', total=len(paths)))
return csvs
# all togeter:
def concat_csvs_parallel(out_path):
local_paths = sync_outputs(out_path)
csvs = read_csvs_parallel(local_paths)
df = pd.concat(csvs)
return df
aws s3 sync dowloaded about 1000 files (~1KB each) in about 30 second, and reading than with multiproccesing (8 cores) took 3 seconds, this was much faster than also downloading the files using multiprocessing (almost 2 minutes for 1000 files)

Importing Numbers using DataFrame

I'm trying to import numbers from a xlsx file using pandas DataFrame. But I'm getting numbers in a slightly different format,
let's say the number is: 9582*****4
the number i get using this code is 9582*****4.0
df=pd.read_excel("Contacts.xlsx")
for i in range(len(df)):
print(df.iloc[i,0])
It was working just fine till last night.
i guess you need to change the data type from float to int
df=pd.read_excel("Contacts.xlsx")
df = df.astype(int) # for all columns
type df = df.astype({"Column_name": int}) # for specific column
for i in range(len(df)):
print(df.iloc[i,0])

How to quickly read csv files and append them into single pandas data frame

I am trying to read 4684 csv files from an folder with each file consisting of 2000 rows and 102 columns and file size is 418 kB. I am reading and appending them one by one using below code.
for file in allFiles:
df = pd.read_csv(file,index_col=None, header = None)
df2 = df2.append(df)
This is taking 4 to 5 hours to read all the 4684 file and append in on dataframe. Is there any possibility to make this process complete quickly. I am using i7 with 32GB ram.
Thanks

Pandas reading csv into hdfstore thrashes, creates huge file

As a test, I'm trying to read a small 25 mg csv file using pandas.HDFStore:
store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
store.append('df',chunk)
store.close()
It causes my computer to thrash and when it finally completes, file.h5 is 6.7 gigs. I don't know what is causing the file size to balloon: when I look at the store afterwards, the only thing in there is the small dataframe.
If I read the csv in without chunking and then add it to the store, I have no problems.
Update 1:
I'm running Anaconda, using python 2.7.6, HDF5 version 1.8.9, numpy 1.8.0, pytables 3.1.0, pandas 13.1, ubuntu 12.04.
The data is proprietary, so I can't post the chunk information online. I do have some mixed types. It still crashes if I try to read everything in as object.
Update 2:
Dropped all the columns with mixed type and I'm still getting the same issue. I have some very large text columns if that makes any difference.
Update 3:
The problem seems to be loading the dataframe into the hdfstore. I drastically reduced the size of my file, but kept one of my very wide columns (1259 characters). Whereas the size of the csv file is 878.6kb, the size of the hdfstore is 53 megs. Is pytables unable to handle very wide columns? Is there a threshold above which I should truncate?
The wide object columns are definitely the problem. My solution has been to truncate the object columns while reading them in. If I truncate to a width of 20 characters, the h5 file is only about twice as large as a csv file. However, if I truncate to 100 characters, the h5 file is about 6 times larger.
I include my code below as an answer, but if anyone has any idea how to reduce this size disparity without having to truncate so much text, I'd be grateful.
store = pd.HDFStore(filepath, 'w')
for chunk in pd.read_csv(f, chunksize=5000, sep='\t',
na_values="null", error_bad_lines=False):
chunk = chunk.apply(truncateCol)
store.append(table, chunk)
def truncateCol(ser, width=100):
if ser.dtype == np.object:
ser = ser.str[:width] if ser.str.len().max() > width else ser
return ser