After pandas loading the CSV file, the DataFrame has some wrong columns - pandas

I have a big CSV file that size more than 5G, so I tried to part of to load the file as below codes.
import pandas as pd
reader = pd.read_csv('/path/to/csv', chunksize=10000, error_bad_lines=True, iterator=True)
for chunk in reader:
with open('/path/to/save', 'a') as chunk_file:
chunk.to_csv(chunk_file)
I saw some warning like:
Skipping line 8245: expected 1728 fields, saw 1729
I thought the saved file would be without the dirty data, but the file still exists some wrong data columns.
I've set up the error_bad_lines, I don't know why that happened?

Related

Ignore Last row in CSV file as part of BigQuery External table command

I have about 40 odd csv files, comma delimited in GCS however the last line of all the files has quotes and dot
”.
So these are not exactly conformed csv schema and has data quality issue which i have to get around
My aim is to create an external table referencing to the gcs files and then be able to select the data.
example:
create or replace dataset.tableName
options (
uris = ['gs://bucket_path/allCSVFILES_*.csv'],
format = 'CSV',
skip_leading_rows = 1,
ignore_unknown_values = true
)
the external table gets created without any error. however, when I select the data, I ran to error
"error message: CSV table references column position 16, but line starting at position:18628631 contains only 1 columns"
This is due to quotes and dot ”. at the end of file.
My question is: is there any way in BigQuery to consume to data without the LAST LINE. as part of options we have skip_leading_rows to skip header but any way to skip to last row?
Currently my best placed option is to clean the files, using sed/tail command.
I have checked the create or replace external table options list below and have tried using ignore_unknown_values but other than this option i don't see any other option which will work.
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_external_table_statement
You can try below work around:
I tried with pandas and removed the last record from the csv file.
from google.cloud import bigquery
import pandas as pd
from google.cloud import storage
df=pd.read_csv('gs://samplecsv.csv')
client = bigquery.Client()
dataset_ref = client.dataset('dataset')
table_ref = dataset_ref.table('new_table')
df.drop(df.tail(1).index,inplace=True)
client.load_table_from_dataframe(df, table_ref).result()
For more information you can refer to this link which mentions the limitation for loading csv files to Bigquery.

How to write time to csv and read again as datetime64[ns, Europe/Berlin]?

Before I am writing to csv file i have time column as:
time datetime64[ns, Europe/Berlin]
When I am reading df from csv i am getting:
time object
How to write and read time columns as the same type as before save process?
Befor writing proces i have:
df = df.astype({'time':'datetime64[ns]'})
mytz = get_localzone()
df['time'] = pd.to_datetime(df['time'] , unit='ms').dt.tz_localize('UTC').dt.tz_convert(mytz)
How do I write to scv?
df.to_csv('test.csv' , index=False)
How do I read csv?
df = pd.read_csv('test.csv')
And time column looks like that:
0 2021-09-20 00:00:00+02:00
1 2021-09-20 01:00:00+02:00
For this I would use pd.to_pickle(path) and pd.read_pickle(path), since csv cannot really store anything else than strings and numbers. With pickle, it serializes the entire DataFrame and saves it as if you had just directly dumped the python object into a file and vice versa.

issue with reading csv file from AWS S3 with boto3

I have a csv file with the following columns:
Name Adress/1 Adress/2 City State
When I try to read this csv file from local disk I have no issue.
But when I try to read it from S3 with the below code I get error when I use io.StringIO.
When I use io.BytesIO each record displays as one column. Though the file is a ',' separated some column do contain '/n' or '/t' in it. I believe these causing the issue.
I used AWS Wrangler with no issue. But my requirement is to read this csv file with boto3
import pandas as pd
import boto3
s3 = boto3.resource('s3', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
my_bucket = s3.Bucket(AWS_S3_BUCKET)
csv_obj=my_bucket.Object(key=key).get().get('Body').read().decode('utf16')
data= io.BytesIO(csv_obj) #io.StringIO(csv_obj)
sdf = pd.read_csv(data,delimiter=sep,names=cols, header=None,skiprows=1)
print(sdf)
Any suggestion please?
try get_object():
obj = boto3.client('s3').get_object(Bucket=AWS_S3_BUCKET, Key=key)
data = io.StringIO(obj['Body'].read().decode('utf-8'))

Exporting Multiple log files data to single Excel using Pandas

How do I export multiple dataframes to a single excel, I'm not talking about merging or combining. I just want a specific line from multiple log files to be compiled to a single excel sheet. I already wrote a code but I am stuck:
import pandas as pd
import glob
import os
from openpyxl.workbook import Workbook
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
for files in read_files:
logs = pd.read_csv(files, header=None).loc[540:1060, :]
print(LBS_logs)
logs.to_excel("LBS.xlsx")
When I do this, I only get data from the first log.
Appreciate your recommendations. Thanks !
You are saving logs, which is the variable in your for loop that changes on each iteration. What you want is to make a list of dataframes and combine them all, and then save that to excel.
file_path = "C:/Users/HP/Desktop/Pandas/MISC/Log Source"
read_files = glob.glob(os.path.join(file_path,"*.log"))
dfs = []
for file in read_files:
log = pd.read_csv(file, header=None).loc[540:1060, :]
dfs.append(log)
logs = pd.concat(logs)
logs.to_excel("LBS.xlsx")

How to skip duplicate headers in multiple CSV files having indetical columns and merge as one big data frame

I have copied 34 CSV files having identical columns in google colab and trying to merge as one big data frame. However, each CSV has a duplicate header which needs to be skipped.
The actual header anyway will be skipped while concatenating, as my CSV files having identical columns correct?
dfs = [pd.read_csv(path.join('/content/drive/My Drive/',x)skiprows=1) for x in os.listdir('/content/drive/My Drive/') if path.isfile(path.join('/content/drive/My Drive/',x))]
df = pd.concat(dfs)
Above code throwing below error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1: invalid continuation byte
Below code working for sample files,but need an efficient way to skip dup headers and merged into one data frame.Please suggest.
df1=pd.read_csv("./Aug_0816.csv",skiprows=1)
df2=pd.read_csv("./Sep_0916.csv",skiprows=1)
df3=pd.read_csv("./Oct_1016.csv",skiprows=1)
df4=pd.read_csv("./Nov_1116.csv",skiprows=1)
df5=pd.read_csv("./Dec_1216.csv",skiprows=1)
dfs=[df1,df2,df3,df4,df5]
df=pd.concat(dfs)
Have you considered using glob from the standard library?
Try this
path = ('/content/drive/My Drive/')
os.chdir(path)
allFiles = glob.glob("*.csv")
dfs = [pd.read_csv(f,header=None,error_bad_lines=False) for f in allFiles]
#or if you know the specific delimiter for your csv
#dfs = [pd.read_csv(f,header=None,delimiter='yourdelimiter') for f in allFiles]
df = pd.concat(dfs)
Try this, the most generic script for concatenating multiple 'n' csv files in a specific path with a common file name format!
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f,**kwargs) for f in flist], ignore_index=True)
path = r"C:\Users\Jyotsna\Documents"
fmask = os.path.join(path, 'Detail**.csv')
df = get_merged_csv(glob.glob(fmask), index_col=None)
df.head()
If you want to skip some fixed rows and/or columns in each of the files before concatenating, edit the code accordingly on this line!
return pd.concat([pd.read_csv(f, skiprows=4,usecols=range(9),**kwargs) for f in flist], ignore_index=True)