Loading CSV file with pandas - Error Tokenizing - pandas

i am trying to load a csv file
import pandas as pd
dfc = pd.read_csv('data/Vehicles0515.csv', sep =',')
but i have the following error
ParserError: Error tokenizing data. C error: Expected 22 fields in line 3004427, saw 23
i have read to include error_bad_lines = False
but it doesn't solve the problem
Thanks a lot

Sometimes parser is getting confused by the head of csv file.
try this:
dfc = pd.read_csv('data/Vehicles0515.csv', header=None)
or
dfc = pd.read_csv('data/Vehicles0515.csv', skiprows=2)
Also you don't need provide an comma separator owing to fact that comma is default value in pandas read_csv method.

Related

Google cloud blob: XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

I want to import several xml files from a bucket on GCS and then parse them into a pandas Dataframe. I found the pandas.read_xml function do to this which is great. Unfortunately
I keep getting the error:
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
I checked the xml files and they look fine.
This is the code:
from google.cloud import storage
import pandas as pd
#importing the data
client = storage.Client()
bucket = client.get_bucket('bucketname')
df = pd.DataFrame()
from google.cloud import storage
import pandas as pd
#parsing the data into pandas df
for blob in bucket.list_blobs():
print(blob)
split = str(blob.name).split("/")
country = split[0]
data = pd.read_xml(blob.open(mode='rt', encoding='iso-8859-1', errors='ignore'), compression='gzip')
df["country"] = country
print(country)
df.append(data)
When I print out the blob it gives me :
<Blob: textkernel, DE/daily/2020/2020-12-19/jobs.0.xml.gz, 1612169959288959>
maybe it has something to do with the pandas function trying to read the filename and not the content? Does someone have an idea about why this could be happening?
thank you!

pandas read csv giving creating incorrect columns

I am using pandas to read a csv file
import pandas as pd
data = pd.read_csv('file_name.csv', "ISO-8859-1")
Op:
col1,col2,col3
"sample1","sample2","sample3"
but the DF is just creating 1 column instead of the 5 it is supposed to create. I checked the csv and its fine.
Any suggestions on why this could be happening will be useful.

How do you read a txt file (from SQLCMD) into Pandas DataFrame?

I've Google searched but haven't found a way to parse SQL txt file outputs and import as Pandas DataFrame. I have, within the cmd line:
sqlcmd -S server_name -E -Q "select top 10 * from table_name" -o "test.txt"
This produces a text file, which isn't exactly the best format, since it has dashed lines and a comment saying (10 rows affected), but whatever.
Now, I do:
import numpy as np
import pandas as pd
df_test = pd.read_csv('test.txt', sep = ' ')
And it produces an error:
ParserError: Error tokenizing data. C error: Expected 10006 fields in line 3, saw 14963
Anyone know how to parse a SQL test file within Python?
Thanks!
Edit: This would be the first column in the txt file:
Add error handeling to the read_csv:
df_test = pd.read_csv('test.txt', sep = ' ', errors='coerce')

How to Read GZIP csv from S3 directly into pandas dataframe

I'm writing an airflow job to read a gzipped file from s3.
First I get the key for the object, which works fine
obj = self.s3_hook.get_key(key, bucket_name=self.s3_bucket)
obj looks fine, something like this:
path/to/file/data_1.csv.gz
Now I want to read the contents into a pandas dataframe. I've tried a number of things but this is my current iteration:
import pandas as pd
df = pd.read_csv(obj['Body'], compression='gzip')
This returns the following error:
TypeError: 's3.Object' object is not subscriptable
What am I doing wrong? i feel like I need to do something with StringIO or BytesIO...I was able to read it in as bytes, but thought there was a more straight forward way to get to a dataframe
Just in case it matters, one row of the data looks like this when I unzip and open in CSV:
9671211|ddc9979d5ff90a4714fec7290657c90f|2138|2018-01-30 00:00:12|2018-01-30 00:00:16.069048|42b32863522dbe52e963034bb0aa68b6|1909705|8803795|collect|\\N|0||0||0|
figured it out:
obj = self.s3_hook.get_key(key, bucket_name=self.s3_bucket)
df = pd.read_csv(obj.get()['Body'], compression='gzip', header = None, sep = '|')

How to make pandas read csv input as string, not as an url

I'm trying to load a csv (from an API response) into pandas, but keep getting an error
"ValueError: stat: path too long for Windows" and "FileNotFoundError: [Errno 2] File b'"fwefwe","fwef..."
indicating that pandas interprets it as an url, not a string.
The code below causes the errors above.
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(fake_csv, encoding='utf8')
df
How do I force pandas to interpret my argument as a csv string?
You can do that using StringIO:
import io
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(io.StringIO(fake_csv), encoding='utf8', sep=',', lineterminator=';')
df
Result:
Out[30]:
fwefwe fwefw fwefew
0 2 5 7