pandas read file avoiding unwanted separators

pandas read file avoiding unwanted separators - pandas

I am reading a csv file that looks like this:
"column, column, column, column,"column, column
if I read using sep=',' I only get three columns.
Any Idea of how to parse this type of file?

Use quotechar in read_csv from pandas,
df = pd.read_csv(PATH, quotechar="'")
print(df.columns.tolist())
['"column', ' column', ' column.1', ' column.2', '"column.1', ' column.3']

Related

pandas cant replace commas with dots

Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values in the dataset can be changed this way...

You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes

csv header parsing in pyspark

I am trying to read csv file as dataframe from Azure databricks.
The header columns (when I open in excel) are as follows.
All the header names are in the following format in the CSV file.
e.g.
"City_Name"ZYD_CABC2_EN:0TXTMD
Basically I want to include only strings within quotes as my header (City_Name) and ignore the second part of the string (ZYD_CABC2_EN:0TXTMD)
sales_df = spark.read.format("csv").load(input_path + '/sales_2020.csv', inferSchema = True, header=True)

You can parse the column names after reading in the csv file, using regular expressions to extract the words between the quotes, and then using toDF to reassign all column names at once:
import re
# sales_df = spark.read.format("csv")...
sales_df = sales_df.toDF(*[re.search('"(.*)"', c).group(1) for c in df.columns])

You can split the actual names using " to get the desired column names:
sales_df = sales_df.toDF(*[c.split('"')[1] for c in df.columns])

Pandas dataframe: Splitting single-column data from txt file into multiple columns

I have an obnoxious .txt file that is output from a late 1990's program for an Agilent instrument. I am trying to comma-separate and organize the single column of the text file into multiple columns in a pd dataframe. After some organization, the txt file currently looks like the following: See link here:
Organized Text File
Each row is indexed in a pd dataframe. The code used to reorganize the file and attempt to split into multiple columns follows:
quantData = pd.read_csv(epaTemp, header = None)
trimmed_File = quantData.iloc[16:,]
trimmed_File = trimmed_File.drop([17,18,70,71,72], axis = 0)
print (trimmed_File)
###
splitFile = trimmed_File.apply( lambda x: pd.Series(str(x).split(',')))
print (splitFile)
The split function above did not get applied to all rows present in the txt file. It only split(',')the first row rather than all of them:
0 16 Compound R... 1
dtype: object
I would like this split functionality to apply to all rows in my txt file so I can further organize my data. Thank you for the help.

Pandas read_csv wrong separator recognition

I am trying to open csv file in pandas using read_csv function. My file have the following structure a row with headers where each column header's name have the name underlined with quotes for example "header1";"header2"; non headers values in columns contains int or string values without quotes with only ; delimiter. Dataframe has the following structure
"header1";"header2";"header3";
value1;value2;value3;
When i apply read_csv df = pd.read_csv("filepath", sep=";", engine="python") i am getting ParseError: expected ';' after ' " ' help to solve it

Try to specify column names as follows, and see if it resolves the issue:
col_names = ["header1", "header2", "header3"]
df = pd.read_csv(filepath, sep=";", names=col_names)
If this doesn't work, try adding 'quotechar=' " ' and see

pandas reading CSV data formatted with comma for thousands separator

I am trying to create a dataframe in pandas using a CSV that is semicolon-delimited, and uses commas for the thousands separator on numeric data. Is there a way to read this in so that the type of the column is float and not string?

Pass param thousands=',' to read_csv to read those values as thousands:
In [27]:
import pandas as pd
import io
t="""id;value
0;123,123
1;221,323,330
2;32,001"""
pd.read_csv(io.StringIO(t), thousands=r',', sep=';')
Out[27]:
id value
0 0 123123
1 1 221323330
2 2 32001

The answer to this question should be short:
df=pd.read_csv('filename.csv', thousands=',')

Take a look at the read_csv documentation there is a keyword argument 'thousands' that you can pass the ',' into. Likewise if you had European data containing a '.' for the separator you could do the same.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pandas read file avoiding unwanted separators - pandas

I am reading a csv file that looks like this: "column, column, column, column,"column, column if I read using sep=',' I only get three columns. Any Idea of how to parse this type of file?

Use quotechar in read_csv from pandas, df = pd.read_csv(PATH, quotechar="'") print(df.columns.tolist()) ['"column', ' column', ' column.1', ' column.2', '"column.1', ' column.3']

Related

pandas cant replace commas with dots

csv header parsing in pyspark

Pandas dataframe: Splitting single-column data from txt file into multiple columns

Pandas read_csv wrong separator recognition

pandas reading CSV data formatted with comma for thousands separator

Categories

Resources