I am trying to open csv file in pandas using read_csv function. My file have the following structure a row with headers where each column header's name have the name underlined with quotes for example "header1";"header2"; non headers values in columns contains int or string values without quotes with only ; delimiter. Dataframe has the following structure
"header1";"header2";"header3";
value1;value2;value3;
When i apply read_csv df = pd.read_csv("filepath", sep=";", engine="python") i am getting ParseError: expected ';' after ' " ' help to solve it
Try to specify column names as follows, and see if it resolves the issue:
col_names = ["header1", "header2", "header3"]
df = pd.read_csv(filepath, sep=";", names=col_names)
If this doesn't work, try adding 'quotechar=' " ' and see
Related
Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ββββin the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes
Previously, To move data to Redshift table we used "Copy" Command which has the functionality of Data Conversion parameters like BLANKSASNULL and EMPTYASNULL.
As our data contains both "Empty Strings" and "Null" values, We used to convert both to Null while moving to Redshift Table. as shown below.
Example code :
COPY Database.Table
FROM 's3:/folder/file.csv'
IAM_ROLE 'arn:aws:iam::0000000:role/RedshiftCopyUnload'
DELIMITER ',' ESCAPE
REMOVEQUOTES
ACCEPTINVCHARS
EMPTYASNULL
BLANKSASNULL
NULL AS 'NULL'
DATEFORMAT 'auto';
Now, We had to use write_dynamic_frame.from_jdbc_conf method, we are trying to replicate the same(copy command data conversion parameters like **BLANKSASNULL and EMPTYASNULL), But we are unable to find the exact reference.
# Save data to Redshift
redshift_save_options = {
"dbtable": "Database." + TableName,
"database": "Schema"
}
from awsglue.dynamicframe import DynamicFrame
x = DynamicFrame.fromDF(input_data, glueContext, "dfx")
glueContext.write_dynamic_frame.from_jdbc_conf(
frame = x,
catalog_connection = "Regshift-glue-connection",
connection_options = redshift_save_options,
redshift_tmp_dir = "s3:/project/RedshiftTempDirectory/")
Can someone help me in solving this.
Any suggestion is appreciated . Thankyou
To replicate the functionality of BLANKSASNULL and EMPTYASNULL, replace blank and empty columns in the DataFrame (i.e. input_data) prior to converting it to a DynamicFrame.
Example:
from pyspark.sql.functions import col, when
# replace empty strings values
# calling strip() handles "blank" strings (i.e. handles new line characters, etc)
input_data = input_data.select(
[
when(col(c).strip() == "", None).otherwise(col(c)).alias(c) for c in input_data.columns
]
)
x = DynamicFrame.fromDF(input_data, glueContext, "dfx")
References:
PySpark Replace Empty Value With None/null on DataFrame
I am trying to read csv file as dataframe from Azure databricks.
The header columns (when I open in excel) are as follows.
All the header names are in the following format in the CSV file.
e.g.
"City_Name"ZYD_CABC2_EN:0TXTMD
Basically I want to include only strings within quotes as my header (City_Name) and ignore the second part of the string (ZYD_CABC2_EN:0TXTMD)
sales_df = spark.read.format("csv").load(input_path + '/sales_2020.csv', inferSchema = True, header=True)
You can parse the column names after reading in the csv file, using regular expressions to extract the words between the quotes, and then using toDF to reassign all column names at once:
import re
# sales_df = spark.read.format("csv")...
sales_df = sales_df.toDF(*[re.search('"(.*)"', c).group(1) for c in df.columns])
You can split the actual names using " to get the desired column names:
sales_df = sales_df.toDF(*[c.split('"')[1] for c in df.columns])
I am reading a csv file that looks like this:
"column, column, column, column,"column, column
if I read using sep=',' I only get three columns.
Any Idea of how to parse this type of file?
Use quotechar in read_csv from pandas,
df = pd.read_csv(PATH, quotechar="'")
print(df.columns.tolist())
['"column', ' column', ' column.1', ' column.2', '"column.1', ' column.3']
I am trying to create a fixed width file output in Pandas. When using pandas dataFrames to_string all the data has a "white space" separating the values. How do I remove the white space between the data columns?
sql = """SELECT
FIELD_1,
FIELD_2,
.........
FROM
VIEW"""
db_connection_string = "your connection string"
df = pd.read_sql_query(sql=sql, con=db_connection_string)
df['field_1'] = df['field_1'].str.pad(width=10, side='right', fillchar='-')
df['field_2'] = df['field_2'].str.pad(width=10, side='right', fillchar='-')
print(df.to_string(header=False, index=False)
I expected the following:
field1----field2----
What I got was:
field1---- field2----
Please note the spaces between the columns. This is what I am trying to remove. The fields should be flush against one another and not have a whitespace separator.
I think problem is to_string add default separator. Possible solutions is join together all columns:
print(df.astype(str).apply(''.join, 1).to_string(header=False, index=False))
field1----field_2---
Or only some columns:
print ((df['field_1'] + df['field_2']).to_string(header=False, index=False)))