Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ββββin the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes
I have a csv file having 300 columns. Out of these 300 columns, I need only 3 columns. Hence i defines schema for same. But when I am mapping schema to dataframe it shows only 3 columns but incorretly mapping schema with first 3 columns. Its not mapping csv columns names with my schema structfields. Please advise
from pyspark.sql.types import *
dfschema = StructType([
StructField("Call Number",IntegerType(),True),
StructField("Incident Number",IntegerType(),True),
StructField("Entry DtTm",DateType() ,True)
])
df = spark.read.format("csv")\
.option("header","true")\
.schema(dfschema)\
.load("/FileStore/*/*")
df.show(5)
This is actually the expected behaviour of Spark's CSV-Reader.
If the columns in the csv file do not match the supplied schema, Spark treats the row as a corrupt record. The easiest way to see that is to add another column _corrupt_record with type string to the schema. You will see that all rows are stored in this column.
The easiest way to get the correct columns would be to read the the csv file without schema (or if feasible with the complete schema) and then select the required columns. There will be no performance penalty for reading the whole csv file as (unlike in formats like parquet) Spark cannot read selected columns from csv. The file is always read completely.
#read the csv file without infering the schema
df=spark.read.option("header","true").option("inferSchema", False).csv(<...>)
#all columns will now be of type string
df.printSchema()
#select the required columns and cast them to the appropriate type
df2 = df.selectExpr("cast(`Call Number` as int)", "cast(`Incident Number` as int)", ....)
#only the required columns with the correct type are contained in df2
df2.printSchema()
Previously, To move data to Redshift table we used "Copy" Command which has the functionality of Data Conversion parameters like BLANKSASNULL and EMPTYASNULL.
As our data contains both "Empty Strings" and "Null" values, We used to convert both to Null while moving to Redshift Table. as shown below.
Example code :
COPY Database.Table
FROM 's3:/folder/file.csv'
IAM_ROLE 'arn:aws:iam::0000000:role/RedshiftCopyUnload'
DELIMITER ',' ESCAPE
REMOVEQUOTES
ACCEPTINVCHARS
EMPTYASNULL
BLANKSASNULL
NULL AS 'NULL'
DATEFORMAT 'auto';
Now, We had to use write_dynamic_frame.from_jdbc_conf method, we are trying to replicate the same(copy command data conversion parameters like **BLANKSASNULL and EMPTYASNULL), But we are unable to find the exact reference.
# Save data to Redshift
redshift_save_options = {
"dbtable": "Database." + TableName,
"database": "Schema"
}
from awsglue.dynamicframe import DynamicFrame
x = DynamicFrame.fromDF(input_data, glueContext, "dfx")
glueContext.write_dynamic_frame.from_jdbc_conf(
frame = x,
catalog_connection = "Regshift-glue-connection",
connection_options = redshift_save_options,
redshift_tmp_dir = "s3:/project/RedshiftTempDirectory/")
Can someone help me in solving this.
Any suggestion is appreciated . Thankyou
To replicate the functionality of BLANKSASNULL and EMPTYASNULL, replace blank and empty columns in the DataFrame (i.e. input_data) prior to converting it to a DynamicFrame.
Example:
from pyspark.sql.functions import col, when
# replace empty strings values
# calling strip() handles "blank" strings (i.e. handles new line characters, etc)
input_data = input_data.select(
[
when(col(c).strip() == "", None).otherwise(col(c)).alias(c) for c in input_data.columns
]
)
x = DynamicFrame.fromDF(input_data, glueContext, "dfx")
References:
PySpark Replace Empty Value With None/null on DataFrame
I have a csv with the first column the date and the 5th the hours.
I would like to merge them in a single column with a specific format in order to write another csv file.
This is basically the file:
DATE,DAY.WEEK,DUMMY.WEEKENDS.HOLIDAYS,DUMMY.MONDAY,HOUR
01/01/2015,5,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,3,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,4,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,5,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,6,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,7,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,8,1,0,0,0,0,0,0,0,0,0,0,0
I have tried to read the dataframe as
dataR = pd.read_csv(fnamecsv)
and convert the first line to date, as:
date_dt3 = datetime.strptime(dataR["DATE"].iloc[0], '%d/%m/%Y')
However, this seems to me not the correct way for two reasons:
1) it add the hour without considering the hour column;
2) it seems not use the pandas feature.
Thanks for any kind of help,
Diedro
Using + operator
you need to convert data frame elements into string before join. you can also use different separators during join, e.g. dash, underscore or space.
import pandas as pd
df = pd.DataFrame({'Last': ['something', 'you', 'want'],
'First': ['merge', 'with', 'this']})
print('Before Join')
print(df, '\n')
print('After join')
df['Name']= df["First"].astype(str) +" "+ df["Last"]
print(df) ```
You can use read_csv with parameters parse_dates with list of both columns names and date_parser for specify format:
f = lambda x: pd.to_datetime(x, format='%d/%m/%Y %H')
dataR = pd.read_csv(fnamecsv, parse_dates=[['DATE','HOUR']], date_parser=f)
Or convert hours to timedeltas and add to datetimes later:
dataR = pd.read_csv(fnamecsv, parse_dates=[0], dayfirst=True)
dataR['DATE'] += pd.to_timedelta(dataR.pop('HOUR'), unit='H')
I am trying to open csv file in pandas using read_csv function. My file have the following structure a row with headers where each column header's name have the name underlined with quotes for example "header1";"header2"; non headers values in columns contains int or string values without quotes with only ; delimiter. Dataframe has the following structure
"header1";"header2";"header3";
value1;value2;value3;
When i apply read_csv df = pd.read_csv("filepath", sep=";", engine="python") i am getting ParseError: expected ';' after ' " ' help to solve it
Try to specify column names as follows, and see if it resolves the issue:
col_names = ["header1", "header2", "header3"]
df = pd.read_csv(filepath, sep=";", names=col_names)
If this doesn't work, try adding 'quotechar=' " ' and see