PySpark dataframe write to orc not allowing column names with hyphen - dataframe

I am new to PySpark. I have a csv file with hyphen in column names. I could successfully read the file into a dataframe. However while writing the df to orc file I get an error like below-
java.lang.IllegalArgumentException: Missing required char ':' at
'struct
When I renamed the columns by removing hyphen, I could write the dataframe to orc. But I need the column names to have hyphen because I want to append this orc to an existing orc which has hyphen in column names.
Could someone please help me with this?
Any help would be greatly appreciated!!!

Use backtick " ` " to enclose the column name.
Like: `column-name`
read the data in a dataframe and create a new empty dataframe with the desired structure
from pyspark.sql.types import *
result= spark.read.orc(path)
schema = StructType([
StructField('col-name', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])
df = spark.createDataFrame(spark.emptyRDD(),schema)
df.unionAll(result).show()

Related

pandas cant replace commas with dots

Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ​​​​in the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes

Applying schema on pyspark dataframe

I have a csv file having 300 columns. Out of these 300 columns, I need only 3 columns. Hence i defines schema for same. But when I am mapping schema to dataframe it shows only 3 columns but incorretly mapping schema with first 3 columns. Its not mapping csv columns names with my schema structfields. Please advise
from pyspark.sql.types import *
dfschema = StructType([
StructField("Call Number",IntegerType(),True),
StructField("Incident Number",IntegerType(),True),
StructField("Entry DtTm",DateType() ,True)
])
df = spark.read.format("csv")\
.option("header","true")\
.schema(dfschema)\
.load("/FileStore/*/*")
df.show(5)
This is actually the expected behaviour of Spark's CSV-Reader.
If the columns in the csv file do not match the supplied schema, Spark treats the row as a corrupt record. The easiest way to see that is to add another column _corrupt_record with type string to the schema. You will see that all rows are stored in this column.
The easiest way to get the correct columns would be to read the the csv file without schema (or if feasible with the complete schema) and then select the required columns. There will be no performance penalty for reading the whole csv file as (unlike in formats like parquet) Spark cannot read selected columns from csv. The file is always read completely.
#read the csv file without infering the schema
df=spark.read.option("header","true").option("inferSchema", False).csv(<...>)
#all columns will now be of type string
df.printSchema()
#select the required columns and cast them to the appropriate type
df2 = df.selectExpr("cast(`Call Number` as int)", "cast(`Incident Number` as int)", ....)
#only the required columns with the correct type are contained in df2
df2.printSchema()

Unable to use BLANKSASNULL Data conversion parameter in write_dynamic_frame.from_catalog while moving data to Redshift table

Previously, To move data to Redshift table we used "Copy" Command which has the functionality of Data Conversion parameters like BLANKSASNULL and EMPTYASNULL.
As our data contains both "Empty Strings" and "Null" values, We used to convert both to Null while moving to Redshift Table. as shown below.
Example code :
COPY Database.Table
FROM 's3:/folder/file.csv'
IAM_ROLE 'arn:aws:iam::0000000:role/RedshiftCopyUnload'
DELIMITER ',' ESCAPE
REMOVEQUOTES
ACCEPTINVCHARS
EMPTYASNULL
BLANKSASNULL
NULL AS 'NULL'
DATEFORMAT 'auto';
Now, We had to use write_dynamic_frame.from_jdbc_conf method, we are trying to replicate the same(copy command data conversion parameters like **BLANKSASNULL and EMPTYASNULL), But we are unable to find the exact reference.
# Save data to Redshift
redshift_save_options = {
"dbtable": "Database." + TableName,
"database": "Schema"
}
from awsglue.dynamicframe import DynamicFrame
x = DynamicFrame.fromDF(input_data, glueContext, "dfx")
glueContext.write_dynamic_frame.from_jdbc_conf(
frame = x,
catalog_connection = "Regshift-glue-connection",
connection_options = redshift_save_options,
redshift_tmp_dir = "s3:/project/RedshiftTempDirectory/")
Can someone help me in solving this.
Any suggestion is appreciated . Thankyou
To replicate the functionality of BLANKSASNULL and EMPTYASNULL, replace blank and empty columns in the DataFrame (i.e. input_data) prior to converting it to a DynamicFrame.
Example:
from pyspark.sql.functions import col, when
# replace empty strings values
# calling strip() handles "blank" strings (i.e. handles new line characters, etc)
input_data = input_data.select(
[
when(col(c).strip() == "", None).otherwise(col(c)).alias(c) for c in input_data.columns
]
)
x = DynamicFrame.fromDF(input_data, glueContext, "dfx")
References:
PySpark Replace Empty Value With None/null on DataFrame

how to merge two columns in one column as date with pandas?

I have a csv with the first column the date and the 5th the hours.
I would like to merge them in a single column with a specific format in order to write another csv file.
This is basically the file:
DATE,DAY.WEEK,DUMMY.WEEKENDS.HOLIDAYS,DUMMY.MONDAY,HOUR
01/01/2015,5,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,3,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,4,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,5,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,6,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,7,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,8,1,0,0,0,0,0,0,0,0,0,0,0
I have tried to read the dataframe as
dataR = pd.read_csv(fnamecsv)
and convert the first line to date, as:
date_dt3 = datetime.strptime(dataR["DATE"].iloc[0], '%d/%m/%Y')
However, this seems to me not the correct way for two reasons:
1) it add the hour without considering the hour column;
2) it seems not use the pandas feature.
Thanks for any kind of help,
Diedro
Using + operator
you need to convert data frame elements into string before join. you can also use different separators during join, e.g. dash, underscore or space.
import pandas as pd
df = pd.DataFrame({'Last': ['something', 'you', 'want'],
'First': ['merge', 'with', 'this']})
print('Before Join')
print(df, '\n')
print('After join')
df['Name']= df["First"].astype(str) +" "+ df["Last"]
print(df) ```
You can use read_csv with parameters parse_dates with list of both columns names and date_parser for specify format:
f = lambda x: pd.to_datetime(x, format='%d/%m/%Y %H')
dataR = pd.read_csv(fnamecsv, parse_dates=[['DATE','HOUR']], date_parser=f)
Or convert hours to timedeltas and add to datetimes later:
dataR = pd.read_csv(fnamecsv, parse_dates=[0], dayfirst=True)
dataR['DATE'] += pd.to_timedelta(dataR.pop('HOUR'), unit='H')

Pandas read_csv wrong separator recognition

I am trying to open csv file in pandas using read_csv function. My file have the following structure a row with headers where each column header's name have the name underlined with quotes for example "header1";"header2"; non headers values in columns contains int or string values without quotes with only ; delimiter. Dataframe has the following structure
"header1";"header2";"header3";
value1;value2;value3;
When i apply read_csv df = pd.read_csv("filepath", sep=";", engine="python") i am getting ParseError: expected ';' after ' " ' help to solve it
Try to specify column names as follows, and see if it resolves the issue:
col_names = ["header1", "header2", "header3"]
df = pd.read_csv(filepath, sep=";", names=col_names)
If this doesn't work, try adding 'quotechar=' " ' and see