How do I specify data type for Databricks spark dataframe and avoid future problems? - dataframe

I have a databricks notebook that will run every 2-4 weeks. It will read in a small csv, perform etl on python, truncate and load to a delta table.
This is what I am currently doing to avoid failures related to data type:
python to replace all '-' with '0'
python to drop rows with NaN or nan
spark_df = spark.createDataFrame(dfnew)
spark_df.write.saveAsTable("default.test_table", index=False, header=True)
This automatically detects the datatypes and is working right now.
BUT, what if the datatype cannot be detected or detects wrong? Mostly concerned about doubles, ints, bigints.
I tested casting but it doesnt work on databricks:
spark_df = spark.createDataFrame(dfnew.select(dfnew("Year").cast(IntegerType).as("Year")))
Is there a way to feed a DDL to spark dataframe for databricks? Should I not use spark?

Related

Spark createdataframe cannot infer schema - default data types?

I am creating a spark dataframe in databricks using createdataframe and getting the error:
'Some of types cannot be determined after inferring'
I know I can specify the schema, but that does not help if I am creating the dataframe each time with source data from an API and they decide to restructure it.
Instead I would like to tell spark to use 'string' for any column where a data type cannot be inferred.
Is this possible?
This can be easily handled with schema evaluation with delta format. Quick ref: https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html

python pandas removes leading 0 while writing to csv

I am facing an issue which might be related to this question and others similar to this, i decided to create a sperate question because i feel my problem might have some additional things that i need to consider. Here is what I am facing right now.
I have a dataframe in pandas where it reads the data from sql and shows up something like following:
in picture it shows me that values have leading '0' and the datatype of this column is 'object'.
when i run this SQL and export to csv on my windows machine (python 3.7, pandas 1.0.3), it works exactly as required and shows the correct output,
the problem occurs when i try to run on my Linux machine (python 3.5.2, pandas 0.24.2), it always removes the leading zeros while writing to CSV, the csv looks like the following image:
i am not sure, what should i be changing to get the desired result at both environments. will appreciate any help.
Edit:
confirmed that read from SQL in ubuntu dataframe also has leading zeros:
If you can use xlsx files instead of csv, then replace df.to_csv with df.to_excel and the file extension to xlsx.
With xlsx files you also get to store the types, so excel will not assume them to be numbers.
csv vs excel

Python bulk insert to Teradata? Default is too slow

I was asked for a Python script to read a file, load into a dataframe, and write to a table in Teradata. It works but it takes 3-4 minutes to write a table that's 300 rows.
Now that's a tiny job for a data warehouse, and our Teradata works fine when handling massive data sets, but I find it's such a drag waiting 3 minutes for this script to run. I don't believe it's a system issue.
Is there a better way to load small to medium size tables into Teradata? If we did this in SQL Server it would take seconds but some other data we need to access is already there in Teradata.
The main code looks like this:
import pandas as pd
from sqlalchemy import create_engine
conn_string = 'teradata://' + user + ':' + passw + '#' + host + '/?authentication=LDAP'
eng = create_engine(conn_string)
# insert into table
df.to_sql(table_name, con = eng, if_exists = 'replace',
schema='TEMP', index=False, dtype=df_datatypes)
The above works but when I added the method='multi' parameter to to_sql, I got the error:
CompileError: The 'teradata' dialect with current database version settings does not
support in-place multirow inserts.
Then I added chunksize=100 and I got this error:
DatabaseError: [HY000] [Teradata][ODBC Teradata Driver][Teradata Database](-9719)QVCI feature is disabled.
Is there a way to speed this up? Should I do something outside of Python altogether?
Pedro, if you are trying to load one row at a time through ODBC, you are approaching the problem like trying to fill a swimming pool through a straw.
Teradata is and has always been a parallel platform. You need to use bulk loaders—or, if you are a Python fan, use teradataml (code below).
By the way, I loaded 728,000 records in 12 seconds on my laptop. When I built a Teradata system in Azure, I was able using the same code below to load the same 728,000 records in 2 seconds:
# -*- coding: utf-8 -*-
import pandas as pd
from teradataml import create_context, remove_context
con = create_context(host = <host name or IP address>, username=<username>, password=<password>, database='<#what ever default database you want>')
colnames = ['storeid','custnumber','invoicenumber','amount']
dataDF = pd.read_csv('datafile.csv', delimiter=',', names=colnames, header=None)
copy_to_sql(dataDF, table_name = 'store', if_exists='replace') # this will load the datafrom from your client machine into the Teradata system.
remove_context()
pedro. I can empathize with you. I work at an organization that has massive data warehouses in Teradata and it's next to impossible to get it to work well. I developed a Python (3.9) script that uses the teradatasql package to interface Python/SQLAlchemy and it works reasonably well.
This script reflects a PostgreSQL database and, table-by-table, pulls the data from PostgreSQL, saves it as an object in Python and then uses the reflected PostgreSQL tables to create comparable versions in Teradata and finally push the retrieved data to the tables.
I've quality checked this and here's what I get:
table reflection time in PostgreSQL: 0.95 seconds on average
longest PostgreSQL data retrieval: 4.22 seconds
shortest time to create mirrored tables in Teradata: 12.10 seconds
shortest time to load data to Teradata: 8.39 seconds
longest time to load data to Teradata: approx. 7,212 seconds
The point here is Teradata is a miserable platform that's ill equipped for business critical tasks. I've demonstrated to senior leadership the performance of Teradata against 5 other ODBC's and Teradata was far and away the least capable.
Maybe things are better on version 17; however, we're on version 16.20.53.30 which also doesn't support in-place multirow inserts. I'd recommend your organization stop using Teradata and move to something more capable, like Snowflake. Snowflake has EXCELLENT Python support and a great CLI.

Rename whitespace in column name in Parquet file using spark sql

I want to show the content of the parquet file using Spark Sql but since the column names in parquet file contains space I am getting error -
Attribute name "First Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
I have written below code -
val r1 = spark.read.parquet("filepath")
val r2 = r1.toDF()
r2.select(r2("First Name").alias("FirstName")).show()
but still getting same error
Try and rename the column first instead of aliasing it:
r2 = r2.withColumnRenamed("First Name", "FirstName")
r2.show()
For anyone still looking for an answer,
There is no optimised way to remove spaces from column names while dealing with parquet data.
What can be done is:
Change the column names at the source itself, i.e, while creating the parquet data itself.
OR
(NOT THE OPTIMISED WAY - won't WORK FOR HUGE DATASETS) read the parquet file using pandas and rename the column for the pandas dataframe. If required, write back the dataframe to a parquet using pandas itself and then progress using spark if required.
PS: With the new Pandas API for PySpark scheduled to be present from PySpark 3.2, implementing pandas with spark might be much faster and optimised when dealing with huge datasets.
For anybody struggling with this, the only thing that worked for me was:
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(base_df.schema).parquet(filename)
This is from this thread: Spark Dataframe validating column names for parquet writes (scala)
Alias, withColumnRenamed, and "as" sql select statements wouldn't work. Pyspark would still use the old name whenever trying to .show() the dataframe.

Pyspark dataframe join taking a very long time

I have 2 dataframes in pyspark that I loaded from a hive database using 2 sparksql queries.
When I try to join the 2 dataframes using the df1.join(df2,df1.id_1=df2.id_2), it takes a long time.
Does Spark re execute the sqls for df1 and df2 when I call the JOIN?
The underlying database is HIVE
Pyspark will be slower compared to using Scala as data serialization occurs between Python process and JVM, and work is done in Python.