Pyspark Column Values are getting shifted automatically while creating DataFrame - dataframe

I am trying to create a pyspark dataframe manually using the below nested schema -
schema = StructType([
StructField('fields', ArrayType(StructType([
StructField('source', StringType()),
StructField('sourceids', ArrayType(IntegerType()))]))),
StructField('first_name',StringType()),
StructField('last_name',StringType()),
StructField('kare_id',StringType()),
StructField('match_key',ArrayType(StringType()))
])
I am using the below code to create a dataframe using this schema -
row = [Row(fields=[Row(
source='BCONNECTED',
sourceids=[10,202,30]),
Row(
source='KP',
sourceids=[20,30,40])],first_name='Christopher', last_name='Nolan', kare_id='kare1', match_key=['abc','abcd']),
Row(fields=[
Row(
source='BCONNECTED',
sourceids=[20,304,5,6]),
Row(
source='KP',
sourceids=[40,50,60])],first_name='Michael', last_name='Caine', kare_id='kare2', match_key=['ncnc','cncnc'])]
content = spark.createDataFrame(sc.parallelize(row), schema=schema)
content.printSchema()
Schema is getting printed correctly, but when I am doing content.show() I can see the values of kare_id and last_name column has swapped.
+--------------------+-----------+---------+-------+-------------+
| fields| first_name|last_name|kare_id| match_key|
+--------------------+-----------+---------+-------+-------------+
|[[BCONNECTED, [10...|Christopher| kare1| Nolan| [abc, abcd]|
|[[BCONNECTED, [20...| Michael| kare2| Caine|[ncnc, cncnc]|
+--------------------+-----------+---------+-------+-------------+

PySpark sorts the Row object on column names using lexicographic ordering. Thus, the ordering of the columns in your data will be fields, first_name, kare_id, last_name, match_key.
Spark then associates each one of the column names with the data resulting in the mismatch. The fix is to swap the schema entry for last_name and kare_id as shown below:
schema = StructType([
StructField('fields', ArrayType(StructType([
StructField('source', StringType()),
StructField('sourceids', ArrayType(IntegerType()))]))),
StructField('first_name', StringType()),
StructField('kare_id', StringType()),
StructField('last_name', StringType()),
StructField('match_key', ArrayType(StringType()))
])
From PySpark Docs on Row: "Row can be used to create a row object by using named arguments, the fields will be sorted by names."
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row

First you are actually defining schema twice once when you are creating data at that time you are already using row object in RDD thus you do not need to use createDataFrame function instead you can do following:
sc.parallelize(row).toDF().show()
But if still you want to mention schema explicitly then you need to keep schema and data in same order and your mentioned Schema is incorrect as per the data you are passing. The correct schema would be:
schema = StructType([
StructField('fields', ArrayType(StructType([StructField('source', StringType()),StructField('sourceids', ArrayType(IntegerType()))]))),
StructField('first_name',StringType()),
StructField('kare_id',StringType()),
StructField('last_name',StringType()),
StructField('match_key',ArrayType(StringType()))
])
kare_id should come before last_name because this is the order in which you are passing data

Related

replace .withColumn with a df.select

I am doing a basic transformation on my pyspark dataframe but here i am using multiple .withColumn statements.
def trim_and_lower_col(col_name):
return F.when(F.trim(col_name) == "", F.lit("unspecified")).otherwise(F.lower(F.trim(col_name)))
df = (
source_df.withColumn("browser", trim_and_lower_col("browser"))
.withColumn("browser_type", trim_and_lower_col("browser_type"))
.withColumn("domains", trim_and_lower_col("domains"))
)
I read that creating multiple withColumn statements isn't very efficient and i should use df.select() instead.
I tried this:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(col) for col in cols_to_transform] + source_df.columns)
)
but it gives me a duplicate column error
What else can I try?
The duplicate column comes because you pass each transformed column twice in that list, once as your newly transformed column (through .alias) as original column (by name in source_df.columns). This solution will allow you to use a single select statement, preserve the column order and not hit the duplication issue:
df = (
source_df.select([trim_and_lower_col(col).alias(col) if col in cols_to_transform else col for col in source_df.columns])
)
Chaining many .withColumn does pose a problem as the unresolved query plan can get pretty large and cause StackOverflow error on Spark driver during query plan optimisation. One good explanation of this problem is shared here: https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015
You are naming your new columns the following: .alias(col).
That means that they have the same name as the column you use to create the new one.
During the creation (using .withColumn) this does not pose a problem. As soon as you are trying to select, Spark does not know which column to pick.
You could fix it for example by giving the new columns a suffix:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(f"{col}_new") for col in cols_to_transform] + source_df.columns)
)
Another solution, which does pollute the DAG though, would be:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
for col in cols_to_transform:
source_df = source_df.withColumn(col, trim_and_lower_col(col))
If you only have these few withColumns, keep using them.. It's still way more readable thus way more maintainable and self explanatory..
If you look into it, you'll see that spark says to be careful with the withColumns when you have like 200 of them.
Using select makes your code more error prone too since it's more complex to read.
Now, if you have many columns, I would define
the list of the column to transform,
the list of the column to keep
then do the select
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
cols_to_keep = [c for c in df.columns if c not in cols_to_transform]
cols_transformed = [trim_and_lower_col(c).alias(c) for c in cols_to_transform]
source_df.select(*cols_to_keep, *cols_transformed)
This would give you the same column order as the withColumns.

how to refer values based on column names in python

i am trying to extract and read the data from a SQL query.
Below is the sample data from SQL developer:
target_name expected_instances environment system_name hostname
--------------------------------------------------------------------------------------
ORAUAT_host1 1 UAT ORAUAT_host1_sys host1.sample.net
ORAUAT_host2 1 UAT ORAUAT_host1_sys host2.sample.net
Normally i pass the system_name to the query (which has a bind variable for system_name) and get the data as a list,but not the column names.
Is there a way in Python to retrieve the data along with the column names and reference values with column name like target_name[0] giving the value ORAUAT_host1?Please suggest.Thanks.
If what you want is to get the column names from the table you are querying, you can do something like this:
My example is printing a csv file
import os
import sys
import cx_Oracle
db = cx_Oracle.connect('user/pass#host:1521/service_name')
SQL = "select * from dual"
print(SQL)
cursor = db.cursor()
f = open("C:\dual.csv", "w")
writer = csv.writer(f, lineterminator="\n", quoting=csv.QUOTE_NONNUMERIC)
r = cursor.execute(SQL)
#this takes the column names
col_names = [row[0] for row in cursor.description]
writer.writerow(col_names)
for row in cursor:
writer.writerow(row)
f.close()
The way to print the columns is using the method description of the cursor object
Cursor.description
This read-only attribute is a sequence of 7-item sequences. Each of
these sequences contains information describing one result column:
(name, type, display_size, internal_size, precision, scale, null_ok).
This attribute will be None for operations that do not return rows or
if the cursor has not had an operation invoked via the execute()
method yet.
The type will be one of the database type constants defined at the
module level.
https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#

Schema conflict when storing dataframes with datetime objects using load_table_from_dataframe()

I'm trying to load data from a Pandas DataFrames into a BigQuery table. The DataFrame has a column of dtype datetime64[ns], and when I try to store the df using load_table_from_dataframe(), I get
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table [table name]. Field computation_triggered_time has changed type from DATETIME to TIMESTAMP.
The table has a schema which reads
CREATE TABLE `[table name]` (
...
computation_triggered_time DATETIME NOT NULL,
...
)
In the DataFrame, computation_triggered_time is a datetime64[ns] column. When I read the original DataFrame from CSV, I convert it from text to datetime like so:
df['computation_triggered_time'] = \
df.to_datetime(df['computation_triggered_time']).values.astype('datetime64[ms]')
Note:
The .values.astype('datetime64[ms]') part is necessary because load_table_from_dataframe() uses PyArrow to serialize the df and that fails if the data has nanosecond-precision. The error is something like
[...] Casting from timestamp[ns] to timestamp[ms] would lose data
This looks like a problem with Google's google-cloud-python package, can you report the bug there? https://github.com/googleapis/google-cloud-python

SparkSQL Staging Table Row Count vs Hive Row count

I am attempting to extract data from Cassandra, into a specific partitioned Hive table using Spark 2.1.1 on Hadoop 2.7. To do this, I have all the data from Cassandra into an rdd which I transform into a dataframe via rdd.toDF(), and passed into the following function:
public def writeToHive(ss: SparkSession, df: DataFrame) {
df.createOrReplaceTempView(tablename)
val cols = df.columns
val schema = df.schema
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
val outdf = ss.sql(s"""INSERT INTO TABLE ${db}.${t} PARTITION (date="${destPartition}") SELECT * FROM ${tablename}""")
// Have also tried the following lines below, but yielded the same results
// var dfInput_1 = dfInput.withColumn("region", lit(s"${destPartition}"))
// dfInput_1.write.mode("append").insertInto(s"${db}.${t}")
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
// logs 423
LOG.info(s"""SELECT COUNT(*) FROM ${db}.${t} where date='${destPartition}'""")
}
When looking in Cassandra, there are indeed 358 rows in the table. I saw this post on Hortonworks https://community.hortonworks.com/questions/51322/count-msmatch-while-using-the-parquet-file-in-spar.html but there doesn't seem to be a solution. I have tried setting spark.sql.hive.metastorePartitionPruning to true, but no changes were seen in the row counts.
Would love any feedback as to why there is a discrepancy between the row counts. Thanks!
EDIT: bad data coming in.... should've seen that coming
Sometimes data contains non-utf8 characters like Japanese or Chinese. Check if data contains any such non-utf8 characters.
If this is a case insert it in ORC format. By default it is text, and text doesn't support non-utf8 characters.

PySpark and HIVE/Impala

I want to build a classification model in PySpark. My input to this model is result of select query or view from Hive or Impala. is any way to include this query in PySpark code itself instead of storing result in text file feeding to our model
Yes for this you need to use HiveContext with sparkContext.
Here is example:-
sqlContext = HiveContext(sc)
tableData = sqlContext.sql("SELECT * FROM TABLE")
#tableData is a dataframe containing reference to schema of table, check this using tableData.printSchema()
tableData.collect() #collect executes query and provide all rows from sql
or you may refer here
https://spark.apache.org/docs/1.6.0/sql-programming-guide.html