How can we parse multiline using pyspark - hive

I am having a test csv file with below content:
"TVBQGEM461
2016-10-05 14:04:33 cvisser gefixt door company
"
I need to store this entire content into one single row. However while processing with pyspark, this are getting splitted into 2 rows.
Below is the pyspark code:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("desc", StringType(), True)])
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', inferschema='true').load('/dev/landingzone/test.csv', schema = customSchema)
df.registerTempTable("temp")
sqlContext.sql("create table dev_core_source.test as select * from temp")
Data is getting loaded into the hive table, but they are splited into 2 rows instead of 1 row
I have also tried some other options like the one below for creating data frame, but still facing the same issue.
df = sqlContext.read \
... .format('com.databricks.spark.csv') \
... .options(header='true') \
... .option(inferschema, 'true') \
... .option(wholeFile, 'true') \
... .options(parserLib ='UNIVOCITY') \
... .load('/dev/landingzone/test.csv', schema = customSchema)

Related

How to remove duplicate records from PySpark DataFrame based on a condition?

Assume that I have a PySpark DataFrame like below:
# Prepare Data
data = [('Italy', 'ITA'), \
('China', 'CHN'), \
('China', None), \
('France', 'FRA'), \
('Spain', None), \
('Taiwan', 'TWN'), \
('Taiwan', None)
]
# Create DataFrame
columns = ['Name', 'Code']
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)
As you can see, a few countries are repeated twice (China & Taiwan in the above example). I want to delete records that satisfy the following conditions:
The column 'Name' is repeated more than once
AND
The column 'Code' is Null.
Note that column 'Code' can be Null for countries which are not repeated, like Spain. I want to keep those records.
The expected output will be like:
Name
Code
'Italy'
'ITA'
'China'
'CHN'
'France'
'FRA'
'Spain'
Null
'Taiwan'
'TWN'
In fact, I want to have one record for every country. Any idea how to do that?
You can use window.PartitionBy to achieve your desired results:
from pyspark.sql import Window
import pyspark.sql.functions as f
df1 = df.select('Name', f.max('Code').over(Window.partitionBy('Name')).alias('Code')).distinct()
df1.show()
Output:
+------+----+
| Name|Code|
+------+----+
| China| CHN|
| Spain|null|
|France| FRA|
|Taiwan| TWN|
| Italy| ITA|
+------+----+
In order to obtain non-null rows first, use the row_number window function to group by Name column and sort the Code column. Since null is considered the smallest in Spark order by, desc mode is used. Then take the first row of each group.
df = df.withColumn('rn', F.expr('row_number() over (partition by Name order by Code desc)')).filter('rn = 1').drop('rn')
Here is one approach :
from pyspark.sql.functions import col
df = df.dropDuplicates(subset=["Name"],keep='first')
There will almost certainly be a cleverer way to do this, but for the sake of a lesson, what if you:
made a new dataframe with just 'Name'
dropped duplicates on that
deleted records where Code = 'null' from initial table
do a left join between new table and old table for 'Code'
I've added Australia with no country code just so you can see it works for that case as well
import pandas as pd
data = [('Italy', 'ITA'), \
('China', 'CHN'), \
('China', None), \
('France', 'FRA'), \
('Spain', None), \
('Taiwan', 'TWN'), \
('Taiwan', None), \
('Australia', None)
]
# Create DataFrame
columns = ['Name', 'Code']
df = pd.DataFrame(data = data, columns = columns)
print(df)
# get unique country names
uq_countries = df['Name'].drop_duplicates().to_frame()
print(uq_countries)
# remove None
non_na_codes = df.dropna()
print(non_na_codes)
# combine
final = pd.merge(left=uq_countries, right=non_na_codes, on='Name', how='left')
print(final)

How to export a null field on BIGQUERY

When I try to export a JSON object through BigQuery, when there is a field with a "null" value, it disappears from the results' download.
An example of downloaded query:
EXPORT DATA OPTIONS(
uri='gs://analytics-export/_*',
format='JSON',
overwrite=true) AS
SELECT NULL AS field1
Actual result is: {}
When the expected result is: {field1: null}
How to force an export with the null value like I showed on the expected result?
For this OP you can use:
Select TO_JSON_STRING(NULL) as field1
Select 'null' as field1
On Export DATA documentation there is no reference to an option that includes null values on output so I think you can go to feature request report page and create one request for it. Also, there are similar observations on other projects and points that it will not be supported yet, see details here.
There are many workarounds for this, let me show you 2 options, see below:
Option 1: Call directly from python using bigquery client library
from google.cloud import bigquery
import json
client = bigquery.Client()
query = "select null as field1, null as field2"
query_job = client.query(query)
json_list = {}
for row in query_job:
json_row = {'field1':row[0],'field2':row[1]}
json_list.update(json_row)
with open('test.json','w+') as file:
file.write(json.dumps(json_list))
Option 2: Use apache beam dataflow with python and BigQuery to produce the desirable output
import argparse
import re
import json
import apache_beam as beam
from apache_beam.io import BigQuerySource
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def add_null_field(row, field):
if field!='skip':
row.update({field: row.get(field, None)})
return row
def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
parser.add_argument(
'--output',
dest='output',
required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.Read(beam.io.BigQuerySource(query='SELECT null as field1, null as field2'))
| beam.Map(add_null_field, field='skip')
| beam.Map(json.dumps)
| beam.io.Write(beam.io.WriteToText(known_args.output, file_name_suffix='.json')))
if __name__ == '__main__':
run()
To run it:
python -m export --output gs://my_bucket_id/output/ \
--runner DataflowRunner \
--project my_project_id \
--region my_region \
--temp_location gs://my_bucket_id/tmp/
Note: Just replace my_project_id,my_bucket_id and my_region with the appropriate values. Look on your cloud storage bucket for output file.
Both options will produce you the output you are looking for:
{"field1": null, "field2": null}
Please let me know if it helps you and gives you the result you want to achieve.

TypeError: field Customer: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

SL No: Customer Month Amount
1 A1 12-Jan-04 495414.75
2 A1 3-Jan-04 245899.02
3 A1 15-Jan-04 259490.06
My Df is above
Code
import findspark
findspark.init('/home/mak/spark-3.0.0-preview2-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mak').getOrCreate()
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pdf3 = pd.read_csv('Repayment.csv')
df_repay = spark.createDataFrame(pdf3)
only loading df_repay is having issue, other data frame are loaded successfully. When i shfted my above to code to below code its worked successfully
df4 = (spark.read.format("csv").options(header="true")
.load("Repayment.csv"))
why df_repay is not loaded with spark.createDataFrame(pdf3) while similar data frames loaded successfully
pdf3 is pandas dataframe and you are trying to convert pandas dataframe to spark dataframe. if you want to stick to your code please use below code that is convert your pandas dataframe to spark dataframe.
from pyspark.sql.types import *
pdf3 = pd.read_csv('Repayment.csv')
#create schema for your dataframe
schema = StructType([StructField("Customer", StringType(), True)\
,StructField("Month", DateType(), True)\
,StructField("Amount", IntegerType(), True)])
#create spark dataframe using schema
df_repay = spark.createDataFrame(pdf3,schema=schema)

Multiplication of RDD row to all other rows in PySpark

I have an RDD of DenseVector objects and I want to:
Select one of these vectors (one row)
Perform a multiplication of this vector to all other vector rows in order to compute a similarity (cosine)
Basically I am trying to perform a dot product between a vector and a matrix, starting from an RDD. For reference, the RDD contains TF-IDF values built with Spark ML, which furnishes a dataframe of SparseVectors, and have been mapped to DenseVectors in order to do the multiplication. The dataframe and corresponding RDD are called tfidf_df and tfidf_rdd respectively.
What I do, which works is (full script with sample data)
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import IDF, Tokenizer, CountVectorizer
from pyspark.mllib.linalg import DenseVector
import numpy as np
sc = SparkContext()
sqlc = SQLContext(sc)
spark_session = SparkSession(sc)
sentenceData = spark_session.createDataFrame([
(0, "I go to school school is good"),
(1, "I like school"),
(2, "I also like cinema")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="tokens")
tokens_df = tokenizer.transform(sentenceData)
# TF feats
count_vectorizer = CountVectorizer(inputCol="tokens",
outputCol="tf_features")
model = count_vectorizer.fit(tokens_df)
tf_df = model.transform(tokens_df)
print(model.vocabulary)
print(tf_df.rdd.take(5))
idf = IDF(inputCol="tf_features",
outputCol="tf_idf_features",
)
model = idf.fit(tf_df)
tfidf_df = model.transform(tf_df)
# Transform into RDD of dense vectors
tfidf_rdd = tfidf_df.select("tf_idf_features") \
.rdd \
.map(lambda row: DenseVector(row.tf_idf_features.toArray()))
print(tfidf_rdd.take(3))
# Select the test vector
test_label = 1
vec = tfidf_df.filter(tfidf_df.label == test_label) \
.select('tf_idf_features') \
.rdd \
.map(lambda row: DenseVector(row.tf_idf_features.toArray())).collect()[0]
rddB = tfidf_rdd.map(lambda row: np.dot(row/np.linalg.norm(row),
vec/np.linalg.norm(vec))) \
.zipWithIndex()
# print('*** multiplication', rddB.take(20))
# Sort the similarities
sorted_rddB = rddB.sortByKey(False)
print(sorted_rddB.take(20))
The test vector has been selected as the one whose label is 1. The end result with similarities is (from the last print statement) [(1.0000000000000002, 1), (0.27105728525552131, 0), (0.1991208898963957, 2)] where indexing has been used to trace back to the original dataset.
This works fine but looks a bit clunky. I'd be looking for the best practices to perform a multiplication between a selected row of a dataframe (vector) with all the dataframe vectors. Am open to any suggestions over the workflow, specifically performance - related.

Spark sql pivot table generation

I have a spark dataframe looks like this:
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql.types import StringType, IntegerType, StructType, StructField,LongType
from pyspark.sql.functions import sum, mean
rdd = sc.parallelize([('retail','food'),
('retail','food'),
('retail','auto'),
('retail','shoes'),
('wholesale','healthsupply'),
('wholesale','foodsupply'),
('wholesale','foodsupply'),
('retail','toy'),
('retail','toy'),
('wholesale','foodsupply'])
schema = StructType([StructField('division', StringType(), True),
StructField('category', StringType(), True)
])
df = sqlContext.createDataFrame(rdd, schema)
I want to generate a table like this, get the division name, division totol records number, top 1 and top2 category within each division and their record number:
division division_total cat_top_1 top1_cnt cat_top_2 top2_cnt
retail 5 food 2 toy 2
wholesale4 foodsupply 3 healthsupply 1
Now I could generate the cat_top_1, cat_top_2 by using window functions in spark, but how to pivot to row, also add a column of division_total, I could not do it right
df_by_div = df.groupby('division','revenue').sort(asc("division"),desc("count"))
windowSpec = Window().partitionBy("division").orderBy(col("count").desc())
df_list = df_by_div.withColumn("rn", rowNumber()\
.over(windowSpec).cast('int'))\
.where(col("rn")<=2)\
.orderBy("division",desc("count"))