PySpark and HIVE/Impala - hive

I want to build a classification model in PySpark. My input to this model is result of select query or view from Hive or Impala. is any way to include this query in PySpark code itself instead of storing result in text file feeding to our model

Yes for this you need to use HiveContext with sparkContext.
Here is example:-
sqlContext = HiveContext(sc)
tableData = sqlContext.sql("SELECT * FROM TABLE")
#tableData is a dataframe containing reference to schema of table, check this using tableData.printSchema()
tableData.collect() #collect executes query and provide all rows from sql
or you may refer here
https://spark.apache.org/docs/1.6.0/sql-programming-guide.html

Related

Pyspark not reading all months from hive database

I am trying to read data from hive into pyspark in order to write csv files. The following sql code results in 5 months:
select distinct posting_date from my_table
When I read the data with pyspark I only get 4 months:
sql_query = 'select * from my_table'
data = spark_session.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
I had the same problem in the past and I solved it by using the deprecated api for reading sql:
sql_context = SQLContext(spark_session.sparkContext)
data = sql_context.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
The problem is that for my current project I have the same issue and I cannot solve it with any method.
I also tried to use HiveContext instead of SQLContext but I had no luck.

Right way to implement pandas.read_sql with ClickHouse

Trying to implement pandas.read_sql function.
I created a clickhouse table and filled it:
create table regions
(
date DateTime Default now(),
region String
)
engine = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY tuple()
SETTINGS index_granularity = 8192;
insert into regions (region) values ('Asia'), ('Europe')
Then python code:
import pandas as pd
from sqlalchemy import create_engine
uri = 'clickhouse://default:#localhost/default'
engine = create_engine(uri)
query = 'select * from regions'
pd.read_sql(query, engine)
As the result I expected to get a dataframe with columns date and region but all I get is empty dataframe:
Empty DataFrame
Columns: [2021-01-08 09:24:33, Asia]
Index: []
UPD. It occured that defining clickhouse+native solves the problem.
Can it be solved without +native?
There is encient issue https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/10. Also there is a hint which assumes to add FORMAT TabSeparatedWithNamesAndTypes at the end of a query. So the init query will be look like this:
select *
from regions
FORMAT TabSeparatedWithNamesAndTypes

Querying avro data files stored in Azure Data Lake directly with raw SQL from Databricks

I'm using Databricks Notebooks to read avro files stored in an Azure Data Lake Gen2. The avro files are created by an Event Hub Capture, and present a specific schema. From these files I have to extract only the Body field, where the data which I'm interested in is actually stored.
I already implented this in Python and it works as expected:
path = 'abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path) # 1
df1 = df0.select(df0.Body.cast('string')) # 2
rdd1 = df1.rdd.map(lambda x: x[0]) # 3
data = spark.read.json(rdd1) # 4
Now I need to translate this to raw SQL in order to filter the data directly in the SQL query. Considering the 4 steps above, steps 1 and 2 with SQL are as follows:
CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")
WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro)
SELECT * FROM body_array
With this partial query I get the same as df1 above (step 2 with Python):
Body
[{"id":"a123","group":"0","value":1.0,"timestamp":"2020-01-01T00:00:00.0000000"},
{"id":"a123","group":"0","value":1.5,"timestamp":"2020-01-01T00:01:00.0000000"},
{"id":"a123","group":"0","value":2.3,"timestamp":"2020-01-01T00:02:00.0000000"},
{"id":"a123","group":"0","value":1.8,"timestamp":"2020-01-01T00:03:00.0000000"}]
[{"id":"b123","group":"0","value":2.0,"timestamp":"2020-01-01T00:00:01.0000000"},
{"id":"b123","group":"0","value":1.2,"timestamp":"2020-01-01T00:01:01.0000000"},
{"id":"b123","group":"0","value":2.1,"timestamp":"2020-01-01T00:02:01.0000000"},
{"id":"b123","group":"0","value":1.7,"timestamp":"2020-01-01T00:03:01.0000000"}]
...
I need to know how to introduce the steps 3 and 4 into the SQL query, to parse the strings into json objects and finally get the desired dataframe with columns id, group, value and timestamp. Thanks.
One way I found to do this with raw SQL is as follows, using from_json Spark SQL built-in function and the scheme of the Body field:
CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")
WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro),
data1 AS (SELECT from_json(Body, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') FROM body_array),
data2 AS (SELECT explode(*) FROM data1),
data3 AS (SELECT col.* FROM data2)
SELECT * FROM data3 WHERE id = "a123" --FILTERING BY CHANNEL ID
It performs faster than the Python code I posted in the question, surely because of the use of from_json and the scheme of Body to extract data inside it. My version of this approach in PySpark looks as follows:
path = 'abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path)
df1 = df0.selectExpr("cast(Body as string) as json_data")
df2 = df1.selectExpr("from_json(json_data, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') as parsed_json")
data = df2.selectExpr("explode(parsed_json) as json").select("json.*")

Scio saveAsTypedBigQuery write to a partition for SCollection of Typed Big Query case class

I'm trying to write a SCollection to a partition in Big Query using:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val date = LocateDate.parse("2017-06-21")
val col = sCollection.typedBigQuery[Blah](query)
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.ISO_LOCAL_DATE),
writeDisposition = WriteDisposition.WRITE_EMPTY,
createDisposition = CreateDisposition.CREATE_IF_NEEDED)
The error I get is
Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used."
How can I write to a partition? I don't see any options to specify partitions via either saveAsTypedBigQuery method so I was trying the Legacy SQL table decorators.
See: BigqueryIO Unable to Write to Date-Partitioned Table. You need to manually create the table. BQ IO cannot create a table and partition it.
Additionally, the no table decorators was a complete ruse. It's the alphanumeric part I was missing.
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.BASIC_ISO_DATE),
writeDisposition = WriteDisposition.WRITE_APPEND,
createDisposition = CreateDisposition.CREATE_NEVER)

SparkSQL Staging Table Row Count vs Hive Row count

I am attempting to extract data from Cassandra, into a specific partitioned Hive table using Spark 2.1.1 on Hadoop 2.7. To do this, I have all the data from Cassandra into an rdd which I transform into a dataframe via rdd.toDF(), and passed into the following function:
public def writeToHive(ss: SparkSession, df: DataFrame) {
df.createOrReplaceTempView(tablename)
val cols = df.columns
val schema = df.schema
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
val outdf = ss.sql(s"""INSERT INTO TABLE ${db}.${t} PARTITION (date="${destPartition}") SELECT * FROM ${tablename}""")
// Have also tried the following lines below, but yielded the same results
// var dfInput_1 = dfInput.withColumn("region", lit(s"${destPartition}"))
// dfInput_1.write.mode("append").insertInto(s"${db}.${t}")
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
// logs 423
LOG.info(s"""SELECT COUNT(*) FROM ${db}.${t} where date='${destPartition}'""")
}
When looking in Cassandra, there are indeed 358 rows in the table. I saw this post on Hortonworks https://community.hortonworks.com/questions/51322/count-msmatch-while-using-the-parquet-file-in-spar.html but there doesn't seem to be a solution. I have tried setting spark.sql.hive.metastorePartitionPruning to true, but no changes were seen in the row counts.
Would love any feedback as to why there is a discrepancy between the row counts. Thanks!
EDIT: bad data coming in.... should've seen that coming
Sometimes data contains non-utf8 characters like Japanese or Chinese. Check if data contains any such non-utf8 characters.
If this is a case insert it in ORC format. By default it is text, and text doesn't support non-utf8 characters.