Below is the function created to generate counts from the table, but in the query (string) I want to add 'group by' a column 'xyz'. Suggest, how to do the same.
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
db = 'database'
schema = 'Schema'
def getCount(table):
string = f"select count(*) as ct from {db}.{schema}." + table
df = spark.read.format(snowflake_name)\
.options(**sfOptions)\
.option('query', string).load()
return df
Well one way would be to alter the f-string slightly
string = f"select some_column, count(*) as ct from {db}.{schema}.{table} group by some_column"
The title almost says it already. I have a pyspark.sql.dataframe.Dataframe with a "ID", "TIMESTAMP", "CONSUMPTION" and "TEMPERATURE" column. I need the "TIMESTAMP" column to be resampled to daily intervals (from 15min intervals) and the "CONSUMPTION" and "TEMPERATURE" column aggregated by summation. However, this needs to be performed for each unique id in the "ID" column. How do I do this?
Efficiency/speed is of importance to me. I have a huge dataframe to start with, which is why I would like to avoid .toPandas() and for loops.
Any help would be greatly appreciated!
The following code will build a spark_df to play around with. The input_spark_df represents the input spark dataframe, the disred output is like desired_outcome_spark_df.
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
df_list = []
for unique_id in ['012', '345', '678']:
date_range = pd.date_range(pd.Timestamp('2022-12-28 00:00'), pd.Timestamp('2022-12-30 23:00'),freq = 'H')
df = pd.DataFrame()
df['TIMESTAMP'] = date_range
df['ID'] = unique_id
df['TEMPERATURE'] = np.random.randint(1, 10, df.shape[0])
df['CONSUMPTION'] = np.random.randint(1, 10, df.shape[0])
df = df[['ID', 'TIMESTAMP', 'TEMPERATURE', 'CONSUMPTION']]
df_list.append(df)
pandas_df = pd.concat(df_list)
spark = SparkSession.builder.getOrCreate()
input_spark_df = spark.createDataFrame(pandas_df)
desired_outcome_spark_df = spark.createDataFrame(pandas_df.set_index('TIMESTAMP').groupby('ID').resample('1d').sum().reset_index())
To condense the question thus: how do I go from input_spark_df to desired_outcome_spark_df as efficient as possible?
I found the answer to my own question. I first change the timestamp to "date only" using pyspark.sql.functions.to_date. Then I groupby both "ID" and "TIMESTAMP" and perfrom the aggregation.
from pyspark.sql.functions import to_date, sum, avg
# Group the DataFrame by the "ID" column
spark_df = input_spark_df.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
desired_outcome = (input_spark_df
.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
.groupBy("ID", 'TIMESTAMP')
.agg(
sum(col("CONSUMPTION")).alias("CUMULATIVE_DAILY_POWER_CONSUMPTION"),
avg(col('TEMPERATURE')).alias("AVERAGE_DAILY_TEMPERATURE")
))
grouped_df.display()
I would like to aggregate a column values (json) in spark dataframe and hive table.
e.g.
year, month, val (json)
2010 01 [{"a_id":"caes"},{"a_id":"rgvtsa"},{"a_id":"btbsdv"}]
2010 01 [{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]
2008 10 [{"a_id":"rfve"},{"a_id":"yjndf"},{"a_id":"onbds"}]
2008 10 [{"a_id":"fvds"},{"a_id":"yjndf"},{"a_id":"yesva"}]
I need:
year, month, val (json), num (int)
2010 01 [{"a_id":"caes"},{"a_id":"rgvtsa"},{"a_id":"btbsdv},{"a_id":"uktf"}, {"a_id":"ohcwa"}] 5
2008 10 [{"a_id":"rfve"},{"a_id":"yjndf"},{"a_id":"onbds"},{"a_id":"yesva"}] 4
I need to remove the duplicates and also find the size of the json string (num of "a_id") in it.
The data is saved as a hive table so it could be better to work on it by pyspark sql ?
I also would like to know how to work on it if it is saved as a spark dataframe.
I have tried:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('a_id', StringType(), True)
]
)
df.withColumn("val", from_json("val", schema))\
.select(col('year'), col('month'), col('val.*'))\
.show()
But, all values in "val1" are null.
thanks
UPDTAE
my hive version:
%sh
ls /databricks/hive | grep "hive"
spark--maven-trees--spark_1.4_hive_0.13
My DDL:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.types import *
def concate_elements(val):
return reduce (lambda x, y:x+y, val)
flatten_array = F.udf(concate_elements, T.ArrayType(T.StringType()))
remove_duplicates = udf(lambda row: list(set(row)),
ArrayType(StringType()))
#final results
df.select("year","month", flatten_array("val").alias("flattenvalues")).withColumn("uniquevalues", remove_duplicates("flattenvalues")).withColumn("size",F.size("uniquevalues")).show()
considered input data input Json file json-input.json
{"year":"2010","month":"01","value":[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]}
{"year":"2011","month":"01","value":[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"uktf"},{"a_id":"sathya"}]}
Approach 1. Read data from hive
1. insert data into hive
ADD JAR /home/sathya/Downloads/json-serde-1.3.7-jar-with-dependencies.jar
CREATE EXTERNAL TABLE json_table (
year string,
month string,
value array<struct<a_id:string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
load data local inpath '/home/sathya/json-input.json' into table json_table;
select * from json_table;
OK
2010 01 [{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]
2011 01 [{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"uktf"},{"a_id":"sathya"}]
2. Read data from spark:
pyspark --jars /home/sathya/Downloads/json-serde-1.3.7-jar-with-dependencies.jar --driver-class-path /home/sathya/Downloads/json-serde-1.3.7-jar-with-dependencies.jar
df=spark.sql("select * from default.json_table")
df.show(truncate=False)
'''
+----+-----+----------------------------------+
|year|month|value |
+----+-----+----------------------------------+
|2010|01 |[[caes], [uktf], [ohcwa]] |
|2011|01 |[[caes], [uktf], [uktf], [sathya]]|
+----+-----+----------------------------------+
'''
#UDFs for concatenating the array elements & removing duplicates in an array
def concate_elements(val):
return reduce (lambda x, y:x+y, val)
flatten_array = F.udf(concate_elements, T.ArrayType(T.StringType()))
remove_duplicates = udf(lambda row: list(set(row)), ArrayType(StringType()))
#final results
df.select("year","month",flattenUdf("value").alias("flattenvalues")).withColumn("uniquevalues", remove_duplicates("flattenvalues")).withColumn("size",size("uniquevalues")).show()
'''
+----+-----+--------------------------+--------------------+----+
|year|month|flattenvalues |uniquevalues |size|
+----+-----+--------------------------+--------------------+----+
|2010|01 |[caes, uktf, ohcwa] |[caes, uktf, ohcwa] |3 |
|2011|01 |[caes, uktf, uktf, sathya]|[caes, sathya, uktf]|3 |
+----+-----+--------------------------+--------------------+----+
'''
Approach 2 - direct read from input Json file json-input.json
{"year":"2010","month":"01","value":[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]}
{"year":"2011","month":"01","value":[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"uktf"},{"a_id":"sathya"}]}
code for your scenario is:
import os
import logging
from pyspark.sql import SQLContext,SparkSession
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql import functions as F
import pyspark.sql.types as T
df=spark.read.json("file:///home/sathya/json-input.json")
df.show(truncate=False)
'''
+-----+----------------------------------+----+
|month|value |year|
+-----+----------------------------------+----+
|01 |[[caes], [uktf], [ohcwa]] |2010|
|01 |[[caes], [uktf], [uktf], [sathya]]|2011|
+-----+----------------------------------+----+
'''
#UDFs for concatenating the array elements & removing duplicates in an array
def concate_elements(val):
return reduce (lambda x, y:x+y, val)
flatten_array = F.udf(concate_elements, T.ArrayType(T.StringType()))
remove_duplicates = udf(lambda row: list(set(row)), ArrayType(StringType()))
#final results
df.select("year","month",flattenUdf("value").alias("flattenvalues")).withColumn("uniquevalues", remove_duplicates("flattenvalues")).withColumn("size",size("uniquevalues")).show()
'''
+----+-----+--------------------------+--------------------+----+
|year|month|flattenvalues |uniquevalues |size|
+----+-----+--------------------------+--------------------+----+
|2010|01 |[caes, uktf, ohcwa] |[caes, uktf, ohcwa] |3 |
|2011|01 |[caes, uktf, uktf, sathya]|[caes, sathya, uktf]|3 |
+----+-----+--------------------------+--------------------+----+
'''
Here is a solution that'll work in Databricks:
#Import libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
#Define schema
schema1=StructType([
StructField('year',IntegerType(),True),
StructField('month',StringType(),True),
StructField('val',ArrayType(StructType([
StructField('a_id',StringType(),True)
])))
])
#Test data
rowsArr=[
[2010,'01',[{"a_id":"caes"},{"a_id":"rgvtsa"},{"a_id":"btbsdv"}]],
[2010,'01',[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]],
[2008,'10',[{"a_id":"rfve"},{"a_id":"yjndf"},{"a_id":"onbds"}]],
[2008,'10',[{"a_id":"fvds"},{"a_id":"yjndf"},{"a_id":"yesva"}]]
]
#Create dataframe
df1=(spark
.createDataFrame(rowsArr,schema=schema1)
)
#Create database
spark.sql('CREATE DATABASE IF NOT EXISTS testdb')
#Dump it into hive table
(df1
.write
.mode('overwrite')
.options(schema=schema1)
.saveAsTable('testdb.testtable')
)
#read from hive table
df_ht=(spark
.sql('select * from testdb.testtable')
)
#Perform transformation
df2=(df_ht
.groupBy('year','month')
.agg(array_distinct(flatten(collect_list('val'))).alias('val'))
.withColumn('num',size('val'))
)
Input DF:
Output DF:
SL No: Customer Month Amount
1 A1 12-Jan-04 495414.75
2 A1 3-Jan-04 245899.02
3 A1 15-Jan-04 259490.06
My Df is above
Code
import findspark
findspark.init('/home/mak/spark-3.0.0-preview2-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mak').getOrCreate()
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pdf3 = pd.read_csv('Repayment.csv')
df_repay = spark.createDataFrame(pdf3)
only loading df_repay is having issue, other data frame are loaded successfully. When i shfted my above to code to below code its worked successfully
df4 = (spark.read.format("csv").options(header="true")
.load("Repayment.csv"))
why df_repay is not loaded with spark.createDataFrame(pdf3) while similar data frames loaded successfully
pdf3 is pandas dataframe and you are trying to convert pandas dataframe to spark dataframe. if you want to stick to your code please use below code that is convert your pandas dataframe to spark dataframe.
from pyspark.sql.types import *
pdf3 = pd.read_csv('Repayment.csv')
#create schema for your dataframe
schema = StructType([StructField("Customer", StringType(), True)\
,StructField("Month", DateType(), True)\
,StructField("Amount", IntegerType(), True)])
#create spark dataframe using schema
df_repay = spark.createDataFrame(pdf3,schema=schema)
I'm facing the following problem and cound't get an answer yet: when converting a pandas dataframe with integers to a pyspark dataframe with a schema that supposes data comes as a string, the values change to "strange" strings, just like the example below. I've saved a lot of important data like that, and I wonder why that happened and if it is possible to "decode" these symbols back to integer forms. Thanks in advance!
import pandas as pd
from pyspark.sql.types import StructType, StructField,StringType
df = pd.DataFrame(data = {"a": [111,222, 333]})
schema = StructType([
StructField("a", StringType(), True)
])
sparkdf = spark.createDataFrame(df, schema)
sparkdf.show()
Output:
--+
+---+
| a|
+---+
| o|
| Þ|
| ō|
+---+
I cannot reproduce the problem on any recent version but the most likely reason is that you incorrectly defined the schema (in combination with enabled Arrow support).
Either cast the input:
df["a"] = df.a.astype("str")
or define the correct schema:
from pyspark.sql.types import LongType
schema = StructType([
StructField("a", LongType(), True)
])