I am trying to define a sql view on a pyspark dataframe(2.0.0) and getting errors like "Table or View Not found". What I am doing : 1. Create an empty dataframe 2. load data from different location into a temp dataframe 3. append the temp data frame to a main dataframe (the empty one) 4. define a sql view on the dataframe(which was empty earlier).
spark = SparkSession.builder.config(conf=SparkConf()).appName("mydailyjob").getOrCreate()
sc = spark.sparkContext
schema = StructType([StructField('vdna_id', StringType(), True),
StructField('miq_id', LongType(), True),
StructField('tags', IntegerType(), True),
StructField('dateserial', DateType(), True),
StructField('date_time', TimestampType(), True),
StructField('survey_id', StringType(), True),
StructField('ip', StringType(), True)])
brandsurvey_feed = sqlContext.createDataFrame(sc.emptyRDD(), schema)
# load brandsurvey feed data for each date in date_list
for loc in all_loc:
# load file from different location
bs_tmp = spark.read.csv(loc, schema=schema, sep='\t', header=True)
brandsurvey_feed = brandsurvey_feed.union(bs_tmp)
brandsurvey_feed.createOrReplaceTempView("brandsurvey_feed")
print(spark.sql("select * from brandsurvey_feed").show())
Folks, i think I found the reason. If we create a sql view on a dataframe with zero records and then access the table you will get an eerror "table or view does not exists". I would suggest keep a check before you define any sql view on the dataframe that it is not empty
Related
I have a source table with 3 columns. One of the column contains json values. some of the rows contain simple json but some of the rows contain nested json like in image's source table. I want the target table to look like in image attached. could someone help with pyspark code or sql code to put it in databrick?
This json doesn't have a fixed schema. it can be varried in different ways but ultimately its a json.
source and target tables
I am expecting pyspark code for above question.
Here is the sample code used to achieve this.
%py
df1 = spark.sql("select eventId, AppId, eventdata from tableA)
df1 = df1 .withColumn("EventData",from_json(df1 .eventdata,MapType(StringType(),StringType())))
df1 = df1 .select(df1.eventId,df1.AppId, explode_outer(df1.EventData))
display(df1)
this resulted in below output
[output][1]
Below is a sample json:
{
"brote":"AKA",
"qFilter":"{\"xfilters\":[{\"Molic\":\"or\",\"filters\":[{\"logic\":\"and\",\"field\":\"Name\",\"operator\":\"contains\",\"value\":\"*R-81110\"},{\"logic\":\"and\",\"field\":\"Title\",\"operator\":\"contains\",\"value\":\"*R-81110\"}]}],\"pSize\":200,\"page\":1,\"ignoreConfig\":false,\"relatedItemFilters\":[],\"entityType\":\"WAFADocuments\"}",
"config":"[\"PR_NMO\"]",
"title":"All Documents",
"selected":"PR_NMO",
"selectedCreateConfig":"PR_NMO",
"selectedQueryConfigs":[
"PR_CVO"
],
"selectedRoles":[
"RL_ZAC_Planner"
]
}
[1]: https://i.stack.imgur.com/Oftvr.png
The requirement is hard to achieve as the schema of the nested values is not fixed. To do it with the sample you have given, you can use the following code:
df1 = df.withColumn("EventData",from_json(df.EventData,MapType(StringType(),StringType())))
df1 = df1 .select(df1.eventID,df1.AppID, explode_outer(df1.EventData))
#df1.show()
df2 = df1.filter(df1.key == 'orders')
user_schema = ArrayType(
StructType([
StructField("id", StringType(), True),
StructField("type", StringType(), True)
])
)
df3 = df2.withColumn("value", from_json("value", user_schema)).selectExpr( "eventID", "AppID", "key","inline(value)")
df3 = df3.melt(['eventID','AppID','key'],['id','type'],'sub_order','val')
req = df3.withColumn('key',concat(df3.key,lit('.'),df3.sub_order))
final_df = df1.filter(df1.key != 'orders').union(req.select('eventID','AppID','key','val'))
final_df.show()
This might be not possible is the schema would be constantly changing.
I am very new to Machine learning and as part of learning, I was going thru the data preprocessing.
This is how my dataset looks like:
I wanted to predict the delete_ind based on the data I have in first three columns.
These are the steps I followed.
schema = StructType([StructField('id', StringType(), True), \
StructField('company_code', StringType(), True), \
StructField("cost", FloatType(), True), \
StructField('delete_ind', StringType(), True)
])
data = spark.sql('select id, company_code, cost, delete_ind from dbname.table limit 5000')
pdata = data.toPandas()
Selecting columns from Pandas Dataframe
x = pdata.iloc[:, 0:-1].values
y = pdata.iloc[:, -1].values
Adding default values in place of missing data
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 2:3])
Encoding the first two column of the dataset to default values
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0:2])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
While I am trying to encode the first two columns of x, I am seeing the below error.
SyntaxError: invalid syntax
File "<command-1856943148058177>", line 1
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0:2])], remainder='passthrough')
^
SyntaxError: invalid syntax
Could anyone let me know what is the mistake I am doing here ?
I am new to PySpark. I have a csv file with hyphen in column names. I could successfully read the file into a dataframe. However while writing the df to orc file I get an error like below-
java.lang.IllegalArgumentException: Missing required char ':' at
'struct
When I renamed the columns by removing hyphen, I could write the dataframe to orc. But I need the column names to have hyphen because I want to append this orc to an existing orc which has hyphen in column names.
Could someone please help me with this?
Any help would be greatly appreciated!!!
Use backtick " ` " to enclose the column name.
Like: `column-name`
read the data in a dataframe and create a new empty dataframe with the desired structure
from pyspark.sql.types import *
result= spark.read.orc(path)
schema = StructType([
StructField('col-name', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])
df = spark.createDataFrame(spark.emptyRDD(),schema)
df.unionAll(result).show()
How to remove t u from this dataframe, as I am not able to load the data into my hive table which is having partition based on ability_id.
Always getting Illigalargument error because of 'u'
>>schema = StructType([ StructField("ability_id", StringType(), True),
StructField("bid", StringType(), True),
StructField("bidtime", StringType(), True),
StructField("bidder", StringType(), True),
StructField("bidderrate", StringType(), True),
StructField("openbid", StringType(), True),
StructField("price", StringType(), True)])`
>>df = sqlContext.createDataFrame(auction_data,schema)
>>df.registerTempTable("auction")
>>first_line = sqlContext.sql("select * from auction where auctionid=8211480551").collect()
>>for i in first_line:
>> print i
>>Row(ability_id=u'8211480551', bid=u'52.99', bidtime=u'1.201505', bidder=u'hanna1104', bidderrate=u'94', openbid=u'49.99', price=u'311.6')
>>Row(ability_id=u'8211480551', bid=u'50.99', bidtime=u'1.203843', bidder=u'wrufai1', bidderrate=u'90', openbid=u'49.99', price=u'311.6')`enter code here`
sqlContext.sql(""" INSERT INTO TABLE dev_core_t1.PINO_KLANT_3 partition (abillity_id) SELECT bid,bidtime,bidder,bidderrate,openbid,price from temp """)
This issue is resolved, it seems any spark version lesser then or equal to 2.0.0.x.x.x will not work.
It only works with the spark version 2.1.x.x.x or higher.
I have a spark dataframe looks like this:
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql.types import StringType, IntegerType, StructType, StructField,LongType
from pyspark.sql.functions import sum, mean
rdd = sc.parallelize([('retail','food'),
('retail','food'),
('retail','auto'),
('retail','shoes'),
('wholesale','healthsupply'),
('wholesale','foodsupply'),
('wholesale','foodsupply'),
('retail','toy'),
('retail','toy'),
('wholesale','foodsupply'])
schema = StructType([StructField('division', StringType(), True),
StructField('category', StringType(), True)
])
df = sqlContext.createDataFrame(rdd, schema)
I want to generate a table like this, get the division name, division totol records number, top 1 and top2 category within each division and their record number:
division division_total cat_top_1 top1_cnt cat_top_2 top2_cnt
retail 5 food 2 toy 2
wholesale4 foodsupply 3 healthsupply 1
Now I could generate the cat_top_1, cat_top_2 by using window functions in spark, but how to pivot to row, also add a column of division_total, I could not do it right
df_by_div = df.groupby('division','revenue').sort(asc("division"),desc("count"))
windowSpec = Window().partitionBy("division").orderBy(col("count").desc())
df_list = df_by_div.withColumn("rn", rowNumber()\
.over(windowSpec).cast('int'))\
.where(col("rn")<=2)\
.orderBy("division",desc("count"))