I have a dataframe with the following schema:
root
|-- _id: long (nullable = true)
|-- student_info: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- lastname: string (nullable = true)
| |-- major: string (nullable = true)
| |-- hounour_roll: boolean (nullable = true)
|-- school_name: string (nullable = true)
How can I get a list of columns under "student_info" only? I.e. ["firstname","lastname","major","honour_roll"]
All of the following return the list of struct's field names. The .columns approach looks cleanest.
df.select("student_info.*").columns
df.schema["student_info"].dataType.names
df.schema["student_info"].dataType.fieldNames()
df.select("student_info.*").schema.names
df.select("student_info.*").schema.fieldNames()
Related
How can i delete all letters from String?
I've got given String:
val stringData ="ABC123.456"
Output value:
val stringData ="123.456"
We can try regex replacement here:
val regex = """[A-Za-z]+""".toRegex()
val stringData = "ABC123.456"
val output = regex.replace(stringData, "")
println(output) // 123.456
Consider the code:
.withColumn("my_column",
aggregate(
col("input_column"),
map(),
(acc, c) => map_concat(acc, map(col("name"), col("other"))))))
This creates my_column with type map<string, strcut<...>>. Is there a way to make it struct<string, struct<...>>?
P.S. similar question - How to convert array of struct into struct of struct in Spark?
The following converts map to struct (map keys become struct fields).
val json_col = to_json($"col_map")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.select(from_json(json_col, json_schema).alias("col_struct"))
In your case, it could look like this:
case class Strct(struct_name: String, struct_key: String)
val df = Seq(
Map("x" -> Strct("x", "val1"), "y" -> Strct("y", "val2"))
).toDF("map_of_structs")
df.printSchema()
// root
// |-- map_of_structs: map (nullable = true)
// | |-- key: string
// | |-- value: struct (valueContainsNull = true)
// | | |-- struct_name: string (nullable = true)
// | | |-- struct_key: string (nullable = true)
val json_col = to_json($"map_of_structs")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.select(from_json(json_col, json_schema).alias("struct_of_structs"))
df2.printSchema()
// root
// |-- struct_of_structs: struct (nullable = true)
// | |-- x: struct (nullable = true)
// | | |-- struct_key: string (nullable = true)
// | | |-- struct_name: string (nullable = true)
// | |-- y: struct (nullable = true)
// | | |-- struct_key: string (nullable = true)
// | | |-- struct_name: string (nullable = true)
I have a dataframe with the following schema:
root
|-- docnumber: string (nullable = true)
|-- event: struct (nullable = false)
| |-- data: struct (nullable = true)
|-- codevent: int (nullable = true)
I need to add a column inside event.data so that the schema would be like:
root
|-- docnumber: string (nullable = true)
|-- event: struct (nullable = false)
| |-- data: struct (nullable = true)
|-- codevent: int (nullable = true)
|-- needtoaddit: int (nullable = true)
I tried
dataframe.withColumn("event.data.needtoaddit", lit("added"))
but it adds a column with name event.data.needtoaddit
dataframe.withColumn(
"event",
struct(
$"event.*",
struct(
lit("added")
.as("needtoaddit")
).as("data")
)
)
but it creates an ambiguous column named event.data and again I have a problem.
How can I make it work?
You're kind of close. Try this code:
val df2 = df.withColumn(
"event",
struct(
struct(
$"event.data.*",
lit("added").as("needtoaddit")
).as("data")
)
)
Spark 3.1+
To add fields inside struct columns, use withField
col("event.data").withField("needtoaddit", lit("added"))
Input:
val df = spark.createDataFrame(Seq(("1", 2)))
.select(
col("_1").as("docnumber"),
struct(struct(col("_2").as("codevent")).as("data")).as("event")
)
df.printSchema()
// root
// |-- docnumber: string (nullable = true)
// |-- event: struct (nullable = false)
// | |-- data: struct (nullable = false)
// | | |-- codevent: long (nullable = true)
Script:
val df2 = df.withColumn(
"event",
col("event.data").withField("needtoaddit", lit("added"))
)
df2.printSchema()
// root
// |-- docnumber: string (nullable = true)
// |-- event: struct (nullable = false)
// | |-- data: struct (nullable = true)
// |-- codevent: int (nullable = true)
// |-- needtoaddit: int (nullable = true)
i am trying to fetch application_number record from hive table and collect as a list. and from this list, i am iterating list and for each and every application_number i am trying to call curl command.
Here is my sample code:
object th extends Serializable
{
def main(args: Array[String]): Unit =
{
val conf = new SparkConf().setAppName("th").setMaster("local")
conf.set("spark.debug.maxToStringFields", "10000000")
val context = new SparkContext(conf)
val sqlCotext = new SQLContext(context)
val hiveContext = new HiveContext(context)
import hiveContext.implicits._
val list = hiveContext.sql("select application_number from tableA").collect().take(100)
val l1=context.parallelize(list)
val stu1 =StructType(
StructField("application_number", LongType, true) ::
StructField("event_code", StringType, true) ::
StructField("event_description", StringType, true) ::
StructField("event_recorded_date", StringType, true) :: Nil)
var initialDF1 = sqlCotext.createDataFrame(context.emptyRDD[Row], stu1)
l1.repartition(10).foreachPartition(f=>{f.foreach(f=>
{
val schema=StructType(List(
StructField("queryResults",StructType(
List(StructField("searchResponse",StructType(
List(StructField("response",StructType(
List(StructField("docs",ArrayType(StructType(
List(
StructField("transactions",ArrayType(StructType(
List
(
StructField("code", StringType, nullable = true),
StructField("description", StringType, nullable = true),
StructField("recordDate", StringType, nullable = true)
)
)))
)
))))
)))
)))
))
))
val z = f.toString().replace("[","").replace("]","").replace(" ","").replace("(","").replace(")","")
if(z!= null)
{
val cmd = Seq("curl", "-X", "POST", "--insecure", "--header", "Content-Type: application/json", "--header", "Accept: application/json", "-d", "{\"searchText\":\""+z+"\",\"qf\":\"applId\"}", "https://ped.uspto.gov/api/queries") //cmd.!
val r = cmd.!!
val r1 = r.toString()
val rdd = context.parallelize(Seq(r1))
val dff = sqlCotext.read.schema(schema).json(rdd.toDS)
val dfContent = dff.select(explode(dff("queryResults.searchResponse.response.docs.transactions"))).toDF("transaction")
val a1 = dfContent.select("transaction.code").collect()
val a2 = dfContent.select("transaction.description").collect()
val a3 = dfContent.select("transaction.recordDate").collect()
for (mmm1 <- a1; mm2 <- a2; mm3 <- a3)
{
val ress1 = mmm1.toString().replace("[", " ").replace("]", " ").replace("WrappedArray(","").replace(")","")
val res2 = mm2.toString().replace("[", " ").replace("]", " ").replace("WrappedArray(","").replace(")","")
val res3 = mm3.toString().replace("[", " ").replace("]", " ").replace("WrappedArray(","").replace(")","")
initialDF1 = initialDF1.union(Seq((z, ress1, res2, res3)).toDF("application_number", "event_code", "event_description", "event_recorded_date"))
}
}
})})
initialDF1.registerTempTable("curlTH")
hiveContext.sql("insert into table default.ipg_tableB select application_number,event_code,event_description,event_recorded_date from curlTH")
}
}
i am getting Task not serializable exception.
Here is my error trace:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
at newipg170103.th$.main(th.scala:58)
at newipg170103.th.main(th.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#1e592ef2)
- field (class: newipg170103.th$$anonfun$main$1, name: context$1, type: class org.apache.spark.SparkContext)
- object (class newipg170103.th$$anonfun$main$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 20 more
In Apache Spark it is not permitted to use SQLContext, SparkContext or SparkSession within action or transformation (map, foreach, mapPartitions, foreachPartition, and so on).
Therefore
l1.repartition(10).foreachPartition(f=>{f.foreach(f=>
...
val rdd = context.parallelize(Seq(r1))
val dff = sqlCotext.read.schema(schema).json(rdd.toDS)
)})
is not valid Spark code.
I have created a hive table like this:
CREATE EXTERNAL TABLE table_df (v1 String, v2 String, v3 String, v4 String, v5 String, v6 String, v7 String, v8 String, v9 String, v10 String, v11 String, v12 String, v13 String, v14 String, v15 String, v16 String, v17 String, v18 String, v19 String, v20 String, v21 String, v22 String, v23 String, v24 String, v25 String, v26 String, v27 String, v28 String, v29 String, v30 String, v31 String, v32 Double, v33 Int, v34 Int, v35 Int)
STORED AS PARQUET LOCATION '/data/test/table_df.parquet';
the parquet file has been :
root
|-- v1: string (nullable = true)
|-- v2: string (nullable = true)
|-- v3: string (nullable = true)
|-- v4: string (nullable = true)
|-- v5: string (nullable = true)
|-- v6: string (nullable = true)
|-- v7: string (nullable = true)
|-- v8: string (nullable = true)
|-- v9: string (nullable = true)
|-- v10: string (nullable = true)
|-- v11: string (nullable = true)
|-- v12: string (nullable = true)
|-- v13: string (nullable = true)
|-- v14: string (nullable = true)
|-- v15: string (nullable = true)
|-- v16: string (nullable = true)
|-- v17: string (nullable = true)
|-- v18: string (nullable = true)
|-- v19: string (nullable = true)
|-- v20: string (nullable = true)
|-- v21: string (nullable = true)
|-- v22: string (nullable = true)
|-- v23: string (nullable = true)
|-- v24: string (nullable = true)
|-- v25: string (nullable = true)
|-- v26: string (nullable = true)
|-- v27: string (nullable = true)
|-- v28: string (nullable = true)
|-- v29: string (nullable = true)
|-- v30: string (nullable = true)
|-- v31: string (nullable = true)
|-- v32: double (nullable = true)
|-- v33: integer (nullable = true)
|-- v34: integer (nullable = true)
|-- v35: integer (nullable = true)
the problem occur when i perform this request
select * from table_df
and i get the following error message:
Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='b`a!2RA\xb7\x85\xb5u\xb5\x06\xe4,\x16', guid='\xcf\xbde\xc0\xc7%C\xe1\x9c\xf2\x10\x8d\xc1\xb2=\xec')), orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.hive.serde2.io.DoubleWritable', sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.hive.serde2.io.DoubleWritable:14:13', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:415', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:233', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:780', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:478', 'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:692', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1557', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1542', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617', 'java.lang.Thread:run:Thread.java:745', '*java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.hive.serde2.io.DoubleWritable:16:2', 'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:164', 'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:1762', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:410', '*org.apache.hadoop.hive.ql.metadata.HiveException:java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.hive.serde2.io.DoubleWritable:23:7', 'org.apache.hadoop.hive.ql.exec.ListSinkOperator:process:ListSinkOperator.java:93', 'org.apache.hadoop.hive.ql.exec.Operator:forward:Operator.java:838', 'org.apache.hadoop.hive.ql.exec.SelectOperator:process:SelectOperator.java:88', 'org.apache.hadoop.hive.ql.exec.Operator:forward:Operator.java:838', 'org.apache.hadoop.hive.ql.exec.TableScanOperator:process:TableScanOperator.java:133', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:437', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:429', 'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:146', '*java.lang.UnsupportedOperationException:Cannot inspect org.apache.hadoop.hive.serde2.io.DoubleWritable:28:5', 'org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector:getPrimitiveJavaObject:ParquetStringInspector.java:77', 'org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector:getPrimitiveJavaObject:ParquetStringInspector.java:28', 'org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils:copyToStandardObject:ObjectInspectorUtils.java:305', 'org.apache.hadoop.hive.serde2.SerDeUtils:toThriftPayload:SerDeUtils.java:168', 'org.apache.hadoop.hive.ql.exec.FetchFormatter$ThriftFormatter:convert:FetchFormatter.java:61', 'org.apache.hadoop.hive.ql.exec.ListSinkOperator:process:ListSinkOperator.java:90'], statusCode=3), results=None, hasMoreRows=None)
I have no problem with this request:
select v1 from table_df
Do you have any idea?
This happens because the files that this table pointing to does not have the same format as the table structure. Check the tables properties using 'show create table table_name'
and make sure the HDFS file (that this table is pointing) satisfies those properties including number of columns and column types.