want an explaination for attributes usage of spark sql example code in official document - apache-spark-sql

In spark 2.0.1 sql document,
I have three questions about following code snippets
1) what's attributes? where can I find the document?
2) => means what?
3) I search the APIs for RDD, but do not find toDF function. Why can call .toDF here?
// Create an RDD of Person objects from a text file, convert it to a Dataframe.
val peopleDF = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()

Related

PySpark map function - send n rows instead of one to build a list

I am using Spark 3.x in Python. I have some data (in millions) in CSV files that I have to index in Apache Solr.
I have deployed pysolr module for this purpose
import pysolr
def index_module(row ):
...
solr_client = pysolr.Solr(SOLR_URI)
solr_client.add(row)
...
df = spark.read.format("csv").option("sep", ",").option("quote", "\"").option("escape", "\\").option("header", "true").load("sample.csv")
df.toJSON().map(index_module).count()
index_module module simply get one row of data frame as json and then index in Solr via pysolr module. Pysolr support to index list of documents instead of one. I have to update my logic so that instead of sending one document in each request, I'll send a list of document. Definatelty, it will improve the performance.
How can I achieve this in PySpark ? Is there any alternative or best approach instead of map and toJSON ?
Also, My all activities are completed in transformation functions. I am using count just to start the job. Is there any alternative dummy function (of action type) in spark to do the same?
Finally, I have to create Solr Object each time, is there any alternative for this ?

Using apache beam JsonTimePartitioning to create time partitioned tables in bigqiery

I have tried using the JsonTimePartitioning class in apache beam JAVA sdk to write data to dynamic tables in bigquery but i get "cannot find symbol" for the class JsonTimePartitioning.
this is how i try to import the class
import com.google.api.services.bigquery.model.JsonTimePartitioning;
and this is how i try to use it in my pipeline
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withJsonTimePartitioningTo(new JsonTimePartitioning().setType("DAY")));
I can't seem to find the JsonTimePartitioning anywhere. Can you point to an example that you are trying to follow? The existing methods on BigQueryIO either accept an instance of TimePartiotioning, or a value-provider for a String that is actually a JSON-serialized instance of the same TimePartitioning. And in fact, when calling the TimePartitioning version of the method, you still end up just serializing it into string internally:. You can find an example of how it's used here:
Loading historical data into time-partitioned BigQuery tables To load
historical data into a time-partitioned BigQuery table, specify
BigQueryIO.Write.withTimePartitioning(com.google.api.services.bigquery.model.TimePartitioning)
with a field used for column-based partitioning. For example:
PCollection<Quote> quotes = ...;
quotes.apply(BigQueryIO.write()
.withSchema(schema)
.withFormatFunction(quote -> new TableRow()
.set("timestamp", quote.getTimestamp())
.set(..other columns..))
.to("my-project:my_dataset.my_table")
.withTimePartitioning(new TimePartitioning().setField("time"))); ```

Converting map into tuple

I am loading hbase table using pig.
product = LOAD 'hbase://product' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a:*', '-loadKey true') AS (id:bytearray, a:map[])
The relation product has a tuple that has map in it. I want to convert the map data into tuple
Here is the sample..
grunt>dump product;
06:177602927,[cloud_service#true,wvilnk#true,cmpgeo#true,cmplnk#true,webvid_standard#true,criteria_search#true,typeahead_search#true,aasgbr#true,lnkmin#false,aasdel#true,aasmcu#true,aasvia#true,lnkalt#false,aastlp#true,cmpeel#true,aasfsc#true,aasser#true,aasdhq#true,aasgbm#true,gboint#true,lnkupd#true,aasbig#true,webvid_basic#true,cmpelk#true]
06:177927527,[cloud_service#true,wvilnk#true,cmpgeo#true,cmplnk#true,webvid_standard#true,criteria_search#true,typeahead_search#true,aasgbr#false,lnkmin#false,aasdel#false,aasmcu#false,aasvia#false,lnkalt#false,aastlp#true,cmpeel#true,aasfsc#false,aasser#false,aasdhq#true,aasgbm#false,gboint#true,lnkupd#true,aasbig#false,webvid_basic#true,cmpelk#true,blake#true]
I want to convert each tuple into individual records like below
177602927,cloud_service,true
177602927,wvilnk,true
177602927,cmpgeo,true
177602927,cmpgeo,true
I am pretty new to pig and perhaps this is my first time to do something with Pig Latin. Any help is much appreciated.
I was able to find a fix for my problem.
I used a UDF called MapEntriesToBag which will convert all the maps into bags.
Here is my code.
>register /your/path/to/this/Jar/Pigitos-1.0-SNAPSHOT.jar
>DEFINE MapEntriesToBag pl.ceon.research.pigitos.pig.udf.MapEntriesToBag();
>product = LOAD 'hbase://product' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a:*', '-loadKey true') AS (id:bytearray, a:map[])
>b = foreach product generate flatten(SUBSTRING($0,3,12)), flatten(MapEntriesToBag($1));
The UDF is available in the Jar Pigitos-1.0-SNAPSHOT.jar. You can download this jar from here
For more information you can refer to this link. It has more interesting UDF's related to Map datatype.

Reading data from SQL Server using Spark SQL

Is it possible to read data from Microsoft Sql Server (and oracle, mysql, etc.) into an rdd in a Spark application? Or do we need to create an in memory set and parallize that into an RDD?
In Spark 1.4.0+ you can now use sqlContext.read.jdbc
That will give you a DataFrame instead of an RDD of Row objects.
The equivalent to the solution you posted above would be
sqlContext.read.jdbc("jdbc:sqlserver://omnimirror;databaseName=moneycorp;integratedSecurity=true;", "TABLE_NAME", "id", 1, 100000, 1000, new java.util.Properties)
It should pick up the schema of the table, but if you'd like to force it, you can use the schema method after read sqlContext.read.schema(...insert schema here...).jdbc(...rest of the things...)
Note that you won't get an RDD of SomeClass here (which is nicer in my view). Instead you'll get a DataFrame of the relevant fields.
More information can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
Found a solution to this from the mailing list. JdbcRDD can be used to accomplish this. I needed to get the MS Sql Server JDBC driver jar and add it to the lib for my project. I wanted to use integrated security, and so needed to put sqljdbc_auth.dll (available in the same download) in a location that java.library.path can see. Then, the code looks like this:
val rdd = new JdbcRDD[Email](sc,
() => {DriverManager.getConnection(
"jdbc:sqlserver://omnimirror;databaseName=moneycorp;integratedSecurity=true;")},
"SELECT * FROM TABLE_NAME Where ? < X and X < ?",
1, 100000, 1000,
(r:ResultSet) => { SomeClass(r.getString("Col1"),
r.getString("Col2"), r.getString("Col3")) } )
This gives an Rdd of SomeClass.The second, third and fourth parameters are required and are for lower and upper bounds, and number of partitions. In other words, that source data needs to be partitionable by longs for this to work.

Creating User Defined Function in Spark-SQL

I am new to spark and spark sql and i was trying to query some data using spark SQL.
I need to fetch the month from a date which is given as a string.
I think it is not possible to query month directly from sparkqsl so i was thinking of writing a user defined function in scala.
Is it possible to write udf in sparkSQL and if possible can anybody suggest the best method of writing an udf.
You can do this, at least for filtering, if you're willing to use a language-integrated query.
For a data file dates.txt containing:
one,2014-06-01
two,2014-07-01
three,2014-08-01
four,2014-08-15
five,2014-09-15
You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:
def myDateFilter(date: String) = date contains "-08-"
Set it all up as follows -- a lot of this is from the Programming guide.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
// case class for your records
case class Entry(name: String, when: String)
// read and parse the data
val entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))
You can use the UDF as part of your WHERE clause:
val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)
and see the results:
augustEntries.map(r => r(0)).collect().foreach(println)
Notice the version of the where method I've used, declared as follows in the doc:
def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD
So, the UDF can only take one argument, but you can compose several .where() calls to filter on multiple columns.
Edit for Spark 1.2.0 (and really 1.1.0 too)
While it's not really documented, Spark now supports registering a UDF so it can be queried from SQL.
The above UDF could be registered using:
sqlContext.registerFunction("myDateFilter", myDateFilter)
and if the table was registered
sqlContext.registerRDDAsTable(entries, "entries")
it could be queried using
sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")
For more details see this example.
In Spark 2.0, you can do this:
// define the UDF
def convert2Years(date: String) = date.substring(7, 11)
// register to session
sparkSession.udf.register("convert2Years", convert2Years(_: String))
val moviesDf = getMoviesDf // create dataframe usual way
moviesDf.createOrReplaceTempView("movies") // 'movies' is used in sql below
val years = sparkSession.sql("select convert2Years(releaseDate) from movies")
In PySpark 1.5 and above, we can easily achieve this with builtin function.
Following is an example:
raw_data =
[
("2016-02-27 23:59:59", "Gold", 97450.56),
("2016-02-28 23:00:00", "Silver", 7894.23),
("2016-02-29 22:59:58", "Titanium", 234589.66)]
Time_Material_revenue_df =
sqlContext.createDataFrame(raw_data, ["Sold_time", "Material", "Revenue"])
from pyspark.sql.functions import *
Day_Material_reveneu_df = Time_Material_revenue_df.select(to_date("Sold_time").alias("Sold_day"), "Material", "Revenue")