Schema option in spark_read_parquet() - apache-spark-sql

I am pretty new to R and spark. I want to read a parquet file with the following code. Anyone knows how to specify schema there?
library(sparklyr)
sc <- spark_connect(master = "yarn",
appname = "test")
df <- spark_read_parquet(sc,
"name",
"path/to/the/file",
repartition = 0,
schema = "?")
I looked at the link https://spark.rstudio.com/reference/spark_read_parquet/, there isn't any detail or example regarding how to set schema in the function to optimize it.

If you are only trying to read a parquet file, a schema does not need to be used, it is just an available option. The following code should work.
df <- spark_read_parquet(sc,
"name",
"path/to/the/file",
repartition = 0,
schema = Null)
But if you want to use a schema, there are many options and choosing the right one depends on your data and what you are using it for. But try running your code without a schema option to see if that works for your data.

try
tbl_change_db(sc, "dbName")
and if u are using RStudio then click the refresh button on the upper right part of snippet

Related

Multiple persists in the Spark Execution plan

I currently have some spark code (pyspark), which loads in data from S3 and applies several transformations on it. The current code is structured in such a way that there are a few persists along the way in the following format
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df = df.transformation5
df = df.transformation6
df = df.transformation7
df.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
df = df.transformationN-1
df = df.transformationN
df.persist(MEMORY_AND_DISK)
When I do df.explain() at the very end of all transformations, as expected, there are multiple persists in the execution plan. Now when I do the following at the end of all these transformations
print(df.count())
All transformations get triggered, including the persist. Since spark will flow through the execution plan, it will execute all these persists. Is there any way that I can inform Spark to unpersist the N-1th persist, when performing the Nth persist, or is Spark smart enough to do this. My issue stems from the fact that later on in the program, I run out of disk space, ie, spark errors out with the following error:
No space left on device
An easy solution is of course to increase the underlying number of instances. But my hypothesis is that the high number of persists eventually costs the disk to go out of space.
My question is, do these donkey persists cause this issue? If they do, what is the best way/practice to structure the code so that I can unpersist the N-1th persist automatically.
I'm more experienced with Scala Spark but it's definitely possible to unpersist a Dataframe.
In fact, the Pyspark method of a Dataframe is also called unpersist. So in your example, you could do something like (this is quite crude):
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df1 = df.transformation5
df1 = df1.transformation6
df1 = df1.transformation7
df.unpersist()
df1.persist(MEMORY_AND_DISK)
.
.
.
dfM = dfM-1.transformationN-2
dfM = dfM.transformationN-1
dfM = dfM.transformationN
dfM-1.unpersist()
dfM.persist(MEMORY_AND_DISK)
Now, the way this code looks triggers some questions in me. It might be that you've mostly written this as pseudocode to be able to ask this question, but still maybe the following questions might help you further:
I only see transformations in there, and no actions. If this is the case, do you even need to persist?
Also, you only seem to have 1 source of data (the spark.read.csv bit): this also seems to hint at not necessarily needing to persist.
This is more of a point about style (and maybe opinionated, so don't worry if you don't agree). As I said in the beginning, I have no experience with Pyspark but the way that I would write (in Scala Spark) something similar to what you have written would be something like this:
df = spark.read.csv(s3path)
.transformation1
.transformation2
.transformation3
.transformation4
.persist(MEMORY_AND_DISK)
df = df.transformation5
.transformation6
.transformation7
.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
.transformationN-1
.transformationN
.persist(MEMORY_AND_DISK)
This is less verbose and IMO a little more "true" to what really happens, just a chaining of transformations on the original dataframe.

How to use arcpullr::get_spatial_layer() and arcpullr::get_layer_by_poly()

I couldn't figure this out through the package documentation https://cran.r-project.org/web/packages/arcpullr/vignettes/intro_to_arcpullr.html.
My codes return the errors described below.
library(arcpullr)
url <- "https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/WBD/MapServer/1"
huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_poly(url,geometry = "esriGeometryPolygon")
huc8_1:
Error in if (layer_info$type == "Group Layer") { :
argument is of length zero
huc8_2:
Error in get_sf_crs(geometry) : "sf" %in% class(sf_obj) is not TRUE
It would be very appreciated if you could provide any help to explain the errors and suggest any solutions. Thanks!
I didn't use the arcpullr package. Using leaflet.esri::addEsriFeatureLayer with a where clause works.
See the relevant codes below, as an example:
leaflet.esri::addEsriFeatureLayer(
url="https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/IR_201820_byParameter/MapServer/2",
options = leaflet.esri::featureLayerOptions(where = IR_where_huc12)
)
You have to pass an sf object as the second argument to any of the get_layer_by_* functions. I alter your example a bit using a point instead of a polygon for spatial querying (since it's easier to create), but get_layer_by_poly would work the same way using an sf polygon instead of a point. Also, the service you use requires a token. I changed the url to USGS HU 6-digit basins instead
library(arcpullr)
url <- "https://hydro.nationalmap.gov/arcgis/rest/services/wbd/MapServer/3"
query_pt <- sf_point(c(-90, 45))
# this would query everything in the feature layer, which may or may not be huge
# huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_point(url, query_pt)
huc_map <- plot_layer(huc8_2)
huc_map
huc_map + ggplot2::geom_sf(data = query_pt)

Python: simplemysql wrapper. Weird (no) Data transfer

I need to say, that i actually learn programming and i try to solve this (propably) simple problem since hours. I want to check if a username already exists and if so, print a message, else write the Data for the new user into the DB. The script (check = db.getOne...) always returns "None" but read = db.getOne seems to read the Data. Where is the difference between them?
The query read = db.getOne("upers", ["logn"]) receives Data, but there is no Data saved in the DB. How can it be and how can i avoid the "None" problem and instead find the user? Sorry if my question seems crazy and like a spaghetti question, but im new to python (and stack overflow). Thanks in Advance for any Hints and Help!
The script:
import MySQLdb
from simplemysql import SimpleMysql
#Dokumentation: http://nadh.in/code/simplemysql/
db = SimpleMysql(
host='localhost',
user = '1234',
db = 'name',
passwd = '1234',
keep_alive=True
)
def createu(u, p, m):
username = u
password = p
mail = m
check = db.getOne("upers",
["logn"],
("logn = %s", [username])
)
print check
if check == None:
db.insert("upers",
{"logn": username, "pw": password, "mail": mail}
)
print "seems to work!"
else:
print "something is wrong!"
check = 0
u='test'
p = 'lol'
m = '123#lol.de'
createu(u, p, m)
read = db.getOne("upers", ["logn"])
print read
The Output:
None
seems to work!
Row(logn=u'test')
Also i cant imagine how the u finds its way into the output of the sql query.
I had many different approaches to get a simple database up and running and the simplesql wrapper seems to fit really good for a beginner like me, so i hope someone can point me in the right direction.
edit: It seems the data isn't even saved in the DB. When i log in to phpmyAdmin it cant find any data in the DB. But how can it be that i am able to read the Data with python? Magic?! oO
Another important question for me is: How can i extract the information from the print read result: Row(logn=u'test')

Reading data from SQL Server using Spark SQL

Is it possible to read data from Microsoft Sql Server (and oracle, mysql, etc.) into an rdd in a Spark application? Or do we need to create an in memory set and parallize that into an RDD?
In Spark 1.4.0+ you can now use sqlContext.read.jdbc
That will give you a DataFrame instead of an RDD of Row objects.
The equivalent to the solution you posted above would be
sqlContext.read.jdbc("jdbc:sqlserver://omnimirror;databaseName=moneycorp;integratedSecurity=true;", "TABLE_NAME", "id", 1, 100000, 1000, new java.util.Properties)
It should pick up the schema of the table, but if you'd like to force it, you can use the schema method after read sqlContext.read.schema(...insert schema here...).jdbc(...rest of the things...)
Note that you won't get an RDD of SomeClass here (which is nicer in my view). Instead you'll get a DataFrame of the relevant fields.
More information can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
Found a solution to this from the mailing list. JdbcRDD can be used to accomplish this. I needed to get the MS Sql Server JDBC driver jar and add it to the lib for my project. I wanted to use integrated security, and so needed to put sqljdbc_auth.dll (available in the same download) in a location that java.library.path can see. Then, the code looks like this:
val rdd = new JdbcRDD[Email](sc,
() => {DriverManager.getConnection(
"jdbc:sqlserver://omnimirror;databaseName=moneycorp;integratedSecurity=true;")},
"SELECT * FROM TABLE_NAME Where ? < X and X < ?",
1, 100000, 1000,
(r:ResultSet) => { SomeClass(r.getString("Col1"),
r.getString("Col2"), r.getString("Col3")) } )
This gives an Rdd of SomeClass.The second, third and fourth parameters are required and are for lower and upper bounds, and number of partitions. In other words, that source data needs to be partitionable by longs for this to work.

Django: copy data from one database to another

I have two sqlite.db files. I'd like to copy the contents of one column in a table of on db file to another.
for example:
I have the model Information in db file called new.db:
class Information(models.Model):
info_id = models.AutoField(primary_key = True)
info_name = models.CharField( max_length = 50)
and the following information model in db file called old.db:
class Information(models.Model):
info_id = models.AutoField(primary_key = True)
info_type = models.CharField(max_length = 50)
info_name = models.CharField( max_length = 50)
I'd like to copy all the data in column info_id and info_name from old.db to info_id and info_name in new.db.
I was thinking something like:
manage.py dbshell
then
INSERT INTO "new.Information" ("info_id", "info_name")
SELECT "info_id", "info_name"
FROM "old.Information";
This doesn't seem to be working. It says new.Information table does not exist... any ideas?
You'd need to switch your database URL in your settings file to db2 and run syncdb to create the new tables. After that the easiest thing to do imo would be to switch back to db1 and run ./manage.py dumpdata myapp > data.json, followed by another switch to db2 where you can run ./manage.py loaddata data.json.
Afterwards, you can drop the data you don't need from db2.
Edit: Another approach would be to use the ATTACH function from sqlite. First, I recommend you do the first step above (change database settings and use syncdb to create the tables), then you can switch back and do this:
./manage.py dbshell
> ATTACH DATABASE 'new.db' AS newdb;
> INSERT INTO newdb.Information SELECT * FROM Information;
The dumped file from old.db contains info_type field which is not in the new Information model. This will fail the loaddata which checks all field loaded from JSON file. You could comment out info_type line before dump from old model.
The Attach way mentioned by Alex is easier and great, which needs a tiny tweak
INSERT INTO newdb.Information SELECT * FROM Information;
note the missing parentheses around the SELECT, sqlite does not accept them. Refs http://sqlite.org/lang_insert.html
If you are performing migration, have you tried South