I am new to spark and learning it. can someone help with below question
The quote in spark definitive regarding dataframe definition is "In general, Spark will fail only at job execution time rather than DataFrame definition timeāeven if,
for example, we point to a file that does not exist. This is due to lazy evaluation,"
so I guess spark.read.format().load() is dataframe definition. On top of this created dataframe we apply transformations and action and load is read API and not transformation if I am not wrong.
I tried to "file that does not exist" in load and I am thinking this is dataframe definition. but I got below error. according to the book it should not fail right?. I am surely missing something. can someone help on this?
df=spark.read.format('csv')
.option('header',
'true').option('inferschema', 'true')
.load('/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv')
Error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Path does not exist: /spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv;'
why dataframe definition is referring Hadoop metadata when it is lazy evaluated?
Till here dataframe is defined and reader object instantiated.
scala> spark.read.format("csv").option("header",true).option("inferschema",true)
res2: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader#7aead157
when you actually say load.
res2.load('/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv') and the file doesnt exist...... is execution time.(that means it has to check the data source and then it has to load the data from csv)
To get dataframe its checking meta data of hadoop since it will check hdfs whether this file exist or not.
It doesnt then you are getting
org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://203-249-241:8020/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv
In general
1) RDD/DataFrame lineage will be created and will not be executed is definition time.
2) when load is executed then it will be the execution time.
See the below flow to understand better.
Conclude : Any traformation (definition time in your way ) will not be executed until action is called (execution time in your way)
Spark is a lazy evolution. However, that doesn't mean It can't verify if file exist of not while loading it.
Lazy evolution happens on DataFrame object, and in order to create dataframe object they need to first check if file exist of not.
Check the following code.
#scala.annotation.varargs
def load(paths: String*): DataFrame = {
if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException("Hive data source can only be used with tables, you can not " +
"read files of Hive data source directly.")
}
DataSource.lookupDataSourceV2(source, sparkSession.sessionState.conf).map { provider =>
val catalogManager = sparkSession.sessionState.catalogManager
val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
source = provider, conf = sparkSession.sessionState.conf)
val pathsOption = if (paths.isEmpty) {
None
} else {
val objectMapper = new ObjectMapper()
Some("paths" -> objectMapper.writeValueAsString(paths.toArray))
}
Related
I have submitted a similar question relating to saveAsTextFile, but I'm not sure if one question will provide the same answer as I now have a new error messagae:
I have compiled the following pyspark.sql code:
#%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/Person_Person.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('Person_Person')
myresults = spark.sql("""SELECT
PersonType
,COUNT(PersonType) AS `Person Count`
FROM Person_Person
GROUP BY PersonType""")
myresults.collect()
result = myresults.collect()
result
result.saveAsTextFile("test")
However, I am getting the following error:
Append ResultsClear Results
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-9-9e137ed161cc> in <module>()
----> 1 result.saveAsTextFile("test")
AttributeError: 'list' object has no attribute 'saveAsTextFile'
As I mentioned I'm trying to send the results of my query to a text file with the command saveAsTextFile, but I am getting the above error.
Can someone shed some light on how to resolve this issue?
Collect() returns all the records of the Dataframe as a list of type Row. And you are calling 'SaveAsTextFile' on the result which is a list.
List doesnt have the 'saveAsTextFile' function, so it's throwing an error.
result = myresults.collect()
result.saveAsTextFile("test")
To save the contents of the Dataframe to file, you have 2 options:
Convert the DataFrame into RDD and call 'saveAsTextFile' function on it.
myresults.rdd.saveAsTextFile(OUTPUT_PATH)
Using DataframeWriter. In this case, DataFrame must have only one column that is of string type. Each row becomes a new line in the output file.
myresults.write.format("text").save(OUTPUT_PATH)
As you have more than 1 column in Dataframe, proceed with Option:1.
Also by default, spark will create 200 Partitions for shuffle. so, 200 files will be created in the output path. If you less data, configure the below parameter according to your data size.
spark.conf.set("spark.sql.shuffle.partitions", 5) # 5 files will be written to output folder.
Hi I am using gunicorn with nginx and a postgreSQL database to run my web app. I recently change my gunicorn command from
gunicorn run:app -w 4 -b 0.0.0.0:8080 --workers=1 --timeout=300
to
gunicorn run:app -w 4 -b 0.0.0.0:8080 --workers=2 --timeout=300
using 2 workers. Now I am getting error messages like
File "/usr/local/lib/python2.7/dist-packages/flask_sqlalchemy/__init__.py", line 194, in session_signal_after_commit
models_committed.send(session.app, changes=list(d.values()))
File "/usr/local/lib/python2.7/dist-packages/blinker/base.py", line 267, in send
for receiver in self.receivers_for(sender)]
File "/usr/local/lib/python2.7/dist-packages/flask_whooshalchemy.py", line 265, in _after_flush
with index.writer() as writer:
File "/usr/local/lib/python2.7/dist-packages/whoosh/index.py", line 464, in writer
return SegmentWriter(self, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/whoosh/writing.py", line 502, in __init__
raise LockError
LockError
I can't really do much with these error messages, but they seem to be linked to whoosh search which I have on the User table in my database model
import sys
if sys.version_info >= (3, 0):
enable_search = False
else:
enable_search = True
import flask.ext.whooshalchemy as whooshalchemy
class User(db.Model):
__searchable__ = ['username','email','position','institute','id'] # these fields will be indexed by whoosh
id = db.Column(db.Integer, primary_key=True)
username = db.Column(db.String(100), index=True)
...
def __repr__(self):
return '<User %r>' % (self.username)
if enable_search:
whooshalchemy.whoosh_index(app, User)
any ideas how to investigate this? I thought postgres allows parallel access and hence I thought lock errors should not happen? When I used only 1 worked they did not happen, so it definitely is caused by having multiple workers...
any help is appreciated
thanks
carl
This has nothing to do with PostgreSQL. Whoosh holds file locks for writing and it's failing on the last line of this code...
class SegmentWriter(IndexWriter):
def __init__(self, ix, poolclass=None, timeout=0.0, delay=0.1, _lk=True,
limitmb=128, docbase=0, codec=None, compound=True, **kwargs):
# Lock the index
self.writelock = None
if _lk:
self.writelock = ix.lock("WRITELOCK")
if not try_for(self.writelock.acquire, timeout=timeout,
delay=delay):
raise LockError
Note, the delay default on this is 0.1 seconds and if it does not get the lock in that time it will fail. You increased your workers so now you have contention on the lock. From the following docs...
https://whoosh.readthedocs.org/en/latest/threads.html
Locking
Only one thread/process can write to an index at a time. When
you open a writer, it locks the index. If you try to open a writer on
the same index in another thread/process, it will raise
whoosh.store.LockError.
In a multi-threaded or multi-process environment your code needs to be
aware that opening a writer may raise this exception if a writer is
already open. Whoosh includes a couple of example implementations
(whoosh.writing.AsyncWriter and whoosh.writing.BufferedWriter) of ways
to work around the write lock.
While the writer is open and during the commit, the index is still
available for reading. Existing readers are unaffected and new readers
can open the current index normally.
You can find examples on how to use Whoosh concurrently.
Buffered
https://whoosh.readthedocs.org/en/latest/api/writing.html#whoosh.writing.BufferedWriter
Async
https://whoosh.readthedocs.org/en/latest/api/writing.html#whoosh.writing.AsyncWriter
I'd try the buffered version first since batching writes is almost always faster.
I am getting below error after executing below code. Am I missing something in the installation? I am using spark installed on my local mac and so I am checking to see if I need to install additional libraries for below code to work and load data from bigquery.
Py4JJavaError Traceback (most recent call last)
<ipython-input-8-9d6701949cac> in <module>()
13 "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
14 "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
---> 15 conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.gson.JsonObject
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>",
"mapred.bq.input.project.id": "publicdata",
"mapred.bq.input.dataset.id":"samples",
"mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD(
"com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
"org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)
The error "java.lang.ClassNotFoundException: com.google.gson.JsonObject" seems to hint that a library is missing.
Please try adding the gson jar to your path: http://search.maven.org/#artifactdetails|com.google.code.gson|gson|2.6.1|jar
Highlighting something buried in the connector link in Felipe's response: the bq connector used to be included by default in Cloud Dataproc, but was dropped starting at v 1.3. The link shows you three ways to get it back.
I want a traceback from every query executed during a request, so I can find where they're coming from and reduce the count/complexity.
I'm using this excellent snippet of middleware to list and time queries, but I don't know where in the they're coming from.
I've poked around in django/db/models/sql/compiler.py but apparent form getting a local version of django and editing that code I can't see how to latch on to queries. Is there a signal I can use? it seems like there isn't a signal on every query.
Is it possible to specify the default Manager?
(I know about django-toolbar, I'm hoping there's a solution without using it.)
An ugly but effective solution (eg. it prints the trace on all queries and only requires one edit) is to add the following to the bottom of settings.py:
import django.db.backends.utils as bakutils
import traceback
bakutils.CursorDebugWrapper_orig = bakutils.CursorWrapper
def print_stack_in_project():
stack = traceback.extract_stack()
for path, lineno, func, line in stack:
if 'lib/python' in path or 'settings.py' in path:
continue
print 'File "%s", line %d, in %s' % (path, lineno, func)
print ' %s' % line
class CursorDebugWrapperLoud(bakutils.CursorDebugWrapper_orig):
def execute(self, sql, params=None):
try:
return super(CursorDebugWrapperLoud, self).execute(sql, params)
finally:
print_stack_in_project()
print sql
print '\n\n\n'
def executemany(self, sql, param_list):
try:
return super(CursorDebugWrapperLoud, self).executemany(sql, param_list)
finally:
print_stack_in_project()
print sql
print '\n\n\n'
bakutils.CursorDebugWrapper = CursorDebugWrapperLoud
Still not sure if there is a more elegant way of doing this?
Django debug toolbar will tell you what you want with spectacular awesomeness.
I need som help to get started with this trippinin api, if you have worked with this api it would be very nice of you to just help me here to get started! I dont understand what I should write in for dayin data[....]:
import requests
import json
r = requests.get("http://api.v1.trippinin.com/City/London/Eat?day=monday&time=morning&limit=10& offset=2&KEY=58ffb98334528b72937ce3390c0de2b7")
data = r.json()
for day in data['city Name']:
print (day['city Name']['weekday'] + ":")
The error:
Traceback (most recent call last):
File "C:\Users\Nux\Desktop\Kurs3\test.py", line 7, in <module>
for day in data['city Name']:
KeyError: 'city Name'
The error KeyError: 'X' means you are trying to access the key X in a dictionary, but it doesn't exist. In your case you're trying to access data['city Name']. Apparently, the information in data does not have the key city Name. That means either a) you aren't getting any data back, or b) the data isn't in the format you expected. In both cases you can validate (or invalidate) your assumptions by printing out the value of data.
To help debug this issue, add the following immediately after you assign a value to data:
print(data)