I need som help to get started with this trippinin api, if you have worked with this api it would be very nice of you to just help me here to get started! I dont understand what I should write in for dayin data[....]:
import requests
import json
r = requests.get("http://api.v1.trippinin.com/City/London/Eat?day=monday&time=morning&limit=10& offset=2&KEY=58ffb98334528b72937ce3390c0de2b7")
data = r.json()
for day in data['city Name']:
print (day['city Name']['weekday'] + ":")
The error:
Traceback (most recent call last):
File "C:\Users\Nux\Desktop\Kurs3\test.py", line 7, in <module>
for day in data['city Name']:
KeyError: 'city Name'
The error KeyError: 'X' means you are trying to access the key X in a dictionary, but it doesn't exist. In your case you're trying to access data['city Name']. Apparently, the information in data does not have the key city Name. That means either a) you aren't getting any data back, or b) the data isn't in the format you expected. In both cases you can validate (or invalidate) your assumptions by printing out the value of data.
To help debug this issue, add the following immediately after you assign a value to data:
print(data)
Related
I'm trying to learn this command, but i can't do it right, can you suggest something?
df['Global_Sales'].max(index(), columns(), skipna=False, level=15, numeric_only=int)
TypeError Traceback (most recent call last)
<ipython-input-59-25017f1d27cb> in <module>
----> 1 df['Global_Sales'].max(index(), columns(), skipna=False, level=15, numeric_only=int)
TypeError: 'Series' object is not callable~
If you're trying to display the max value use:
print(df['Global_sales'].max(axis=None,skipna=False,level=15,numeric_only=None))
If you want to display the index of the max value use:
print(df['Global_sales'].idxmax(axis=1,skipna=False)
and maybe check pandas.series.max parameters here
https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html
I am new to spark and learning it. can someone help with below question
The quote in spark definitive regarding dataframe definition is "In general, Spark will fail only at job execution time rather than DataFrame definition timeāeven if,
for example, we point to a file that does not exist. This is due to lazy evaluation,"
so I guess spark.read.format().load() is dataframe definition. On top of this created dataframe we apply transformations and action and load is read API and not transformation if I am not wrong.
I tried to "file that does not exist" in load and I am thinking this is dataframe definition. but I got below error. according to the book it should not fail right?. I am surely missing something. can someone help on this?
df=spark.read.format('csv')
.option('header',
'true').option('inferschema', 'true')
.load('/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv')
Error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Path does not exist: /spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv;'
why dataframe definition is referring Hadoop metadata when it is lazy evaluated?
Till here dataframe is defined and reader object instantiated.
scala> spark.read.format("csv").option("header",true).option("inferschema",true)
res2: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader#7aead157
when you actually say load.
res2.load('/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv') and the file doesnt exist...... is execution time.(that means it has to check the data source and then it has to load the data from csv)
To get dataframe its checking meta data of hadoop since it will check hdfs whether this file exist or not.
It doesnt then you are getting
org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://203-249-241:8020/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv
In general
1) RDD/DataFrame lineage will be created and will not be executed is definition time.
2) when load is executed then it will be the execution time.
See the below flow to understand better.
Conclude : Any traformation (definition time in your way ) will not be executed until action is called (execution time in your way)
Spark is a lazy evolution. However, that doesn't mean It can't verify if file exist of not while loading it.
Lazy evolution happens on DataFrame object, and in order to create dataframe object they need to first check if file exist of not.
Check the following code.
#scala.annotation.varargs
def load(paths: String*): DataFrame = {
if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException("Hive data source can only be used with tables, you can not " +
"read files of Hive data source directly.")
}
DataSource.lookupDataSourceV2(source, sparkSession.sessionState.conf).map { provider =>
val catalogManager = sparkSession.sessionState.catalogManager
val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
source = provider, conf = sparkSession.sessionState.conf)
val pathsOption = if (paths.isEmpty) {
None
} else {
val objectMapper = new ObjectMapper()
Some("paths" -> objectMapper.writeValueAsString(paths.toArray))
}
I have submitted a similar question relating to saveAsTextFile, but I'm not sure if one question will provide the same answer as I now have a new error messagae:
I have compiled the following pyspark.sql code:
#%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/Person_Person.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('Person_Person')
myresults = spark.sql("""SELECT
PersonType
,COUNT(PersonType) AS `Person Count`
FROM Person_Person
GROUP BY PersonType""")
myresults.collect()
result = myresults.collect()
result
result.saveAsTextFile("test")
However, I am getting the following error:
Append ResultsClear Results
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-9-9e137ed161cc> in <module>()
----> 1 result.saveAsTextFile("test")
AttributeError: 'list' object has no attribute 'saveAsTextFile'
As I mentioned I'm trying to send the results of my query to a text file with the command saveAsTextFile, but I am getting the above error.
Can someone shed some light on how to resolve this issue?
Collect() returns all the records of the Dataframe as a list of type Row. And you are calling 'SaveAsTextFile' on the result which is a list.
List doesnt have the 'saveAsTextFile' function, so it's throwing an error.
result = myresults.collect()
result.saveAsTextFile("test")
To save the contents of the Dataframe to file, you have 2 options:
Convert the DataFrame into RDD and call 'saveAsTextFile' function on it.
myresults.rdd.saveAsTextFile(OUTPUT_PATH)
Using DataframeWriter. In this case, DataFrame must have only one column that is of string type. Each row becomes a new line in the output file.
myresults.write.format("text").save(OUTPUT_PATH)
As you have more than 1 column in Dataframe, proceed with Option:1.
Also by default, spark will create 200 Partitions for shuffle. so, 200 files will be created in the output path. If you less data, configure the below parameter according to your data size.
spark.conf.set("spark.sql.shuffle.partitions", 5) # 5 files will be written to output folder.
I am getting below error after executing below code. Am I missing something in the installation? I am using spark installed on my local mac and so I am checking to see if I need to install additional libraries for below code to work and load data from bigquery.
Py4JJavaError Traceback (most recent call last)
<ipython-input-8-9d6701949cac> in <module>()
13 "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
14 "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
---> 15 conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.gson.JsonObject
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>",
"mapred.bq.input.project.id": "publicdata",
"mapred.bq.input.dataset.id":"samples",
"mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD(
"com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
"org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)
The error "java.lang.ClassNotFoundException: com.google.gson.JsonObject" seems to hint that a library is missing.
Please try adding the gson jar to your path: http://search.maven.org/#artifactdetails|com.google.code.gson|gson|2.6.1|jar
Highlighting something buried in the connector link in Felipe's response: the bq connector used to be included by default in Cloud Dataproc, but was dropped starting at v 1.3. The link shows you three ways to get it back.
I want a traceback from every query executed during a request, so I can find where they're coming from and reduce the count/complexity.
I'm using this excellent snippet of middleware to list and time queries, but I don't know where in the they're coming from.
I've poked around in django/db/models/sql/compiler.py but apparent form getting a local version of django and editing that code I can't see how to latch on to queries. Is there a signal I can use? it seems like there isn't a signal on every query.
Is it possible to specify the default Manager?
(I know about django-toolbar, I'm hoping there's a solution without using it.)
An ugly but effective solution (eg. it prints the trace on all queries and only requires one edit) is to add the following to the bottom of settings.py:
import django.db.backends.utils as bakutils
import traceback
bakutils.CursorDebugWrapper_orig = bakutils.CursorWrapper
def print_stack_in_project():
stack = traceback.extract_stack()
for path, lineno, func, line in stack:
if 'lib/python' in path or 'settings.py' in path:
continue
print 'File "%s", line %d, in %s' % (path, lineno, func)
print ' %s' % line
class CursorDebugWrapperLoud(bakutils.CursorDebugWrapper_orig):
def execute(self, sql, params=None):
try:
return super(CursorDebugWrapperLoud, self).execute(sql, params)
finally:
print_stack_in_project()
print sql
print '\n\n\n'
def executemany(self, sql, param_list):
try:
return super(CursorDebugWrapperLoud, self).executemany(sql, param_list)
finally:
print_stack_in_project()
print sql
print '\n\n\n'
bakutils.CursorDebugWrapper = CursorDebugWrapperLoud
Still not sure if there is a more elegant way of doing this?
Django debug toolbar will tell you what you want with spectacular awesomeness.