Unable to load bigquery data in local spark (on my mac) using pyspark - google-bigquery

I am getting below error after executing below code. Am I missing something in the installation? I am using spark installed on my local mac and so I am checking to see if I need to install additional libraries for below code to work and load data from bigquery.
Py4JJavaError Traceback (most recent call last)
<ipython-input-8-9d6701949cac> in <module>()
13 "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
14 "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
---> 15 conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.gson.JsonObject
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>",
"mapred.bq.input.project.id": "publicdata",
"mapred.bq.input.dataset.id":"samples",
"mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD(
"com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
"org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)

The error "java.lang.ClassNotFoundException: com.google.gson.JsonObject" seems to hint that a library is missing.
Please try adding the gson jar to your path: http://search.maven.org/#artifactdetails|com.google.code.gson|gson|2.6.1|jar

Highlighting something buried in the connector link in Felipe's response: the bq connector used to be included by default in Cloud Dataproc, but was dropped starting at v 1.3. The link shows you three ways to get it back.

Related

Failed to connect to BigQuery with Python - ServiceUnavailable

querying data from BigQuery has been working for me. Then I updated my google packages (e. g. google-cloud-bigquery) and suddenly I could no longer download data. Unfortunately, I don't know the old version of the package I was using any more. Now, I'm using version '1.26.1' of google-cloud-bigquery.
Here is my code which was running:
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
KEY_FILE_LOCATION = "path_to_json"
PROCECT_ID = 'bigquery-123454'
credentials = service_account.Credentials.from_service_account_file(KEY_FILE_LOCATION)
client = bigquery.Client(credentials= credentials,project=PROCECT_ID)
query_job = client.query("""
SELECT
x,
y
FROM
`bigquery-123454.624526435.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20200501' AND '20200502'
""")
results = query_job.result()
df = results.to_dataframe()
Except of the last line df = results.to_dataframe() the code works perfectly. Now I get a weired error which consists of three parts:
Part 1:
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1596627109.629000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1596627109.629000000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>
Part 2:
ServiceUnavailable: 503 failed to connect to all addresses
Part 3:
RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x0000000010BD3C80>, table_reference {
project_id: "bigquery-123454"
dataset_id: "_a0003e6c1ab4h23rfaf0d9cf49ac0e90083ca349e"
table_id: "anon2d0jth_f891_40f5_8c63_76e21ab5b6f5"
}
requested_streams: 1
read_options {
}
format: ARROW
parent: "projects/bigquery-123454"
, metadata=[('x-goog-request-params', 'table_reference.project_id=bigquery-123454&table_reference.dataset_id=_a0003e6c1abanaw4egacf0d9cf49ac0e90083ca349e'), ('x-goog-api-client', 'gl-python/3.7.3 grpc/1.30.0 gax/1.22.0 gapic/1.0.0')]), last exception: 503 failed to connect to all addresses
I don't have an explanation for this error. I don't think it has something to do with me updating the packages.
Once I had problems with the proxy but these problems caused another/different error.
My colleague said that the project "bigquery-123454" is still available in BigQuery.
Any ideas?
Thanks for your help in advance!
503 error occurs when there is a network issue. Try again after some time or retry the job.
You can read more about the error on Google Cloud Page
I found the answer:
After downgrading the package "google-cloud-bigquery" from version 1.26.1 to 1.18.1 the code worked again! So the new package caused the errors.
I downgraded the package using pip install google-cloud-bigquery==1.18.1 --force-reinstall

DataFrame Definintion is lazy evaluation

I am new to spark and learning it. can someone help with below question
The quote in spark definitive regarding dataframe definition is "In general, Spark will fail only at job execution time rather than DataFrame definition timeā€”even if,
for example, we point to a file that does not exist. This is due to lazy evaluation,"
so I guess spark.read.format().load() is dataframe definition. On top of this created dataframe we apply transformations and action and load is read API and not transformation if I am not wrong.
I tried to "file that does not exist" in load and I am thinking this is dataframe definition. but I got below error. according to the book it should not fail right?. I am surely missing something. can someone help on this?
df=spark.read.format('csv')
.option('header',
'true').option('inferschema', 'true')
.load('/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv')
Error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Path does not exist: /spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv;'
why dataframe definition is referring Hadoop metadata when it is lazy evaluated?
Till here dataframe is defined and reader object instantiated.
scala> spark.read.format("csv").option("header",true).option("inferschema",true)
res2: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader#7aead157
when you actually say load.
res2.load('/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv') and the file doesnt exist...... is execution time.(that means it has to check the data source and then it has to load the data from csv)
To get dataframe its checking meta data of hadoop since it will check hdfs whether this file exist or not.
It doesnt then you are getting
org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://203-249-241:8020/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv
In general
1) RDD/DataFrame lineage will be created and will not be executed is definition time.
2) when load is executed then it will be the execution time.
See the below flow to understand better.
Conclude : Any traformation (definition time in your way ) will not be executed until action is called (execution time in your way)
Spark is a lazy evolution. However, that doesn't mean It can't verify if file exist of not while loading it.
Lazy evolution happens on DataFrame object, and in order to create dataframe object they need to first check if file exist of not.
Check the following code.
#scala.annotation.varargs
def load(paths: String*): DataFrame = {
if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException("Hive data source can only be used with tables, you can not " +
"read files of Hive data source directly.")
}
DataSource.lookupDataSourceV2(source, sparkSession.sessionState.conf).map { provider =>
val catalogManager = sparkSession.sessionState.catalogManager
val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
source = provider, conf = sparkSession.sessionState.conf)
val pathsOption = if (paths.isEmpty) {
None
} else {
val objectMapper = new ObjectMapper()
Some("paths" -> objectMapper.writeValueAsString(paths.toArray))
}

added spark-xml_2.11 file into databricks library and creating the dataframes

I has added the spark-xml_2.11 into my databricks library option as a marven coordinate
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.4.1
and trying to load the data to DF like below
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='catalog').load('smaple.xml')
and getting the below error.
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html

errors trying to read Access Database Tables into Pandas with PYODBC

I would like to be performing a simple task of bringing table data from a MS Access database into Pandas in the form of a dataframe. I had this working great recently and now I can not figure out why it is no longer working. I remember when initially troubleshooting the connection there was work that I needed to do around installing a new microsoft database driver with the correct bitness so I have revisited that and gone through a reinstallation of the driver. Below is what I am using for a setup.
Record of install on Laptop:
OS: Windows 7 Professional 64-bit (verified 9/6/2017)
Access version: Access 2016 32bit (verified 9/6/2017)
Python version: Python 3.6.1 (64-bit) found using >Python -V (verified 9/11/2017)
the AccessDatabaseEngine needed will be based on the Python bitness above
Windows database engine driver installed with AccessDatabaseEngine_X64.exe from 2010 release using >AccessDatabaseEngine_X64.exe /passive (verified 9/11/2017)
I am running the following simple test code to try out the connection to a test database.
import pyodbc
import pandas as pd
[x for x in pyodbc.drivers() if x.startswith('Microsoft Access Driver')]
returns:
['Microsoft Access Driver (*.mdb, *.accdb)']
Setting the connection string.
dbpath = r'Z:\1Users\myfiles\software\JupyterNotebookFiles\testDB.accdb'
conn_str = (r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};''DBQ=%s;' %(dbpath))
cnxn = pyodbc.connect(conn_str)
crsr = cnxn.cursor()
Verifying that the I am connected to the db...
for table_info in crsr.tables(tableType='TABLE'):
print(table_info.table_name)
returns:
TestTable1
Trying to connect to TestTable1 gives the error below.
dfTable = pd.read_sql_table(TestTable1, cnxn)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-a24de1550834> in <module>()
----> 1 dfTable = pd.read_sql_table(TestTable1, cnxn)
2 #dfQuery = pd.read_sql_query("SELECT FROM [TestQuery1]", cnxn)
NameError: name 'TestTable1' is not defined
Trying again with single quotes gives the error below.
dfTable = pd.read_sql_table('TestTable1', cnxn)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-15-1f89f9725f0a> in <module>()
----> 1 dfTable = pd.read_sql_table('TestTable1', cnxn)
2 #dfQuery = pd.read_sql_query("SELECT FROM [TestQuery1]", cnxn)
C:\Users\myfiles\Anaconda3\lib\site-packages\pandas\io\sql.py in read_sql_table(table_name, con, schema, index_col, coerce_float, parse_dates, columns, chunksize)
250 con = _engine_builder(con)
251 if not _is_sqlalchemy_connectable(con):
--> 252 raise NotImplementedError("read_sql_table only supported for "
253 "SQLAlchemy connectable.")
254 import sqlalchemy
NotImplementedError: read_sql_table only supported for SQLAlchemy connectable.
I have tried going back to the driver issue and reinstalling a 32bit version without any luck.
Anybody have any ideas?
Per the docs of pandas.read_sql_table:
Given a table name and an SQLAlchemy connectable, returns a DataFrame.
This function does not support DBAPI connections.
Since pyodbc is a DBAPI, use the query method, pandas.read_sql which the con argument does support DBAPI:
dfTable = pd.read_sql("SELECT * FROM TestTable1", cnxn)
Reading db table with just table_name
import pandas
from sqlalchemy import create_engine
engine=create_engine('postgresql+psycopg2://user:password#localhost/db_name')
df=pandas.read_sql_table("table_name",engine)

Changing java heapsizes for a websphere server using websphere and wsadminlib.py

Im trying to call a command from wsadminlib.py to change the initialHeapSize and the maximumHeapSize in a script. But unfortunately my jython (and general scripting knowledge) is still total newbie.
Im using the call
#Change Java Heap Size
setJvmProperty(nodeName,serverName,maximumHeapsize -2048 ,initialHeapSize -2048)
Which should relate to the command in the wsadminlib.py library
def setJvmProperty(nodename,servername,propertyname,value):
"""Set a particular JVM property for the named server
Some useful examples:
'maximumHeapSize': 512 ,
'initialHeapSize':512,
'verboseModeGarbageCollection':"true",
'genericJvmArguments':"-Xgcpolicy:gencon -Xdump:heap:events=user -Xgc:noAdaptiveTenure,tenureAge=8,stdGlobalCompactToSatisfyAllocate -Xconcurrentlevel1 -Xtgc:parallel",
"""
jvm = getServerJvm(nodename,servername)
AdminConfig.modify(jvm, [[propertyname, value]])
But I'm met with this issue when i run the script
WASX7017E: Exception received while running file "/etc/was-scripts/administrateservertest.py"; exception information: com.ibm.bsf.BSFException: exception from Jython:
Traceback (innermost last):
File "", line 14, in ?
NameError: maximumHeapsize
Any suggestions would be appreciated as I'm tearing my hair out trying to work this out
this was answered by a friend on face book
I think you might need to make two calls, one for each property you
want to set. e.g.
setJvmProperty(nodeName,serverName,'maximumHeapsize',2048)
For others looking for a more specific answer, try this:
AdminConfig.modify(jvmId,[['genericJvmArguments',arguments],["maximumHeapSize", str(1536)]])