Issue running query in Google Collab + Bigquery - google-bigquery

I've followed the step by step here and inserted this snippet:
https://colab.research.google.com/notebook#snippetFileIds=%2Fv2%2Fexternal%2Fnotebooks%2Fsnippets%2Fbigquery.ipynb&snippetQuery=Using%20BigQuery%20with%20Pandas%20API
however, i can run the query, but then appears an error :
TypeError Traceback (most recent call last)
<ipython-input-22-b9e37aa67e26> in <module>()
9 COUNT(*) as total
10 FROM `bigquery-public-data.samples.gsod`
---> 11 ''', project_id=project_id).total[0]
12
13 df = pd.io.gbq.read_gbq(f'''
8 frames
/usr/local/lib/python3.6/dist-packages/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays()
TypeError: from_arrays() takes at least 2 positional arguments (1 given)
I have tried with several database, with no success.
Any Idea?

I have followed steps from Using BigQuery with Pandas API Colab and it works fine for me.
Firstly you need to create Cloud Platform project if you not already have one, and then enable billing and BigQuery API.
When running first snippet of code, you need to click on a link, that shows in a console, copy verification code and paste it to the console in Enter verification code field:
from google.colab import auth
auth.authenticate_user()
Before running second snippet of code, you need to change the name of project_id field to the name of your actual project, that you created in GCP:
import pandas as pd
# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = 'your Cloud Platform project ID'
sample_count = 2000
row_count = pd.io.gbq.read_gbq('''
SELECT
COUNT(*) as total
FROM `bigquery-public-data.samples.gsod`
''', project_id=project_id).total[0]
df = pd.io.gbq.read_gbq(f'''
SELECT
*
FROM
`bigquery-public-data.samples.gsod`
WHERE RAND() < {sample_count}/{row_count}
''', project_id=project_id)
print(f'Full dataset has {row_count} rows')
After that, you will get following output:
I hope it helps you.

I fixed this issue by updating to the latest version of arrow
!pip install pyarrow==0.17.1

Related

Pandas diff-function: NotImplementedError

When I use the diff function in my snippet:
for customer_id, cus in tqdm(df.groupby(['customer_ID'])):
# Get differences
diff_df1 = cus[num_features].diff(1, axis = 0).iloc[[-1]].values.astype(np.float32)
I get:
NotImplementedError
The exact same code did run without any error before (on Colab), whereas now I'm using an Azure DSVM via JupyterHub and I get this error.
I already found this
pandas pd.DataFrame.diff(axis=1) NotImplementationError
but the solution doesnt work for me as I dont have any Date types. Also I did upgrade pandas but it didnt change anything.
EDIT:
I have found that the error occurs when the datatype is 'int16' or 'int8'. Converting the dtypes to 'int64' solves it.
However I leave the question open in case someone can explain it or show a solution that works with int8/int16.

Failed to connect to BigQuery with Python - ServiceUnavailable

querying data from BigQuery has been working for me. Then I updated my google packages (e. g. google-cloud-bigquery) and suddenly I could no longer download data. Unfortunately, I don't know the old version of the package I was using any more. Now, I'm using version '1.26.1' of google-cloud-bigquery.
Here is my code which was running:
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
KEY_FILE_LOCATION = "path_to_json"
PROCECT_ID = 'bigquery-123454'
credentials = service_account.Credentials.from_service_account_file(KEY_FILE_LOCATION)
client = bigquery.Client(credentials= credentials,project=PROCECT_ID)
query_job = client.query("""
SELECT
x,
y
FROM
`bigquery-123454.624526435.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20200501' AND '20200502'
""")
results = query_job.result()
df = results.to_dataframe()
Except of the last line df = results.to_dataframe() the code works perfectly. Now I get a weired error which consists of three parts:
Part 1:
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1596627109.629000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1596627109.629000000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>
Part 2:
ServiceUnavailable: 503 failed to connect to all addresses
Part 3:
RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x0000000010BD3C80>, table_reference {
project_id: "bigquery-123454"
dataset_id: "_a0003e6c1ab4h23rfaf0d9cf49ac0e90083ca349e"
table_id: "anon2d0jth_f891_40f5_8c63_76e21ab5b6f5"
}
requested_streams: 1
read_options {
}
format: ARROW
parent: "projects/bigquery-123454"
, metadata=[('x-goog-request-params', 'table_reference.project_id=bigquery-123454&table_reference.dataset_id=_a0003e6c1abanaw4egacf0d9cf49ac0e90083ca349e'), ('x-goog-api-client', 'gl-python/3.7.3 grpc/1.30.0 gax/1.22.0 gapic/1.0.0')]), last exception: 503 failed to connect to all addresses
I don't have an explanation for this error. I don't think it has something to do with me updating the packages.
Once I had problems with the proxy but these problems caused another/different error.
My colleague said that the project "bigquery-123454" is still available in BigQuery.
Any ideas?
Thanks for your help in advance!
503 error occurs when there is a network issue. Try again after some time or retry the job.
You can read more about the error on Google Cloud Page
I found the answer:
After downgrading the package "google-cloud-bigquery" from version 1.26.1 to 1.18.1 the code worked again! So the new package caused the errors.
I downgraded the package using pip install google-cloud-bigquery==1.18.1 --force-reinstall

Unable to SaveAsTextFile AttributeError: 'list' object has no attribute 'saveAsTextFile'

I have submitted a similar question relating to saveAsTextFile, but I'm not sure if one question will provide the same answer as I now have a new error messagae:
I have compiled the following pyspark.sql code:
#%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/Person_Person.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('Person_Person')
myresults = spark.sql("""SELECT
PersonType
,COUNT(PersonType) AS `Person Count`
FROM Person_Person
GROUP BY PersonType""")
myresults.collect()
result = myresults.collect()
result
result.saveAsTextFile("test")
However, I am getting the following error:
Append ResultsClear Results
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-9-9e137ed161cc> in <module>()
----> 1 result.saveAsTextFile("test")
AttributeError: 'list' object has no attribute 'saveAsTextFile'
As I mentioned I'm trying to send the results of my query to a text file with the command saveAsTextFile, but I am getting the above error.
Can someone shed some light on how to resolve this issue?
Collect() returns all the records of the Dataframe as a list of type Row. And you are calling 'SaveAsTextFile' on the result which is a list.
List doesnt have the 'saveAsTextFile' function, so it's throwing an error.
result = myresults.collect()
result.saveAsTextFile("test")
To save the contents of the Dataframe to file, you have 2 options:
Convert the DataFrame into RDD and call 'saveAsTextFile' function on it.
myresults.rdd.saveAsTextFile(OUTPUT_PATH)
Using DataframeWriter. In this case, DataFrame must have only one column that is of string type. Each row becomes a new line in the output file.
myresults.write.format("text").save(OUTPUT_PATH)
As you have more than 1 column in Dataframe, proceed with Option:1.
Also by default, spark will create 200 Partitions for shuffle. so, 200 files will be created in the output path. If you less data, configure the below parameter according to your data size.
spark.conf.set("spark.sql.shuffle.partitions", 5) # 5 files will be written to output folder.

errors trying to read Access Database Tables into Pandas with PYODBC

I would like to be performing a simple task of bringing table data from a MS Access database into Pandas in the form of a dataframe. I had this working great recently and now I can not figure out why it is no longer working. I remember when initially troubleshooting the connection there was work that I needed to do around installing a new microsoft database driver with the correct bitness so I have revisited that and gone through a reinstallation of the driver. Below is what I am using for a setup.
Record of install on Laptop:
OS: Windows 7 Professional 64-bit (verified 9/6/2017)
Access version: Access 2016 32bit (verified 9/6/2017)
Python version: Python 3.6.1 (64-bit) found using >Python -V (verified 9/11/2017)
the AccessDatabaseEngine needed will be based on the Python bitness above
Windows database engine driver installed with AccessDatabaseEngine_X64.exe from 2010 release using >AccessDatabaseEngine_X64.exe /passive (verified 9/11/2017)
I am running the following simple test code to try out the connection to a test database.
import pyodbc
import pandas as pd
[x for x in pyodbc.drivers() if x.startswith('Microsoft Access Driver')]
returns:
['Microsoft Access Driver (*.mdb, *.accdb)']
Setting the connection string.
dbpath = r'Z:\1Users\myfiles\software\JupyterNotebookFiles\testDB.accdb'
conn_str = (r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};''DBQ=%s;' %(dbpath))
cnxn = pyodbc.connect(conn_str)
crsr = cnxn.cursor()
Verifying that the I am connected to the db...
for table_info in crsr.tables(tableType='TABLE'):
print(table_info.table_name)
returns:
TestTable1
Trying to connect to TestTable1 gives the error below.
dfTable = pd.read_sql_table(TestTable1, cnxn)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-a24de1550834> in <module>()
----> 1 dfTable = pd.read_sql_table(TestTable1, cnxn)
2 #dfQuery = pd.read_sql_query("SELECT FROM [TestQuery1]", cnxn)
NameError: name 'TestTable1' is not defined
Trying again with single quotes gives the error below.
dfTable = pd.read_sql_table('TestTable1', cnxn)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-15-1f89f9725f0a> in <module>()
----> 1 dfTable = pd.read_sql_table('TestTable1', cnxn)
2 #dfQuery = pd.read_sql_query("SELECT FROM [TestQuery1]", cnxn)
C:\Users\myfiles\Anaconda3\lib\site-packages\pandas\io\sql.py in read_sql_table(table_name, con, schema, index_col, coerce_float, parse_dates, columns, chunksize)
250 con = _engine_builder(con)
251 if not _is_sqlalchemy_connectable(con):
--> 252 raise NotImplementedError("read_sql_table only supported for "
253 "SQLAlchemy connectable.")
254 import sqlalchemy
NotImplementedError: read_sql_table only supported for SQLAlchemy connectable.
I have tried going back to the driver issue and reinstalling a 32bit version without any luck.
Anybody have any ideas?
Per the docs of pandas.read_sql_table:
Given a table name and an SQLAlchemy connectable, returns a DataFrame.
This function does not support DBAPI connections.
Since pyodbc is a DBAPI, use the query method, pandas.read_sql which the con argument does support DBAPI:
dfTable = pd.read_sql("SELECT * FROM TestTable1", cnxn)
Reading db table with just table_name
import pandas
from sqlalchemy import create_engine
engine=create_engine('postgresql+psycopg2://user:password#localhost/db_name')
df=pandas.read_sql_table("table_name",engine)

Unable to load bigquery data in local spark (on my mac) using pyspark

I am getting below error after executing below code. Am I missing something in the installation? I am using spark installed on my local mac and so I am checking to see if I need to install additional libraries for below code to work and load data from bigquery.
Py4JJavaError Traceback (most recent call last)
<ipython-input-8-9d6701949cac> in <module>()
13 "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
14 "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
---> 15 conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.gson.JsonObject
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>",
"mapred.bq.input.project.id": "publicdata",
"mapred.bq.input.dataset.id":"samples",
"mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD(
"com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
"org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)
The error "java.lang.ClassNotFoundException: com.google.gson.JsonObject" seems to hint that a library is missing.
Please try adding the gson jar to your path: http://search.maven.org/#artifactdetails|com.google.code.gson|gson|2.6.1|jar
Highlighting something buried in the connector link in Felipe's response: the bq connector used to be included by default in Cloud Dataproc, but was dropped starting at v 1.3. The link shows you three ways to get it back.