BigQuery, None until set from the server property - google-bigquery

I am trying to get the number of rows in a table in BigQuery, using the method num_rows, but I get None as a the result. When checked the documentation, it shows in the code :returns: the row count (None until set from the server). When will the server set the number of rows in a table or should I perform any operations before calling this method.
Below is my code
from google.cloud import bigquery
bqclient = bigquery.Client.from_service_account_json('service_account.json')
datasets = list(bqclient.list_datasets())
for dataset in datasets:
for table in bqclient.list_dataset_tables(dataset):
print(table.num_rows)

Try this instead:
for dataset in datasets:
for table in bqclient.list_dataset_tables(dataset):
print("Table {} has {} rows".format(table.table_id,
bqclient.get_table(table).num_rows))

Related

DolphinDB: chunks distribution of a dfs table in a cluster

How to get the distribution of all the chunks of a dfs table in a cluster with DolphinDB? I've tried getChunksMeta but it only returned the chunk information.
Use DolphinDB function getTabletsMeta() to view the chunk metadata of the data node. The output includes the information on the data node where the chunk is located. Then encapsulate a query function:
def chunkDistribution(dbName, tbName){
return select count(*) from pnodeRun(getTabletsMeta{"/"+substr(dbName,6)+"/%",tbName,true,-1}) group by node
}
dbName = "dfs://testDB"
tbName = "testTable"
chunkDistribution(dbName, tbName)

How can i write a data frame to a specific partition of a date partitioned BQ table using to_gbq()

I have a dataframe which I want to write to date partitioned BQ table. I am using to_gbq() method to do this. I am able to replace or append the existing table but can't write to a specific partition of the table using to_gbq()
Since to_gbq() doesn't support it as of yet, I created a code snippet for doing this with BigQuery API client.
Assuming you have an existing date-partitioned table that was created like this (you don't need to pre-create it, more details later):
CREATE TABLE
your_dataset.your_table (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
and you have a DataFrame like this:
import pandas
import datetime
records = [
{"transaction_id": 1, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 2, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 3, "transaction_date": datetime.date(2021, 10, 21)},
]
df = pandas.DataFrame(records)
here's how to write to a specific partition:
from google.cloud import bigquery
client = bigquery.Client(project='your_project')
job_config = bigquery.LoadJobConfig(
write_disposition="WRITE_TRUNCATE",
# This is needed if table doesn't exist, but won't hurt otherwise:
time_partitioning=bigquery.table.TimePartitioning(type_="DAY")
)
# Include target partition in the table id:
table_id = "your_project.your_dataset.your_table$20211021"
job = client.load_table_from_dataframe(df, table_id, job_config=job_config) # Make an API request
job.result() # Wait for job to finish
The important part is the $... part in the table id. It tells the API to only update a specific partition. If your data contains records which belong to a different partition, the operation is going to fail.
I believe that to_gbq() is not supported yet for partitioned tables.
You can check here recent issues https://github.com/pydata/pandas-gbq/issues/43.
I would recommend that using Google BigQuery API client library https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html
You can upload dataframe to BigQuery table too.
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe

pandas read sql query improvement

So I downloaded some data from a database which conveniently has a sequential ID column. I saved the max ID for each table I am querying to a small text file which I read into memory (max_ids dataframe).
I was trying to create a query where I would say give me all of the data where the Idcol > max_id for that table. I was getting errors that Series are mutable so I could not use them in a parameter. The code below ended up working but it was literally just a guess and check process. I turned it into an int and then a string which basically extracted the actual value from the dataframe.
Is this the correct way to accomplish what I am trying to do before I replicate this for about 32 different tables? I want to always be able to grab only the latest data from these tables which I am then doing stuff to in pandas and eventually consolidating and exporting to another database.
df= pd.read_sql_query('SELECT * FROM table WHERE Idcol > %s;', engine, params={'max_id', str(int(max_ids['table_max']))})
Can I also make the table name more dynamic as well? I need to go through a list of tables. The database is MS SQL and I am using pymssql and sqlalchemy.
Here is an example of where I ran max_ids['table_max']:
Out[11]:
0 1900564174
Name: max_id, dtype: int64
assuming that your max_ids DF looks as following:
In [24]: max_ids
Out[24]:
table table_max
0 tab_a 33333
1 tab_b 555555
2 tab_c 66666666
you can do it this way:
qry = 'SELECT * FROM {} WHERE Idcol > :max_id'
for i, r in max_ids.iterrows():
print('Executing: [%s], max_id: %s' %(qry.format(r['table']), r['table_max']))
pd.read_sql_query(qry.format(r['table']), engine, params={'max_id': r['table_max']})

Fetch time series from Google BigQuery

I am trying to fetch a list of prices from google big query using the following query :
query_request = service.jobs()
query_data = {
'query': (
'''
SELECT
open
FROM
timeseries.price_2015
''')
}
query_response = query_request.query(
projectId=project_id,
body=query_data).execute()
The table contains 370000 records, but the query loads only the first 100000. I guess I am hitting some limit? Can you tell how I can fetch all records for the 'price' column?
The number of rows returned is limited by the lesser of either the maximum page size or the maxResults property. See more in Paging Through list Results
Consider using Jobs: getQueryResults or Tabledata: list where you can call those API in loop passing PageToken from previous response to next call and collecting whole set on client side

Put Array in to Table Column

I'm trying to store information in a pytables subclass. I have my class Record and subclass Data. Data will have many rows for every row of Record. I don't want to use a loop with row.append() because it seems like it would be horribly slow. Can I just create an array and drop it in Data.v column? How?
import tables as tbs
import numpy as np
class Record(tbs.IsDescription):
filename = tbs.StringCol(255)
timestamp = tbs.Time32Col()
class Data(tbs.IsDescription):
v = tbs.Int32Col(dflt=None)
...
row = table.row
for each in importdata:
row['filename'] = each['filename']
row['timestamp'] = each['timestamp']
# ???? I want to do something like this
row.Data = tbs.Array('v', each['v'], shape=np.shape(each['v']))
row.append()
OK, When I read about nested tables I was think about relational data in a one-to-many situation. This isn't possible with nested tables. Instead I just created a separate table and stored row references using
tables.nrows
to get the current row of my Data table. This works for me because for every entry in Record I can calculate the number of rows that will be stored in Data. I just need to know the starting row. I'm not going to modify/insert/remove any rows in the future so my reference doesn't change. Anyone considering this technique should understand the significant limitations it brings.
Nested columns use the '/' separator in the column key. So I think that you simply need to change the line:
row.Data = tbs.Array('v', each['v'], shape=np.shape(each['v']))
to the following:
row['Data/v'] = each['v']