why the following Bigquery insertion is failing? - google-bigquery

Hello I am trying to insert one row into a table, I succesfully created the table as follows:
schema = [{'name': 'foo', 'type': 'STRING', 'mode': 'nullable'},{'name': 'bar', 'type': 'FLOAT', 'mode': 'nullable'}]
created = client.create_table(dataset='api_data_set_course_33', table='insert_test_333', schema=schema)
print('Creation Result ',created)
However when I push the row I got False,
rows = [{'id': 'NzAzYmRiY', 'one': 'uno', 'two': 'dos'}]
inserted = client.push_rows('api_data_set_course_33','insert_test_333', rows, 'id')
print('Insertion Result ',inserted)
So I don't have idea what is wrong, I really would like to appreciate support to overcome this task
This is the API that I am testing:
https://github.com/tylertreat/BigQuery-Python
This is my complete code:
schema = [{'name': 'foo', 'type': 'STRING', 'mode': 'nullable'},{'name': 'bar', 'type': 'FLOAT', 'mode': 'nullable'}]
created = client.create_table(dataset='api_data_set_course_33', table='insert_test_333', schema=schema)
print('Creation Result ',created)
rows = [{'id': 'NzAzYmRiY', 'one': 'uno', 'two': 'dos'}]
inserted = client.push_rows('api_data_set_course_33','insert_test_333', rows, 'id')
print('Insertion Result ',inserted)
Output:
Creation Result True
Insertion Result False
After feedback I tried:
>>> client = get_client(project_id, service_account=service_account,private_key_file=key, readonly=False)
>>> schema = [{'name': 'foo', 'type': 'STRING', 'mode': 'nullable'},{'name': 'bar', 'type': 'FLOAT', 'mode': 'nullable'}]
>>> rows = [{'id': 'NzAzYmRiY', 'foo': 'uno', 'bar': 'dos'}]
>>> inserted = client.push_rows('api_data_set_course_33','insert_test_333', rows, 'id')
>>> print(inserted)
False
and also:
>>> rows = [{'id': 'NzAzYmRiY', 'foo': 'uno', 'bar': 45}]
>>> inserted = client.push_rows('api_data_set_course_33','insert_test_333', rows, 'id')
>>> print(inserted)
False
However I only got false

Your row field names don't match your schema field names. Try this instead:
rows = [{'id': 'NzAzYmRiY', 'foo': 'uno', 'bar': 'dos'}]

Related

Data ingestion with dataflow write to bq file error

I'm trying to ingest a csv file into bigquery using apache beam and dataflow here's my code:
import logging
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
gcs_bucket_name = "gs://bck-fr-fichiers-manuel-dev/de_par_categorie_et_code_rome/"
target_table_annonce = 'fr-parisraw-dev-8ef8:pole_emploi.de_par_categorie_et_code_rome'
table_schema_annonce = {'fields': [
{'name': 'cd_metier_rome', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'lb_metier_rome', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'cd_departement', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'lb_departement', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'nb_demandeur', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'mois', 'type': 'STRING', 'mode': 'NULLABLE'}
]}
# DoFn
class PrepareBqRowDoFn(beam.DoFn):
def process(self, element, *args, **kwargs):
logging.basicConfig(level=logging.INFO)
DOFN_LOGGER = logging.getLogger("PREPAREBQROWDOFN_LOGGER")
import csv
from datetime import datetime, timedelta
import re
# element = re.sub(r'(?=[^"]+)ยค(?=[^"]+)', '', element)
line = csv.reader(element.splitlines(), quotechar='"',
delimiter=';',quoting=csv.QUOTE_ALL, skipinitialspace=True)
for row in line:
try:
bq_row = {"cd_metier_rome": row[0],
"lb_metier_rome": row[1],
"cd_departement": row[2],
"lb_departement": row[3],
"nb_demandeur": row[4],
"mois": row[5]
}
yield bq_row
except IndexError:
DOFN_LOGGER.info("Error Row : " + element)
def run():
pipeline = beam.Pipeline(options=PipelineOptions())
file_patterns = ['de_par_*.csv']
for file_pattern in file_patterns:
csv_lines = pipeline | 'Read File From GCS {}'.format(file_pattern) >> beam.io.ReadFromText(
gcs_bucket_name + file_pattern)
bq_row = csv_lines | 'Create Row {}'.format(file_pattern) >> beam.ParDo(PrepareBqRowDoFn())
bq_row | 'Write to BQ {}'.format(file_pattern) >> beam.io.Write(beam.io.WriteToBigQuery(
target_table_annonce,
schema=table_schema_annonce,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
pipeline.run()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
The pipeline generated looks like this :
Each step I can see the the rows are being treated by Dataflow :
Step 1 (Read File From GCS de_par_*.csv) :
Step 2 (Create Row de_par_*.csv) :
But the final step 3 (Write to BQ de_par_*.csv) :
I get 0 lines

Why doesn't pandas dataframe need full row values?

fields = ['name', 'type', 'age']
df = pd.DataFrame(columns=fields)
item1 = {'name': 'john', type:'student', 'age': 21}
item2 = {'name': 'john', 'age': 21}
for item in items:
df = df.append(item, ignore_index=True)
I had thought only 'item1' would be able to be appended, not 'item2' since it has only 2 required fields. Is this normal?

How to parse a nested column in a df column?

Is there a smart pythonic way to parse a nested column in a pandas dataframe like this one to 3 different columns? So for example the column could look like this:
col1
[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}]
[{'name': 'amount', 'value': 3}, {'name': 'frequency', 'value': 1}, {'name': 'freq_unit', 'value': 'month'}]
And the expected result should be these 3 columns:
amount frequency freq_unit
1 2 month
3 1 month
That's just level 1. I have the level 2: What if the elements in the list still have the same names (amount, frequency and freq_unit) but the order could change? Could the code in the answer deal with this?
col1
[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}]
[{'name': 'amount', 'value': 3}, {'name': 'freq_unit', 'value': 'month'}, {'name': 'frequency', 'value': 1}]
Code for reproduce the data. Really look forward to see how the community would solve this. Thank you
data = {'col1':[[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}],
[{'name': 'amount', 'value': 3}, {'name': 'frequency', 'value': 1}, {'name': 'freq_unit', 'value': 'month'}]]}
df = pd.DataFrame(data)
A combination of list comprehension, itertools.chain, and collections.defaultdict could help out here:
from itertools import chain
from collections import defaultdict
data = defaultdict(list)
phase1 = [[(data["name"], data["value"])
for data in entry]
for entry in df.col1
]
phase1 = chain.from_iterable(phase1)
for key, value in phase1:
data[key].append(value)
pd.DataFrame(data)
amount frequency freq_unit
0 1 2 month
1 3 1 month
The above is verbose: #piRSquared's comment is much simpler, with a list comprehension:
pd.DataFrame([{x["name"]: x["value"] for x in lst} for lst in df.col1])
Another idea, but very unnecessary, is to use a list comprehension, combined with Pandas' string methods:
outcome = [(df.col1.str[num].str["value"]
.rename(df.col1.str[num].str["name"][0])
)
for num in range(df.col1.str.len()[0])
]
pd.concat(outcome, axis = 'columns')
#piRsquared's solution is the simplest, in my opinion.
You can write a function that will parse each cell in your Series and return a properly formatted Series and use apply to tuck the iteration away:
>>> def custom_parser(record):
... clean_record = {rec["name"]: rec["value"] for rec in record}
... return pd.Series(clean_record)
>>> df["col1"].apply(custom_parser)
amount frequency freq_unit
0 1 2 month
1 3 1 month

Pandas extract value from a key-value pair

I have a Datafrmae with output as shown below, I am trying to extract specific text
id,value
101,*sample value as shown below*
I am trying to extract the value corresponding to key in this text
Expected output
id, key, id_new
101,Ticket-123, 1001
Given below is how the data looks like:
{
'fields': {
'status': {
'statusCategory': {
'colorName': 'yellow',
'name': 'In Progress',
'key': 'indeterminate',
'id': 4
},
'description': '',
'id': '11000',
'name': 'In Progress'
},
'summary': 'Sample Text'
},
'key': 'Ticket-123',
'id': '1001'
}
Use Series.str.get:
df['key'] = df['value'].str.get('key')
df['id_new'] = df['value'].str.get('id')
print (df)
id value key id_new
0 101 {'fields': {'status': {'statusCategory': {'col... Ticket-123 1001
Tested Dataframe:
v = {
'fields': {
'status': {
'statusCategory': {
'colorName': 'yellow',
'name': 'In Progress',
'key': 'indeterminate',
'id': 4
},
'description': '',
'id': '11000',
'name': 'In Progress'
},
'summary': 'Sample Text'
},
'key': 'Ticket-123',
'id': '1001'
}
df = pd.DataFrame({'id':101, 'value':[v]})

pandas same attribute comparison

I have the following dataframe:
df = pd.DataFrame([{'name': 'a', 'label': 'false', 'score': 10},
{'name': 'a', 'label': 'true', 'score': 8},
{'name': 'c', 'label': 'false', 'score': 10},
{'name': 'c', 'label': 'true', 'score': 4},
{'name': 'd', 'label': 'false', 'score': 10},
{'name': 'd', 'label': 'true', 'score': 6},
])
I want to return names that have the "false" label score value higher than the score value of the "true" label with at least the double. In my example, it should return only the "c" name.
First you can pivot the data, and look at the ratio, filter what you want:
new_df = df.pivot(index='name',columns='label', values='score')
new_df[new_df['false'].div(new_df['true']).gt(2)]
output:
label false true
name
c 10 4
If you only want the label, you can do:
new_df.index[new_df['false'].div(new_df['true']).gt(2)].values
which gives
array(['c'], dtype=object)
Update: Since your data is result of orig_df.groupby().count(), you could instead do:
orig_df['label'].eq('true').groupby('name').mean()
and look at the rows with values <= 1/3.