How can I write a query to insert array values from a python dictionary in BigQuery? - sql

I have a python dictionary that looks like this:
{
'id': 123,
'categories': [
{'category': 'fruit', 'values': ['apple', 'banana']},
{'category': 'animal', 'values': ['cat']},
{'category': 'plant', 'values': []}
]
}
I am trying to insert those values into a table in big query via the API using python, I just need to format the above into an "INSERT table VALUES" query. The table needs to have the fields: id, categories.category, categories.values.
I need categories to basically be an array with the category and each category's corresponding values. The table is supposed to look sort of like this in the end - except I need it to be just one row per id, with the corresponding category fields nested and having the proper field name:
SELECT 123 as id, (["fruit"], ["apple", "banana"]) as category
UNION ALL (SELECT 123 as id, (["animal"], ["cat"]) as category)
UNION ALL (SELECT 123 as id, (["plant"], ["tree", "bush", "rose"]) as category)
I'm not really sure how to format the "INSERT" query to get the desired result, can anyone help?

If you want to load a dictionary to BigQuery using Python, you have first to prepare your data. I chose to convert the Python Dictionary to a .json file and then load it to BigQuery using the Python API. However, according to the documentation, BigQuery has some limitations regarding loading .json nested data, among them:
Your .json must be a new line delimited, which means that each object must be in a new line in the file
BigQuery does not support maps or dictionaries in Json. Thus, in order to do so, you have to wrap your whole data in [], as you can see here.
For this reason, some modifications should be done in the file, so you can load the created .json file to BiguQuery. I have created two scripts, in which: the first one converts the Python dict to a JSON file and the second the JSON file is formatted as New Line delimited json and then loaded in BigQuery.
Convert the python dict to a .json file. Notice that you have to wrap the whole data between []:
import json
from google.cloud import bigquery
py_dict =[{
'id': 123,
'categories': [
{'category': 'fruit', 'values': ['apple', 'banana']},
{'category': 'animal', 'values': ['cat']},
{'category': 'plant', 'values': []}
]
}]
json_data = json.dumps(py_dict, sort_keys=True)
out_file = open('json_data.json','w+')
json.dump(py_dict,out_file)
Second, convert json to new line delimited json and load to BigQuery:
import json
from google.cloud import bigquery
with open("json_data.json", "r") as read_file:
data = json.load(read_file)
result = [json.dumps(record) for record in data]
with open('nd-proceesed.json', 'w') as obj:
for i in result:
obj.write(i+'\n')
client = bigquery.Client()
filename = '/path/to/file.csv'
dataset_id = 'sample'
table_id = 'json_mytable'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True
with open("nd-proceesed.json", "rb") as source_file:
job = client.load_table_from_file(source_file, table_ref, job_config=job_config)
job.result() # Waits for table load to complete.
print("Loaded {} rows into {}:{}.".format(job.output_rows, dataset_id, table_id))
Then, in the BigQuery UI, you can query your table as follows:
SELECT id, categories
FROM `test-proj-261014.sample.json_mytable4` , unnest(categories) as categories
And the output:

You can use below query - with your dictionary text embed into it
#standardSQL
WITH data AS (
SELECT '''
{
'id': 123,
'categories': [
{'category': 'fruit', 'values': ['apple', 'banana']},
{'category': 'animal', 'values': ['cat']},
{'category': 'plant', 'values': ['tree', 'bush', 'rose']}
]
}
''' dict
)
SELECT
JSON_EXTRACT_SCALAR(dict, '$.id') AS id,
ARRAY(
SELECT AS STRUCT
JSON_EXTRACT_SCALAR(cat, '$.category') AS category,
ARRAY(
SELECT TRIM(val, '"')
FROM UNNEST(JSON_EXTRACT_ARRAY(cat, '$.values')) val
)`values`
FROM UNNEST(JSON_EXTRACT_ARRAY(dict, '$.categories')) cat
) AS categories
FROM data
which produces below result
Row id categories.category categories.values
1 123 fruit apple
banana
animal cat
plant tree
bush
rose

Related

Pandas – Extracting a phrase in a dict column

I have a specific text in one column that I want to extract, and wondering if I could extract a specific sequence from the rows in that column and add them to a new column.
From this:
|studios|
|-------|
|[{'mal_id': 14, 'name': 'Sunrise'}]|
|[{'mal_id': 34, 'name': 'Hal Film Maker'}]|
|[{'mal_id': 18, 'name': 'Toei Animation'}]|
|[]|
|[{'mal_id': 455, 'name': 'Palm Studio'}]|
To this:
|studios|
|-------|
|Sunrise|
|Hal Film Maker|
|Toei Animation|
|[]|
|Palm Studio|
You can use .str to access indexes/keys from the lists/dicts of items in a column, and use a combination of pipe and where to fallback to the original values where the result from .str returns NaN:
df['studios'] = df['studios'].str[0].str['name'].pipe(lambda x: x.where(x.notna(), df['studios']))
Note: you may need to convert the items in df['studio'] to actual objects, in case they're just strings that look like objects. To do that, run this before you run the above code:
import ast
df['studios'] = df['studios'].apply(ast.literal_eval)

Pandas to sqlalchemy when you have many different CSVs with different column names

I'm attempting to load in many CSVs from different clients that all contain the same types of data, but with different column names. For example
Source | Medium | Date
Src | Med | Conversion Date
Came From | Format | DateTime
All of these columns should be considered the same. So Source, Src, and Came From all need to go into a database column "Source." They could be named anything for different CSVs and be in any order, so some mapping needs to occur each time a different client is created.
Pandas has a to_sql function, but this requires you to manually input the column names and I don't want a ton of different tables, because I need to display the same table for each client later.
One solution I could implement is to have the interface require the administrator to manually select the columns and match them to the appropriate "master" column name. Then on the backend, just rename those columns before running to_sql.
Is there any other way that would be more efficient to perform this? Perhaps iterating through the dataframe and handling things row by row?
I think the best way is to create a table for relations(alias -> target column) or config file. Here is just an example, but I think you can understand my approach:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, String, Integer
engine = create_engine('sqlite:///:memory:', echo=True)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base(bind=engine)
class ClientAlias(Base):
# table for dynamic aliases
__tablename__ = 'client_alias'
id = Column(Integer, primary_key=True)
alias = Column(String)
target = Column(String)
class FinalTable(Base):
# result table with standardized columns - for all clients
__tablename__ = 'final_table'
id = Column(Integer, primary_key=True)
client_id = Column(Integer)
source = Column(String)
medium = Column(String)
Base.metadata.create_all(engine)
def prepare_aliases():
"""
insert default mapping:
Src -> source, Came From -> source, Med -> medium, etc...
"""
for target, aliases in (
('source', ('Source', 'Src', 'Came From'), ),
('medium', ('Medium', 'Med', 'Format'), ),
):
for alias in aliases:
session.add(ClientAlias(target=target, alias=alias))
session.commit()
# insert a few records with client column aliases
prepare_aliases()
# example processing
dfs = (
# first client with specific columns
pd.DataFrame.from_dict({
'client_id': (1, 1, ),
'Source': ('Source11', 'Source12'),
'Medium': ('Medium11', 'Medium12'),
}),
# second client with specific columns
pd.DataFrame.from_dict({
'client_id': (2, 2, ),
'Src': ('Source12', 'Source22'),
'Med': ('Medium12', 'Medium22'),
}),
# one more client with specific columns
pd.DataFrame.from_dict({
'client_id': (3, 3, ),
'Came From': ('Source13', 'Source23'),
'Format': ('Medium13', 'Medium23'),
}),
# etc...
)
# create columns map {Src -> source, Came From -> source, ect...}
columns = {c.alias: c.target for c in session.query(ClientAlias).all()}
for df in dfs:
df.rename(columns=columns, inplace=True)
# union and insert into final table
df = pd.concat(dfs, sort=False, ignore_index=True)
df.to_sql(
con=engine,
name=FinalTable.__tablename__,
index=False,
if_exists='append'
)
So you can add a new record into client_alias(or into config file) if you'll have a new client or there will be some changes. And all will works fine without code changes and deploying. Anyway this is just example - you can customize it as you wish.

Issues using mergeDynamicFrame on AWS Glue

I need do a merge between two dynamic frames on Glue.
I tried to use the mergeDynamicFrame function, but i keep getting the same error:
AnalysisException: "cannot resolve 'id' given input columns: [];;\n'Project ['id]\n+- LogicalRDD false\n"
Right now, i have 2 DF:
df_1(id, col1, salary_src) and df_2(id, name, salary)
I want to merge df_2 into df_1 by the "id" column.
df_1 = glueContext.create_dynamic_frame.from_catalog(......)
df_2 = glueContext.create_dynamic_frame.from_catalog(....)
merged_frame = df_1.mergeDynamicFrame(df_2, ["id"])
applymapping1 = ApplyMapping.apply(frame = merged_frame, mappings = [("id", "long", "id", "long"), ("col1", "string", "name", "string"), ("salary_src", "long", "salary", "long")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(....)
As a test i tried to pass a column from both DFs (salary and salary_src), and, the error as:
AnalysisException: "cannot resolve 'salary_src' given input columns: [id, name, salary];;\n'Project [salary#2, 'salary_src]\n+- LogicalRDD [id#0, name#1, salary#2], false\n"
Is this case, it seems to recognize the columns from the df_2 (id, name, salary).. but if i pass just one of the columns, or even the 3, it keeps failing
It doesn't appear to be a mergeDynamicFrame issue.
Based on the information you provided, it looks like your df1, df2 or both are not reading data correctly and returning an empty dynamicframe which is why you have an empty list of input columns "input columns: []"
if you are reading from s3, you must crawl your data before you can use glueContext.create_dynamic_frame.from_catalog.
you can also include df1.show() or df1.printSchema() after you create your dynamic_frame as a troubleshooting step to make sure you are reading your data correctly before merging.

Generate a pyarrow schema in the format of a list of pa.fields?

Is there a way for me to generate a pyarrow schema in this format from a pandas DF? I have some files which have hundreds of columns so I can't type it out manually.
fields = [
pa.field('id', pa.int64()),
pa.field('date', pa.timestamp('ns')),
pa.field('name', pa.string()),
pa.field('status', pa.dictionary(pa.int8(), pa.string(), ordered=False),
]
I'd like to save it in a file and then refer to it explicitly when I save data with to_parquet.
I tried to use schema = pa.Schema.from_pandas(df) but when I print out schema it is in a different format (I can't save it as a list of data type tuples like the fields example above).
Ideally, I would take a pandas dtype dictionary and then remap it into the fields list above. Is that possible?
schema = {
'id': 'int64',
'date': 'datetime64[ns]',
'name': 'object',
'status': 'category',
}
Otherwise, I will make the dtype schema, print it out and paste it into a file, make any required corrections, and then do a df = df.astype(schema) before saving the file to Parquet. However, I know I can run into issues with fully null columns in a partition or object columns with mixed data types.
I really don't understand why pa.Schema.from_pandas(df) doesn't work for you.
As far as I understood what you need is this:
schema = pa.Schema.from_pandas(df)
fields = []
for col_name, col_type in zip(schema.names, schema.types):
fields.append(pa.field(col_name, col_type))
or using list comprehension:
schema = pa.Schema.from_pandas(df)
fields = [pa.field(col_name, col_type) for col_name, col_type in zip(schema.names, schema.types)]

How to create multi index dataframe in pandas?

How to create multi index data frames in pandas ?
First option
One of possibe solutions is to create the MultiIndex from tuples:
Start from defining source tuples:
tpl = [('', 'Material'), ('', 'Brand'),
('abcd.com', 'Stock'), ('abcd.com', 'Sales'), ('abcd.com', 'Leftover'),
('xyz.com', 'Stock'), ('xyz.com', 'Sales'), ('xyz.com', 'Leftover')]
Each tuple contains respective column name at consecutive levels.
Then create the MultiIndex:
cols = pd.MultiIndex.from_tuples(tpl)
And now you can create your DataFrame, calling e.g.:
df = pd.DataFrame(<your_data>, columns=cols)
Another option
The second choice is to create the MultiIndex from arrays:
Source data is actually a list containing column names (also lists) for each level:
arr = [[ '', '', 'abcd.com', 'abcd.com', 'abcd.com', 'xyz.com', 'xyz.com', 'xyz.com' ],
[ 'Material', 'Brand', 'Stock', 'Sales', 'Leftover', 'Stock', 'Sales', 'Leftover']]
Then, to create the MultiIndex, call:
ind = pd.MultiIndex.from_arrays(arr)