Pandas: Rename duplicate columns in same dataframe and merge - pandas

I have the following;
mylist = ["name", "state", "name", "city"]
newlist = ["name1", "state", "name2", "city"]
I have renamed the duplicates. I would like to merge the name1 and name2 and rename to name

This method works well to do this and is clean:
Renamed DataFrame method accepts the dictionary "key: values" that allow you to map the old value to the new value.
Example.
new_df = "read in data source"
col_map = {"name": "name1",
"state": "state",
"name": "name2",
"city": "city"
}
new_df.rename(columns=col_map)
This method should work perfectly fine. Please respond so that our team here in StackOverflow knows that you care about our feedback and are responsive.

Related

Insert into table using data from JSON value in PostgresQL

I am working on a sensitive migration. The scenario is as follows:
I have a new table that I need to populate with data
There is an existing table, which contains a column (type = json), which contains an array of objects such as:
[
{
"id": 0,
"name": "custom-field-0",
"label": "When is the deadline for a response?",
"type": "Date",
"options": "",
"value": "2020-10-02",
"index": 1
},
{
"id": 1,
"name": "custom-field-1",
"label": "What territory does this relate to?",
"type": "Dropdown",
"options": "UK, DE, SE, DK, BE, NL, IT, FR, ES, AT, CH, NO, US, SG, Other",
"value": " DE",
"index": 2
}
]
I need to essentially map these values in this column to my new table. I have worked with JSON data in PostgresQL before, where I was dealing with a single object in the JSON, but never with arrays of objects and on such a large scale.
So just to summarise, how does someone iterate every row, and every object in an array, and insert that data into a new table?
EDIT
I have been experimenting with some functions, and I found one that seems promising json_array_elements_text or json_array_elements. As this allowed me to add multiple rows to the new table using this array of objects.
However, my issue is that I need to map certain values to the new table.
INSERT INTO form_field_value ("name", "label", "inputType", "options", "form" "workspace")
SELECT <<HERE IS WHERE I NEED TO EXTRACT VALUES FROM THE JSON ARRAY>>, task.form, task.workspace
FROM task;
EDIT 2
I have been playing around some more with the above functions, but reached a slight issue.
INSERT INTO form_field_value ("name", "label", "inputType", "options", "form" "workspace")
SELECT cf ->> 'name',
(cf ->> 'label')
...
FROM jsonb_array_elements(task."customFields") AS t(cf);
My issue lies in the FROM clause, so customFields is the array of objects, but I also need to get the form and workspace attribute from this table too. Plus I a pretty sure that the FROM clause would not work anyway, as it probably will complain about the task."customFields" not being specified or something.
Here is the select statement that uses json_array_elements and a lateral join in the from clause to flatten the data.
select j ->> 'name' as "name", j ->> 'label' as "label",
j ->> 'type' as "inputType", j ->> 'options' as "options", form, workspace
from task
cross join lateral json_array_elements("customFields") as l(j);
The from clause can be less verbose
from task, json_array_elements("customFields") as l(j)
you can try to use json_to_recordset:
select * from json_to_recordset('
[
{
"id": 0,
"name": "custom-field-0",
"label": "When is the deadline for a response?",
"type": "Date",
"options": "",
"value": "2020-10-02",
"index": 1
},
{
"id": 1,
"name": "custom-field-1",
"label": "What territory does this relate to?",
"type": "Dropdown",
"options": "UK, DE, SE, DK, BE, NL, IT, FR, ES, AT, CH, NO, US, SG, Other",
"value": " DE",
"index": 2
}
]
') as x(id int, name text,label text,type text,options text,value text,index int)
for insert record you can use an sql like this:
INSERT INTO form_field_value ("name", "label", "inputType", "options", "form" "workspace")
SELECT name, label, type, options, form, workspace
FROM
task,
json_to_record(task) AS
x (id int, name text,label text,type text,options text,value text,index int)

How to filter entities with nested arrays with CosmosDB

I have an entity like this:
{
"id": "xxxx",
"attributes": [{
"name": "name-01",
"value": "value-01"
}, {
"name": "name-02",
"value": "value-02"
}
]
}
Our "questions" to data usually: Give me entities with attribute or attribute with particular value;
in SQL it was written like as:
select *
from c
where
and array_contains(c.attributes, { "name": "name-01", "value": "value-01" }, true)
and array_contains(c.attributes, { "name": "name-02", "value": "value-02" }, true)
but I would like to extend a model to allow have suggestion of values in each attribute by transform an entity to:
{
"id": "xxxx",
"attributes": [{
"name": "name_01",
"value": "value-01",
"suggestions": ["a", "b", "c"]
}, {
"name": "name_02",
"value": "value-02",
"suggestions": ["a", "d", "e"]
}
]
}
With that structure I would like to ask: Give me all entities that has specified attribute and value equals to "XYZ" or suggestions array contains "XYZ";
In general scenario if always add value into array of suggestions the ask would be "Give me all entities that has specified attribute and suggestions contains XYZ"
N.B. Also I would like to make queries : Give me all entities that has more ALL specified attributes with constraints per each by suggestions?
Please suggest how to write such queries or rebuild a structure of entities in Cosmos DB;
P.S. We can technically switch from SQL to other protocol to better make such queries;
This should be doable using ARRAY_CONTAINS along with iterating the attributes array.
Give me items with value "value-01" or suggestion "f":
SELECT DISTINCT VALUE(c)
FROM c JOIN attr IN c.attributes
WHERE attr["value"] = "value-01" OR ARRAY_CONTAINS(attr.suggestions, "f")
Give me items with value "value-01" or both suggestions "a" and "f":
SELECT DISTINCT VALUE(c)
FROM c JOIN attr IN c.attributes
WHERE attr["value"] = "value-01" OR
ARRAY_CONTAINS(attr.suggestions, "a") AND ARRAY_CONTAINS(attr.suggestions, "f")

How to generate JSON array from multiple rows, then return with values of another table

I am trying to build a query which combines rows of one table into a JSON array, I then want that array to be part of the return.
I know how to do a simple query like
SELECT *
FROM public.template
WHERE id=1
And I have worked out how to produce the JSON array that I want
SELECT array_to_json(array_agg(to_json(fields)))
FROM (
SELECT id, name, format, data
FROM public.field
WHERE template_id = 1
) fields
However, I cannot work out how to combine the two, so that the result is a number of fields from public.template with the output of the second query being one of the returned fields.
I am using PostGreSQL 9.6.6
Edit, as requested more information, a definition of field and template tables and a sample of each queries output.
Currently, I have a JSONB row on the template table which I am using to store an array of fields, but I want to move fields to their own table so that I can more easily enforce a schema on them.
Template table contains:
id
name
data
organisation_id
But I would like to remove data and replace it with the field table which contains:
id
name
format
data
template_id
At the moment the output of the first query is:
{
"id": 1,
"name": "Test Template",
"data": [
{
"id": "1",
"data": null,
"name": "Assigned User",
"format": "String"
},
{
"id": "2",
"data": null,
"name": "Office",
"format": "String"
},
{
"id": "3",
"data": null,
"name": "Department",
"format": "String"
}
],
"id_organisation": 1
}
This output is what I would like to recreate using one query and both tables. The second query outputs this, but I do not know how to merge it into a single query:
[{
"id": 1,
"name": "Assigned User",
"format": "String",
"data": null
},{
"id": 2,
"name": "Office",
"format": "String",
"data": null
},{
"id": 3,
"name": "Department",
"format": "String",
"data": null
}]
The feature you're looking for is json concatenation. You can do that by using the operator ||. It's available since PostgreSQL 9.5
SELECT to_jsonb(template.*) || jsonb_build_object('data', (SELECT to_jsonb(field) WHERE template_id = templates.id)) FROM template
Sorry for poorly phrasing what I was trying to achieve, after hours of Googling I have worked it out and it was a lot more simple than I thought in my ignorance.
SELECT id, name, data
FROM public.template, (
SELECT array_to_json(array_agg(to_json(fields)))
FROM (
SELECT id, name, format, data
FROM public.field
WHERE template_id = 1
) fields
) as data
WHERE id = 1
I wanted the result of the subquery to be a column in the ouput rather than compiling the entire output table as a JSON.

No schema when creating Sheets-based external table from command line

Having found that for one specific sheets document I was trying to reference as an external table, the heading row was being included in the data when executing queries*. I decided to drop the table and recreate it using a definitions file which would definitely expose the options from the docs. It didn't seem to work as no schema is created, despite being defined in the file.
I've recreated the issue with a simple sheet with 3 columns and a frozen header row and the following test.def file:
{
"autodetect": false,
"schema": {
"fields": [
{"name": "c1", "type": "STRING", "mode": "nullable"},
{"name": "c2", "type": "STRING", "mode": "nullable"},
{"name": "c3", "type": "STRING", "mode": "nullable"},
]
},
"sourceFormat": "GOOGLE_SHEETS",
"sourceUris": [
"https://docs.google.com/spreadsheets/..."
],
"googleSheetsOptions": {
"skipLeadingRows": 1
}
}
and then I try to create the file using:
bq mk myproject:mydataset.mytable < test.def
the table is created but no schema is present - what am I doing wrong?
this issue remains but I cannot identify why 95% of the time the table is created OK and the first/header row correctly excluded from the data returned by a query but in one specific case, created the same way as all others, the header row is returned in the data ...
Odd :(
M
OK so the correct syntax is:
bq mk --external_table_definition=myfile.def project:dataset.table
This also allows you to tell google to skip leading rows on the sheet (as of tiem of writing not possible from BQ UI)
M

BigQuery: Create column of JSON datatype

I am trying to load json with the following schema into BigQuery:
{
key_a:value_a,
key_b:{
key_c:value_c,
key_d:value_d
}
key_e:{
key_f:value_f,
key_g:value_g
}
}
The keys under key_e are dynamic, ie in one response key_e will contain key_f and key_g and for another response it will instead contain key_h and key_i. New keys can be created at any time so I cannot create a record with nullable fields for all possible keys.
Instead I want to create a column with JSON datatype that can then be queried using the JSON_EXTRACT() function. I have tried loading key_e as a column with STRING datatype but value_e is analysed as JSON and so fails.
How can I load a section of JSON into a single BigQuery column when there is no JSON datatype?
Having your JSON as a single string column inside BigQuery is definitelly an option. If you have large volume of data this can end up with high query price as all your data will end up in one column and actually querying logic can become quite messy.
If you have luxury of slightly changing your "design" - I would recommend considering below one - here you can employ REPEATED mode
Table schema:
[
{ "name": "key_a",
"type": "STRING" },
{ "name": "key_b",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
},
{ "name": "key_e",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
}
]
Example of JSON to load
{"key_a": "value_a1", "key_b": [{"key": "key_c", "value": "value_c"}, {"key": "key_d", "value": "value_d"}], "key_e": [{"key": "key_f", "value": "value_f"}, {"key": "key_g", "value": "value_g"}]}
{"key_a": "value_a2", "key_b": [{"key": "key_x", "value": "value_x"}, {"key": "key_y", "value": "value_y"}], "key_e": [{"key": "key_h", "value": "value_h"}, {"key": "key_i", "value": "value_i"}]}
Please note: it should be newline delimited JSON so each row must be on one line
You can't do this directly with BigQuery, but you can make it work in two passes:
(1) Import your JSON data as a CSV file with a single string column.
(2) Transform each row to pack your "any-type" field into a string. Write a UDF that takes a string and emits the final set of columns you would like. Append the output of this query to your target table.
Example
I'll start with some JSON:
{"a": 0, "b": "zero", "c": { "woodchuck": "a"}}
{"a": 1, "b": "one", "c": { "chipmunk": "b"}}
{"a": 2, "b": "two", "c": { "squirrel": "c"}}
{"a": 3, "b": "three", "c": { "chinchilla": "d"}}
{"a": 4, "b": "four", "c": { "capybara": "e"}}
{"a": 5, "b": "five", "c": { "housemouse": "f"}}
{"a": 6, "b": "six", "c": { "molerat": "g"}}
{"a": 7, "b": "seven", "c": { "marmot": "h"}}
{"a": 8, "b": "eight", "c": { "badger": "i"}}
Import it into BigQuery as a CSV with a single STRING column (I called it 'blob'). I had to set the delimiter character to something arbitrary and unlikely (thorn -- 'รพ') or it tripped over the default ','.
Verify your table imported correctly. You should see your simple one-column schema and the preview should look just like your source file.
Next, we write a query to transform it into your desired shape. For this example, we'd like the following schema:
a (INTEGER)
b (STRING)
c (STRING -- packed JSON)
We can do this with a UDF:
// Map a JSON string column ('blob') => { a (integer), b (string), c (json-string) }
bigquery.defineFunction(
'extractAndRepack', // Name of the function exported to SQL
['blob'], // Names of input columns
[{'name': 'a', 'type': 'integer'}, // Output schema
{'name': 'b', 'type': 'string'},
{'name': 'c', 'type': 'string'}],
function (row, emit) {
var parsed = JSON.parse(row.blob);
var repacked = JSON.stringify(parsed.c);
emit({a: parsed.a, b: parsed.b, c: repacked});
}
);
And a corresponding query:
SELECT a, b, c FROM extractAndRepack(JsonAnyKey.raw)
Now you just need to run the query (selecting your desired target table) and you'll have your data in the form you like.
Row a b c
1 0 zero {"woodchuck":"a"}
2 1 one {"chipmunk":"b"}
3 2 two {"squirrel":"c"}
4 3 three {"chinchilla":"d"}
5 4 four {"capybara":"e"}
6 5 five {"housemouse":"f"}
7 6 six {"molerat":"g"}
8 7 seven {"marmot":"h"}
9 8 eight {"badger":"i"}
One way to do it, is to load this file as CSV instead of JSON (and quote the values or eliminate newlines in the middle), then it will become single STRING column inside BigQuery.
P.S. You are right that having a native JSON data type would have made this scenario much more natural, and BigQuery team is well aware of it.