strange question here but I was trying to create an empty dataframe with the following code. I want the columns to be in the order that I wrote them but when output they are in a different order. Is there a reason why this is happening intuitively?
import pandas as pd
user_df = pd.DataFrame(columns={'NAME',
'AGE',
'EMAIL',
'PASSWORD',
'FAVORITE_TEAM'
})
user_df
Output:
PASSWORD EMAIL AGE NAME FAVORITE_TEAM
Reason is because use sets ({}), there is not defined order.
Docs:
A set object is an unordered collection of distinct hashable objects.
If use list ([]) all working nice:
user_df = pd.DataFrame(columns={'NAME',
'AGE',
'EMAIL',
'PASSWORD',
'FAVORITE_TEAM'
})
print (user_df)
Empty DataFrame
Columns: [AGE, FAVORITE_TEAM, EMAIL, NAME, PASSWORD]
Index: []
user_df = pd.DataFrame(columns=['NAME',
'AGE',
'EMAIL',
'PASSWORD',
'FAVORITE_TEAM'
])
print (user_df)
Empty DataFrame
Columns: [NAME, AGE, EMAIL, PASSWORD, FAVORITE_TEAM]
Index: []
Related
I am new to BigQuery and come from an AWS background.
I have a bucket with no structure, just files of names YYYY-MM-DD-<SOME_ID>.csv.gzip.
The goal is to import this into BigQuery, then create another dataset with a subset table of the imported data. It should be last month's data, exclude some rows with a WHERE statement and exclude some columns.
There seem to be many alternatives using different operators. What would be the best practice to do it?
BigQueryCreateEmptyDatasetOperator(...)
BigQueryCreateEmptyTableOperator(...)
BigQueryExecuteQueryOperator(...) / BigQueryInsertJobOperator / BigQueryUpsertTableOperator
I also found
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
GCSToBigQueryOperator,
)
GCSToBigQueryOperator(...)
When is this preferred?
This is my current code:
create_new_dataset_A = BigQueryCreateEmptyDatasetOperator(
dataset_id=DATASET_NAME_A,
project_id=PROJECT_ID,
gcp_conn_id='_my_gcp_conn_',
task_id='create_new_dataset_A')
load_csv = GCSToBigQueryOperator(
bucket='cloud-samples-data',
compression="GZIP",
create_disposition="CREATE_IF_NEEDED",
destination_project_dataset_table=f"{PROJECT_ID}.{DATASET_NAME_A}.{TABLE_NAME}",
source_format="CSV",
source_objects=['202*'],
task_id='load_csv',
write_disposition='WRITE_APPEND',
schema_fields=[
{'name': 'name', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'post_abbr', 'type': 'STRING', 'mode': 'NULLABLE'},
],
)
create_new_dataset_B = BigQueryCreateEmptyDatasetOperator(
dataset_id=DATASET_NAME_B,
project_id=PROJECT_ID,
gcp_conn_id='_my_gcp_conn_',
task_id='create_new_dataset_B')
populate_new_dataset_B = BigQueryExecuteQueryOperator(...) / BigQueryInsertJobOperator / BigQueryUpsertTableOperator
Alternatives below:
populate_new_dataset_B = BigQueryExecuteQueryOperator(
task_id='load_from_table_a_to_table_b',
use_legacy_sql=False,
write_disposition='WRITE_APPEND',
sql=f'''
INSERT `{PROJECT_ID}.{DATASET_NAME_A}.D_EXCHANGE_RATE`
SELECT col_x, col_y #skip som col from table_a
FROM
`{PROJECT_ID}.{DATASET_NAME_A}.S_EXCHANGE_RATE`
WHERE col_x is not null
'''
Does it keep track of rows it loaded due to write_disposition='WRITE_APPEND'?
Does GCSToBigQueryOperator keep track of metadata or load duplicates?
populate_new_dataset_B = BigQueryInsertJobOperator(
task_id="load_from_table_a_to_table_b",
configuration={
"query": {
"query": "{% include 'sql-file.sql' %}",
"use_legacy_sql": False,
}
},
dag=dag,
)
Is this more for scheduled ETL jobs? Example: https://github.com/simonbreton/Capstone-project/blob/a6563576fa63b248a24d4a1bba70af10f527f6b4/airflow/dags/sql/fact_query.sql.
Here they do not use write_disposition='WRITE_APPEND' they use a where statement instead. Why? When to prefer?
Last operator I dont get, when to use it?
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery.html#howto-operator-bigqueryupserttableoperator
Which operator to use for populate_new_dataset_B?
Appreciate all help.
I would like to store an array of strings in my Postgres database, I have the below code but it is not working as I want:
#Column({type: 'text', array: true, nullable: true })
names: string[] = [];
I got the following error:
PostgreSQL said: malformed array literal: "["james"]"
Detail: "[" must introduce explicitly-specified array dimensions.
Anything I might be doing wrong?
I was able to resolve this with
#Column('simple-array', { nullable: true })
city: string[];
This should work for an array.
#Column('text', { array: true })
names: string[];
I can't find anything about how to do this type of query in FaunaDB. I need to select only specifics fields from a document, not all fields. I can select one field using Select function, like below:
serverClient.query(
q.Map(
q.Paginate(q.Documents(q.Collection('products')), {
size: 12,
}),
q.Lambda('X', q.Select(['data', 'title'], q.Get(q.Var('X'))))
)
)
Forget the selectAll function, it's deprecated.
You can also return an object literal like this:
serverClient.query(
q.Map(
q.Paginate(q.Documents(q.Collection('products')), {
size: 12,
}),
q.Lambda(
'X',
{
title: q.Select(['data', 'title'], q.Get(q.Var('X')),
otherField: q.Select(['data', 'other'], q.Get(q.Var('X'))
}
)
)
)
Also you are missing the end and beginning quotation marks in your question at ['data, title']
One way to achieve this would be to create an index that returns the values required. For example, if using the shell:
CreateIndex({
name: "<name of index>",
source: Collection("products"),
values: [
{ field: ["data", "title"] },
{ field: ["data", "<another field name>"] }
]
})
Then querying that index would return you the fields defined in the values of the index.
Map(
Paginate(
Match(Index("<name of index>"))
),
Lambda("product", Var("product"))
)
Although these examples are to be used in the shell, they can easily be used in code by adding a q. in front of each built-in function.
I have a morris.js graph. I have a table with 3 columns id with values such as 1,2,3 etc, usernames with values such as sophie, nick, Paul etc and timesloggedin with values such as 69, 58, 4 etc.
I created a chart that has the ids on x and the timesloggedin on y.
What I want is instead of displaying the id number on the bottom of the chart under the bars, to have their usernames. You can see the chart here:
http://kleanthisg.work/chartsnew2.php
CODE:
http://kleanthisg.work/CODE.TXT
table:
user list:
You need to provide the username and set it as xkey
Morris.Bar({
element : 'chart_line_1',
data:[{ id:'1', timesloggedin:65, username: 'Paul'},
{ id:'5', timesloggedin:10, username: 'John'},
{ id:'7', timesloggedin:4, username: 'Steve' }],
xkey:'username',
ykeys:['timesloggedin'],
labels:['timesloggedin'],
hideHover:'auto',
});
I have a table with the following structure:
and the following data in it:
[
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:25 UTC"
},
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:16 UTC"
}
]
What I'd like to do is recreate the same table (same structure) but with only the latest version of a row. In this example let's say that I'd like to group by everything by name and take the row with the most recent createdAt.
I tried to do something like this: Google Big Query SQL - Get Most Recent Column Value but I couldn't get it to work with record and repeated fields.
I really hoped someone from Google Team will provide answer on this question as it is very frequent topic/problem asked here on SO. BigQuery definitelly not friendly enough with writing Nested / Repeated stuff back to BQ off of BQ query.
So, I will provide the workaround I found relatively long time ago. I DO NOT like it, but (and that is why I hoped for the answer from Google Team) it works. I hope you will be able to adopt it for you particular scenario
So, based on your example, assume you have table as below
and you expect to get most recent records based on createdAt column, so result will look like:
Below code does this:
SELECT name, age, createdAt, addresses.city
FROM JS(
( // input table
SELECT name, age, createdAt, NEST(city) AS addresses
FROM (
SELECT name, age, createdAt, addresses.city
FROM (
SELECT
name, age, createdAt, addresses.city,
MAX(createdAt) OVER(PARTITION BY name, age) AS lastAt
FROM yourTable
)
WHERE createdAt = lastAt
)
GROUP BY name, age, createdAt
),
name, age, createdAt, addresses, // input columns
"[ // output schema
{'name': 'name', 'type': 'STRING'},
{'name': 'age', 'type': 'INTEGER'},
{'name': 'createdAt', 'type': 'INTEGER'},
{'name': 'addresses', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'city', 'type': 'STRING'}
]
}
]",
"function(row, emit) { // function
var c = [];
for (var i = 0; i < row.addresses.length; i++) {
c.push({city:row.addresses[i]});
};
emit({name: row.name, age: row.age, createdAt: row.createdAt, addresses: c});
}"
)
the way above code works is: it implicitely flattens original records; find rows that belong to most recent records (partitioned by name and age); assembles those rows back into respective records. final step is processing with JS UDF to build proper schema that can be actually written back to BigQuery Table as nested/repeated vs flatten
The last step is the most annoying part of this workaround as it needs to be customized each time for specific schema(s)
Please note, in this example - it is only one nested field inside addresses record, so NEST() fuction worked. In scenarious when you have more than just one
field inside - above approach still works, but you need to involve concatenation of those fields to put them inside nest() and than inside js function to do extra splitting those fields, etc.
You can see examples in below answers:
Create a table with Record type column
create a table with a column type RECORD
How to store the result of query on the current table without changing the table schema?
I hope this is good foundation for you to experiment with and make your case work!