What is column family in gcp bigtable and how is data stored in bigtable? - bigtable

In cassandra column families are just table
What's the difference between creating a table and creating a columnfamily in Cassandra?
but it seems like column family is referring to something else in gcp bigtable
https://cloud.google.com/bigtable/docs/schema-design
what exactly is column family in gcp bigtable?
bigtable is a key value store right?
And how does bigtable store its data?
column1 column2
row1 row1_column1_value row1_column2_value
row2 row1_column1_value row1_column2_value
Is it stored as
rowKey1:column1_value:column2_value rowKey2:column1_value2:column2_value2
or
rowKey1:column1_value rowKey2:column1_value2
rowKey1:column2_value rowKey2:column2_value2

Column Family in Cloud Bigtable refers to the set of columns that are related to one another and/or typically used together. These grouping of columns in Bigtable helps organize the data and limit what you’re pulling back.
As illustrated in the Bigtable documentation, below is a sample table structure of a BT table (edited with values):
To futher illustrate this, see this corresponding table using the cbt tool:
----------------------------------------
r1
cf1:c1 # 2021/12/20-06:27:45.349000
"val1"
cf1:c2 # 2021/12/20-06:29:15.517000
"val3"
cf2:c2 # 2021/12/20-06:48:09.685000
"val5"
----------------------------------------
r2
cf1:c1 # 2021/12/20-06:28:33.973000
"val2"
cf1:c2 # 2021/12/20-06:29:29.219000
"val4"
cf2:c1 # 2021/12/20-06:49:24.112000
"val6"
Additionally, you may try the quickstart guide to get familiarize with Bigtable.

Related

Flattening array column and creating Index column for array elements at the same time -- Azure Data Factory

I have a dataset that in simple representation looks like:
Col1
Col2
1
[A,B]
2
[C]
I want to denormalize the data and create another column while flattening, which would be the index of the elements in the array. The desired result set would look like:
Col1
Col2
Col3
1
A
1
1
B
2
2
C
1
I was able to achieve the requirement using mapindex, keyvalues and mapassociation expression functions.
Somehow I feel like this is not the right way to do it and there must be a better and easier way to do it. I read the microsoft documentation and couldnt find it.
Can someone help/guide me to a better solution?
Edit 1:
Source is Azure Blob Storage. I have access to only ADF. Data is a complex XML document. All transformations are to be performed only with ADF.
Edit 2:
Target is SAP BW . But I don't have control on it. I can only write to it.
You can use flatten transformation to flatten the array values and Window transformation to get the RowNumber, partition by Col1.
Flatten transformation: Unroll by array column (Col2).
Window transformation: Connect the output of flatten to Windows transformation.
Set a partition column in the Over clause.
Set a sort column to sort the data ordering.
In window columns setting, you can define the aggregation rowNumber() to get the index value based on col1.
Output of Window transformation:

How can i write a data frame to a specific partition of a date partitioned BQ table using to_gbq()

I have a dataframe which I want to write to date partitioned BQ table. I am using to_gbq() method to do this. I am able to replace or append the existing table but can't write to a specific partition of the table using to_gbq()
Since to_gbq() doesn't support it as of yet, I created a code snippet for doing this with BigQuery API client.
Assuming you have an existing date-partitioned table that was created like this (you don't need to pre-create it, more details later):
CREATE TABLE
your_dataset.your_table (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
and you have a DataFrame like this:
import pandas
import datetime
records = [
{"transaction_id": 1, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 2, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 3, "transaction_date": datetime.date(2021, 10, 21)},
]
df = pandas.DataFrame(records)
here's how to write to a specific partition:
from google.cloud import bigquery
client = bigquery.Client(project='your_project')
job_config = bigquery.LoadJobConfig(
write_disposition="WRITE_TRUNCATE",
# This is needed if table doesn't exist, but won't hurt otherwise:
time_partitioning=bigquery.table.TimePartitioning(type_="DAY")
)
# Include target partition in the table id:
table_id = "your_project.your_dataset.your_table$20211021"
job = client.load_table_from_dataframe(df, table_id, job_config=job_config) # Make an API request
job.result() # Wait for job to finish
The important part is the $... part in the table id. It tells the API to only update a specific partition. If your data contains records which belong to a different partition, the operation is going to fail.
I believe that to_gbq() is not supported yet for partitioned tables.
You can check here recent issues https://github.com/pydata/pandas-gbq/issues/43.
I would recommend that using Google BigQuery API client library https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html
You can upload dataframe to BigQuery table too.
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe

How to get repeatable sample using Presto SQL?

I am trying to get a sample of data from a large table and want to make sure this can be repeated later on. Other SQL allow repeatable sampling to be done with either setting a seed using set.seed(integer) or repeatable (integer) command. However, this is not working for me in Presto. Is such a command not available yet? Thanks.
One solution is that you can simulate the sampling by adding a column (or create a view) with random stuff (such as UUID) and then selecting rows by filtering on this column (for example, UUID ended with '1'). You can tune the condition to get the sample size you need.
By design, the result is random and also repeatable across multiple runs.
If you are using Presto 0.263 or higher you can use key_sampling_percent to reproducibly generate a double between 0.0 and 1.0 from a varchar.
For example, to reproducibly sample 20% of records in table using the id column:
select
id
from table
where key_sampling_percent(id) < 0.2
If you are using an older version of Presto (e.g. AWS Athena), you can use what's in the source code for key_sampling_percent:
select
id
from table
where (abs(from_ieee754_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
I have found that you have to use from_big_endian_64 instead of from_ieee754_64 to get reliable results in Athena. Otherwise I got no many numbers close to zero because of the negative exponent.
select id
from table
where (abs(from_big_endian_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
You may create a simple intermediate table with selected ids:
CREATE TABLE IF NOT EXISTS <temp1>
AS
SELECT <id_column>
FROM <tablename> TABLESAMPLE SYSTEM (10);
This will contain only sampled ids and will be ready to use it downstream in your analysis by doing JOIN with data of interest.

pandas read sql query improvement

So I downloaded some data from a database which conveniently has a sequential ID column. I saved the max ID for each table I am querying to a small text file which I read into memory (max_ids dataframe).
I was trying to create a query where I would say give me all of the data where the Idcol > max_id for that table. I was getting errors that Series are mutable so I could not use them in a parameter. The code below ended up working but it was literally just a guess and check process. I turned it into an int and then a string which basically extracted the actual value from the dataframe.
Is this the correct way to accomplish what I am trying to do before I replicate this for about 32 different tables? I want to always be able to grab only the latest data from these tables which I am then doing stuff to in pandas and eventually consolidating and exporting to another database.
df= pd.read_sql_query('SELECT * FROM table WHERE Idcol > %s;', engine, params={'max_id', str(int(max_ids['table_max']))})
Can I also make the table name more dynamic as well? I need to go through a list of tables. The database is MS SQL and I am using pymssql and sqlalchemy.
Here is an example of where I ran max_ids['table_max']:
Out[11]:
0 1900564174
Name: max_id, dtype: int64
assuming that your max_ids DF looks as following:
In [24]: max_ids
Out[24]:
table table_max
0 tab_a 33333
1 tab_b 555555
2 tab_c 66666666
you can do it this way:
qry = 'SELECT * FROM {} WHERE Idcol > :max_id'
for i, r in max_ids.iterrows():
print('Executing: [%s], max_id: %s' %(qry.format(r['table']), r['table_max']))
pd.read_sql_query(qry.format(r['table']), engine, params={'max_id': r['table_max']})

Speeding up PostgreSQL query where data is between two dates

I have a large table (> 50m rows) which has some data with an ID and timestamp:
id, timestamp, data1, ..., dataN
...with a multi-column index on (id, timestamp).
I need to query the table to select all rows with a certain ID where the timestamp is between two dates, which I am currently doing using:
SELECT * FROM mytable WHERE id = x AND timestamp BETWEEN y AND z
This currently takes over 2 minutes on a high end machine (2x 3Ghz dual-core Xeons w/HT, 16GB RAM, 2x 1TB drives in RAID 0) and I'd really like to speed it up.
I have found this tip which recommends using a spatial index, but the example it gives is for IP addresses. However, the speed increase (436s to 3s) is impressive.
How can I use this with timestamps?
That tip is only suitable when you have two columns A and B and use queries like:
where 'a' between A and B
That's not:
where A between 'a' and 'b'
Using index on date(column) rather than column could speed it up a little bit.
Could you EXPLAIN the query for us? Then we know how the database executes your query. And what about the configuration? What are the settings for shared_buffers and work_mem? And when did you (or your system) the last vacuum and analyze? And last thing, what OS and pgSQL-version are you using?
You can create wonderfull indexes but without proper settings, the database can't use them very efficient.
Make sure the index is TableID+TableTimestamp, and you do a query like:
SELECT
....
FROM YourTable
WHERE TableID=..YourID..
AND TableTimestamp>=..startrange..
AND TableTimestamp<=..endrange..
if you apply functions to the table's TableTimestamp column in the WHERE, you will not be able to completely use the index.
if you are already doing all of this, then your hardware might not be up to the task.
if you are using version 8.2 or later, you should try:
WHERE (TableID, TableTimestamp) >= (..YourID.., ..startrange.. )
and (TableID, TableTimestamp) <= (..YourID.., ..endrange..)