Getting table creation time in Big Query - google-bigquery

How do you get the creation time for a table in the dataset?
bq show my_project:my_dataset.my_table
gives you
Table my_project:my_dataset.my_table
Last modified Schema Total Rows Total Bytes Expiration
----------------- ------------------ ------------ ------------- ------------
**16 Oct 14:47:41** |- field1: string 3 69
|- field2: string
|- field3: string
We can use the "Last Modified" date but its missing the year!. Also there needs to be a cryptic log applied to parse the date out.
Is this meta information available through any other specific 'bq' based commands?
I am looking to use this information to determine a appropriate table decorator that can be used on the table since it seems like if the decorator is going back 4 hours (on recurring basis) and the table/partition has existed for only 3hrs the query errors out.
Ideally it would be nice if the decorator usage defaults the time window to "now - table creation time" if the specified window was larger than "now-table creation time".

FWIW this information is available in the API, which the bq tool calls under the covers: https://developers.google.com/bigquery/docs/reference/v2/tables#resource

If you use bq --format=json you can get the information easily:
$ bq --format=prettyjson show publicdata:samples.wikipedia
{
"creationTime": "1335916132870", ...
}
This is the exact value to use in the table decorator.
While I'm not sure that I like the idea of having a 'really low start value' be interpreted as table creation time, I've got other options:
Table#0 means table at creation time
Table#0 means table at the earliest time at which a snapshot is available.
I'm leaning towards #2, since snapshots can only go back 7 days in time.

Related

After importing a metric into Victoria Metrics, the metric is repeated for 5 minutes. What controls this behavior?

I am writing some software that will be pushing data to Victoria Metrics, as below:
curl -d 'foo{bar="baz"} 30' -X POST 'http://[Victoria]/insert/0/prometheus/api/v1/import/prometheus'
I noticed that if I push a single metric like this, it shows up as not a single data point but rather shows up repeatedly as if it was being scraped every 15 seconds, either until I push a new value for that metric or 5 minutes passes.
What setting/mechanism is causing this 5-minute repeat period?
Pushing data with a timestamp does not change this. Metric gets repeated for 5 minutes after that time or until a change regardless.
I don't necessarily need to alter this behavior, just trying to understand why it's happening.
How do you query the database?
I guess this behaviour is due to the ranged query concept and ephemeral datapoints, check this out:
https://docs.victoriametrics.com/keyConcepts.html#range-query
The interval between datapoints depends on the step parameter, which is 5 minutes when omitted.
If you want to receive only the real datapoints, go via export functions.
https://docs.victoriametrics.com/#how-to-export-time-series
TSDB VM has ephemeral dots which fill gaps in the closest sample on the left to the requested timestamp.
So if you make the instant request:
curl "http://<victoria-metrics-addr>/api/v1/query?query=foo_bar&time=2022-05-10T10:03:00.000Z"
The time range at which VictoriaMetrics will try to locate a missing data sample is equal to 5m by default and can be overridden via step parameter.
step - optional max lookback window for searching for raw samples when executing the query. If step is skipped, then it is set to 5m (5 minutes) by default.
GET | POST /api/v1/query?query=...&time=...&step=...
You can read more about key concepts in this part of the documentation
key-concepts
There you can find also information about query range and different concepts about TSDB

bq decorator behavior against disk and streaming buffer

I am trying to utilize bigquery's decorator but there is some behaviors I want to confirm.
After some experiment, we found that query result of same absolute interval is not the same by different time of executing the query. If I query a live-streaming table of a very recent hour interval by say batch-issuing 60 queries, each of granularity of one minute, I can see very even distribution of data output for every query. However if I query the same hour interval after say 2 hours. The output loads become very skewed. I see many minutes with output size 0 and suddenly a spike in one minute interval contains almost all the data of that hour.
For example, for querying the same table with absolute value from timestamp 8:00AM to 9:00AM
If I execute the query at say 9:10AM, I get number of rows output with distribution like:
8:01AM - 8:02AM: 12
8:02AM - 8:03AM: 9
8:03AM - 8:04AM: 10
8:04AM - 8:05AM: 22
8:05AM - 8:06AM: 15
…
If I execute the query at say 11:00AM, I get result number of output rows like:
8:01AM - 8:02AM: 0
8:02AM - 8:03AM: 0
8:03AM - 8:04AM: 0
8:04AM - 8:05AM: 0
8:05AM - 8:06AM: 0
…
8:20AM - 8:21AM: 123
…
I assume that difference is caused by whether the data is in streaming buffer or disk. However it is kind of undermining the idempotency of query a given range of same table and causes a lot of complexity of using it. Therefore I want to have some expected behaviors clarified.
Is the difference in this query result really caused by the data residence between streaming buffer and disk?
Assuming the difference is because of (1), when a data is flushed from buffer to disk, will that data be reallocated to a screenshot of future timestamp or might it be possible to reallocated to an past timestamp. The question relates to whether it is possible to miss any streaming data when using the decorator.
When exposed to query result, is it guaranteed that reallocation of data is atomic? Namely for same row, it will only be versioned with one server timestamp?
Assuming the scenario when data is re-allocated to a screenshot of future timestamp. Is possible for BQ to provide transactional read on query groups anyhow. Say if I am batching multiple queries of a table, each query covers a unique minute interval while at the same time these queries get executed there is buffer flushing at the background. Is possible that the same data will appear in more than one query output. The question relates to whether it is possible to get duplicate data when using the decorator.
EDIT:
Some additional observation. I found out that after some time, the query result will be stabilized, namely the result of the same query will not change over the time I execute them. I assume it is because of the data "snapshot" of that time range have gotten finalized. So is it possible for me to know how open does BigQuery flush their data from buffer and how often do data get snapshotted? (or whatever mechanism that determine the query result of bq decorator). Namely is there a guaranteed time cutup on when the output of bq decorator can be finalized?

What InfluxDB schema is suitable for these measurements?

I have data about the status of my server collected over the years: temperatures, fan speeds, cpu load, SMART data. They are stored in a SQLite database under various tables, each one specific for each type of data.
I'm switching to InfluxDB for easier graphing (Grafana) and future expansion: the data will include values from another server and also UPS data (voltages, battery, ...).
I read the guidelines about schemas in InfluxDB but still I'm confused because I have no experience on the topic. I found another question about a schema recommendation but I cannot apply that one to my case.
How should I approach the problem and how to design an appropriate schema for the time series? what should I put in tags and what in fields? should I use a single "measurement" series or should I create multiple ones?
These are the data I am starting with:
CREATE TABLE "case_readings"(date, sensor_id INTEGER, sensor_name TEXT, Current_Reading)
CREATE TABLE cpu_load(date, load1 REAL, load2 REAL, load3 REAL)
CREATE TABLE smart_readings(date, disk_serial TEXT, disk_long_name TEXT, smart_id INTEGER, value)
Examples of actual data:
case_readings:
"1478897100" "4" "01-Inlet Ambient" "20.0"
"1478897100" "25" "Power Supply 1" "0x0"
cpu_load:
"1376003998" "0.4" "0.37" "0.36"
smart_readings:
"1446075624" "50026B732C022B93" "KINGSTON SV300S37A60G" "194" "26 (Min/Max 16/76)"
"1446075624" "50026B732C022B93" "KINGSTON SV300S37A60G" "195" "0/174553172"
"1446075624" "50026B732C022B93" "KINGSTON SV300S37A60G" "196" "0"
"1446075624" "50026B732C022B93" "KINGSTON SV300S37A60G" "230" "100"
This is my idea for a InfluxDB schema. I use uppercase to indicate the actual value and spaces only when a string actually contains spaces:
case_readings,server=SERVER_NAME,sensor_id=SENSOR_ID "sensor name"=CURRENT_READING DATE
cpu_readings,server=SERVER_NAME load1=LOAD1 load2=LOAD2 load3=LOAD3 DATE
smart_readings,server=SERVER_NAME,disk=SERIAL,disk="DISK LONG NAME" smart_id=VALUE DATE
I found the schema used by an official Telegraph plugin for the same IPMI readings I have:
ipmi_sensor,server=10.20.2.203,unit=degrees_c,name=ambient_temp \
status=1i,value=20 1458488465012559455
I will convert my old data into that format, I have all the required fields stored in my old SQLite DB. I will modify the plugin to save the name of the server instead of the IP, that here at home is more volatile than the name itself. I will also probably reduce the precision of the timestamp to simple milliseconds or seconds.
Using that one as example, I understand that the one I proposed for CPU readings could be improved:
cpu,server=SERVER_NAME,name=load1 value=LOAD1 DATE
cpu,server=SERVER_NAME,name=load2 value=LOAD2 DATE
cpu,server=SERVER_NAME,name=load3 value=LOAD3 DATE
However I am still considering the one I proposed, without indexing of the single values:
cpu,server=SERVER_NAME load1=LOAD1 load2=LOAD2 load3=LOAD3 DATE
For SMART data my proposal was also not optimal so I will use:
smart_readings,server=SERVER_NAME,serial=SERIAL,name=DISK_LONG_NAME",\
smart_id=SMART_ID,smart_description=SMART_DESCRIPTION \
value=VALUE value_raw=VALUE_RAW DATE

Export Bigquery Logs

I want to analyze the activity on BigQuery during the past month.
I went to the cloud console and the (very inconvenient) log viewer. I set up exports to Big-query, and now I can run queries on the logs and analyze the activity. There is even very convenient guide here: https://cloud.google.com/bigquery/audit-logs.
However, all this helps to look at data collected from now on. I need to analyze past month.
Is there a way to export existing logs (rather than new) to Bigquery (or to flat file and later load them to BQ)?
Thanks
While you cannot "backstream" the BigQuery's logs of the past, there is something you can still do, depending on what kind of information you're looking for. If you need information about query jobs (jobs stats, config etc), you can call Jobs: list method of BigQuery API to list all jobs in your project. The data is preserved there for 6 months and if you're project owner, you can list the jobs of all users, regardless who actually ran it.
If you don't want to code anything, you can even use API Explorer to call the method and save the output as json file and then load it back into BigQuery's table.
Sample code to list jobs with BigQuery API. It requires some modification but it should be fairly easy to get it done.
You can use Jobs: list API to collect job info and upload it to GBQ
Since it is in GBQ - you can analyze it any way you want using power of BigQuery
You can either flatten result or use original - i recommend using original as it is less headache as no any transformation before loading to GBQ (you just literally upload whatever you got from API). Of course all this in simple app/script that you still have to write
Note: make sure you use full value for projection parameter
I was facing the same problem when I found a article which describes how to inspect Big Query using INFORMATION_SCHEMA without any script nor Jobs: list as mentioned by other OPs.
I was able to run and got this working.
# Monitor Query costs in BigQuery; standard-sql; 2020-06-21
# #see http://www.pascallandau.com/bigquery-snippets/monitor-query-costs/
DECLARE timezone STRING DEFAULT "Europe/Berlin";
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
DATE(creation_time, timezone) creation_date,
FORMAT_TIMESTAMP("%F %H:%I:%S", creation_time, timezone) as query_time,
job_id,
ROUND(total_bytes_processed / gb_divisor,2) as bytes_processed_in_gb,
IF(cache_hit != true, ROUND(total_bytes_processed * cost_factor,4), 0) as cost_in_dollar,
project_id,
user_email,
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
ORDER BY
bytes_processed_in_gb DESC
Credits: https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/

How should I partition data in s3 for use with hadoop hive?

I have a s3 bucket containing about 300gb of log files in no particular order.
I want to partition this data for use in hadoop-hive using a date-time stamp so that log-lines related to a particular day are clumped together in the same s3 'folder'. For example log entries for January 1st would be in files matching the following naming:
s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3
etc
What would be the best way for me to transform the data? Am I best just running a single script that reads in each file at a time and outputs data to the right s3 location?
I'm sure there's a good way to do this using hadoop, could someone tell me what that is?
What I've tried:
I tried using hadoop-streaming by passing in a mapper that collected all log entries for each date then wrote those directly to S3, returning nothing for the reducer, but that seemed to create duplicates. (using the above example, I ended up with 2.5 million entries for Jan 1st instead of 1.4million)
Does anyone have any ideas how best to approach this?
If Hadoop has free slots in the task tracker, it will run multiple copies of the same task. If your output format doesn't properly ignore the resulting duplicate output keys and values (which is possibly the case for S3; I've never used it), you should turn off speculative execution. If your job is map-only, set mapred.map.tasks.speculative.execution to false. If you have a reducer, set mapred.reduce.tasks.speculative.execution to false. Check out Hadoop: The Definitive Guide for more information.
Why not create an external table over this data, then use hive to create the new table?
create table partitioned (some_field string, timestamp string, created_date date) partition(created_date);
insert overwrite partitioned partition(created_date) as select some_field, timestamp, date(timestamp) from orig_external_table;
In fact, I haven't looked up the syntax, so you may need to correct it with reference to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries.