How up-to-date is the Github BigQuery dataset really?

How up-to-date is the Github BigQuery dataset really? - google-bigquery

The official BigQuery documentation says the update frequency is weekly, but what exactly does that mean? Because when I query the dataset for my own github repos, then only repos I created/updated before ~august 2015 show up. Everything newer is nowhere to be found.
Do I need a paid plan to get access to the latest data? What am I doing wrong?
For reference, this is the simple query I ran to get all files that contain my Github username:
SELECT
*
FROM
[bigquery-public-data:github_repos.files]
WHERE
repo_name CONTAINS 'my_github_name'

Related

Google cloud, BigQuery assets table, os_inventory.update_time

We have Inventory setup for GCP compute engines (VMs) and run export commands (daily) to create project level assets tables (for each project) in a BigQuery under my-monitoring project.
Then we can query VMs, let say installed packages. A query example, to see if the package "xxx" does exists for the VMs deployed at the test-project.
SELECT pack.value.installed_package.apt_package, os_inventory.update_time FROM `my-monitoring.InventoryLogs.test-project_compute_googleapis_com_Instance`,
unnest (os_inventory.items) as pack
where pack.value.installed_package.apt_package.package_name = 'xxx'
The problem is, it always shows as the package xxx does exist, if it was ever installed. If I remove the package later on, still this command shows existence. An output something like this:
As I understand, it shows an old logs. I looked at some other VMs, the value of os_inventory.update_time is very old. Does anyone know what this value is for and when it refreshes? I was expecting it gets updated every time we run export assets command. Any ideas how to query for the latest packages values at inventory table?
Or even any other solutions how to query for a particular package existence for all VMs across all projects?

Why are some Apache-licensed repositories not in GitHub open dataset?

I find the repository "SchemaStore/json-validator" in the open GitHub data collection, queryable with BigQuery.
The repository "SchemaStore/schemastore" has the same Apache 2.0 license, but does not seem to be in the open data collection.
I tried:
SELECT distinct repo_name
FROM `bigquery-public-data.github_repos.files`
WHERE repo_name like 'SchemaStore/%';
This only finds "SchemaStore/json-validator", but not "SchemaStore/schemastore".

Not all GitHub projects are included on the BigQuery copy.
They need to have an appropriate license, and SchemaStore/json-validator has one.
But there are more criteria.
One is project relevance - we see here SchemaStore/json-validator has only 8 stars, and no action during the last 4 years.
I'm not sure when the next list of repos to index refresh will happen, but probably this project won't be included - unless there's a good reason it should.

Is BigQuery GitHub contains all the data?

Documentation says that GitHub data collection contains all the code from a GitHub
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
But I can't find my code in it:
SELECT *
FROM [bigquery-public-data:github_repos.files]
WHERE repo_name LIKE 'Everettss/%';
results with: Query returned zero records.
Here is example of one of my repo: https://github.com/Everettss/isomorphic-pwa
EDIT
After Felipe Hoffa answer, I've added LICENCE to my repo, so my example may not be valid.

The linked sample project is not part of the BigQuery dataset, because the linked project is not open source.
What do I mean with this: For a project to be open source, at a minimum it needs to have a LICENSE file, and GitHub needs to be able to recognize that license as one of the already approved open source licenses.

Google Cloud Logging export to Big Query does not seem to work

I am using the the google cloud logging web ui to export google compute engine logs to a big query dataset. According to the docs, you can even create the big query dataset from this web ui (It simply asks to give the dataset a name). It also automatically sets up the correct permissions on the dataset.
It seems to save the export configuration without errors but a couple of hours have passed and I don't see any tables created for the dataset. According to the docs, exporting the logs will stream the logs to big query and will create the table with the following template:
my_bq_dataset.compute_googleapis_com_activity_log_YYYYMMDD
https://cloud.google.com/logging/docs/export/using_exported_logs#log_entries_in_google_bigquery
I can't think of anything else that might be wrong. I am the owner of the project and the dataset is created in the correct project (I only have one project).
I also tried exporting the logs to a google storage bucket and still no luck there. I set the permissions correctly using gsutil according to this:
https://cloud.google.com/logging/docs/export/configure_export#setting_product_name_short_permissions_for_writing_exported_logs
And finally I made sure that the 'source' I am trying to export actually has some log entries.
Thanks for the help!

Have you ingested any log entries since configuring the export? Cloud Logging only exports entries to BigQuery or Cloud Storage that arrive after the export configuration is set up. See https://cloud.google.com/logging/docs/export/using_exported_logs#exported_logs_availability.

You might not have given edit permission for 'cloud-logs#google.com' in the Big Query console. Refer this.

BigQuery - 1 file duplicate out of many using Java API

I am using Java API quick start program to upload CSV files into bigquery tables. I uploaded more than thousand file but in one of the table 1 file's rows are duplicated in bigquery. I went through the logs but found only 1 entry of its upload. Also went through many similar questions where #jordan mentioned that the bug is fixed. Can there be any reason of this behavior? I am yet to try a solution mentioned of setting a job id. But just could not get any reason from my side of the duplicate entries...

Since BigQuery is append only by design, you need to accept there will be some duplicates in your system, and be able to write your queries in such way that selects the most recent version of it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How up-to-date is the Github BigQuery dataset really? - google-bigquery

Related

Google cloud, BigQuery assets table, os_inventory.update_time

Why are some Apache-licensed repositories not in GitHub open dataset?

Is BigQuery GitHub contains all the data?

Google Cloud Logging export to Big Query does not seem to work

BigQuery - 1 file duplicate out of many using Java API

Categories

Resources