Is BigQuery GitHub contains all the data?

Is BigQuery GitHub contains all the data? - sql

Documentation says that GitHub data collection contains all the code from a GitHub
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
But I can't find my code in it:
SELECT *
FROM [bigquery-public-data:github_repos.files]
WHERE repo_name LIKE 'Everettss/%';
results with: Query returned zero records.
Here is example of one of my repo: https://github.com/Everettss/isomorphic-pwa
EDIT
After Felipe Hoffa answer, I've added LICENCE to my repo, so my example may not be valid.

The linked sample project is not part of the BigQuery dataset, because the linked project is not open source.
What do I mean with this: For a project to be open source, at a minimum it needs to have a LICENSE file, and GitHub needs to be able to recognize that license as one of the already approved open source licenses.

Related

Why are some Apache-licensed repositories not in GitHub open dataset?

I find the repository "SchemaStore/json-validator" in the open GitHub data collection, queryable with BigQuery.
The repository "SchemaStore/schemastore" has the same Apache 2.0 license, but does not seem to be in the open data collection.
I tried:
SELECT distinct repo_name
FROM `bigquery-public-data.github_repos.files`
WHERE repo_name like 'SchemaStore/%';
This only finds "SchemaStore/json-validator", but not "SchemaStore/schemastore".

Not all GitHub projects are included on the BigQuery copy.
They need to have an appropriate license, and SchemaStore/json-validator has one.
But there are more criteria.
One is project relevance - we see here SchemaStore/json-validator has only 8 stars, and no action during the last 4 years.
I'm not sure when the next list of repos to index refresh will happen, but probably this project won't be included - unless there's a good reason it should.

How up-to-date is the Github BigQuery dataset really?

The official BigQuery documentation says the update frequency is weekly, but what exactly does that mean? Because when I query the dataset for my own github repos, then only repos I created/updated before ~august 2015 show up. Everything newer is nowhere to be found.
Do I need a paid plan to get access to the latest data? What am I doing wrong?
For reference, this is the simple query I ran to get all files that contain my Github username:
SELECT
*
FROM
[bigquery-public-data:github_repos.files]
WHERE
repo_name CONTAINS 'my_github_name'

BigQuery - 1 file duplicate out of many using Java API

I am using Java API quick start program to upload CSV files into bigquery tables. I uploaded more than thousand file but in one of the table 1 file's rows are duplicated in bigquery. I went through the logs but found only 1 entry of its upload. Also went through many similar questions where #jordan mentioned that the bug is fixed. Can there be any reason of this behavior? I am yet to try a solution mentioned of setting a job id. But just could not get any reason from my side of the duplicate entries...

Since BigQuery is append only by design, you need to accept there will be some duplicates in your system, and be able to write your queries in such way that selects the most recent version of it.

TFS2010 database size

We've been using TFS since around 2009 when we installed TFS2008. We upgraded to TFS2010 at some point and we've been using it for source control, work item management, builds etc.
Our TFSVersionControl.mdf file is 287,120,000 KB (273GB). We ran some queries and found that our tbl_BuildInformationField table is massive. It has 1,358,430,452 rows which takes up 150,988,624 KB (143GB). We have multiple active products over multiple active builds which more than one solution per build and the solutions aren't free of warning messages.
My questions:
Is it possible to stop MSBuild from spamming the
tbl_BuildInformationField table so much? I.e. only write errors and
general build information and not all the warnings for every
project?
Is there a way to purge or clean up old data from this
table?
Is 273GB for 4 years of TFS use an average size?
Is 143GB for tbl_BuildInformationField a "normal" size?

The table holds the values and output of build process. Take note that build retention policy doesnt actualy delete the build object like everything else in TFS the object is marked deleted and only public visibility and drop location is cleared.
I would suggest if you have retainened same build definitions for very long time (when build definition is deleted the related objects get removed as well) you should query for build info including deleted ones using TFS api, the same api will also alow you to remove them for good. Deleting build definitions probably will not work and will fail with timeout error.
You can consult the following:
http://blogs.msdn.com/b/adamroot/archive/2009/06/12/working-with-deleted-build-data-in-team-foundation-server-2010-beta-1.aspx

Finding which files were "FIXED"and how many times between two specific date by using Trac?

I need to find out that how many times and which files are fixed or changed due to a bug between two specific dates in an open source project which uses Trac. I selected Webkit project for that purpose. (https://trac.webkit.org/) However, it can be any open source project.
What can I do for that? How do I start? Do i have to use version control systems like svn or git for intergration? I am kinda newbie for these bug-tracking and issue-tracking systems.

I'm not certain I exactly understand your question, but...
If you browse to the directory containing the files you care about in the Trac site, then click on Revision Log, you will get a list of changesets that affected that directory. You can select the revisions that span the timeframe of interest and then View changes and you will get a summary of the changes, and depending on the size of the changes and the particular Trac configuration, you may get the diffs on that page as well.
Now, that won't tell you how many times those files were changed, just the net changes.
It also won't tell you which bugs those changes were for.
If you really need to filter on what bug, you'll have to determine how that information is tracked by the particular project; and some might not track it directly. The project might include a #123 in the commit message. If you can rely on that, you could use svn log --xml {2009-11-01}:{2009-12-01} ... to get an xml version of the commit log which you could then parse and filter based on the presence of the bug's ticket number in the commit message. From that, you should have a list of the revisions that you care about.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Is BigQuery GitHub contains all the data? - sql

Related

Why are some Apache-licensed repositories not in GitHub open dataset?

How up-to-date is the Github BigQuery dataset really?

BigQuery - 1 file duplicate out of many using Java API

TFS2010 database size

Finding which files were "FIXED"and how many times between two specific date by using Trac?

Categories

Resources