Why are some Apache-licensed repositories not in GitHub open dataset? - google-bigquery

I find the repository "SchemaStore/json-validator" in the open GitHub data collection, queryable with BigQuery.
The repository "SchemaStore/schemastore" has the same Apache 2.0 license, but does not seem to be in the open data collection.
I tried:
SELECT distinct repo_name
FROM `bigquery-public-data.github_repos.files`
WHERE repo_name like 'SchemaStore/%';
This only finds "SchemaStore/json-validator", but not "SchemaStore/schemastore".

Not all GitHub projects are included on the BigQuery copy.
They need to have an appropriate license, and SchemaStore/json-validator has one.
But there are more criteria.
One is project relevance - we see here SchemaStore/json-validator has only 8 stars, and no action during the last 4 years.
I'm not sure when the next list of repos to index refresh will happen, but probably this project won't be included - unless there's a good reason it should.

Related

Is BigQuery GitHub contains all the data?

Documentation says that GitHub data collection contains all the code from a GitHub
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
But I can't find my code in it:
SELECT *
FROM [bigquery-public-data:github_repos.files]
WHERE repo_name LIKE 'Everettss/%';
results with: Query returned zero records.
Here is example of one of my repo: https://github.com/Everettss/isomorphic-pwa
EDIT
After Felipe Hoffa answer, I've added LICENCE to my repo, so my example may not be valid.
The linked sample project is not part of the BigQuery dataset, because the linked project is not open source.
What do I mean with this: For a project to be open source, at a minimum it needs to have a LICENSE file, and GitHub needs to be able to recognize that license as one of the already approved open source licenses.

How up-to-date is the Github BigQuery dataset really?

The official BigQuery documentation says the update frequency is weekly, but what exactly does that mean? Because when I query the dataset for my own github repos, then only repos I created/updated before ~august 2015 show up. Everything newer is nowhere to be found.
Do I need a paid plan to get access to the latest data? What am I doing wrong?
For reference, this is the simple query I ran to get all files that contain my Github username:
SELECT
*
FROM
[bigquery-public-data:github_repos.files]
WHERE
repo_name CONTAINS 'my_github_name'

Migrate from youtrack to jira

After using youtrack for quite a while my organization is considering a move to JIRA (because of many reasons). However JIRA doesn't seem to include a youtrack importer/migration out of the box (though there seems to be plenty of importers/migrations the other way around).
Has anyone migrated from youtrack to JIRA and have any experience in this?
Edit:
To anyone who might have this problem later, my final solution ended up something like this:
transfer all "basic" data by hand (user accounts, basic project setup etc)
write a small C# program using the atlassian sdk and the youtrack sdk that transfers from one to the other (creating empty placeholder issues if issues was missing due to someone deleting them in youtrack in order to keep numbering).
This approach worked good enough and I managed to transfer pretty much all data without any loss of any very important data (though of course all timestamps are messed up now, but we saw that as an acceptable loss).
Important to know is that youtrack handles issues moved from one project to another a bit counter-intuitive (they still show up in their first project even when they're moved away from there, but they have an issue id from their new project - a slight wtf when I ran into that the first time).
Also, while the atlassian sdk did allow me to "spoof" the creator of an issue (that is, being logged in as used A and creating an issue while telling the system that it's actually user B who is creating this issue) it does not allow you to do this with comments. So in order to transfer those properly I had to actually loop through the comments and log in with the corresponding new user and post the comments.
Also, attachments from youtrack was a bit annoying to download, so I ended up having to download those "by hand". :/
But all in all, it was relatively pain-free. Some assembly required, some final touch-ups required, but it was all done within a couple of days.
I had the same problem. After a discussion with JIM (Jira Importer) developer, I used YouTrack Rest API and Python script to make JSON files. Then I used JIM JSON import.
With this solution you can import almost all fields from YT - the standard one and files with description, links between issues and projects and so on...
I don't know if I can push it to GitHub, I have to ask my boss - I did it during my work hours.... But of course you can ask me if you want.
The easiest approach is probably to export the data from youtrack into CSV and use the JIRA CSV importer. You may have to modify some of the data to fit the expected format for the CSV importer

How to prevent Trac to show some commits in the Timeline?

I'm trying to configure a trac server we are using in my team, in order to avoid an undesired behaviour. We are mainly developing free and open-source software in the team, but we sometimes need to be able to build our early prototypes as completely private.
Because of our first constraint, we want our timeline to be visible for anonymous users. But because of the seconde constraints, we want some commits to be completely hidden from the external world, i.e. we don't want anybody else than us to be able to read the message and content of some commits in the timeline.
Unfortunately, I've been unable to configure Trac the proper way to reach this behaviour untli now. I wan't find a configuration that would let me manage the Timeline content with enough accuracy.
Consequently, I would like to know if such a configuration is possible with trac.
For information, I'm using Trac 0.12.2. The installed plugins are :
Trac 0.12.2
TracAccountManager 0.2.1dev-r7731
TracNav 4.1
The only permission I can see that is related to Timeline is TIMELINE_VIEW.
EDIT :
I have forgot to mention something. We don't want to loose the private commits. And we want them to display for registered users. Consequently, it's not a solution for us to remove them from the database.
EDIT 2 :
Ideally, we would like the commits' message to be displayed according to the right to read the content of our Subversion repository. The idea is that, if a commit is made on a part someone can't access, this person is not supposed to be able to read the message of the commit either.
EDIT 3 :
If we have a look in the configuration file of trac, we already can find :
permission_policies = AuthzSourcePolicy, DefaultPermissionPolicy, LegacyAttachmentPolicy
and the authz_file variable is properly set too. Moreover, svn access to the private folders of the svn repositories can't be accessed by anonymous users.
You should set up authz checking for both your Subversion repository and your Trac installation. You can use the same permission file for both. For Subversion, see Path-based authorization in the SVN book. For Trac, enable and configure the trac.versioncontrol.svn_authz.AuthzSourcePolicy component.
This will allow you to have a very fine-grained control over who can access which part of the repository. Note that the implementation of AuthzSourcePolicy in Trac 0.12.2 has a few bugs that will be fixed in 0.12.3.
There are two ways of going about this :
1) You can directly edit the plugins that are running in trac, and add a module that helps you to filter these out at the code level (i.e. you can edit the behavior of the script to , say, only include commits which exclude certain key words). The timeline script is here (trac 2.4) : /usr/local/lib/python2.4/site-packages/trac/Timeline.py (here is an online diff snapshot of the source code : http://trac.edgewall.org/attachment/ticket/890/Timeline.py.diff)
2) You can remove the commits entirely - trac commits are derived from the sqlLite database (the schema is here http://trac.edgewall.org/wiki/TracDev/DatabaseSchema).
Of course, there also might be some fancy tools out there that provide a nice interface for editing the way the timeline looks.
Finally - temporarily, you can remove the timeline/roadmap entirely from the trac.ini file : http://www.gossamer-threads.com/lists/trac/users/28079
I confess that I've virtually no experience with the repository part of Trac, even less with using a repository with a variety of permissions across it's contents.
On the subject: Configuration is certainly not enough, see rblanks answer. While I've never seen the code for that functionality, I was wrong to suggest it doesn't exist. Because it is a central place and developed/supported in Trac core this is definitely the way to go.

Finding which files were "FIXED"and how many times between two specific date by using Trac?

I need to find out that how many times and which files are fixed or changed due to a bug between two specific dates in an open source project which uses Trac. I selected Webkit project for that purpose. (https://trac.webkit.org/) However, it can be any open source project.
What can I do for that? How do I start? Do i have to use version control systems like svn or git for intergration? I am kinda newbie for these bug-tracking and issue-tracking systems.
I'm not certain I exactly understand your question, but...
If you browse to the directory containing the files you care about in the Trac site, then click on Revision Log, you will get a list of changesets that affected that directory. You can select the revisions that span the timeframe of interest and then View changes and you will get a summary of the changes, and depending on the size of the changes and the particular Trac configuration, you may get the diffs on that page as well.
Now, that won't tell you how many times those files were changed, just the net changes.
It also won't tell you which bugs those changes were for.
If you really need to filter on what bug, you'll have to determine how that information is tracked by the particular project; and some might not track it directly. The project might include a #123 in the commit message. If you can rely on that, you could use svn log --xml {2009-11-01}:{2009-12-01} ... to get an xml version of the commit log which you could then parse and filter based on the presence of the bug's ticket number in the commit message. From that, you should have a list of the revisions that you care about.