How to search github projects ordered by number of commits? - google-bigquery

I was thinking of trying out BigQuery and GithubArchive, but I'm not sure how to compose a query that would let me search for a term in code or project and order the results by number of commits descending.
Thanks for any tips

The GithubArchive data loaded into BigQuery doesn't have copy of the source code, so search term in code wouldn't be possible. But if you wanted to search for a term in repository description, and then pick top repositories by number of commits, here is an example how to do it (the term is "SQL" in this example):
select count(*) c, repository_url, repository_description
from [githubarchive:github.timeline]
where type = 'PushEvent' and repository_description contains 'SQL'
group by 2, 3
order by c desc
limit 10
This results in
14925 https://github.com/danberindei/infinispan Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.
9377 https://github.com/postgres/postgres Mirror of the official PostgreSQL GIT repository. Note that this is just a *mirror* - we don't work with pull requests on github. To contribute, please see http://wiki.postgresql.org/wiki/Submitting_a_Patch
4876 https://github.com/galderz/infinispan Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.
4747 https://github.com/triAGENS/ArangoDB ArangoDB is a multi-purpose, open-source database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript/Ruby extensions. Use ACID transaction if you require them. Scale horizontally and vertically with a few mouse clicks.
3590 https://github.com/webnotes/erpnext Open Source, web-based ERP based on Python, Javascript and MySQL.
3489 https://github.com/anistor/infinispan Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.
3263 https://github.com/youtube/vitess vitess provides servers and tools which facilitate scaling of MySQL databases for large scale web services.
3071 https://github.com/infinispan/infinispan Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.
2631 https://github.com/theory/sqitch Simple SQL change management
2358 https://github.com/zzzeek/sqlalchemy Mirror of SQLAlchemy

SELECT COUNT(1) c, repository_url, repository_description
FROM [githubarchive:github.timeline]
WHERE type = 'PushEvent'
AND REGEXP_MATCH(repository_description, r'(?i)SQL')
GROUP BY 2, 3
ORDER BY c DESC
LIMIT 10
BigQuery supports Regular Expression so you can greatly improve / narrow down your search result having flexibility of using search pattern vs. seach term
Below references can help you further:
BigQuery Regular expression functions
re2 Syntax

Related

Hybrid Query Example in AgensGraph

I am using agensgraph but I dont know how to write a hybrid query, any examples of hybrid query in agensgraph would help a lot.
In AgensGraph you can write hybrid queries in two ways:
Let's say you are creating the followings:
CREATE GRAPH AG;
CREATE VLABEL dev;
CREATE (:dev {name: 'someone', year: 2015});
CREATE (:dev {name: 'somebody', year: 2016});
CREATE TABLE history (year, event)
AS VALUES (1996, 'PostgreSQL'), (2016, 'AgensGraph');
1- Cypher in SQL
Syntax:
SELECT [column_name]
FROM ({table_name|SQL-query|CYPHERquery})
WHERE [column_name operator value];
Example:
SELECT n->>'name' as name
FROM history, (MATCH (n:dev) RETURN n) as dev
WHERE history.year > (n->>'year')::int;
Result:
name ----
someone
(1 row)
2- SQL in Cypher
Syntax:
MATCH [table_name]
WHERE (column_name operator {value|SQLquery|CYPHERquery})
RETURN [column_name];
Example:
MATCH (n:dev)
WHERE n.year < (SELECT year FROM history WHERE event =
'AgensGraph')
RETURN properties(n) AS n;
Result:
n ----
{"name": "someone", "year": 2015}
(1 row)
You can find more information here
I found more info on the hybrid query language in these slides. Every other bit of information I have been able to find is just the same example that Eya posted, in different places.
I agree that more information about the hybrid queries in AgensGraph would be great, as it seems like a killer feature of software.
Let’s assume that we have a network management system and we are keeping our network topology in graph part of the AgensGraph (Graph Format) and our time-series data (such as date&time information regarding specific devices) in the relational part of the AgensGraph (Table Format). So, in this case, we know that we have a graph, tables and if we want, we can write a hybrid query to fetch data from both models.
In our graph, we have different devices that are connected to each other such as a modem, IoT sensors, etc. for each of these devices, we also have some information respectively stored in tables - related to those devices such as download speed, the upload speed or CPU usage.
In the following hybrid queries, our goal is to collect the information regarding specific devices by querying both from the graph and the tables simultaneously.
Cypher in SQL
In this hybrid query, we are looking to find modem devices which are having issues and their abnormality type is 2 (2 indicates that this device is having some issues regarding its download and upload speed) and after we find those devices, our goal is to return their id, download, and upload speed to investigate the issue. As you can see in the following query our inner query is Cypher and our outer query is SQL.
SELECT id,sysdnbps, sysupbps
from public.modemrdb where to_jsonb(id) in
(SELECT id FROM (MATCH(m:modem) where
m.abnormaltype=2
return m.name)
AS s(id));
SQL in Cypher
In this hybrid query, we are looking to find modem devices which their CPU usages are more than 80 (not in range of threshold) which indicate there is an issue with these devices and after we find those devices, our goal is to return that modems and any IoT devices that are connected to them. As you can see in the following example our inner query is SQL and our outer query is Cypher.
MATCH p=(n:modem)-[r*1..2]->(iot)
WHERE n.name in
(SELECT to_jsonb(id)
FROM public.modemrdb
WHERE syscpuusage >= 80)
RETURN p;
This can be another example of a hybrid query.

How much storage capacity my dataset or table consume?

I have multiple datasets, each with hundreds of tables in Google BigQuery. I'd like to remove some old, legacy data and I am looking for the most convenient way to know how much storage space my each dataset and table is occupying, so I could make educated decision on what datasets/tables I may remove.
I tried to use bq command-line tool but couldn't find a way to display table storage and entire dataset storage related information.
You can access metadata about the tables in a dataset by using the TABLES meta-table. I.e., and example:
select * from [publicdata:samples.__TABLES__]
returns
project_id dataset_id table_id creation_time last_modified_time row_count size_bytes type
publicdata samples github_nested 1348782587310 1348782587310 2541639 1694950811 1
publicdata samples github_timeline 1335915950690 1335915950690 6219749 3801936185 1
publicdata samples gsod 1335916040125 1440625349328 14420316 17290009238 1
publicdata samples natality 1335916045005 1440625330604 37826763 23562717384 1
publicdata samples shakespeare 1335916045099 1440625429551 164656 6432064 1
publicdata samples trigrams 1335916127449 1445684180324 68051509 277168458677 1
publicdata samples wikipedia 1335916132870 1445689914564 13797035 38324173849 1
More documentation here: https://cloud.google.com/bigquery/querying-data
Below is an example of how to combine use of metadata (as in answer by #Moshapasumansky) with visualization (as in recommendation by #DoITInternational) and all without leaving BigQuery Web UI, but you will need BigQuery Mate Chrome Extension
Assuming you have extension - Follow below steps:
Step 1 - Run Query against tables metadata in publicdata:samples dataset
SELECT
table_id,
DATE(TIMESTAMP(creation_time/1000)) AS Created,
DATE(TIMESTAMP(last_modified_time/1000)) AS Modified,
row_count AS Rows,
ROUND(size_bytes/POW(1024, 3)) AS GB
FROM [publicdata:samples.__TABLES__]
Step 2 - Move to JSON View
Step 3 - Expand Result Panel by Clicking on + Button
This is for two reasons:
To bring to result panel up to 500 records (which should cover your case as you mentioned you have hundreds tables) at a time vs. relatively limited amount of rows at a time that currently supported by native ui
To release more real estate for chart
Step 4 - Close Query Editor (optional) – more real estate for chart
Step 5 - Click Show Pivot to bring Pivot/Chart Tool up with data from Result and than design your pivot chart the way you like (as it is in below screenshot for example)
It might be not the best way - but at least it allows you to do what you want here w/o leaving web ui. In some cases it can be a preferred option I think.
Rather than using the BigQuery API (Tables: get method specifically) and looking into numBytes in the response, I can suggest to use BQdu or BigQuery Disk Usage web application. It will scan your project for datasets and tables and will display this nice visualization, mentioning how much storage each table (or entire dataset) is consuming.

Is Bigtable (or BigQuery) the right platform for correlation analysis of logs?

I'm faced with the challenge of analysing different system logfiles based on following requirements:
several hundred systems
millions of logs every day in different formats
Beside many other objectives my biggest challenge is a realtime correlation analysis of all incoming logs on all current system logs and also on partially historical log events.
Currently we're focusing on MongoDB, ElasticSearch, Hadoop, ... to meet this challenge.
On the other hand I've read some interesting things about Google Bigtable and Bigquery.
So my question is, is Bigtable and/or Bigquery a solution worth looking at, in order to do this realtime analysis ?
I've no experience with these two products, so I'm hoping for some tips whether these Google solutions could be an alternative for my requirements.
THX & BR
bdriven
EDIT:
too broad. you need to show actual analisis you need to make. bigquery will be much much cheaper that homemade with nosql
Our goal is, to develop a system, which is able to generate warnings based on current log events (or a combination of different log events) and their past interactions on other systems behavior.
Therefore we have to be able to do fast correlation analysis for current events against huge amounts of unstructured historical data.
I know that this requirement description is probably not the most specific one, but we're right at the beginning of this project.
So my goal with this question is to get some arguments for our next team meeting, whether we should consider to take a closer look at Bigtable / Bigquery or not.
One of my favorite features of BigQuery is its ability to run correlations.
Here's a correlations with BigQuery tutorial I wrote a couple years ago: http://nbviewer.ipython.org/gist/fhoffa/6459195
For example, to rank and find the most correlated airports in terms of flight delays:
SELECT a.departure_state, b.departure_state, corr(a.avg, b.avg) corr, COUNT(*) c
FROM
(SELECT date, departure_state, AVG(departure_delay) avg , COUNT(*) c
FROM [bigquery-samples:airline_ontime_data.flights]
GROUP BY 1,2 HAVING c > 5
) a
JOIN
(SELECT date, departure_state ,
AVG(departure_delay) avg, COUNT(*) c FROM [bigquery-samples:airline_ontime_data.flights]
GROUP BY 1,2 HAVING c > 5 ) b
ON a.date=b.date
WHERE a.departure_state < b.departure_state
GROUP EACH BY 1, 2
HAVING c > 5
ORDER BY corr DESC;
Try it yourself in the next 5 minutes! A quick getting started tutorial: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/

Microsoft Decision Trees: support cases for a specific node

I'm using Microsoft Decision Trees in Microsoft Analysis Services Data Mining, and need to show the historical data (the support cases from the training data used to train the decision tree) for a given leaf node in my mining model. Is there a way to access those records directly based on the NodeID using a DMX query, or is the only way to get the NODE_DESCRIPTION for the node, replace not = with <> and execute a query against my live database with that as my WHERE clause?
Courtesy of rok1 on the MSDN forums: http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/thread/e6502263-a2b9-4fa1-b60b-04414e3efd29
SELECT * FROM [ModelName].Cases
where ISTrainingCase()
and IsInNode('0') --your intended node

Specification Pattern vs Specific Hibernate Query

My question is when to use a specification pattern, and when to use specific SQL query.
I understood that specific pattern need to collect whole collection and post filter using concrete specification. But i dont't understand the advantage in front of specific SQL query.
CarColorSpecification cc = new CarColorSpecification(RED);
CarAgeSpecification ca = new CarAgeSpecification(OLDER, 5);
ISpecification finalSpec = cc.And(ca);
List<Car> res;
List<Car> carColl = service.getCars();
foreach(Car c in carColl) {
if(finalSpec.isSatisfiedBy(c)) {
res.add(c);
}
}
And the same in SQL / Hibernate
FROM Car c WHERE c.color = RED AND c.age > 5
I think it depends of the data volume to process.
The SQL version will run quickly provided the table is appropriately indexed for the columns in question and its size, and it will transfer a smaller volume of data between the DB server and the app server if they're different. It may, however, impose a higher load on the SQL box in terms of CPU and disk I/O usage, and in many environments, the DB server is the most expensive component to scale.
So yes, it depends a great deal on the size of the data.
The Repository is used to abstract the persistence implementation away from your domain classes. In short, SQL/HQL should only exist within your repositories.
If you're dealing with high data volumes, create a new method on your Repository and call that method from your Specification.
I think a good compromise would be to combine specification pattern according to hibernate to generate HQL queries. (Maybe Linq for Java ^^ ?)