I'm wanting to use GDELT to obtain a CSV containing a count of the number of articles containing a specific theme for all countries for a given number of years. Please note it's not quite the same as the events database, I'm specifically interested in the themes). There are some tutorials on the GDELT website (here), but they appear to be a little out of date regarding the regexp syntax and I'm not super familiar with SQL. Ideally, my output should look something like the following :
Year Country Count
2001 Afghanistan 34234
2002 Afghanistan 11864
...
2001 Zambia 939
2002 Zambia 864
From my understanding and the tutorial, this code counts the number of articles by theme and day of the year rather than year (taken from here). It's close to what I want, but not quite there.
select date(_partitiontime) date, count(theme) occurences
from `gdelt-bq.gdeltv2.gkg_partitioned`, unnest(split(themes,';')) as theme
where _partitiontime >= "2020-11-01 00:00:00" and _partitiontime < "2020-11-07 00:00:00"
and lower(theme) like "%bitcoin%"
group by date
-- order by date
I think I need to
Add something to return and group by V2Locations at the country level
Parse the data to get only the year
but am not sure how to do it. Any help would be greatly appreciated.
Thanks!
Related
I am working with crimes in Boston dataset, with each crime as nodes in Neo4j. I want to query and display the most committed crimes each year to get a result like this:
Year
offense_code_group
count(offense_code_group)
2015
Aggravated Assault
5827
2016
Larceny From Motor Vehicle
11534
2017
Verbal Disputes
12049
2018
Investigate Person
8724
Note: The result is just an example of how I want it to group and look, not the actual result I'll get after querying the dataset.
But this is the best I have been able to do so far:
query and output in neo4j desktop
I know there is no GROUP BY clause in Neo4j and I have tried using collect(), but I can't get it to work.
How about something like this:
MATCH (c:Crime)
WITH c.year AS year, c.offense_code_group AS code, count(c.offense_code_group) AS count
ORDER BY year, count DESC
WITH year, COLLECT(code) AS codes, COLLECT(count) AS counts
RETURN year, codes[0], counts[0]
I am trying to construct an economic indicator based on all events with specific cameo codes from gdelt database.
So the idea is to collect data from 1990 to till date and see how economic cooperation varied based on news appearances of certain words. CAMEO codes 0211, 0311, 061, 1011 and 1211 in specific.
My query is how to extract this data for these specific cameo codes. If you can direct me to any source, it would be of great help.
One person suggested me that try using bigquery. I honestly don't know how to navigate the google bigquery page till now (I tried my best probably being from a non-tech background, it was a bit overwhelming for me). If any of you can help with one Cameo code data extraction example then I can play around with other events.
Edit: I am editing to show the progress I have made and the issues I am facing while running the query.
SELECT
*
FROM
[gdelt-bq:full.events]
WHERE
Year >= 1979
AND EventCode IN ('0211', '0311','061', '1011', '1211')
AND Actor1CountryCode != Actor2CountryCode
This query will process 228 GB when run and also excludes the cases where both the country codes are null. It has over 2 million rows and I cant download this as a csv file from bigquery platform.
The part where I need help is the following,
is there any way that I can get the total number of events for each event code which satisfies the following conditions
Actor1Countrycode and Actor2CountryCode should be different except when they are null
Count for each event code every month which satisfies the above condition.
PS: You can run the code given by Ben P in the answer below to see the number and type of columns in the database.
Edit2: There is another query that I am trying to write where in the AvgTone of an event with a specified code is greater than the average of AvgTone of all events in that particular month. Any leads on how to write this would be really helpful. Suppose, I add a WHERE clause wherein the AvgTone is greater than the average of AvgTone of all events for that particular period (MonthYear in this case). My doubt is how to write this in a query format.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211',
'0311',
'061')
AND Actor1CountryCode != Actor2CountryCode
AND AvgTone > (SELECT AVG(AvgTone) FROM [gdelt-bq:full.events] GROUP BY MonthYear ORDER BY MonthYear)
GROUP BY
MonthYear
ORDER BY
MonthYear
Error: ELEMENT can only be applied to result with 0 or 1 row.
Can someone help me with the above query? Thanks.
The GDELT database is available in BigQuery.
Here is a link to their available datasets, your first step would to identify which contains the information you are interested in:
https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016/
Then this section of the site contains sample queries, which you can use as a starting point and try to tweak to your needs (note that these examples appear to me mostly in Legacy SQL, I would suggest you use them as a guide and rewrite then in Standard SQL):
https://blog.gdeltproject.org/a-compilation-of-gdelt-bigquery-demos/
If you have any specific SQL/BigQuery questions after you have done this I would recommend you come back with fresh questions and share examples of your working code, details what you have already tried and the results you expect to see.
Having had a quick look, and I must say i am not familiar with the dataset, but this may be a simple query that can start you on your way:
-- first we select all columns from the event dataset, which seems
-- to be the one you want, containing cameo codes
SELECT * FROM `gdelt-bq.full.events`
-- then we add a filter to only look at events in or after 1990
WHERE Year >= 1990
-- and another filter to look at only the specific camera
--codes you provided (I think EventCode is the correct column here,
AND EventCode IN ('0211','0311','061','1011','1211')
-- finally, we add a limit to our query, so we don't bring back ALL
-- the results while testing, once we are happy with our query, we'd remove this!
LIMIT 100
Finally, the GDELT tag right here on StackOverflow contains some really great content.
Hope that helps, GDELT looks like a fascinating project!
I finally figured out a way to extract data from GDELT using bigquery. Although the query is very simple, my lack of SQL knowledge made it difficult. Thanks to Ben who provided the initial help. Following are the queries which satisfy the conditions given in the question.
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode IS NULL
AND Actor2CountryCode IS NULL
GROUP BY
MonthYear
ORDER BY
MonthYear
SELECT
MonthYear,
COUNT(*)
FROM
[gdelt-bq:full.events]
WHERE
EventCode IN ('0211', '0311','061')
AND Actor1CountryCode != Actor2CountryCode
GROUP BY
MonthYear
ORDER BY
MonthYear
I am very new to SQL and have my data in an Access database (~50k rows) with the following structure
State Year Date Price
CA 2012 1/2/13 5.00
NY 2013 1/2/13 6.00
NY 2013 1/7/13 7.00
A (State, Year) pair, though held in different columns here, represent a vintage (like a wine). So we talk about how the price of "CA 2012" moves throughout the year.
Because some of our data is entered manually into this database, there is opportunity for error. We would like to write a query that flags any suspicious entries for further review.
I have read many different questions and threads on the subject but have not found anything that addresses my main concern of how to find local outliers - the price can move up and down so prices that may be okay for some date range may be an outlier earlier in the year
Update: I chunked my data into buckets of months so finding local outliers might be easier as a result of that. I'm still looking for good outlier detection methods I can implement in SQL.
Sometimes simple is best- No need for an intro to statistics yet. I would recommend starting with simple grouping. Within that function you can Average, get the minimum, the Maximum and other useful bits of data. Here are a couple of examples to get you started:
SELECT Table1.State, Table1.Yr, Count(Table1.Price) AS CountOfPrice, Min(Table1.Price) AS MinOfPrice, Max(Table1.Price) AS MaxOfPrice, Avg(Table1.Price) AS AvgOfPrice
FROM Table1
GROUP BY Table1.State, Table1.Yr;
Or (in case you want month data included)
SELECT Table1.State, Table1.Yr, Month([Dt]) AS Mnth, Count(Table1.Price) AS CountOfPrice, Min(Table1.Price) AS MinOfPrice, Max(Table1.Price) AS MaxOfPrice
FROM Table1
GROUP BY Table1.State, Table1.Yr, Month([Dt]);
Obviously you'll need to modify the table and field names (Just so you know though- 'Year' and 'Date' are both reserved words and best not used for field names.)
currently I'm in midst of developing crystal report template. Any help is much appreciated.
What I want to achieve is to count based on date.
I give example.
I got 3 params:
start period(date): eg: 1/5/2013
end period: eg: 5/5/2013(date)
how many working days(number): eg: 5
and the table is like this
name working date
nameA 2/5/2013
2/5/2013
3/5/2013
nameB 2/5/2013
4/5/2013
5/5/2013
I want it to count how many working days for each name.
Eg result: nameA: 2 working days, nameB: 3 working days.
Please help. I'm new to programming..:( Couldn't think any programming design for this..
You need this query,
select name, count(distinct working_date)
from table where working_date between ? and ?
but most of all, you need to learn SQL. It is impossible to write reports without some knowledge of SQL.
Wondering if anyone can help with the code for this.
I want to query the data and get 2 entries, one for YTD previous year and one for this year YTD.
Only way I know how to do this is as 2 separate queries with where clauses.. I would prefer to not have to run the query twice.
One column called DatePeriod and populated with 2011 YTD and 2012YTD, would be even better if I could get it to do 2011YTD, 2012YTD, 2011Total, 2012Total... though guessing this is 4 queries.
Thanks
EDIT:
In response to help clear a few things up:
This is being coded in MS SQL.
The data looks like so: (very basic example)
Date | Call_Volume
1/1/2012 | 4
What I would like is to have the Call_Volume summed up, I have queries that group it by week, and others that do it by month. I could pull all the dailies in and do this in Excel but the table has millions of rows so always best to reduce the size of my output.
I currently group by Week/Month and Year and union all so its 1 output. But that means I have 3 queries accessing the same table, large pain, very slow not efficient and that is fine but now I also need a YTD so its either 1 more query or if I could find a way to add it to the yearly query that would ideal:
So
DatePeriod | Sum_Calls
2011 Total | 40
2011 YTD | 12
2012 Total | 45
2012 YTD | 15
Hope this makes any sense.
SQL is built to do operations on rows, not columns (you select columns, of course, but aggregate operations are all on rows).
The most standard approach to this is something like:
SELECT SUM(your_table.sales), YEAR(your_table.sale_date)
FROM your_table
GROUP BY YEAR(your_table.sale_date)
Now you'll get one row for each year on record, with no limit to how many years you can process. If you're already grouping by another field, that's fine; you'll then get one row for each year in each of those groups.
Your program can then iterate over the rows and organize/render them however you like.
If you absolutely, positively must have columns instead, you'll be stuck with something like this:
SELECT SUM(IF(YEAR(date) = 2011, sales, 0)) AS total_2011,
SUM(IF(YEAR(date) = 2012, total_2012, 0)) AS total_2012
FROM your_table
If you're building the query programmatically you can add as many of those column criteria as you need, but I wouldn't count on this running very efficiently.
(These examples are written with some MySQL-specific functions. Corresponding functions exist for other engines but the syntax would be a little different.)