My goal is to get the monthly number of protests in Mexico reported between the years 2004 and 2020. I am using Google BigQuery to get this data from the GDELT database.
My problem is that I am getting different results when running the same query on different tables.
select
GlobalEventID
,MonthYear
,ActionGeo_Long
,ActionGeo_Lat
from
gdelt-bq.full.events_partitioned -- Returns 34650 records
--gdelt-bq.gdeltv2.events_partitioned -- Returns 93551 records
where
_PARTITIONTIME >= TIMESTAMP('2004-01-01')
and _PARTITIONTIME <= TIMESTAMP('2020-12-31')
and EventRootCode = '14'
and ActionGeo_CountryCode = 'MX'
;
Can you tell me which table I should use and why the query results differ from each other?
According to the GDELT documentation, gdeltv2 contains more events, and is more up to date for recent years. However they may not have finished backpopulating it to 1979.
This query shows only 20340 of the 93563 event IDs existing in both tables, so for such a large time range you may get best results by using the v1 table before 2015, and the v2 table from 2015 onwards.
SELECT COUNT(*)
FROM gdelt-bq.gdeltv2.events_partitioned g2
JOIN gdelt-bq.full.events_partitioned g1 ON g1.GlobalEventID = g2.GlobalEventID
WHERE g2._PARTITIONTIME >= TIMESTAMP('2004-01-01')
AND g2._PARTITIONTIME <= TIMESTAMP('2020-12-31')
AND g2.EventRootCode = '14'
AND g2.ActionGeo_CountryCode = 'MX'
AND g1._PARTITIONTIME >= TIMESTAMP('2004-01-01')
AND g1._PARTITIONTIME <= TIMESTAMP('2020-12-31')
AND g1.EventRootCode = '14'
AND g1.ActionGeo_CountryCode = 'MX'
Related
I just started using the SSIS tool and I need quick help to load data quarterly
Here`s my scenario:
I came up with a query ( source Database: DB2 ) that will extract data from 2010-01-01 to 2021-12-31,(11 years of data) however the data volume is too high ( around 300 M), so I would like to split the data source query to load data into quarter wise.
I tried year wise and still, I am getting more volume of data which my SSIS server is not able to handle.
I have created a year loop to loop it through, in that created a script task into it followed by a data flow task.
For example,
select * from tab1 where start_date >= '2010-01-01' and end_Date <= '2010-12-31'
This I would like to loop it as, ( 4 times load, 1 for each quarter)
select from tab1 where start_date >= '2010-01-01' and end_Date <= '2010-03-31'
select from tab1 where start_date >= '2010-04-01' and end_Date <= '2010-06-30'
select from tab1 where start_date >= '2010-07-01' and end_Date <= '2010-09-30'
select from tab1 where start_date >= '2010-10-01' and end_Date <= '2010-12-31'
Year-wise perfectly works fine, however, I am not getting any idea how do I need to load the data into quarter-wise.
I want to pass each quarter parameters to the source query as parameters, so overall I need to loop to 48 times ( 2010 to 2021 = 11 yrs * 4 quarters)
Any help is greatly appreciated.
I can send screenshots of what I have created for the year loop which is working perfectly fine.
I think the solution is to use the OFFSET FETCH clause to iterate over data. Why looping over data quarterly while using a number of rows is more precise (each iteration will handle the same amount of data). A step-by-step guide is provided in the following article:
SQL OFFSET FETCH Feature: Loading Large Volumes of Data Using Limited Resources with SSIS
One thing worth mentioning is that the article handles an SQL Server source, while you are using DB2. Then you should take into consideration any syntax difference while using the OFFSET FETCH clause:
Getting top n to n rows from db2
Example: Using the OFFSET clause with a cursor
Similar questions:
Reading Huge volume of data from Sqlite to SQL Server fails at pre-execute
Loop 10 records at a time and assign it to variable
SSIS failing to save packages and reboots Visual Studio
I'm doing some counts to validate a table XXX, I designed 2 queries to calculate people younger than 18 years.
The query i'm using is the following:
select count(distinct user.id) from user
left join sometable on sometable.id = user.someTableId
left join anotherTable on sometable.anotherTableId = anotherTable.id
where (sometable.id = 'x' or user.public = true)
AND (DATE_PART('year', age(current_date, user.birthdate)) >= 0 and DATE_PART('year', age(current_date, user.birthdate)) <= 18);
This query is giving 5000 counts (Fake result)
but this query that is supposed to do the same:
select count(distinct user.id) from user
left join sometable on sometable.id = user.someTableId
left join anotherTable on sometable.anotherTableId = anotherTable.id
where (sometable.id = 'x' or user.public = true)
and (user.birthdate between '2002-08-26' and current_date)
SIDE NOTE: date '2002-08-26' is because today is 2020-08-26, so I subtracted 18 years from today's date.
is giving me a different count from the first one. (This last one, is giving the correct one, since is the same that I've in another nosql database)
I would like to know what's the difference in the queries or why the counts are different.
Thanks in advance.
In your first query, you are including everyone who has not yet turned 19.
In your second query, you are excluding a bunch of 18 year old's who were born prior to 2002-08-26. For example, someone born on 2002-04-12 is still 18 years old. She won't turn 19 until 2021-04-12.
Easiest way to write in postgres is this, which provides same results as your first query:
where extract(year from age(now(), birthdate)) <= 18
If you really want to use the format of your 2nd query, then change your line to:
where (birth_date between '2001-08-27' and current_date)
I am trying to work out How many field engineers work over 48 hours over a 17 week period. (by law you cannot work over 48 hours over a 17 week period)
I managed to run the query for 1 Engineer but when I run it without an Engineer filter my query is very slow.
I need to get the count of Engineers working over 48 hours and count under 48 hours then get an Average time worked per week.
Note: I am doing a Union on SPICEMEISTER & SMARTMEISTER because they are our old and new databases.
• How many field engineers go over the 48 hours
• How many field engineers are under the 48 hours
• What is the average time worked per week for engineers
SELECT DS_Date
,TechPersNo
FROM
(
SELECT DISTINCT
SMDS.EPL_DAT as DS_Date
,EN.pers_no as TechPersNo
FROM
[SpiceMeister].[FS_OTBE].[EngPayrollNumbers] EN
INNER JOIN
[SmartMeister].[Main].[PlusDailyKopf] SMDS
ON RIGHT(CAST(EN.[TechnicianID] AS CHAR(10)),5) = SMDS.PRPO_TECHNIKERNR
WHERE
SMDS.EPL_DAT >= '2017-01-01'
and
SMDS.EPL_DAT < '2018-03-01'
UNION ALL
SELECT DISTINCT
SPDS.DailySummaryDate as DS_Date
,EN.pers_no as TechPersNo
FROM
[SpiceMeister].[FS_OTBE].[EngPayrollNumbers] EN
INNER JOIN
[SpiceMeister].[FS_DS_BO].[DailySummaryHeader] SPDS
ON EN.TechnicianID = SPDS.TechnicianID
WHERE
SPDS.DailySummaryDate >= '2018-03-01'
) as Techa
where TechPersNo = 850009
) Tech
cross APPLY
Fast results
The slowness is definitely due to the use of cross apply with a correlated subquery. This will force the computation on a per-row basis and prevents SQL Server from optimizing anything.
This seems more like it should be a 'group by' query, but I can see why you had trouble making it up on account of the complex cumulative calculation in which you need output by person and by date, but the average involves not the date in question but a date range ending on the date in question.
What I would do first is make a common query to capture your base data between the two datasets. That's what I do in the 'dailySummaries' common table expression below. Then I would join dailySummaries onto itself matching by the employee and selecting the date range required. From that, I would group by employee and date, aggregating by the date range.
with
dailySummaries as (
select techPersNo = en.pers_no,
ds_date = smds.epl_dat,
dtDif = datediff(minute, smds.abfahrt_zeit, smds.rueck_zeit)
from smartMeister.main.plusDailyKopf
join smartMeister.main.plusDailyKopf smds
on right(cast(en.technicianid as char(10)),5) = smds.prpo_technikernr
where smds.epl_dat < '2018-03-01'
union all
select techPersNo = en.pers_no,
dailySummaryDate,
datediff(minute,
iif(spds.leaveHome < spds.workStart, spds.leaveHome, spds.workStart),
iif(spds.arrivehome > spds.workEnd, spds.arrivehome, spds.workEnd)
)
from spiceMeister.fs_ds_bo.dailySummaryHeader spds
join spiceMeister.fs_ds_bo.dailySummaryHeader spds
on en.TechnicianID = spds.TechnicianID
where spds.DailySummaryDate >= '2018-03-01'
)
select ds.ds_date,
ds.techPersNo,
AvgHr = convert(real, sum(dsPrev.dtDif)) / (60*17)
from dailySummaries ds
left join dailySummaries dsPrev
on ds.techPersNo = dsPrev.techPersNo
and dsPrev.ds_date between dateadd(day,-118, ds.ds_date) and ds.ds_date
where ds.ds_date >= '2017-01-01'
group by ds_date,
techPersNo
order by ds_date
I may have gotten a thing or two wrong in translation but you get the idea.
In the future, post a more minimal example. The union of the two datasets from the separate databases is not central to the problem you were trying to ask about. A lot of the date filterings aren't core to the question. The special casting in your join match logic is not important. These things cloud the real issue.
Thank you for taking the time to read this, it is probably a very basic question. Most search queries I did seemed a bit more in depth to the INNER JOIN operator.
Basically my question is this: I have a shipping and receiving table with dates on when the item was either shipped or received. In the shipping table (tbl_shipping) the date row is labeled as trans_out_date and for the receiving table (tbl_receiving) the date row is labeled as trans_in_date.
I can view transactions set on either table from a user entered form but I want to populate a table with information pulled from both tables where the criteria meets. ie. If the receiving table has 10 transactions done in April and 5 in June and the shipping table has 15 transactions in April and 10 in June... when the user wants to see all transactions in June, it will populate the 15 transactions that occurred in June.
As of right now, I can pull only from 1 table with
SELECT *
FROM tbl_shipping
WHERE trans_out_date >= ‘from_date’
AND trans_out_date <= ‘to_date’
Would this be the appropriate syntax for what I am looking to achieve?
SELECT *
FROM tbl_shipping
INNER JOIN tbl_receiving ON tbl_shipping.trans_out_date = tbl_receiving.trans_in_date
WHERE
tbl_shipping.trans_out_date >= ‘from_date’
AND tbl_shipping.trans_out_date <= ‘to_date’
Thank you again in advance for reading this.
You appear to want union all rather than a join:
SELECT s.item, s.trans_out_date as dte, 'shipped' as which
FROM tbl_shipping S
WHERE s.trans_out_date >= ? AND
s.trans_out_date <= ?
UNION ALL
SELECT r.item, NULL, r.trans_in_date as dte, 'received'
FROM tbl_receiving r
WHERE r.trans_out_date >= ? AND
r.trans_out_date <= ?
ORDER BY dte;
Notes:
A JOIN can cause problems due to data that goes missing (because dates don't line up) or data that gets duplicated (because there are multiple dates).
The ? is for a parameter. If you are calling this from an application, use parameters!
You can include additional columns for more information in the result set.
This may not be the exact result format you want. If not, ask another question with sample data and desired results.
I have a query in which I need to get the third month of the given reporting date using SQL and then use it as part of the query. I am able to get all the months but I specifically need to get the third month how would I go about doing that? I know this is fairly easy to do in other languages but is it possible in SQL?
SELECT REPORTING_MONTH, COUNT(*)
FROM database1 AS fb
JOIN (
--derrived core set
SELECT service_no, subscription_id
FROM database2
WHERE REPORTING_MONTH = '2015-04-01' <-- this is the reporting month
) AS c
ON fb.SERVICE_NO = c.service_no
AND fb.subscription_id = c.subscription_id
AND fb.REPORTING_MONTH = '2015-07-01' <-- THIS SHOULD BE THE THIRD MONTH
AND fb.ACTIVE_BASE_IND_NEW = 1
GROUP BY 1
ORDER BY 1
For example if the reporting month is '2015-04-01 I need the variable month to then be '2015-07-01' to be used as part of the query
You don't specify the database you are using. A typical approach would be:
SELECT REPORTING_MONTH, COUNT(*)
FROM database1 fb JOIN
database2 c
ON fb.SERVICE_NO = c.service_no AND
c.REPORTING_MONTH = '2015-04-01' AND
fb.subscription_id = c.subscription_id AND
fb.REPORTING_MONTH = c.reporting_month + interval '3 month' AND
fb.ACTIVE_BASE_IND_NEW = 1
GROUP BY 1
ORDER BY 1;
The exact syntax for + interval '3 month' varies by database.
If the field REPORTING_MONTH is text then you might have to use SUBSTRING (SQL Server) or MID (Others).
If it's a proper date field then perhaps DATEPART(month, fb.REPORTING_MONTH) = 3 will work?
My SQL is a bit rusty but try those functions.