Cross-sectional to panel data (Stata) "repeated time values within panel" - dataframe

I am relatively new to Stata and I currently have a Reddit dataset in cross-sectional format with each row representing a given Reddit post by a username, and with some usernames posting several times per day while others post only once/twice in the entire dataset.
* Example generated by -dataex-. For more info, type help dataex
clear
input float id str36 username int date
6 "(crash )" 19013
end
format %td date
I am interested in running a Heckman selection model, so I am trying to convert the data into a panel format, I created an ID variable per username as shown below:
egen id = group(username)
Then ran this to declare the data as panel following the guideline here
xtset id date
And I am receiving the following error: "repeated time values within panel" and I am not sure how to solve this because I believe in my case this is not problematic given that it's typical for social media users to post several times within the same day, which my time unit in this dataset.
If I ran the same code without the date variable, the code works w/out any errors but my understanding is that I need to use both variables for a panel format.

You could use a timestamp to handle this. There is usually one available in session data. Just make sure to store it as a double:
. clear
. input byte id int date double ts
id date ts
1. 1 0 0
2. 1 0 1000
3. 1 0 2000
4. end
. format %td date
. format %tc ts
. list, clean noobs
id date ts
1 01jan1960 01jan1960 00:00:00
1 01jan1960 01jan1960 00:00:01
1 01jan1960 01jan1960 00:00:02
. xtset id ts
Panel variable: id (strongly balanced)
Time variable: ts, 01jan1960 00:00:00 to 01jan1960 00:00:02, but with gaps
Delta: .001 seconds
. xtset id date
repeated time values within panel
r(451);
Alternatively, collapse to user x date level if your analysis permits it.

What is said by Stata is correct, and you confirm it. xtset with an identifier and time variable will only work if each (identifier, time) observation occurs at most once. The only work-arounds to this are to combine or omit observations to match that required pattern -- or to xtset in terms of identifier alone.
You are clearly right about the data -- repeated posts from individual users on the same day are a fact -- but Stata's rules for panel data aren't negotiable. More positively put, what you are missing out on is applying models that don't make sense for your data structure any way.
There isn't a Stata issue here unless it is misunderstanding what Stata requires or wishing that it did not do that.

Related

Histogram for time periods using SQLite (regular buckets 1h wide)

I have a series of visitor data and would like to represent its timespans as a histogram the same way Google Maps does.
My table has two columns: firstSeen and another one called lastSeen. Both contain (I think) so called Unix-timestamps such as 1581981627248.0 or 1581981629641.0, their type is REAL.
And I'm a little bit lost I have to say, I'm not too experienced in using SQL. Calculating the average stay time and so on was easy but this is kind of weird?
I could easily do something like the following query:
SELECT
strftime('%H', datetime(round(firstseen / 1000), 'unixepoch', 'weekday 1')) as 'Hour'
FROM visitors;
This is already for only one day (monday), which I guess it is okay I guess.
Then I could count and group by the hour. But wouldn't this alone be wrong? I would only count firstSeen as a field but what about a visitor which was present at 08:59-09:00? As I understand it, you would have to count their present twice in this case, once for the time span of 08:00 to 08:59 and once for 09:00 to 09:59. I also see another problem for empty slots, would it be possible to include these? Would that even make sense?
I hope it's clear what I'd like to accomplish and someone can point me in the right direction.
~Edit~
added MRE:
CREATE TABLE `detections_2` (
`firstseen` REAL NOT NULL,
`lastseen` REAL NOT NULL
);
INSERT INTO detections_2(
firstSeen,
lastSeen
)
VALUES
(1581892607,1581892644),
(1581892607,1581892694),
(1581892607,1581892703),
(1581892607,1581892629),
(1581892607,1581892619),
(1581892607,1581892683),
(1581892607,1581892702),
(1581892607,1581892651),
(1581892607,1581892697),
(1581892607,1581892654),
(1581892607,1581892680),
(1581892607,1581892619),
(1581892607,1581892700),
(1581892607,1581892716),
(1581892607,1581892700),
(1581892607,1581892643),
(1581892607,1581892720),
(1581892607,1581892647),
(1581892607,1581892726),
(1581892607,1581892679),
(1581892607,1581892665),
(1581892607,1581892701),
(1581892607,1581892659),
(1581892607,1581892725),
(1581892607,1581892662),
(1581892607,1581892661),
(1581979007,1581979037),
(1581979007,1581979054),
(1581979007,1581979038),
(1581979007,1581979100),
(1581979007,1581979027),
(1581979007,1581979080),
(1581979007,1581979034),
(1581979007,1581979119),
(1581979007,1581979027),
(1581979007,1581979093),
(1581979007,1581979068),
(1581979007,1581979061),
(1581979007,1581979115),
(1581979007,1581979126),
(1581979007,1581979106),
(1581979007,1581979114),
(1581979007,1581979078),
(1581979007,1581979078),
(1581979007,1581979056),
(1581979007,1581979117),
(1581979007,1581979040),
(1581979007,1581979057),
(1581979007,1581979068),
(1581979007,1581979103)

Using IF THEN in Access 2010 Query

I'm not very knowledgeable in coding of Access queries, so I hope someone can help with this issue.
I have a query (using the query builder) that has a field named RetrainInterval from table tblProcedures (this will return a number like 1, 3, 6, 12, etc.; the rotational months the particular document have to be retrained on) and another field named Training/Qualification Date from table tblTrainingRecords.
I want the query to look at the RetrainInterval for a given record (record field is ClassID in tblProcedures) and then look at the Training/Qualification Date and calculate if that record should be in the query.
In a module I would do this:
IF RetrainInterval = 1 Then
DateAdd("m",1,[Training/Qualification Date]) <add to query if <=today()+30>
ElseIf RetrainInterval = 3 Then
DateAdd("m",3,[Training/Qualification Date]) <add to query if <=today()+30>
ElseIF......
How can I translate this into something that would work in a query? My end goal is to generate a report that will show me what document class numbers are due within a specified time interval (say I enter 30 in the form textbox to represent any upcoming required training within 30 days of the query), but all of the calculations to determine this is based off of when the last training date was (stored in the training records table). I also want to make sure that I do not get multiple returns for the same class number since there will be multiple training entries for each class, just grab the minimum last training date. I hope I explained it well enough. It's hard to put this into words on what I am trying to do without positing up the entire database.
UPDATE
I think I have simplified this a bit after getting some rest. Here are two images, one is the current query, and one is what comes up in the report. I have been able to refine this a bit, but now my problem is I only want the particular Class to show once on the report, not twice, even though I have multiple retrain due dates (because everything is looking at the table that holds the employee training data and will have multiple training's for each Class number). I would like to only show one date, the oldest. Hope that makes sense.
Query - http://postimg.org/image/cpcn998zx/
Report - http://postimg.org/image/krl5945l9/
When RetrainInterval = 1, you add 1 month to [Training/Qualification Date].
When RetrainInterval = 3, you add 3 months to [Training/Qualification Date].
And so on.
The pattern appears to be that RetrainInterval is the number of months to add. If that is true, use RetrainInterval directly in your DateAdd() expression and don't bother about IF THEN.
DateAdd("m", RetrainInterval, [Training/Qualification Date])
You can not do that in a query. Been there, cursed that!
You can use the IFF( 2>x ; 1 ;0)
Giving that if the first statement is true, 1 is returned, and 0 if false.
You can not return a criteria like IFF(2>x ; Cell>2 ; Cell>0) (Not possible) It will just return 0 if you try, i think. it will not give an error all the time.
You have to use criterias!
I would to something like this picture:
I hope you follow, else let me know.

Add all column values that are equal to row value SSRS

I have a table of records each with its own creation date and closure date. I am trying to create a graph that will show the number of open and closed records on each day. So far, i've set up a binary matrix that will display a 1 if the record was open on a certain month and 0 otherwise. So, if i wanted to find the total on a certain week, i could just use a RunningValue to sum all the rows for a certain column. Unfortunately, i cannot seem to find a way to graph open and closed records on the same bar graph. So far, ive created a column in the query that has the number of the closed week. I assumed that i could just add these all up if they equal the current week but this doesnt seem to work. I used the following expression (the comparison is weird because i thought it might have something to do with comparing to values to each other) obviously this is me just testing :
'=CINT(Fields!Ident_Week.Value) & " / " & Fields!Close_Week.Value & " = " & SUM(IIF(CINT(Fields!Ident_Week.Value)/CINT(Fields!Close_Week.Value)=1,1,0))'
Im tempted now (embaressingly so) to just create 52 variables and assign the values that way. But i thought id ask here first. What do you think the best way is to find the closed records created on a certain week? Im using SSRS 2008 R2
a sample of my dataset is below (only relavent information is displayed)
Ident_Week Closed_Week Ident_Date Closed_Date Jan Feb .... Dec
1 3 1/1/13 1/15/13 1 0 0
I think you may be over complicating the dataset a little.
Try using UNPIVOT as below:
http://sqlfiddle.com/#!3/b6270c/6
You should be able to do what you need with this. Let me know if you need any further explanation.

How to check data integrity within an SQL table?

I have a table for logging access data of a lab. The table struct like this:
create table accesslog
(
userid int not null,
direction int not null,
accesstime datetime not null
);
This lab have only one gate that is under access control. So the users must first "enter" the lab before they can "leave". In my original design, I set the "direction" field as a flag that is either 1 (for entering the lab) or -1 (for leaving the lab). So that I can use queries like:
SELECT SUM(direction) FROM accesslog;
to get the total user count within the lab. Theoretically, it worked; since the "direction" will always be in the patterns of 1 => -1 => 1 => -1 for any given userid.
But soon I found that the log message would lost in the transmission path from lab gate to server, being dropped either by busy network or by hardware glitches. Of course I can enforce the transmission path with sequence number, ACK, retransmission, hardware redundancy, etc., but in the end I might still get something like this:
userid direction accesstime
-------------------------------------
1 1 2013/01/03 08:30
1 -1 2013/01/03 09:20
1 1 2013/01/03 10:10
1 -1 2013/01/03 10:50
1 -1 2013/01/03 13:40
1 1 2013/01/03 18:00
It's a recent log for user "1". It's clear that I've lost one log message for that user entering the lab between 10:50 to 13:40. While I query this data, he is still in the lab, so there is no exiting logs after 2013/01/03 18:00 yet; that's affirmative.
My question is: is there any way to "find" this data inconsistence with SQL command ? There are total 5000 users within my system and the lab is operating 24 hour, there is no such "magic time" that the lab would be cleared. I'd be horrible if I've to write codes checking the continuity of "direction" field line-by-line, user-by-user.
I know it's not possible to "fix" the log with correct data. I just want to know "Oh, I have a data inconsistency issue for userid=1" so that I can add an marked amending data to the correct the final statistic.
Any advice would be appreciated, even changing the table structure would be OK.
Thanks.
Edit: Sorry I didn't mentioned the details.
Currently I'm using mixed SQL solution. The table showed above is MySQL, and it contains only logs within 24 hrs as the "real time" status for fast browsing.
Everyday at 03:00 AM a pre-scheduled process written in C++ on POSIX will be launched. This process will calculated the statistic data, and add the daily statistic to an Oracle DB, via a proprietary-protocol TCP socket, then it will remove the old data from MySQL.
The Oracle part is not handled by me and I can do nothing about it. I just want to make sure that the final statistics of each day is correct.
The data size is about 200,000 records per day -- I know it's sound crazy but it's true.
You didn't state your DBMS, so this is ANSI SQL (which works on most modern DBMS).
select userid,
direction,
accesstime,
case
when lag(direction) over (partition by userid order by accesstime) = direction then 'wrong'
else 'correct'
end as status
from accesslog
where userid = 1
for each row in accesslog you'll get a column "status" which indicates if the row "breaks" the rule or not.
You can filter out those that are invalid using:
select *
from (
select userid,
direction,
accesstime,
case
when lag(direction) over (partition by userid order by accesstime) = direction then 'wrong'
else 'correct'
end as status
from accesslog
where userid = 1
) t
where status = 'wrong'
I don't think there is a way to enforce this kind of rule using constraints in the database (although I have the feeling that PostgreSQL's exclusion constraints could help here)
Why not use SUM() with a WHERE field to filter by USER.
If you get anything other than 0 or 1 then you surely have a problem.
Ok I figured it out. Thanks for the idea provided by a_horse_with_no_name.
My final solution is this query:
SELECT userid, COUNT(*), SUM(direction * rule) FROM (
SELECT userid, direction, #inout := #inout * -1 AS rule
FROM accesslog l, (SELECT #inout := -1) r
ORDER by userid, accesstime
) g GROUP by userid;
First I created a pattern with #inout that will yield 1 => -1 => 1 => -1 for each row in the "rule" column. Than I compare the direction field with rule column by calculating multiplication product.
It's OK even if there are odd records for certain users; since each user is supposed to follow identical or reversed pattern as "rule". So the total sum of multiplication product should be equal to either COUNT() or -1 * COUNT().
By checking SUM() and COUNT(), I can know exactly which userid had go wrong.

Group by run when there is no run number in data (was Show how changing the length of a production run affects time-to-build)

It would seem that there is a much simpler way to state the problem. Please see Edit 2, following the sample table.
I have a number of different products on a production line. I have the date that each product entered production. Each product has two identifiers: item number and serial number I have the total number of labour hours for each product by item number and by serial number (i.e. I can tell you how many hours went into each object that was manufactured and what the average build time is for each kind of object).
I want to determine how (if) varying the length of production runs affects the average time it takes to build a product (item number). A production run is the sequential production of multiple serial numbers for a single item number. We have historical records going back several years with production runs varying in length from 1 to 30.
I think to achieve this, I need to be able to assign 'run id'. To me, that means building a query that sorts by start date and calculates a new unique value at each change in item number. If I knew how to do that, I could solve the rest of the problem on my own.
So that suggests a series of related questions:
Am I thinking about this the right way?
If I am on the right track, how do I generate those run id values? Calculate and store is an option, although I have a (misguided?) preference for direct queries. I know exactly how I would generate the run numbers in Excel, but I have a (misguided?) preference to do this in the database.
If I'm not on the right track, where might I find that track? :)
Edit:
Table structure (simplified) with sample data:
AutoID Item Serial StartDate Hours RunID (proposed calculation)
1 Legend 1234 2010-06-06 10 1
3 Legend 1235 2010-06-07 9 1
2 Legend 1237 2010-06-08 8 1
4 Apex 1236 2010-06-09 12 2
5 Apex 1240 2010-06-10 11 2
6 Legend 1239 2010-06-11 10 3
7 Legend 1238 2010-06-12 8 3
I have shown that start date, serial, and autoID are mutually unrelated. I have shown the expectation that labour goes down as the run length increases (but this is a 'fact' only via received wisdom, not data analysis). I have shown what I envision as the heart of the solution, that being a RunID that reflects sequential builds of a single item. I know that if I could get that runID, I could group by run to get counts, averages, totals, max, min, etc. In addition, I could do something like hours/ to get percentage change from the start of the run. At that point I could graph the trends associated with different run lengths either globally across all items or on a per item basis. (At least I think I could do all that. I might have to muck about a bit, but I think I could get it done.)
Edit 2: This problem would appear to be: how do I get the 'starting' member (earliest start date) of each run when I don't already have a runID? (The runID shown in the sample table does not exist and I was originally suggesting that being able to calculate runID was a potentially viable solution.)
AutoID Item
1 Legend
4 Apex
6 Legend
I'm assuming that having learned how to find the first member of each run that I would then be able to use what I've learned to find the last member of each run and then use those two results to get all other members of each run.
Edit 3: my version of a query that uses the AutoID of the first item in a run as the RunID for all units in a run. This was built entirely from samples and direction provided by Simon, who has the accepted answer. Using this as the basis for grouping by run, I can produce a variety of run statistics.
SELECT first_product_of_run.AutoID AS runID, run_sibling.AutoID AS itemID, run_sibling.Item, run_sibling.Serial, run_sibling.StartDate, run_sibling.Hours
FROM (SELECT first_of_run.AutoID, first_of_run.Item, first_of_run.Serial, first_of_run.StartDate, first_of_run.Hours
FROM dbo.production AS first_of_run LEFT OUTER JOIN
dbo.production AS earlier_in_run ON first_of_run.AutoID - 1 = earlier_in_run.AutoID AND
first_of_run.Item = earlier_in_run.Item
WHERE (earlier_in_run.AutoID IS NULL)) AS first_product_of_run LEFT OUTER JOIN
dbo.production AS run_sibling ON first_product_of_run.Item = run_sibling.Item AND first_product_of_run.AutoID run_sibling.AutoID AND
first_product_of_run.StartDate product_between.Item AND
first_product_of_run.StartDate
Could you describe your table structure some more? If the "date that each product entered production" is a full time stamp, or if there is a sequential identifier across products, you can write queries to identify the first and last products of a run. From that, you can assign IDs to or calculate the length of the runs.
Edit:
Once you've identified 1,4, and 6 as the start of a run, you can use this query to find the other IDs in the run:
select first_product_of_run.AutoID, run_sibling.AutoID
from first_product_of_run
left join production run_sibling on first_product_of_run.Item = run_sibling.Item
and first_product_of_run.AutoID <> run_sibling.AutoID
and first_product_of_run.StartDate < run_sibling.StartDate
left join production product_between on first_product_of_run.Item <> product_between.Item
and first_product_of_run.StartDate < product_between.StartDate
and product_between.StartDate < run_sibling.StartDate
where product_between.AutoID is null
first_product_of_run can be a temp table, table variable, or sub-query that you used to find the start of a run. The key is the where product_between.AutoID is null. That restricts the results to only pairs where no different items were produced between them.
Edit 2, here's how to get the first of each run:
select first_of_run.AutoID
from
(
select product.AutoID, product.Item, MAX(previous_product.StartDate) as PreviousDate
from production product
left join production previous_product on product.AutoID <> previous_product.AutoID
and product.StartDate > previous_product.StartDate
group by product.AutoID, product.Item
) first_of_run
left join production earlier_in_run
on first_of_run.PreviousDate = earlier_in_run.StartDate
and first_of_run.Item = earlier_in_run.Item
where earlier_in_run.AutoID is null
It's not pretty, and will break if StartDate is not unique. The query could be simplified by adding a sequential and unique identifier with no gaps. In fact, that step will probably be necessary if StartDate is not unique. Here's how it would look:
select first_of_run.AutoID
from production first_of_run
left join production earlier_in_run
on (first_of_run.Sequence - 1) = earlier_in_run.Sequence
and first_of_run.Item = earlier_in_run.Item
where earlier_in_run.AutoID is null
Using outer joins to find where things aren't still twists my brain, but it's a very powerful technique.