I am trying to create a dummy variable to identify the next five observations after a selection of cutoffs. The first method in the code below works, but it looks a bit messy and I'd like to be able to adjust the number of observations I'm creating dummies for without typing out the same expression 30 times (usually a sign I'm doing something the hard way).
Every time I put a macro into the indexing, i.e.
[_n-`i']
I get the following error:
_= invalid name
r(198);
I'd be very grateful for some advice.
sysuse auto.dta, replace
global cutoffs 3299 4424 5104 5788 10371
This works
sort price
gen A=0
foreach x in $cutoffs {
replace A=1 if price==`x'
replace A=1 if price[_n-1]==`x'
replace A=1 if price[_n-2]==`x'
replace A=1 if price[_n-3]==`x'
replace A=1 if price[_n-4]==`x'
replace A=1 if price[_n-5]==`x'
}
This doesn't.
foreach x in $cutoffs {
forval `i' = 0/25 {
replace A=1 if price[_n-`i']==`x'
}
}
Any advice as to why?
In Stata terms no loops are needed here at all, except those tacit in generate and replace. You want to set a counter going each time immediately after you hit a cutoff, and then identify counter values between 1 and 5. Here's some technique:
sysuse auto.dta, clear
global cutoffs 3299,4424,5104,5788,10371
sort price
gen counter = 0 if inlist(price, $cutoffs)
replace counter = counter[_n-1] + 1 if missing(counter)
gen wanted = inrange(counter, 1, 5)
list price counter wanted
+---------------------------+
| price counter wanted |
|---------------------------|
1. | 3,291 . 0 |
2. | 3,299 0 0 |
3. | 3,667 1 1 |
4. | 3,748 2 1 |
5. | 3,798 3 1 |
|---------------------------|
6. | 3,799 4 1 |
7. | 3,829 5 1 |
8. | 3,895 6 0 |
9. | 3,955 7 0 |
10. | 3,984 8 0 |
|---------------------------|
11. | 3,995 9 0 |
12. | 4,010 10 0 |
13. | 4,060 11 0 |
14. | 4,082 12 0 |
15. | 4,099 13 0 |
|---------------------------|
16. | 4,172 14 0 |
17. | 4,181 15 0 |
18. | 4,187 16 0 |
19. | 4,195 17 0 |
20. | 4,296 18 0 |
|---------------------------|
21. | 4,389 19 0 |
22. | 4,424 0 0 |
23. | 4,425 1 1 |
24. | 4,453 2 1 |
25. | 4,482 3 1 |
|---------------------------|
26. | 4,499 4 1 |
27. | 4,504 5 1 |
28. | 4,516 6 0 |
29. | 4,589 7 0 |
30. | 4,647 8 0 |
|---------------------------|
31. | 4,697 9 0 |
32. | 4,723 10 0 |
33. | 4,733 11 0 |
34. | 4,749 12 0 |
35. | 4,816 13 0 |
|---------------------------|
36. | 4,890 14 0 |
37. | 4,934 15 0 |
38. | 5,079 16 0 |
39. | 5,104 0 0 |
40. | 5,172 1 1 |
|---------------------------|
41. | 5,189 2 1 |
42. | 5,222 3 1 |
43. | 5,379 4 1 |
44. | 5,397 5 1 |
45. | 5,705 6 0 |
|---------------------------|
46. | 5,719 7 0 |
47. | 5,788 0 0 |
48. | 5,798 1 1 |
49. | 5,799 2 1 |
50. | 5,886 3 1 |
|---------------------------|
51. | 5,899 4 1 |
52. | 6,165 5 1 |
53. | 6,229 6 0 |
54. | 6,295 7 0 |
55. | 6,303 8 0 |
|---------------------------|
56. | 6,342 9 0 |
57. | 6,486 10 0 |
58. | 6,850 11 0 |
59. | 7,140 12 0 |
60. | 7,827 13 0 |
|---------------------------|
61. | 8,129 14 0 |
62. | 8,814 15 0 |
63. | 9,690 16 0 |
64. | 9,735 17 0 |
65. | 10,371 0 0 |
|---------------------------|
66. | 10,372 1 1 |
67. | 11,385 2 1 |
68. | 11,497 3 1 |
69. | 11,995 4 1 |
70. | 12,990 5 1 |
|---------------------------|
71. | 13,466 6 0 |
72. | 13,594 7 0 |
73. | 14,500 8 0 |
74. | 15,906 9 0 |
+---------------------------+
In fact, your text says "the next five observations after" but your code implements not that only that, but the cutoff observation too. For the latter, use inrange(counter, 0, 5).
Understanding the principles explained here is crucial for this question.
For inrange() and inlist() see their help entries and/or this paper.
So, what did you do wrong?
This line
forval `i' = 0/25 {
in illegal unless you have previously defined the local macro i (and rather odd style even then). You perhaps meant
forval i = 0/25 {
although where the 25 comes from, given your problem statement, is unclear to me. The error message isn't especially helpful, but Stata is struggling to make sense of code with a hole in it, given that the local macro implied by your code is not defined.
Related
I am trying to summerize sales date, by month, sales region and type. The problem is, the results change when I try to group by year.
My simplified query is as follows:
SELECT
DAB700.DATUM,DAB000.X_REGION,DAB700.BELEG_ART, // the date, sales region, order type
// calculate the number of orders per month
COUNT (DISTINCT CASE WHEN MONTH(DAB700.DATUM) = 1 THEN DAB700.BELEG_NR END) as jan,
COUNT (DISTINCT CASE WHEN MONTH(DAB700.DATUM) = 2 THEN DAB700.BELEG_NR END) as feb,
COUNT (DISTINCT CASE WHEN MONTH(DAB700.DATUM) = 3 THEN DAB700.BELEG_NR END) as mar
FROM "DAB700.ADT" DAB700
left join "DAB050.ADT" DAB050 on DAB700.BELEG_NR = DAB050.ANUMMER // join to table 050, to pull in order info
left join "DF030000.DBF" DAB000 on DAB050.KDNR = DAB000.KDNR // join table 000 to table 050, to pull in customer info
left join "DAB055.ADT" DAB055 on DAB050.ANUMMER = left (DAB055.APNUMMER,6)// join table 055 to table 050, to pull in product info
WHERE (DAB700.BELEG_ART = 10 OR DAB700.BELEG_ART = 20) AND (DAB700.DATUM>={d '2021-01-01'}) AND (DAB700.DATUM<={d '2021-01-11'}) AND DAB055.ARTNR <> '999999' AND DAB055.ARTNR <> '999996' AND DAB055.TERMIN <> 'KW.22.22' AND DAB055.TERMIN <> 'KW.99.99' AND DAB050.AUF_ART = 0
group by DAB700.DATUM,DAB000.X_REGION,DAB700.BELEG_ART
This returns the following data, which is correct (manually checked):
| DATUM | X_REGION | BELEG_ART | jan | feb | mar |
|------------|----------|-----------|-----|-----|-----|
| 04.01.2021 | 1 | 10 | 3 | 0 | 0 |
| 04.01.2021 | 3 | 10 | 2 | 0 | 0 |
| 04.01.2021 | 4 | 10 | 1 | 0 | 0 |
| 04.01.2021 | 4 | 20 | 1 | 0 | 0 |
| 04.01.2021 | 6 | 20 | 2 | 0 | 0 |
| 05.01.2021 | 1 | 10 | 1 | 0 | 0 |
and so on....
The total number of records for Jan is 117 (correct).
Now I now want to summerize the data in one row (for example, data grouped by region and type)..
so I change my code so that I have:
SELECT
YEAR(DAB700.DATUM),
and
group by YEAR(DAB700.DATUM)
the rest of the code stays the same.
Now my results are:
| EXPR | X_REGION | BELEG_ART | jan | feb | mar |
|------|----------|-----------|-----|-----|-----|
| 2021 | 1 | 10 | 16 | 0 | 0 |
| 2021 | 1 | 20 | 16 | 0 | 0 |
| 2021 | 2 | 10 | 19 | 0 | 0 |
| 2021 | 2 | 20 | 22 | 0 | 0 |
| 2021 | 3 | 10 | 12 | 0 | 0 |
| 2021 | 3 | 20 | 6 | 0 | 0 |
Visually it is correct. But, the total count for January is now 116. A difference of 1. What am I doing wrong?
How can I keep the results from the first code - but have it presented as per the 2nd set?
You count distinct BELEG_NR. This is what makes the difference. Let's look at an example. Let's say your table contains four rows:
DATUM
X_REGION
BELEG_ART
BELEG_NR
04.01.2021
1
10
100
04.01.2021
1
10
200
05.01.2021
1
10
100
05.01.2021
1
10
300
That gives you per day, region and belegart:
DATUM
X_REGION
BELEG_ART
DISTINCT COUNT BELEG_NR
04.01.2021
1
10
2
05.01.2021
1
10
2
and per year, region and belegart
YEAR
X_REGION
BELEG_ART
DISTINCT COUNT BELEG_NR
2021
1
10
3
The BELEG_NR 100 never appears more than once per day, so every instance gets counted. But it appears twice for the year, so it gets counted once instead of twice.
I would like to create a new column based on various conditions
Let's say I have a df where column A can equal any of the following: ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other'], column B has numeric values from 0-30.
I'm trying to get column C to be 'Moderate' if A = 'Single' or 'Multiple', and if it equals anything else, to consider the values in column B. If column A != 'Single' or 'Multiple', column C will equal Moderate if 3 < B > 19 and 'High' if B>=19.
I have tried various loop combinations but I can't seem to get it. Any help?
trial = []
for x in df['A']:
if x == 'Single' or x == 'Multiple':
trial.append('Moderate')
elif x != 'Single' or x != 'Multiple':
if df['B']>19:
trial.append('Test')
df['trials'] = trial
Thank you kindly,
Denisse
It will good if you provide some sample data. But with some that I created, you can see how to apply a function to each row of your DataFrame.
Data
valuesA = ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other',
'Single', 'Multiple', 'Commercial', 'Domestic', 'Other']
valuesB = [0, 10, 20, 25, 30, 25, 15, 10, 5, 3 ]
df = pd.DataFrame({'A': valuesA, 'B': valuesB})
| | A | B |
|---:|:-----------|----:|
| 0 | Single | 0 |
| 1 | Multiple | 10 |
| 2 | Commercial | 20 |
| 3 | Domestic | 25 |
| 4 | Other | 30 |
| 5 | Single | 25 |
| 6 | Multiple | 15 |
| 7 | Commercial | 10 |
| 8 | Domestic | 5 |
| 9 | Other | 3 |
Function to apply
You don't specify what happen if column B is less than or equal to 3, so I suppose that C will be 'Low'. Adapt the function as you need. Also, maybe there is a typo in your question where you say '3 < B > 19', I changed to '3 < B < 19'.
def my_function(x):
if x['A'] in ['Single', 'Multiple']:
return 'Moderate'
else:
if x['B'] <= 3:
return 'Low'
elif 3 < x['B'] < 19:
return 'Moderate'
else:
return 'High'
New column
With the DataFrame and the new function you can apply it to each row with the method apply using the argument 'axis=1':
df['C'] = df.apply(my_function, axis=1)
| | A | B | C |
|---:|:-----------|----:|:---------|
| 0 | Single | 0 | Moderate |
| 1 | Multiple | 10 | Moderate |
| 2 | Commercial | 20 | High |
| 3 | Domestic | 25 | High |
| 4 | Other | 30 | High |
| 5 | Single | 25 | Moderate |
| 6 | Multiple | 15 | Moderate |
| 7 | Commercial | 10 | Moderate |
| 8 | Domestic | 5 | Moderate |
| 9 | Other | 3 | Low |
I am trying to implement a MUX (Multiplexor) gate in the nand2tetris course. I first tried myself, and I got an error. But no matter what I changed I always got the error. So I tried checking some code online, and this is what most people use:
CHIP Mux {
IN a, b, sel;
OUT out;
PARTS:
Not(in=sel, out=nsel);
And(a=sel, b=b, out=c1);
And(a=nsel, b=a, out=c2);
Or(a=c1, b=c2, out=out);
}
But even when I try this code I still get the following error:
What I get as a truth table:
| a | b | sel | out |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 0 |
What I should get:
| a | b | sel | out |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 |
I have the newest software suite per 2020-01-13
From what can be seen your input pins are:
a = 0
b = 1
sel = 1
Your internal pins are:
nsel = 1
c1 = 1
c2 = 0
All as expected so far.
Expected out = 1 in this case and you get out = 0. Test script stops at this point because of failure.
Now there might be two reasons of that:
1) you didn't load correct Mux.hdl and because if you calculated Or(c1,c2) you would get 1 which is correct. If you placed And gate in place of Or it would explain failure
2) your implementation of Or.hdl is incorrect.Mux uses your version of Or gate if such file is present in the same directory.
So first verify your code in Hardware Simulator, then verify your implementation of Or.hdl. The latter you could do by removing temporarily Or.hdl from project directory. Hardware Simulator would load built-in version of Or gate.
I'm trying to solve the bus routing problem in postgresql which requires visibility of previous and next rows. Here is my solution.
Step 1) Have one edges table which represents all the edges (the source and target represent vertices (bus stops):
postgres=# select id, source, target, cost from busedges;
id | source | target | cost
----+--------+--------+------
1 | 1 | 2 | 1
2 | 2 | 3 | 1
3 | 3 | 4 | 1
4 | 4 | 5 | 1
5 | 1 | 7 | 1
6 | 7 | 8 | 1
7 | 1 | 6 | 1
8 | 6 | 8 | 1
9 | 9 | 10 | 1
10 | 10 | 11 | 1
11 | 11 | 12 | 1
12 | 12 | 13 | 1
13 | 9 | 15 | 1
14 | 15 | 16 | 1
15 | 9 | 14 | 1
16 | 14 | 16 | 1
Step 2) Have a table which represents bus details like from time, to time, edge etc.
NOTE: I have used integer format for "from" and "to" column for faster results as I can do an integer query, but I can replace it with any better format if available.
postgres=# select id, "busedgeId", "busId", "from", "to" from busedgetimes;
id | busedgeId | busId | from | to
----+-----------+-------+-------+-------
18 | 1 | 1 | 33000 | 33300
19 | 2 | 1 | 33300 | 33600
20 | 3 | 2 | 33900 | 34200
21 | 4 | 2 | 34200 | 34800
22 | 1 | 3 | 36000 | 36300
23 | 2 | 3 | 36600 | 37200
24 | 3 | 4 | 38400 | 38700
25 | 4 | 4 | 38700 | 39540
Step 3) Use dijkstra algorithm to find the nearest path.
Step 4) Get the upcoming buses from the busedgetimes table in the earliest first order for the nearest path detected by dijkstra algorithm.
Problem: I am finding it difficult to make the query for the Step 4.
For example: If I get the path as edges 2, 3, 4, to travel from source vertex 2 to target vertex 5 in the above records. To get the first bus for the first edge, it's not so hard as I can simply query with from < 'expected departure' order by from desc but for the second edge, the from condition requires to time of first result row. Also, query requires edge ids filter.
How can I achieve this in a single query?
I am not sure if I understood your problem correctly. But getting values from other rows this can be done by window functions (https://www.postgresql.org/docs/current/static/tutorial-window.html):
demo: db<>fiddle
SELECT
id,
lag("to") OVER (ORDER BY id) as prev_to,
"from",
"to",
lead("from") OVER (ORDER BY id) as next_from
FROM bustimes;
The lag function moves the value of the previous row into the current one. The lead function does the same with the next row. So you are able to calculate a difference between last arrival and current departure or something like that.
Result:
id prev_to from to next_from
18 33000 33300 33300
19 33300 33300 33600 33900
20 33600 33900 34200 34200
21 34200 34200 34800 36000
22 34800 36000 36300
Please notice that "from" and "to" are reserved words in PostgreSQL. It would be better to chose other names.
Yesterday I asked this question: SQL: How to add values according to index columns but I found out that my problem is a bit more complicated:
I have an array like this
id | value| position | relates_to_position |type
19 | 100 | 2 | NULL | 1
19 | 50 | 6 | NULL | 2
19 | 20 | 7 | 6 | 3
20 | 30 | 3 | NULL | 2
20 | 10 | 4 | 3 | 3
From this I need to create the resulting table, which adds all the lines where the relates_to_position value matches the position value, but only for lines sharing the same id!
The resulting table should be
id | value| position |type
19 | 100 | 2 | 1
19 | 70 | 6 | 2
20 | 40 | 3 | 2
I am using Oracle 11. There is only one level of recursion, meaning a line would not refer to a line which has the relates_to_pos field set.
I think the following query will do this:
select id, coalesce(relates_to_position, position) as position,
sum(value) as value, min(type) as type
from t
group by id, coalesce(relates_to_position, position);