Left recursion elimination questions - grammar

I'm working on remove left recursion in grammar. (3 grammars)
1. A->Ab | aC
B->BaBB | BA
C->bC | BA
2. T->Txxy | TaabT | TTa
3. A-> BA | Baa
B-> Ab | Abb
I've tried to do it, but I'm not sure about my answers.
First one, I have no idea how to do it. Second, the third one I think it will fail. Is my answer right?
How can I change that?
Please someone explain it in detail.

I have written the solution. I didn't have much time to type the entire thing and also I was not sure if this question will be available anymore because of offtopic.
Check the solution for first grammar here at first grammar solution
For the second grammar, either the Grammar is incomplete or left recursion can't be removed, there is no null production nor it has any production with only terminals. It is infinitely recurring and hence can't remove left recursion.
For the third grammar, we can do
A-> BA | Baa
B-> Ab | Abb
Replace All B's into A
A-> AbA | Abaa | AbbA | Abbaa
Now again all the productions are recursive and can't even terminate the grammar. You either need a null production or a production with only terminals.

Related

How to use regex extract to grab text between specific characters?

I'm currently trying out different kinds of formulas for REGEX EXTRACT, trying to play around and understand it fully. Below will be an example of the data that I'm using and the current code that I'm using to grab what I need. (Please critique my code if it can be written better as I'm still learning REGEX EXTRACT)
Sample_Data
AAAA;BBBB;CCCC;A1=1234;DDDD;EEEE
FFFF;GGGG;A1=2345;A2=4567,2345;RRRR;KKKK
SSSS;TTTT;UUUU;VVVV;A1=3456;GGGG;UUUU
UUUU:WWWW;QQQQ;IIII;A1=9876;A2=7654,7890;UUUU
The current code that I have is:
SELECT
REGEXP_EXTRACT(Sample_Data, r'(?:^|;)A1=(\d*)') AS A1,
REGEXP_EXTRACT(Sample_Date, r'(?:^|;)A2=(\d*)(?:;)') AS A2,
SPLIT(REGEXP_EXTRACT(Sample_Data, r'(?:^|;)A2=(\d*\,\d*)(?:;)'), ",")[offset(1)] AS A2_v1
FROM
db.Sample
The output that I get is:
A1 | A2 | A2_v1
1234 | NULL | NULL
2345 | 4567 | 2345
3456 | NULL | NULL
9876 | 7654 | 7890
With the output it's what I would expect. But, there are a few different questions I have from this, as you can see in the output row 2:
2345 | 4567 | 2345
It has 2345 twice, is there a way to make it so it only shows 2345 once so something like:
2345 | 4567 | NULL
My thought process is to have a CASE WHEN and have it check the REGEXP_EXTRACT formulas to see if they match and if they do throw a NULL instead. Is there a better way of doing this or would this be the best result?
My second question is, lets say we have the following sample data:
AAAA;GGGG;DDDD;A1=1234;A2=7890,1234,3456;DDDD
BBBB;DDDD;CCCC;FFFF;A1=2345;A2=8907,1234,4567,8976;WWWW;GGGG
CCCC;EEEE;A1=6789;A2=34567,8901,3456,12345;TTTT
With the current formulas that I have, it would work to get A1 and a part of A2 only. But, how would I convert the formula to be able to pick up all digits separated by ,? The end result that I'm looking for is the following:
A1 | A2 | A2_v1 | A2_v2 | A2_v3
1234 | 7890 | 1234 | 3456 | NULL
2345 | 8907 | 1234 | 4567 | 8976
6789 | 34567| 8901 | 3456 | 12345
How would I make this work properly? Would it be a variation of the:
SPLIT(REGEXP_EXTRACT(Sample_Data, r'(?:^|;)A2=(\d*\,\d*)(?:;)'), ",")[offset(1)] AS A2_v1
And have a different offset? OR is there a different kind of formula that would be capable of doing this?
Any help would be much appreciated!!
To avoid repeating the numbers I think that your idea of CASE ... WHEN is a good approach. In this case, the IF conditional can be used as a shorthand. By making the original query a subquery is easier to compare the values.
For A2, in REGEXP_EXTRACT you cannot use more than one match group, so the full digits can be captured by being more permissive in the regex. For example, the regex used:
'A2=([\d,]*)'
Will also match expressions like: A2=1,2,3,4,5 which may or may not should be allowed in your scenario. The regex can be improved to match exactly what you're looking for; however, it will need to be much longer or, it will need to use more than matching group. Example:
'A2=((\d{4},?)+)'
This regex will match one or more sequences of four numbers followed by zero or one commas. To use this regex you can use REGEXP_REPLACE instead, and keep the desired part while removing everything else. However, this approach seems to complicate things more than simplify them.
Finally, since the number of values in the array may change, I suggest using SAFE_OFFSET to access the array values, as this will return null values whenever there's an Index Out of Range Error.
You can use the below SQL query as a reference:
SELECT
A1,
IF(A2[SAFE_OFFSET(0)] = A1, NULL, A2[SAFE_OFFSET(0)]) AS A2,
IF(A2[SAFE_OFFSET(1)] = A1, NULL, A2[SAFE_OFFSET(1)]) AS A2_V1,
IF(A2[SAFE_OFFSET(2)] = A1, NULL, A2[SAFE_OFFSET(2)]) AS A2_v2,
IF(A2[SAFE_OFFSET(3)] = A1, NULL, A2[SAFE_OFFSET(3)]) AS A2_v3
FROM (
SELECT
REGEXP_EXTRACT(Sample_Data, r'A1=(\d{4})') as A1,
SPLIT(REGEXP_EXTRACT(Sample_Data, r'A2=([\d,]*)'), ",") AS A2
FROM
(
SELECT 'BBBB;DDDD;CCCC;FFFF;A1=2345;A2=8907,1234,4567,2345;WWWW;GGGG' AS Sample_Data
UNION ALL
SELECT 'CCCC;EEEE;A1=6789;TTTT' AS Sample_Data)
)

0 results in MS Access totals query (w. COUNT) after applying criteria

A query I am working on is showing a rather interesting behaviour that I couldn't debug so far.
This is the query before it gets buggy:
QryCount
SELECT EmpId, [Open/Close], Count([Open/Close]) AS Occurences, Attribute1, Market, Tier, Attribute2, MtSWeek
FROM qrySource
WHERE (Venue="NewYork") AND (Type="TypeA")
GROUP BY EmpId, [Open/Close], Attribute1, Market, Tier, Attribute2, MtSWeek;
The query gives precisely the results that I would expect it to:
#01542 | Open | 5 | Call | English | Tier1 | Complain | 01/01/2017
#01542 | Closed | 2 | Call | English | Tier2 | ProdInfo | 01/01/2017
#01542 | Open | 7 | Mail | English | Tier1 | ProdInfo | 08/01/2017
etc...
But as a matter of fact in doing so it provides more records than needed at a subsequent step thereby creating cartesians.
qrySource.[Open/Close] is a string type field with possible attributes (you guessed) "open", "Closed" and null and it is actually provided by a mapping table at the creation stage of qrySource (not sure, but maybe this helps).
Now, the error comes in when I try to limit qryCount only to records where Open/Close = "Open".
I tried both using WHERE and HAVING to no avail. The query would result in 0 records, which is not what I would like to see.
I thought that maybe it is because "open" is a reserved term, but even by changing it to "D_open" in the source table didn't fix the issue.
Also tried to filter for the desired records in a subsequent query
SELECT *
FROM QryCount
WHERE [Open/Close] ="D_Open"
But nothing, still 0 records found.
I am suspicious it might be somehow related to some inherent proprieties of the COUNT function but not sure. Any help would be appreciated.
Everyone who participated, thank you and apologies for providing you with insufficient/confusing information. I recon the question could have been drafted better.
Anyhow, I found that the problem was apparently caused by the "/" in the Open/Closed field name. As soon as I removed it from the field name in the original mapping table the query performed as expected.

Is there a way to transpose data in Hive?

Can data in Hive be transposed? As in, the rows become columns and columns are the rows? If there is no function straight up, is there a way to do it in a couple of steps?
I have a table like this:
| ID | Names | Proc1 | Proc2 | Proc3 |
| 1 | A1 | x | b | f |
| 2 | B1 | y | c | g |
| 3 | C1 | z | d | h |
| 4 | D1 | a | e | i |
I want it to be like this:
| A1 | B1 | C1 | D1 |
| x | y | z | a |
| b | c | d | e |
| f | g | h | i |
I have been looking up other related questions and they all mention using lateral views and explode, but is there a way to selectively choose columns for lateral(ly) view(ing) and explod(ing)?
Also, what might be the rough process to achieve what I would like to do? Please help me out. Thanks!
Edit: I have been reading this link: https://cwiki.apache.org/Hive/languagemanual-lateralview.html and it shows me half of what I want to achieve. The first example in the link is basically what I'd like except that I don't want the rows to repeat and want them as column names. Any ideas on how to get the data to a form such that if I do an explode, it would result in my desired output, or the other way, ie, explode first to lead to another step that would then lead to my desired output table. Thanks again!
I don't know of a way out of the box in hive to do this, sorry. You get close with explode etc. but I don't think it can get the job done.
Overall, conceptually, I think it's hard to a transpose without knowing what the columns of the destination table are going to be in advance. This is true, in particular for hive, because the metadata related to how many columns, their types, their names, etc. in a database - the metastore. And, it's true in general, because not knowing the columns beforehand, would require some sort of in-memory holding of data (ok, sure with spills) and users may need to be careful about not overflowing the memory and such (just like dynamic partitioning in hive).
In any case, long story short, if you know the columns of the destination table beforehand, life is good. There isn't a set command in hive per se, to the best of my knowledge, but you could use a bunch of if clauses and case statements (ugly I know, but that's how I have done the same in the past) in the select clause to transpose the data. Something along the lines of SQL - How to transpose?
Do let me know how it goes!
As Mark pointed out there's no easy way to do this in Hive since PIVOT doesn't present in Hive and you may also encounter issues when trying to use the case/when 'trick' since you have multiple values (proc1,proc2,proc3).
As for testing purposes, you may try a different approach:
select v, o1, o2, o3 from (
select k,
v,
LEAD(v,3) OVER() as o1,
LEAD(v,6) OVER() as o2,
LEAD(v,9) OVER() as o3
from (select transform(name,proc1,proc2,proc3) using 'python strm.py' AS (k, v)
from input_table) q1
) q2 where k = 'A1';
where strm.py:
import sys
for line in sys.stdin:
line = line.strip()
name, proc1, proc2, proc3 = line.split('\t')
print '%s\t%s' % (name, proc1)
print '%s\t%s' % (name, proc2)
print '%s\t%s' % (name, proc3)
The trick here is to use a python script in the map phase which emits each column of a row as distinct rows. Then every third (since we have 3 proc columns) row will form the resulting row which we get by peeking forward (lead).
However, this query does the job, it has the drawback that as the input grows, you need to peek the next 3rd element in the query which may lead to performance hit. Anyway you may evaluate it for testing purposes.

Which of the following language satisfies the grammar?

I'm kinda new to RDP/Pairwise Disjoint Test and this is just a sample problem. I already have the answer and I would just like to verify if this is correct.
Grammar:
<GU> ::= du<GU>bi<MI> | <HO> | ru
<MI> ::= ra | fa | <HO>
<HO>::= bi<HO> | bi
Solution:
will generate a sting of "bi" OR one "bi"
will generate one "ra" OR one "fa" OR (string of "bi" OR one "bi")
So will generate
du <GU> bi {ra | fa | {bi's | bi} } | {bi's | bi} | ru
Here are the sentences that can be produced by the grammar:
a. dudurubifabira
b. dubibibira
c. dubirubirurafa
d. dududubibibifabirabibibi
e. dududubibifarabirabibi
My answer is "b" and "d".
Am I correct?
Looks like a can also be generated by the language:
<GU>
-> du<GU>bi<MI>
-> dudu<GU>bi<MI>bi<MI>
-> dudurubi<MI>bi<MI>
-> dudurubifabi<MI>
-> dudurubifabira
Otherwise, your end result seems to be correct. I'd be careful about saying a "bi" will generate something though, since it's a terminal.

How can I optimize this query...?

I have two tables, one for routes and one for airports.
Routes contains just over 9000 rows and I have indexed every column.
Airports only 2000 rows and I have also indexed every column.
When I run this query it can take up to 35 seconds to return 300 rows:
SELECT routes.* , a1.name as origin_name, a2.name as destination_name FROM routes
LEFT JOIN airports a1 ON a1.IATA = routes.origin
LEFT JOIN airports a2 ON a2.IATA = routes.destination
WHERE routes_build.carrier = "Carrier Name"
Running it with "DESCRIBE" I get the followinf info, but I'm not 100% sure on what it's telling me.
id | Select Type | Table | Type | possible_keys | Key | Key_len | ref | rows | Extra
--------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | routes_build | ref | carrier,carrier_2 | carrier | 678 | const | 26 | Using where
--------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | a1 | ALL | NULL | NULL | NULL | NULL | 5389 |
--------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | a2 | ALL | NULL | NULL | NULL | NULL | 5389 |
--------------------------------------------------------------------------------------------------------------------------------------
The only alternative I can think of is to run two separate queries and join them up with PHP although, I can't believe something like this being something that could kill a mysql server. So as usual, I suspect I'm doing something stupid. SQL is my number 1 weakness.
Personally, I would start by removing the left joins and replacing them with inner joins as each route must have a start and end point.
It's telling you that it's not using an index for joining on the airports table. See how the "rows" column is so huge, 5000 odd? that's how many rows it's having to read to answer your query.
I don't know why, as you have claimed you have indexed every column. What is IATA? Is it Unique? I believe if mysql decides the index is inefficient it may ignore it.
EDIT: if IATA is a unique string, maybe try indexing half of it only? (You can select how many characters to index) That may give mysql an index it can use.
SELECT routes.*, a1.name as origin_name, a2.name as destination_name
FROM routes_build
LEFT JOIN
airports a1
ON a1.IATA = routes_build.origin
LEFT JOIN
airports a2
ON a2.IATA = routes_build.destination
WHERE routes_build.carrier = "Carrier Name"
From your EXPLAIN PLAN I can see that you don't have an index on airports.IATA.
You should create it for the query to work fast.
Name also suggests that it should be a UNIQUE index, since IATA codes are unique.
Update:
Please post your table definition. Issue this query to show it:
SHOW CREATE TABLE airports
Also I should note that your FULLTEXT index on IATA is useless unless you have set ft_max_word_len is MySQL configuration to 3 or less.
By default, it's 4.
IATA codes are 3 characters long, and MySQL doesn't search for such short words using FULLTEXT with default settings.
After you implement Martin Robins's excellent advice (i.e. remove every instance of the word LEFT from your query), try giving routes_build a compound index on carrier, origin, and destination.
It really depends on what information you're trying to get to. You probably don't need to join airports twice and you probably don't need to use left joins. Also, if you can search on a numeric field rather than a text field, that would speed things up as well.
So what are you trying to fetch?