how to enter into an infinite loop of expect operations with Tcl? - iteration

How can I intentionally enter an infinite loop with the telnet server? (Not literally infinite, but to "cycle" through a sequence with the server.)
whenever presented with a menu of "WEATHER UNDERGROUND MAIN MENU" or "CITY FORECAST MENU" I'd to enter selection "1" at least for a few iterations.
However, expect is somewhat as it stands.
Can I create a sort of "list" of "triggers" so that whenever a line is read it iterates through a list and selects the first match?
Currently the dict of cities only has the single entry, but the notion would be to iterate through that list of, say, five cities.
automated result (I simply select the connect method of 1):
thufir#dur:~/NetBeansProjects/spawnTelnet/telnet$
thufir#dur:~/NetBeansProjects/spawnTelnet/telnet$ tclsh main.tcl
locations
---------
1 nyc
connect to wunderground with:
-----------------------------
1) noControlFlow
2) connect
connection method: 1
connecting with 1
spawn telnet rainmaker.wunderground.com
getting weather for nyc
Trying 35.160.169.47...
Connected to rainmaker.wunderground.com.
Escape character is '^]'.
------------------------------------------------------------------------------
* Welcome to THE WEATHER UNDERGROUND telnet service! *
------------------------------------------------------------------------------
* *
* National Weather Service information provided by Alden Electronics, Inc. *
* and updated each minute as reports come in over our data feed. *
* *
* **Note: If you cannot get past this opening screen, you must use a *
* different version of the "telnet" program--some of the ones for IBM *
* compatible PC's have a bug that prevents proper connection. *
* *
* comments: jmasters#wunderground.com *
------------------------------------------------------------------------------
Press Return to continue:
Press Return for menu
or enter 3 letter forecast city code--
nyc
WEATHER UNDERGROUND MAIN MENU
******************************
1) U.S. forecasts and climate data
2) Canadian forecasts
3) Current weather observations
4) Ski conditions
5) Long-range forecasts
6) Latest earthquake reports
7) Severe weather
8) Hurricane advisories
9) Weather summary for the past month
10) International data
11) Marine forecasts and observations
12) Ultraviolet light forecast
X) Exit program
C) Change scrolling to screen
H) Help and information for new users
?) Answers to all your questions
Selection:1
Not a valid option. Type a number 1 to 12.
WEATHER UNDERGROUND MAIN MENU
******************************
1) U.S. forecasts and climate data
2) Canadian forecasts
3) Current weather observations
4) Ski conditions
5) Long-range forecasts
6) Latest earthquake reports
7) Severe weather
8) Hurricane advisories
9) Weather summary for the past month
10) International data
11) Marine forecasts and observations
12) Ultraviolet light forecast
X) Exit program
C) Change scrolling to screen
H) Help and information for new users
?) Answers to all your questions
Selection:
CITY FORECAST MENU
---------------------------------------------------
1) Print forecast for selected city
2) Print climatic data for selected city
3) Display 3-letter city codes for a selected state
4) Display all 2-letter state codes
M) Return to main menu
X) Exit program
?) Help
Selection:1
Enter 3-letter city code: nyc
Weather Conditions at 06:51 AM EDT on 11 May 2020 for New York JFK, NY.
Temp(F) Humidity(%) Wind(mph) Pressure(in) Weather
========================================================================
49 93% SSE at 6 29.88 Mostly Cloudy
Forecast for New York, NY
527 am EDT Mon may 11 2020
.Today...Mostly cloudy. A slight chance of showers early, then a
chance of showers. Isolated thunderstorms this afternoon. Highs
in the lower 60s. Southwest winds 5 to 10 mph with gusts up to
20 mph, increasing to west 15 to 20 mph with gusts up to 30 mph
this afternoon. Chance of rain 50 percent.
.Tonight...Partly cloudy with a slight chance of showers with
isolated thunderstorms in the evening, then mostly clear after
midnight. Lows in the lower 40s. Northwest winds 15 to 20 mph
with gusts up to 30 mph. Chance of rain 20 percent.
.Tuesday...Sunny. Highs in the upper 50s. Northwest winds 15 to
20 mph with gusts up to 30 mph.
.Tuesday night...Partly cloudy in the evening, then clearing.
Lows in the lower 40s. Northwest winds 10 to 15 mph. Gusts up to
25 mph in the evening.
.Wednesday...Sunny. Highs in the lower 60s. West winds 5 to
Press Return to continue, M to return to menu, X to exit: thufir#dur:~/NetBeansProjects/spawnTelnet/telnet$
code:
package provide weather 1.0
package require Expect
namespace eval ::wunderground {
}
#works
proc ::wunderground::noControlFlow {city} {
variable telnet [spawn telnet rainmaker.wunderground.com]
puts "getting weather for $city"
expect "Press Return to continue:"
send "\r"
expect "Press Return for menu:"
send "\r"
#assuming actually a dictionary of cities
expect "or enter 3 letter forecast city code--"
send "$city\r"
expect "WEATHER UNDERGROUND MAIN MENU"
send "1\r"
expect "CITY FORECAST MENU"
send "1\r"
expect "Enter 3-letter city code:"
send "$city\r"
expect "Press Return to continue, M to return to menu, X to exit:"
send "M\r"
}
main:
lappend auto_path /home/thufir/NetBeansProjects/spawnTelnet/telnet/weather
package require weather 1.0
package require locations 1.0
set cities [cities::dictionary]
puts "locations\n---------"
dict for {k v} $cities {puts $k\t$v}
#################
puts "\n\n\nconnect to wunderground with:"
puts "-----------------------------"
puts "1)\tnoControlFlow"
puts "2)\tconnect\n\n"
puts -nonewline "connection method: "
flush stdout
gets stdin prompt
puts "connecting with $prompt"
if {$prompt == 1 } {
wunderground::noControlFlow "nyc"
} else {
wunderground::connect "nyc"
}

Related

Using Google big query sql split the string in a column to multiple columns without breaking words

Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words
Regular expression are a good way to accomplish this task. However, BigQuery is still quite limited in the usasge of regular expression. Therefore, I would suggest to solve this with a UDF and JavaScript. A solution for JavaScript can be found here:
https://www.tutorialspoint.com/how-to-split-sentence-into-blocks-of-fixed-length-without-breaking-words-in-javascript
Adaption this solution to BigQuery
The function string_split expects the character counts to be splitted and the text to be splitted. It returns an array with the chunks. The chunks can be two characters longer than the given size value due to the spaces.
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
#length(split_text)
FROM
(
SELECT
text,string_split(20,text) as split_text
FROM (
SELECT "Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words" AS text
UNION ALL SELECT "This is a short text. And can be splitted as well."
)
)
#, unnest(split_text) as split_text #
Please uncomment the two lines to split the text from the array into single rows.
For larger datasets it also works and took less than two minutes:
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
length(split_text)
FROM
(
SELECT
text,string_split(40,text) as split_text
FROM (
SELECT abstract as text from `bigquery-public-data.breathe.jama`
)
)
, unnest(split_text) as split_text #
order by 3 desc
Consider below approach
create temp function split_parts(parts array<string>, max_len int64) returns array<string>
language js as """
var arr = [];
var part = '';
for (i = 0; i < parts.length; i++) {
if (part.length + parts[i].length < max_len){part += parts[i]}
else {arr.push(part); part = parts[i];}
}
arr.push(part);
return arr;
""";
select * from (
select id, offset, part
from your_table, unnest(split_parts(regexp_extract_all(col, r'[^ ]+ ?'), 50)) part with offset
)
pivot (any_value(trim(part)) as part for offset in (0, 1, 2, 3))
if applied to dummy data as below with split size = 50
output is
Non-regexp Approach
DECLARE LONG_SENTENCE DEFAULT "It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about its spuds when your potato comes with a side of potatoes.";
CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
bin = 0;
a.reduce((c, v) => {
if (c + Number(v) > 264) { bin += 1; return Number(v); }
else return c += Number(v);
}, 0);
return bin;
""";
WITH splits AS (
SELECT w, cumsumbin(ARRAY_AGG(LENGTH(w) + 1) OVER (ORDER BY o)) AS bin
FROM UNNEST(SPLIT(LONG_SENTENCE, ' ')) w WITH OFFSET o
)
SELECT * FROM (
SELECT bin, STRING_AGG(w, ' ') AS segment
FROM splits
GROUP BY 1
) PIVOT (ANY_VALUE(segment) AS segment FOR bin IN (0, 1, 2, 3))
;
Query results:
segment_0
segment_1
segment_2
segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red
clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos
tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about
its spuds when your potato comes with a side of potatoes.
Length of each segment
segment_0
segment_1
segment_2
segment_3
261
262
261
57
Regexp Approach
[note] below expression (.{1,264}\b) is simple but word boundary doesn't include a period(.), thus result can have some error. You can see last period(.) in segment_3 is missing. But under centain circumtances this might be useful, I think.
SELECT * FROM (
SELECT *
FROM UNNEST(REGEXP_EXTRACT_ALL(LONG_SENTENCE, r'(.{1,264}\b)')) segment WITH OFFSET o
) PIVOT (ANY_VALUE(segment) segment FOR o IN (0, 1, 2, 3));
Query rseults:
segment_0
segment_1
segment_2
segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red
clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos
tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about
its spuds when your potato comes with a side of potatoes
Length of each segment
segment_0
segment_1
segment_2
segment_3
261
262
261
56

Bloomberg drawing out multiple securities

The Bloomberg excel formumla =BDH() only retrieves the prices for 1 security. If I want to get other securities, I'll need to repeat the formula which is no issue as I've written a script for that.
The problem comes when the dates of the securities doesn't match up, either due to trading days or contract expiry.
E.g. =BDH(name_of_commod,"PX_LAST","19/12/2015","2/5/2017") for two separate tickers produces:
QWV8 Comdty QWZ8 Comdty
#NAME? 495.2 #NAME? 479.7
14/2/2017 496.7 18/4/2017 462.2
15/2/2017 494.4 19/4/2017 457.1
16/2/2017 495.3 20/4/2017 456.6
17/2/2017 495 21/4/2017 457
20/2/2017 498.7 24/4/2017 454.9
21/2/2017 498.4 25/4/2017 453.5
22/2/2017 498.1 26/4/2017 445
23/2/2017 491.6 27/4/2017 439.9
24/2/2017 489.5 28/4/2017 450
27/2/2017 481.6 2/5/2017 448.4
The mismatch here is due to that QWZ8 is not available till 18th Apr, which kind of screws over my calculations as I've got about a hundred other securities in the data set.
Is there a way to output bloomberg data such that all the dates align to the same row?
Like such:
QWV8 Comdty QWZ8 Comdty
18/4/2017 461.3 18/4/2017 462.2
19/4/2017 456.2 19/4/2017 457.1
20/4/2017 455.7 20/4/2017 456.6
21/4/2017 456.1 21/4/2017 457
24/4/2017 454 24/4/2017 454.9
25/4/2017 452.6 25/4/2017 453.5
26/4/2017 444 26/4/2017 445
27/4/2017 438.9 27/4/2017 439.9
28/4/2017 449 28/4/2017 450
2/5/2017 447.4 2/5/2017 448.4
You can use overrides to specify how missing dates are handled. For example:
=BDH(name_of_commod,"PX_LAST","19/12/2015","2/5/2017","Days=W,Fill=N")
will have one datapoint for each workday and if no data was available for a date will leave the "price" cell blank.
The possible values are for Days are:
N, W or Weekdays - All weekdays
C, A or All - All calendar days
T, Trading- Omits all non-trading days.
and for Fill:
C, P, or Previous - Carries over the last available data.
N, E, or Error - Returns an error message.
B or Blank - Returns a blank.
NA - Excel is not available
PNA - Previous value and Excel is #N/A when the previous value is not available
Any other value the client enters will be used literally as a filler.
You can find a more exhaustive list of valid overrides in the help for the function. (In Excel, go to the cell with the formula and click "More Functions..." and "Help on this function")

How to find multiple subsets of numbers that are approximately equal to a given value?

I am using VBA that gets data from an Excel 2013 spreadsheet. I have a couple years experience in computer science from a while back using VBA and java, but I'm by no means an expert.
I have a column of numbers ranging from 20 to 60 total. Each of those numbers represents 'minutes' and can range from 3 to 500 (normally 60 to 300). Each number has an assigner called a 'load number' (such as N03, N22 and etc.) and a date/time. All of these values are attributed to a 'load' that needs to be picked. 'Pickers' are the ones that have the loads or minutes assigned to them. They can only pick so many minutes per given day which ranges from 400-600 (8 hour shift = 400 minutes).
What I need to do is assign sets of loads that are equal to an approximate amount of total minutes (set number w/ threshold) to two groups of pickers (The groups are AM and PM, each have 3-5 pickers). Once one load is assigned to a picker, it can't be assigned to another UNLESS the loads for a given day have too many minutes and all the pickers can't be assigned an approximate amount of minutes.
Example: Out of 8 pickers, 6 can be assigned loads totaling between 380-420 minutes, but 2 can't be assigned between 380-420 because of the remaining loads.
In the case of the given example, for the remaining 2 pickers, a total of 760 - 840 minutes can be assigned to BOTH of them.
Loads also need to be assigned based on their date/time. If pickers are picking loads due on the same day, the earliest loads need to be assigned to the AM group of pickers and, accordingly, the latest to the PM group of pickers. If all loads to be assigned are for the next day, they can be assigned to anyone as long as the earliest loads are prioritized.
Example: AM shift starts at 5AM w/ 5 pickers. There is three loads that are 200 minutes (4 hours, actual) due at 9AM on the same day
The three loads should be assigned to three different pickers, so the loads can be done on time. They would be marked as the #1 load, so each picker knows to do it first
Example: Another load is due at 9AM on the same day. It is 400 minutes though.
2 pickers can be assigned to this load as their #1 pick and 200 minutes would be assigned to both of them.
Once the loads are assigned to the pickers, the results will be displayed in a separate spreadsheet with each row having: AM/PM, Picker's name, Load number #'s 1-10 w/ load number and minutes to pick and the total minutes.
Example: PICKER | AM | Toby | 029-N10 (268), 030-N05 (93), 030-N04 (111) | 472 TOTAL
Any help / pointers on this problem would be appreciated. I've looked at similar questions posted on here and abroad, but couldn't find any that would give me enough to go by to start working on a solution. It's not too bad assigning loads manually, but it gets complex one there's over 30 and 4,000 minutes total and especially when most of them are larger. It would just be much easier having a program assign everything and save 1-2 hours in the process everyday.
Edit:
The data, in Excel, is structured into 8 columns and up to 50 rows. Each row represents a 'load' and has only 3 useful cells. I got all the information into three arrays, which can be used to display the info for any load by using the same element (1-50) for each array.
Dim LoadNumbers(1 To 50) As String
Dim LoadTime(1 To 50) As Double
Dim LoadMinutes(1 To 50) As Double
Dim C As Integer
C = 1
Do While C < 50
LoadNumbers(C) = Cells(C, 2)
LoadTime(C) = Cells(C, 5) * 24
LoadMinutes(C) = Cells(C, 7)
C = C + 1
Loop
For example:
LoadNumbers(5) & " # " & LoadTimes(5) & " Hours PST # " & LoadMinutes(5) & " Minutes"
Will return:
039-N06  # 9.5 Hours PST # 67.4 Minutes (9.5 hours = 9:30AM)
The LoadTimes and LoadMinutes arrays are the ones I need to assign loads. I will have another two cells that users will input the desired minutes (M) to be assigned and the threshold (T). I then need to VBA script to assign (M-T to M+T) minutes to each picker.
Here's what the values in LoadMinutes look like:
141.8
96
73.7
32.2
67.4
106.1
21.3
14.2
141.6
49.5
68.6
200.6
72
174.9
223.1
161.8
76.6
235.5
76.2
134.9
236.7
166.3
170.7
134.6
63.9
352.9
136.2
146.3
243.2
There's 29 loads # 3,818 minutes total
Lets say the minutes need to be between 430 to 470. Out of those 29 loads, I need to assign sets of different numbers adding up to 430 to 470 based on their time. The times in LoadTimes ranges from 7 to 20 (7AM to 8PM).

Google Spreadsheet with SQL query - finding best combination

I have a google spreadsheet for my gaming information. It contains 2 sheets - one for monster information, another for team.
Monster information sheet contains the attack value, defend value, and the mana cost of monsters. It's almost like a database of monsters that I can summon.
Team sheet does the following:
Asks for the amount of mana I currently have.
Computes a list of up to 5 monsters that I can summon (it can be less than 5).
Each monster has their own mana cost, therefore total mana cost mustn't exceed the amount of mana I have given in point 1.
The tabulated list should give me a team that have the highest combined attack value. It does not matter how many monsters are summoned. Each monster cannot be summoned twice though.
I have been thinking of using query() function so that I can make use of SQL statements. (so that I can hopefully retrieve the tabulated list directly)
Sample: Monster Info
A B C D
1 Monster Attack Defense Cost
2 MonA 1200 1200 35
3 MonB 1400 1300 50
... ...
Sample: Team
A B C D
1 Mana 120
2
3 Attack Team
4 Monster Attack Cost Total Attack
5 MonB 1400 50 1400
6 MonA 1200 35 2600
7 ... ...
I have these formula in "Team" sheet
A5: =query('Monster Info'!$A$:$D,"SELECT A,B,D ORDER BY B DESC LIMIT 5")
B5: =CONTINUE(A5, 1, 2)
C5: =CONTINUE(A5, 1, 3)
D5: =C5
A6: =CONTINUE(A5, 2, 1)
B6: =CONTINUE(A5, 2, 2)
C6: =CONTINUE(A5, 2, 3)
D6: =D5+C6
That only gets the 5 best attack monsters, regardless of the mana cost consideration. How do I do that such that it takes consideration of both attack value and mana cost value? There is another problem shown in the example below:
Example: (simplified version, without defense value etc)
Monster Attack Cost
MonA 1400 50
MonB 1200 35
MonC 1100 30
MonD 900 25
MonE 500 20
MonF 400 15
MonG 350 10
MonH 250 5
If I have 160 mana, then the obvious team is A+B+C+D+E (5100 Attack).
If I have 150 mana, it becomes A+B+C+D+G (4950 Attack).
If I have 140 mana, it becomes A+B+C+D (4600 Attack).
If I have 130 mana, it becomes B+C+D+E+F (4100 Attack using 125 mana) or A+B+C+F (4100 Attack using all 130 mana).
If I have 120 mana, it becomes B+C+D+E+G (4050 Attack).
If I have 110 mana, it becomes B+C+D+F+H (3850 Attack).
As you can see, there isn't really a pattern within the results.
Any expert willing to share their insights on this?
I've played with the problem for an hour and I only have a workaround here. Your problem seems to be a standard linear programming task which should can easily be solved by a "Solver" software. There used to be a so called "Solver" in google spreadsheet, but unfortunately it was removed from the newest version. If you are not insisting on Google solution, you should try it in one of the Solver-supported spreadsheet manager softwares.
I tried MS Office (it has a Solver add-in, installation guide: http://office.microsoft.com/en-001/excel-help/load-the-solver-add-in-HP010342660.aspx).
Before you run the solver, you should prepare your original dataset a bit, with helper columns and cells.
Add a new column next to the "Cost" column (let's assume it is column "D"), and under it put each row either 0, or 1. This column will tell you if a monster is selected to the attack team or not.
Add two more columns ("E" and "F" respectively). These columns will be products of the Attack and of the Cost respectively. So you should write a function to the E2 cell: =b2*d2, and for the F2 cell: =c2*d2. With this way if a monster is selected (which is told by the D column, remember), the appropriate E and F cells will be non zero values, aotherwise they will be 0.
Create a SUM row under the last row, and create a summarizing function for the D,E,F columns respectively. So in my spreadsheet D10 cell gets its value like this: =sum(d2:d9), and so on.
I created a spreadsheet to show these steps: https://docs.google.com/spreadsheets/d/1_7XRlupEEwat3CthSSz8h_yJ44MysK9hMsj0ijPEn18/edit?usp=sharing
Remember to copy this worksheet to an MS Office worksheet, before you start the Solver.
Now, you are ready to start the Solver. (Data menu, Solver in MS Office). You can see a video here on using the Solver: https://www.youtube.com/watch?v=Oyc0k9kiD7o
It's not that hard as it looks like, but for this case I'll describe what to write where:
Set Objective: you should select the "E10" cell, as that represents the sum of all the attack points.
Check "Max" radiobutton as we would like to maximize the value of the attacks.
By Changing variable cells: Select the "d2:d9" interval as those cells are representing whether a monster is selected or not. The solver will try to adjust these values (0, or 1) in order to maximise the sum attack.
Subject to the Contraints: Here we should add some constraints. Click on the Add button, and then:
First we should ensure that d2:d9 are all binary values. So "Cell reference" should be "d2:d9" and from the dropdown menu, select "bin" as binary.
Another constraint should be that the sum of the selected monsters should not exceed 5. So select the cell where the sum of the selected monsters is represented (D10) and add "<=" and the value "5"
Finally we cannot use more manna that we have, so select the cell in which you store the sum of used manna (F2), and "<=", and add the whole amount of manna we can spend in my case it's in the I2 cell).
Done. It should work, in my case it worked at least.
Hope it helps anyway.

Which algorithm I can use to find common adjacent words/ pattern recognition?

I have a big table in my database with a lot of words from various texts in the text order. I want to find the number of times/frequency that some set of words appears together.
Example: Supposing I have this 4 words in many texts: United | States | of | America. I will get as result:
United States: 50
United States of: 45
United States of America: 40
(This is only an example with 4 words, but can there are with less and more than 4).
There is some algorithm that can do this or similar to this?
Edit: Some R or SQL code showing how to do is welcome. I need a practical example of what I need to do.
Table Structure
I have two tables: Token which haves id and text. The text is is UNIQUE and each entrance in this table represents a different word.
TextBlockHasToken is the table which keeps the text order. Each row represents a word in a text.
It haves textblockid that is the block of the text the token belongs. sentence that is the sentence of the token, position that is the token position inside the sentence and tokenid that is the token table reference.
It is called an N-gram; in your case a 4-gram. It can indeed be obtained as the by-product of a Markov-chain, but you could also use a sliding window (size 4) to walk through the (linear) text while updating a 4-dimensionsal "histogram".
UPDATE 2011-11-22:
A markov chain is a way to model the probability of switching to a new state, given the current state. This is the stochastic equivalent of a "state machine". In the natural language case, the "state" is formed by the "previous N words", which implies that you consider the prior probability (before the previous N words) as equal_to_one. Computer people will most likely use a tree for implementing Markov chains in the NLP case. The "state" is simply the path taken from the root to the current node, and the probabilities of the words_to_follow are the probabilities of the current node's offspring. But every time that we choose a new child node, we actually shift down the tree, and "forget" the root node, out window is only N words wide, which translates to N levels deep into the tree.
You can easily see that if you are walking a Markov chain/tree like this, at any time the probability before the first word is 1, the probability after the first word is P(w1), after the second word = P(w2) || w1, etc. So, when processing the corpus you build a Markov tree ( := update the frequencies in the nodes), at the end of the ride you can estimate the probability of a given choice of word by freq(word) / SUM(freq(siblings)). For a word 5-deep into the tree this is the probability of the word, given the previous 4 words. If you want the N-gram probabilities, you want the product of all the probabilities in the path from the root to the last word.
This is a typical use case for Markov chains. Estimate the Markov model from your textbase and find high probabilites in the transition table. Since these indicate probabilities that one word will follow another, phrases will show up as high transition probabilites.
By counting the number of times the phrase-start word showed up in the texts, you can also derive absolute numbers.
Here is a small snippet that calculates all combinations/ngrams of a text for a given set of words. In order to work for larger datasets it uses the hash library, though it is probably still pretty slow...
require(hash)
get.ngrams <- function(text, target.words) {
text <- tolower(text)
split.text <- strsplit(text, "\\W+")[[1]]
ngrams <- hash()
current.ngram <- ""
for(i in seq_along(split.text)) {
word <- split.text[i]
word_i <- i
while(word %in% target.words) {
if(current.ngram == "") {
current.ngram <- word
} else {
current.ngram <- paste(current.ngram, word)
}
if(has.key(current.ngram, ngrams)) {
ngrams[[current.ngram]] <- ngrams[[current.ngram]] + 1
} else{
ngrams[[current.ngram]] <- 1
}
word_i <- word_i + 1
word <- split.text[word_i]
}
current.ngram <- ""
}
ngrams
}
So the following input ...
some.text <- "He states that he loves the United States of America,
and I agree it is nice in the United States."
some.target.words <- c("united", "states", "of", "america")
usa.ngrams <- get.ngrams(some.text, some.target.words)
... would result in the following hash:
>usa.ngrams
<hash> containing 10 key-value pair(s).
america : 1
of : 1
of america : 1
states : 3
states of : 1
states of america : 1
united : 2
united states : 2
united states of : 1
united states of america : 1
Notice that this function is case insensitive and registers any permutation of the target words, e.g:
some.text <- "States of united America are states"
some.target.words <- c("united", "states", "of", "america")
usa.ngrams <- get.ngrams(some.text, some.target.words)
...results in:
>usa.ngrams
<hash> containing 10 key-value pair(s).
america : 1
of : 1
of united : 1
of united america : 1
states : 2
states of : 1
states of united : 1
states of united america : 1
united : 1
united america : 1
I'm not sure if its of a help to you, but here is a little python program I wrote about a year ago that counts N-grams (well, only mono-, bi-, and trigrams). (It also calculates the entropy of each N-gram). I used it to count those N-grams in a large text.
Link