Detect predefined topics in a text - text-mining

I would like to find in a text corpus (with long boring text), allusions about some pre-defined topic (let's say i am interested in the 2 topic: "Remuneration" and "Work condition").
For exemple finding in my corpus where (the specific paragraph) it is pointing problems about "remuneration".
To accomplish that i first thought about a deterministic approach: building a big dictionary, and thanks to regex maybe flagging those words in the text corpus. It is a very basic idea but i do not know how i could build efficiently my dictionary (i need a lot of words in the lexical field of remuneration). Do you know some website in french which could help me to build this dictionary ?
Perhaps can you think about a more clever approach based on some Machine Learning algorithm which could realize this task (i know about topic modelling but the difference here is that i am focusing on pre-determines subject/topic like "Remuneration"). I need a simple approach :)

The dictionary approach is a very basic one, but it could work. You can build the dictionary iteratively:
Suppose you want a dictionary of terms related to "work conditions".
Start with a seed, a small number of terms that may be related, with high probability, to work conditions.
Use this dictionary to run through the corpus and find relevant documents.
Now go through the relevant documents and find terms with high TFIDF value (terms which have high representation in the above documents but low representation in the rest of the corpus). These terms can be assumed to refer to the subject of "work conditions" as well.
Add the new terms you found to the dictionary.
Now you can run again through the corpus and find additional relevant documents.
You can repeat the above process for a pre-configured number of times, or until no more new terms are found.

A full treatment of such "topic analysis" problems is well beyond the scope of a Stack Overflow Q&A - There are multiple books and papers on such.
For a small, starter project: collect a number of articles which focus on discussing your topic(s), and on other topics. Rate each document according to whether or not it covers each of your chosen topics. Calculate the term-frequency-inverse-document-frequency for each of the sample articles. Convert these into a vector of the frequency of appearance of each word for each document. (You'll probably want to eliminate extremely common or ambiguous "stop words" from the analysis and do "stemming" as well. You can also scan for common sequences of two or more words.) This then defines a set of "positive" and "negative" examples for each defined topic.
If there's only a single topic of interest, you can then use a cosine-similarity function to determine which sample article is most like your new / input text sample. For multiple topics, you'll probably want to do something like Principal Component Analysis from the original text samples to identify which words and word combinations are most representative of each topic.
The quality of the classification will depend largely on the number of example texts you have to train the model and how much they differ.

If you're talking about the coding way of solving this, why don't you write a code (in your language) that finds the paragraph containing the word or the allusion word.
For example, I would do it like this in JavaScript
// longText is a long text that includes 4 paragraphs in total
const longText = `
In 1893, the first running, gasoline-powered American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts. The first public run of the Duryea Motor Wagon took place on 21 September 1893, on Taylor Street in Metro Center Springfield.[32][33] The Studebaker Automobile Company, subsidiary of a long-established wagon and coach manufacturer, started to build cars in 1897[34]: p.66  and commenced sales of electric vehicles in 1902 and gasoline vehicles in 1904.[35]
In Britain, there had been several attempts to build steam cars with varying degrees of success, with Thomas Rickett even attempting a production run in 1860.[36] Santler from Malvern is recognized by the Veteran Car Club of Great Britain as having made the first gasoline-powered car in the country in 1894,[37] followed by Frederick William Lanchester in 1895, but these were both one-offs.[37] The first production vehicles in Great Britain came from the Daimler Company, a company founded by Harry J. Lawson in 1896, after purchasing the right to use the name of the engines. Lawson's company made its first car in 1897, and they bore the name Daimler.[37]
In 1892, German engineer Rudolf Diesel was granted a patent for a "New Rational Combustion Engine". In 1897, he built the first diesel engine.[1] Steam-, electric-, and gasoline-powered vehicles competed for decades, with gasoline internal combustion engines achieving dominance in the 1910s. Although various pistonless rotary engine designs have attempted to compete with the conventional piston and crankshaft design, only Mazda's version of the Wankel engine has had more than very limited success.
All in all, it is estimated that over 100,000 patents created the modern automobile and motorcycle.
`
document.querySelector('.searchbox').addEventListener('submit', (e)=> { e.preventDefault(); search() })
function search(){
const allusion = document.querySelector('#searchbox').value.toLowerCase()
const output = document.querySelector('#results-body ol')
output.innerHTML = "" // reset the output
const paragraphs = longText.split('\n').filter(item => item != "")
const included = paragraphs.filter((paragraph) => paragraph.toLowerCase().includes(allusion))
let foundIn = included.map(paragraph => `<div class="result-row"> <li>${paragraph.toLowerCase()}</li>
</div>`)
foundIn = foundIn.map(el => el.replaceAll(allusion, `<span class="highlight">${allusion}</span>`))
output.insertAdjacentHTML('afterbegin', foundIn.join('\n'))
}
.container{
padding : 5px;
border: .2px solid black;
}
.searchbox{
padding-bottom: 5px
}
.searchbox input {
width: 90%
}
.result-row{
padding-bottom: 5px;
}
.highlight{
background: yellow;
}
h3 span {
font-size: 14px;
font-style: italic;
}
<div class="container">
<form class="searchbox">
<h3>Give me an hint: <span>ex: car, gasoline, company</span> </h3>
<input id="searchbox" type="text" placeholder="allusion word, ex: car, gasoline, company">
<button type"submit"> find </button>
</form>
<div id="results-body">
<ol></ol>
</div>
</div>

Related

React native handling html p tags

I have a screen that displays article information thats been pulled from a Wordpress API call and returns json (inclusive of all its lovely HTML tags).
<Text style={styles.summary}>{htmlRegex(item.content.rendered)}{"\n"}{Moment(item.date, "YYYYMMDD").fromNow()}</Text>
I have a function that strips out all of the HTML tags, tidies up any unicode, etc...
function htmlRegex(string) {
string = string.replace(/<\/?[^>]+(>|$)/g, "")
string = string.replace(/…/g,"...")
let changeencode = entities.decode(string);
return changeencode;
}
The challenge is that the tags returned in the content appear to be causing odd line spacing issues, as shown in the screen grab;
The content.rendered contains;
rendered: "
<figure class="wp-block-image size-large"><img data-attachment-id="655" data-permalink="https://derbyfutsal.com/derby-futsal-club-women-name-change-june20/" data-orig-file="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png" data-orig-size="1024,512" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}" data-image-title="derby-futsal-club-women-name-change-june20" data-image-description="" data-medium-file="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=300" data-large-file="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=730" src="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=1024" alt="" class="wp-image-655" srcset="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png 1024w, https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=150 150w, https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=300 300w, https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=768 768w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
<p>Derby Futsal Club Ladies’ team are renamed Derby Futsal Club Women.</p>
<p>The change in name reflects Derby Futsal’s work in developing all aspects of futsal on and off the court.</p>
<p>It reflects the way the league (FA National Futsal Women’s Super Series), the players, the fans and the management refer to the game.</p>
<p>Hannah Roberts, Derby Futsal Club Women captain, believes “the change from Ladies to Women’s is a subtle but important one. Many professional sports teams have moved towards ‘Women’s’ in the last five years in order to stay modern and in touch, and as a forward-thinking club it’s important for Derby Futsal to do the same. We’re making so many strides in our community work and marketing, and this name change is another step forward to the future for the club”.</p>
<p>Derby Futsal Club Women first team coach, Matt Hardy feels this name change signifies evolution for the team; “the future of the women’s game both at Derby and nationally is looking bright. So it’s only right that we have a name that is modern, and inline with the national game”. </p>
<p>This news follows similar moves in professional football. Chelsea, Manchester City and Arsenal have all renamed their women’s team recently. It is something Professor Kath Woodward from the Open University, an expert on sociology and sport agrees with, “the use of ladies suggests a physical frailty and need for protection”.</p>
<p>Alex Scott, former Arsenal Women captain, adds: “the term ‘Women’s’ delineates between men and women without as many stereotypes or preconceived notions and it is in keeping with modern-day thinking on equality”.</p>
<p></p>
",
My question is, how do you handle the tags so that the return line white space is managable?
Put this in your css:
p {
margin: 0;
padding: 0;
}
And just replace 0 with whatever suits (0.5rem, 20px, whatever floats your boat really).

Generating similar named entities/compound nouns

I have been trying to create distractors (false answers) for multiple choice questions. Using word vectors, I was able to get decent results for single-word nouns.
When dealing with compound nouns (such as "car park" or "Donald Trump"), my best attempt was to compute similar words for each part of the compound and combine them. The results are very entertaining:
Car park -> vehicle campground | automobile zoo
Fire engine -> flame horsepower | fired motor
Donald Trump -> Richard Jeopardy | Jeffrey Gamble
Barrack Obama -> Obamas McCain | Auschwitz Clinton
Unfortunately, these are not very convincing. Especially in case of named entities, I want to produce other named entities, which appear in similar contexts; e.g:
Fire engine -> Fire truck | Fireman
Donald Trump -> Barrack Obama | Hillary Clinton
Niagara Falls -> American Falls | Horseshoe Falls
Does anyone have any suggestions of how this could be achieved? Is there are a way to generate similar named entities/noun chunks?
I managed to get some good distractors by searching for the named entities on Wikipedia, then extracting entities which are similar from the summary. Though I'd prefer to find a solution using just spacy.
If you haven't seen it yet, you might want to check out sense2vec, which allows learning context-sensitive vectors by including the part-of-speech tags or entity labels. Quick usage example of the spaCy extension:
s2v = Sense2VecComponent('/path/to/reddit_vectors-1.1.0')
nlp.add_pipe(s2v)
doc = nlp(u"A sentence about natural language processing.")
most_similar = doc[3]._.s2v_most_similar(3)
# [(('natural language processing', 'NOUN'), 1.0),
# (('machine learning', 'NOUN'), 0.8986966609954834),
# (('computer vision', 'NOUN'), 0.8636297583580017)]
See here for the interactive demo using a sense2vec model trained on Reddit comments. Using this model, "car park" returns things like "parking lot" and "parking garage", and "Donald Trump" gives you "Sarah Palin", "Mitt Romney" and "Barack Obama". For ambiguous entities, you can also include the entity label – for example, "Niagara Falls|GPE" will show similar terms to the geopolitical entitiy (GPE), e.g. the city as opposed to the actual waterfalls. The results obviously depend on what was present in the data, so for even more specific similarities, you could also experiment with training your own sense2vec vectors.

Understanding Themes in Google BigQuery GDELT GKG 2.0

I'm using Google bigquery to analyze the GDELT GKG 2.0 dataset and would like to better understand how to query based on themes (or V2Themes). The docs mention a 'Category List' spreadsheet but so far I've been unsuccessful in finding that list.
the following asesome blog mentions that you can use World Bank Taxonomy among others to narrow down your search. My objective is to find all items that mention "droughts / too little water" ,all items that mention "floods / too much water" and all items that mention " poor quality / too dirty water" that have a geographical match on a sub-country level.
So far I've been able to get a list of distinct themes but this is non-extensive and I don't get the hierarchy / structure of it.
SELECT
DISTINCT theme
FROM (
SELECT
GKGRECORDID,
locations,
REGEXP_EXTRACT(themes,r'(^.[^,]+)') AS theme,
CAST(REGEXP_EXTRACT(locations,r'^(?:[^#]*#){0}([^#]*)') AS NUMERIC) AS location_type,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){1}([^#]*)') AS location_fullname,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){2}([^#]*)') AS location_countrycode,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){3}([^#]*)') AS location_adm1code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){4}([^#]*)') AS location_adm2code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){5}([^#]*)') AS location_latitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){6}([^#]*)') AS location_longitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){7}([^#]*)') AS location_featureid,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){8}([^#]*)') AS location_characteroffset,
DocumentIdentifier
FROM
`gdelt-bq.gdeltv2.gkg_partitioned`,
UNNEST(SPLIT(V2Locations,';')) AS locations,
UNNEST(SPLIT(V2Themes,';')) AS themes
WHERE
_PARTITIONTIME >= "2018-08-20 00:00:00"
AND _PARTITIONTIME < "2018-08-21 00:00:00" )
WHERE
(location_type = 5
OR location_type = 4
OR location_type = 2) --WorldState, WorldCity or US State
ORDER BY
theme
And a list of water related themes I've been able to find so far (sample, not exhaustive):
CRISISLEX_C06_WATER_SANITATION
ENV_WATERWAYS
HUMAN_RIGHTS_ABUSES_WATERBOARD
HUMAN_RIGHTS_ABUSES_WATERBOARDED
HUMAN_RIGHTS_ABUSES_WATERBOARDING
NATURAL_DISASTER_FLOODWATER
NATURAL_DISASTER_FLOODWATERS
NATURAL_DISASTER_FLOOD_WATER
NATURAL_DISASTER_FLOOD_WATERS
NATURAL_DISASTER_HIGH_WATER
NATURAL_DISASTER_HIGH_WATERS
NATURAL_DISASTER_WATER_LEVEL
TAX_AIDGROUPS_WATERAID
TAX_DISEASE_WATERBORNE_DISEASE
TAX_DISEASE_WATERBORNE_DISEASES
TAX_FNCACT_WATERBOY
TAX_FNCACT_WATERMAN
TAX_FNCACT_WATERMEN
TAX_FNCACT_WATER_BOY
TAX_WEAPONS_WATER_CANNON
TAX_WEAPONS_WATER_CANNONS
TAX_WORLDBIRDS_WATERFOWL
TAX_WORLDMAMMALS_WATER_BUFFALO
UNGP_CLEAN_WATER_SANITATION
WATER_SECURITY
WB_1000_WATER_MANAGEMENT_STRUCTURES
WB_1021_WATER_LAW
WB_1063_WATER_ALLOCATION_AND_WATER_SUPPLY
WB_1064_WATER_DEMAND_MANAGEMENT
WB_1199_WATER_SUPPLY_AND_SANITATION
WB_1215_WATER_QUALITY_STANDARDS
WB_137_WATER
WB_138_WATER_SUPPLY
WB_139_SANITATION_AND_WASTEWATER
WB_140_AGRICULTURAL_WATER_MANAGEMENT
WB_141_WATER_RESOURCES_MANAGEMENT
WB_143_RURAL_WATER
WB_144_URBAN_WATER
WB_1462_WATER_SANITATION_AND_HYGIENE
WB_149_WASTEWATER_TREATMENT_AND_DISPOSAL
WB_150_WASTEWATER_REUSE
WB_155_WATERSHED_MANAGEMENT
WB_156_GROUNDWATER_MANAGEMENT
WB_159_TRANSBOUNDARY_WATER
WB_1729_URBAN_WATER_FINANCIAL_SUSTAINABILITY
WB_1731_NON_REVENUE_WATER
WB_1778_FRESHWATER_ECOSYSTEMS
WB_1790_INTERNATIONAL_WATERWAYS
WB_1798_WATER_POLLUTION
WB_1805_WATERWAYS
WB_1998_WATER_ECONOMICS
WB_2008_WATER_TREATMENT
WB_2009_WATER_QUALITY_MONITORING
WB_2971_WATER_PRICING
WB_2981_DRINKING_WATER_QUALITY_STANDARDS
WB_2992_FRESHWATER_FISHERIES
WB_427_WATER_ALLOCATION_AND_WATER_ECONOMICS
While this link is provided as a theme listing:
http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_CategoryList.xlsx
...it is far from complete (perhaps just the original theme list?). I just pulled a single day's worth of GKG, and there are tons of themes not on the list of 283 themes in that spreadsheet.
GKG documentation located at https://blog.gdeltproject.org/world-bank-group-topical-taxonomy-now-in-gkg/ points to a World Bank Taxonomy located at http://pubdocs.worldbank.org/en/275841490966525495/Theme-Taxonomy-and-definitions.pdf. The GKG post implies this World Bank taxonomy has been rolled into the GKG theme list.
This is presented as a complete listing of World Bank Taxonomy themes. Unfortunately, I've found numerous World Bank themes in GKG that aren't in this publication. The union of these two lists represents a portion of GKG themes, but it definitely isn't all of them.
Here is the list of GKG Themes:
http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_CategoryList.xlsx
If anyone needs this, I have added a list of all themes in the GKG v1 in the timeperiod from 1/1/2017-31/12/2020 which are at least present in 10 or more articles for that particular day: Themes.parquet
It consists of 17639 unique themes with the count per day. Looks like this:
The complete numbers for that 4 year dataset is 36 713 385 unique actors, 50 845 unique themes as well as 26 389 528 unique organizations. These numbers are not filtered for different spellings for the same entity, and hence Donald Trump and Donald J. Trump will count as two separate actors.
The best GDELT GKG Themes list I could find is here, as described in this blog post.
I put it into a CSV file, which I find slightly easier to work with, and put that file here.

Issue regarding the Attribute Names

Xml Document
I am having a problem regarding the xml attribute names coming from sharepoint which contains the attributes names like description0,ows_x0020_long_desc coming in the xmldoc
<z:row ows_LinkFilename="Aerospace Energy.jpg"
ows_Title="Aerospace"
ows_ContentType="Image"
ows__ModerationStatus="0"
ows_PreviewOnForm="Aerospace Energy.jpg"
ows_ThumbnailOnForm="Technology Experience/Aerospace Energy.jpg"
ows_Modified="2011-12-07 12:02:34"
ows_Editor="1073741823;#System Account"
ows_Description0="Honeywell's SmartPath® Ground-Based Augmentation System (GBAS), which offers airports improved efficiency and capacity, greater navigational accuracy, and fewer weather-related delays."
ows_ID="28"
ows_Created="2011-12-02 11:26:01"
ows_Author="1073741823;#System Account"
ows_FileSizeDisplay="6091"
ows_Mode="Energy"
ows_Solution="Business"
ows_Long_x0020_Desc="Honeywell's SmartTraffic™ and IntuVue® 3-D Weather Radar technologies make the skies safer and enable pilots to more efficiently route flights. SmartTraffic ."
ows_Brief_x0020_Desc="Honeywell's Required Navigation Performance (RNP) capabilities enable aircraft to fly more precise approaches through tight corridors and congested airports, leading to fewer delays."
ows_Tags="True"
ows__Level="1"
ows_UniqueId="28;#{928FDA3E-94FA-47A5-A9AD-B5D98C12C18C}"
ows_FSObjType="28;#0"
ows_Created_x0020_Date="28;#2011-12-02 11:26:01"
ows_ProgId="28;#"
ows_FileRef="28;#Technology Experience/Aerospace Energy.jpg"
ows_DocIcon="jpg"
ows_MetaInfo="28;#Solution:SW|Business vti_thumbnailexists:BW|true vti_parserversion:SR|14.0.0.4762 Category:SW|Enter Choice #1 Description0:LW|Honeywell's SmartPath® Ground-Based Augmentation System (GBAS), which offers airports improved efficiency and capacity, greater navigational accuracy, and fewer weather-related delays. vti_stickycachedpluggableparserprops:VX|wic_XResolution Subject vti_lastheight vti_title vti_lastwidth wic_YResolution oisimg_imageparsedversion vti_lastwidth:IW|294 vti_author:SR|SHAREPOINT\\system vti_previewexists:BW|true vti_modifiedby:SR|SHAREPOINT\\system Long Desc:LW|Honeywell's SmartTraffic™ and IntuVue® 3-D Weather Radar technologies make the skies safer and enable pilots to more efficiently route flights. SmartTraffic . Keywords:LW| vti_foldersubfolderitemcount:IR|0 vti_lastheight:IW|172 ContentTypeId:SW|0x0101009148F5A04DDD49CBA7127AADA5FB792B00AADE34325A8B49CDA8BB4DB53328F21400623D4FCEEB2ADC4EA8269BF873F0BB6F _Author:SW| vti_title:SW|Aerospace wic_System_Copyright:SW| Mode:SW|Energy Tags:SW|True wic_YResolution:DW|96.0000000000000 oisimg_imageparsedversion:IW|4 Brief Desc:LW|Honeywell's Required Navigation Performance (RNP) capabilities enable aircraft to fly more precise approaches through tight corridors and congested airports, leading to fewer delays. _Comments:LW| wic_XResolution:DW|96.0000000000000 Subject:SW|Aerospace vti_folderitemcount:IR|0"
ows_Last_x0020_Modified="28;#2011-12-07 12:02:34"
ows_owshiddenversion="6"
ows_FileLeafRef="28;#Aerospace Energy.jpg"
ows_PermMask="0x7fffffffffffffff"
xmlns:z="#RowsetSchema" />
Could you please tell the solution for this.
SharePoint when returning data in xml will always use this fromat.
Field names will be prepended by ows_
Internal names of field will be used not display names.
Internal field names in SharePoint contain unicode equivalents for special characters
e.g. if you create a field with name 'Field Name' from SharePoint UI,
SharePoint will create internal name as 'Field_x0020_Name'
where 0020 is unicode representation of space.
If fields are created by code or feature however you can specify your own internal and display names.
So if you are parsing such xml you will have to code remembering these rules.
SharePoint does not add x0020 escape sequence in field's internal name unless there is a space in the display name while creating the field from UI.
Also once the field is created, changing the display name has no effect on the internal name of a field.
So if you create a field 'Long Desc' from UI and the later change the name to 'LongDesc', the internal name will still be Long_x0020_Desc.

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.