UniData - record count of all files / tables - unidata

Looking for a shortcut here. I am pretty adept with SQL database engines and ERPs. I should clarify... I mean databases like MS SQL, MySQL, postresql, etc.
One of the things that I like to do when I am working on a new project is to get a feel for what is being utilized and what isn't. In T-SQL this is pretty easy. I just query the information schema and get a row count of all the tables and filter out the ones having rowcount = 0. I know this isn't truly a precise row count, but it does give me an idea of what is in use.
So I recently started at a new company and one of their systems is running on UniData. This is a pretty radical shift from mainstream databases and there isn't a lot of help out there. I was wondering if anybody knew of a command to do the same thing listed above in UniBasic/UniQuery/whatever else.
Which tables, files, are heavily populated and which ones are not?

You can start with a special "table" (or file in Unidata terminology) named VOC - it will have a list of all the other files that are in your current "database" (aka account), as well as a bunch of other things.
To get a list of files in (or pointed to) the current account:
:SORT VOC WITH F1 = "F]" "L]" "DIR" F1 F2
Try HELP CREATE.FILE if you're curious about the difference between F and LF and DIR.
Once you have a list of files, weed out the ones named *TEMP* or *WORK* and start digging into the ones that seem important. There are other ways to get at what's important (e.g using triggers or timestamps), but browsing isn't a bad idea to see what conventions are used.
Once you have a file that looks interesting (let's say CUSTOMERS), you can look at the dictionary of that file to see
:SORT DICT CUSTOMERS F1 F2 BY F1 BY F2 USING DICT VOC
It can help to create something like F2.LONG in DICT VOC to increase the display size up from 15 characters.
Now you have a list of "columns" (aka fields or attributes), you're looking for D-type attributes that will tell you what columns are in the file. V or I-type's are calculations
https://github.com/ianmcgowan/SCI.BP/blob/master/PIVOT is helpful with profiling when you see an attribute that looks interesting and you want to see what the data looks like.
http://docs.rocketsoftware.com/nxt/gateway.dll/RKBnew20/unidata/previous%20versions/v8.1.0/unidata_userguide_v810.pdf has some generally good information on the concepts and there are many other online manuals available there. It can take a lot of reading to get to the right thing if you don't know the terminology.

Related

Apache Lucene: Creating an index between strings and doing intelligent searching

My problem is as follows: Let's say I have three files. A, B, and C. Each of these files contains 100-150M strings (one per line). Each string is in the format of a hierarchical path like /e/d/f. For example:
File A (RTL):
/arbiter/par0/unit1/sigA
/arbiter/par0/unit1/sigB
...
/arbiter/par0/unit2/sigA
File B (SCH)
/arbiter_sch/par0/unit1/sigA
/arbiter_sch/par0/unit1/sigB
...
/arbiter_sch/par0/unit2/sigA
File C (Layout)
/top/arbiter/par0/unit1/sigA
/top/arbiter/par0/unit1/sigB
...
/top/arbiter/par0/unit2/sigA
We can think of file A corresponding to circuit signals in a hardware modeling language. File B corresponding to circuit signals in a schematic netlist. File C corresponding to circuit signals in a layout (for manufacturing).
Now a signal will have a mapping between File A <-> File B <-> File C. For example in this case, /arbiter/par0/unit1/sigA == /arbiter_sch/par0/unit1/sigA == /top/arbiter/par0/unit1/sigA. Of course, this association (equivalence) is established by me, and I don't expect the matcher to figure this out for me.
Now say, I give '/arbiter/par0/unit1/sigA'. In this case, the matcher should return a direct match from file A since it is found. For file B/C a direct match is not possible. So it should return the best possible matches (i.e., edit distance?) So in this example, it can give /arbiter_sch/par0/unit1/sigA from file B and /top/arbiter/par0/unit1/sigA from file C.
Instead of giving a full string search, I could also give something like *par0*unit1*sigA and it should give me all the possible matches from fileA/B/C.
I am looking for solutions, and came across Apache Lucene. However, I am not totally sure if this would work. I am going through the docs to get some idea.
My main requirements are the following:
There will be 3 text files with full path to signals. (I can adjust the format to make it more compact if it helps building the indexer more quickly).
Building the index should be fairly fast (take a couple of hours). The files above are static (no modifications).
Searching should be comprehensive. It is OK if it takes ~1s / search but the matching should support direct match, regex match, and edit distance matching. The main challenge is each file can have 100-150 million signals.
Can someone tell me if such a use case can be easily addressed by Lucene? What would be the correct way to go about building a index and doing quick/fast searching? I would like to write some proof-of-concept code and test the performance. Thanks.
i think based on your requirements the best solution would be a PoC with a given test set of entries. Based on this it should be possible to evaluate the target indexing time you like to achieve. Because you only use static informations it's easier, because do don't have to care about topics like NRT (near-real-time searches).
Personally i never used lucene for such a big information set but i think lucene is able to handle this.
How i would do it:
Read tutorials and best practices about lucene, indexing, searching and understand how it works
Define an data set for indexing lets say 1000 lines for each file
Define your lucene document structure
this is really important because based on this you will apply your
searches. take care about analyzer tasks like tokanization if needed
and how. If you need fulltext search care about a TextField.
Write code for simple indexing
Run small tests with indexing and inspect your index with Luke
Write code for simple searching
Define queries and your expected results. execute searches and check
results.
Try to structure your code. separate indexing and searching -> it will be easier to refactor.

Dictionary API used for stressed syllables

This might end up being a very general question, but hopefully it will be useful to others as well.
I want to be able to request a word that is x number of syllables with a stress on x.[y] syllable. I've found plenty of APIs that return both of these such as Wordnik, but I'm not sure how to approach the search aspect. The URL to get the syllables is
GET /word.json/{word}/hyphenation
but I won't know the word ahead of time to make this request. They also have this:
GET /words.json/randomWords
which returns a list random words.
Is there a way to achieve what I want with this API without asking for random words over and over and checking if they meet my needs? That just seems like it would be really slow and push me over my usage limits.
Do I need to build my own data structure with the words and syllables to query locally?
I doubt you'll find this kind of specialized query on any of the big dictionary APIs. You'll need to download an English dictionary and create your own data structure to do this kind of thing.
The Moby Project has a hyphenated dictionary with about 185,000 words in it. There are many other dictionary projects available. A good place to start looking is http://www.dicts.info/dictionaries.php.
Once you've downloaded the dictionary, you'll need to preprocess it to build your data structure. You should be able to construct a dictionary or hash map that is indexed by (syllables, emphasis), and whose data member is a list of words. So you'd have an entry like (4, 2) (4-syllable word with emphasis on the 2nd syllable), and a list of all such words.
To query it, then, you'd just pack the query into a structure and look up that key in the hash map. Then pick a random word from the resulting list.

Manipulation of Large Files in R

I have 15 files of data, each around 4.5GB. Each file is a months worth of data for around 17,000 customers. All together, the data represents information on 17,000 customers over the course of 15 months. I want to reformat this data so that, instead of 15 files each denoting a month, I have 17,000 files for each customer and all their data. I wrote a script to do this:
#the variable 'files' is a vector of locations of the 15 month files
exists = NULL #This vector keeps track of customers who have a file created for them
for (w in 1:15){ #for each of the 15 month files
month = fread(files[w],select = c(2,3,6,16)) #read in the data I want
custlist = unique(month$CustomerID) #a list of all customers in this month file
for (i in 1:length(custlist)){ #for each customer in this month file
curcust = custlist[i] #the current customer
newchunk = subset(month,CustomerID == curcust) #all the data for this customer
filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is
if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back
custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file
custfile$V1 = NULL #remove an extra column the fread adds
custfile= rbind(custfile,newchunk)#combine read in data with our new data
write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
} else { #if it has not been created, write newchunk to a csv
write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files
}
}
}
The script works (At least, I'm pretty sure). The problem is that it is incredibly slow. At the rate I'm going, it's going to take a week or more to finish, and I don't have that time. Do any of you a better, faster way to do this in R? Should I try to do this in something like SQL? I've never really used SQL before; could any of you show me how something like this would be done? Any input is greatly appreciated.
As the #Dominic Comtois I would also recommend to use SQL.
R can handle quite a biggish data - there is nice benchmark of 2 billions rows which beats python - but because R run mostly in memory you need to have a good machine to make it work. Still your case don't need to load more than 4.5GB file at once so it should be well doable on personal computer, see second approach for fast non-database solution.
You can utilize R to load data to SQL database and later to query them from database.
If you don't know SQL you may want to use some simple database. The simplest way from R is to use RSQLite (unfortunately since v1.1 it is not lite any more). You don't need to install or manage any external dependency. The RSQLite package contains the database engine embedded.
library(RSQLite)
library(data.table)
conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db")
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE)
cat("data for",monthfile,"loaded to db\n")
}
# query data
df <- dbGetQuery(conn, "select * from mytablename where customerid = 1")
# when working with bigger sets of data I would recommend to do below
setDT(df)
dbDisconnect(conn)
Thats all. You use SQL without really having to do much overhead usually related to databases.
If you prefer to go with the approach from your post I think you can dramatically speed up by doing write.csv by groups while aggregation in data.table.
library(data.table)
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID]
cat("data for",monthfile,"written to csv\n")
}
So you utilize fast unique from data.table and perform subsetting while grouping which is also ultra fast. Below is working example of the approach.
library(data.table)
data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]
Update 2016-12-05:
Starting from data.table 1.9.8+ you can replace write.csv with fwrite, example in this answer.
I think you already have your answer. But to reinforce it, see the official Doc
R Data Import Export
That states
In general, statistical systems like R are not particularly well
suited to manipulations of large-scale data. Some other systems are
better than R at this, and part of the thrust of this manual is to
suggest that rather than duplicating functionality in R we can make
another system do the work! (For example Therneau & Grambsch (2000)
commented that they preferred to do data manipulation in SAS and then
use package survival in S for the analysis.) Database manipulation
systems are often very suitable for manipulating and extracting data:
several packages to interact with DBMSs are discussed here.
So clearly storage of massive data is not R's primary strength, yet it provides interfaces to several tools specialized for this. In my own work, the lightweight SQLite solution is enough, even if it's a matter of preference, to some extent. Search for "drawbacks of using SQLite" and you probably won't find much to dissuade you.
You should find SQLite's documentation pretty smooth to follow. If you have enough programming experience, doing a tutorial or two should get you going pretty quickly on the SQL front. I don't see anything overly complicated going on in your code, so the most common & basic queries such as CREATE TABLE, SELECT ... WHERE will likely meet all your needs.
Edit
Another advantage of using a DBMS that I didn't mention is that you can have views that make easily accessible other data organization schemas if one might say. By creating views, you can go back to the "visualization by month" without having to rewrite any table nor duplicate any data.

SQL Most effective way to store every word in a document separately

Here's my situation (or see TLDR at bottom): I'm trying to make a system that will search for user entered words through several documents and return the documents that contain those words. The user(s) will be searching through thousands of documents, each of which will be 10 - 100+ pages long, and stored on a webserver.
The solution I have right now is to store each unique word in a table with an ID (only maybe 120 000 relevant words in the English language), and then in a separate table store the word id, the document it is in, and the number of times it appears in that document.
E.g: Document foo's text is
abc abc def
and document bar's text is
abc def ghi
Documents table will have
id | name
1 'foo'
2 'bar'
Words table:
id | word
1 'abc'
2 'def'
3 'ghi'
Word Document table:
word id | doc id | occurrences
1 1 2
1 2 1
2 1 1
2 2 1
3 2 1
As you can see when you have thousands of documents and each has thousands of unique words, the Word Document tables blows up very quickly and takes way too long to search through.
TL;DR My question is this:
How can I store searchable data from large documents in an SQL database, while retaining the ability to use my own search algorithm (I am aware SQL has one built in for .docs and pdfs) based on custom factors (like occurrence, as well as others) without having an outright massive table for all the entries linking each word to a document and its properties in that document?
Sorry for the long read and thanks for any help!
Rather than building your own search engine using SQL Server, have you considered using a C# .net implementation of the lucene search api's? Have a look at https://github.com/apache/lucene.net
Good question. I would piggy back on the existing solution of SQL Server (full text indexing). They have integrated a nice indexing engine which optimises considerably better than your own code probably could do (or the developers at Microsoft are lazy or they just got a dime to build it :-)
Please see SQL server text indexing background. You could query views such as sys.fulltext_index_fragments or use stored procedures.
Ofcourse, piggy backing on an existing solution has some draw backs:
You need to have a license for the solution.
When your needs can no longer be served, you will have to program it all yourself.
But if you allow SQL Server to do the indexing, you could more easily and with less time build your own solution.
Your question strikes me as being naive. In the first place... you are begging the question. You are giving a flawed solution to your own problem... and then explaining why it can't work. Your question would be much better if you simply described what your objective is... and then got out of the way so that people smarter than you could tell you HOW to accomplish that objective.
Just off hand... the database sounds like a really dumb idea to me. People have been grepping text with command line tools in UNIX-like environments for a long time. Either something already exists that will solve your problem or else a decent perl script will "fake" it for you-- depending on your real world constraints, of course.
Depending on what your problem actually is, I suspect that this could get into some really interesting computer science questions-- indexing, Bayesian filtering, and who knows what else. I suspect, however, that you're making a very basic task more complicated than it needs to be.
TL;DR My answer is this:
** Why wouldn't you just write a script to go through a directory... and then use regexes to count the occurences of the word in each file that is found there?

Model clause in Oracle

I am recently inclined towards in Oracle jargon and the more I am looking into the more is attracting me.
I have recently come across the MODEL clause but to be honest I am not understanding the behaviour of this. Can any one with some examples please let me know about the same.
Thanks in advance
Some examples of MODEL are given here.
Personally I've looked at MODEL several times and not yet succeeded in finding a use case for it. While it first appears to be useful, there's a lot of places where only literals work (rather than binds or variables) which restrict its flexibility. For example, on inter-row calculations, you can't readily refer to the 'previous' or 'next' row, but have to be able to absolutely identify it by its attributes. So you can't say 'take the value of the row with the same date in the previous month' but can only code a specific date.
It might be used (internally) by some analytical tools. But as an end-user tool, I never 'got' it. I have previously recommended that, if you ever find a problem you think can be solved by the application of the MODEL clause, go and have a lie down until the feeling wears off.
I think the MODEL clause is quite simple to understand, when you slowly read the official whitepaper. In my opinion, the whitepaper nicely explains the MODEL clause step by step, adding one feature at a time to the examples, leaving out the most advanced features to the official documentation.
From the whitepaper, I also find it easy to understand when to actually use the MODEL clause. In some examples, it is a lot simpler to do "Excel-spreadsheet-like" operations using MODEL rather than, for instance, using window functions, CONNECT BY, or subquery factoring. Think about Excel. Whenever you want to define a complex rule set for Excel columns, use the MODEL clause. Example Excel spreadsheet rules:
A10 = A9 + A8
B10 = A10 * 5
C10 = MAX(A1:A9)
D10 = C10 / A10
In other words, MODEL is a very powerful SQL spreadsheet!
The best explanation is in the official white paper. It uses the SH demo schema and you really need it installed.
http://www.oracle.com/technetwork/middleware/bi-foundation/10gr1-twp-bi-dw-sqlmodel-131067.pdf
I don't think they do a very good job explaining this. It basically lets you load up data into an array and and then loop through array using straight SQL, instead of having to write procedural logic. Alot of the terms are based on spreadsheet terms (they are used in the Excel Help). So if you have them in excel, this would be confusing.
They should have drawn a picture for each of the queries and shown the array created than shown how you look through the array. The syntax looks to be based on Excel syntax. I'm not sure if this is common to all spreadsheet tools or not.
It has uses. Bin fitting is the most common. See the 2nd example. This is basically a complex group by where you are grouping by a range, but that range can change. It requires procedural logic. The example gives 3 ways to do it one of which is the model clause.
http://www.oracle.com/technetwork/issue-archive/2012/12-mar/o22asktom-1518271.html
I think people (often managers) who do complex spreadsheet calculations may have an easier time seeing uses for this and getting the lingo.