Ti-84 syntax error relating to a single list conversion to a matrix - syntax-error

:N-remainder(dim(L1),N→ dim(L2)
:Fill(23,L2
:augment(L1, L2->L1
:{1,1→dim([A]
:For(x,1,dim(L1)/N
:augment([A],List▶matr(seq(L1(I),I,Nx-N+1,Nx),[B]
:End
I get a syntax error when running this Ti-basic code and I cannot figure out why (happens somewhere when List is being converted to matrix). Basically this code is suppose to take a L1 (add 23 until I dim(L1) is a multiply of N), then create a matrix with N rows and -int(-dim(L1)/n) columns.
Example:
Let N=3 and
L1 = {9,12,15,22,5,9,14,4,9,1,14,7,9,18,12,19}
dim(L1) = 16 which is not a multiply of 3 (18 is so add 23 to L1 twice)
L1 = {9,12,15,22,5,9,14,4,9,1,14,7,9,18,12,19,23,23}
dim(L1) = 18 which is a multiple of 3
Create a 3x6 matrix with Col1 = {9,12,15}, Col2 = {22,5,9}, ..., Col6 = {19,23,23}
http://tibasicdev.wikidot.com/forum/t-1039272/comments/show?from=activities#post-2131820
Read full convo. here

There are at least two issues with your code:
(1) For the augment command both matrices must share the same number of rows. In your program matrix [A] is set to dimension {1,1} (Why?), but the columns you want to append are of different size. So you'll get a "dimension error".
(2) The List▶matr command doesn't return a matrix (actually it doesn't return anything). So you can't use it as second parameter for the augment command. Instead you must run it first and then use something like augment([A],[B])▶[C].

Related

Why can't I read all of the values in the matrix in scilab?

i am trying to read a csv file and my code is as follows
param=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%i",'double',[],[],[1 2 3 4]); //reads number of clusters and features
data=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%f",'double',[],[],[3 1 19 4]); //reads the values
numft=param(1,1);//save number of features
numcl=param(2,1);//save number of clusters
data_pts=0;
data_pts = max(size(data, "r"));//checks how many number of rows
disp(data(numft-3:data_pts,:));//print all data points (I added -3 otherwise it displays only 15 rows)
disp(numft);//print features
disp(data_pts);//print features
disp(param);
endfunction
below is the values that i am trying to read
features,4,,
clusters,3,,
5.1,3.5,1.4,0.2
4.9,3,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5,3.6,1.4,0.2
7,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,1.5
5.5,2.3,4,1.3
6.5,2.8,4.6,1.5
5.7,2.8,4.5,1.3
6.3,3.3,6,2.5
5.8,2.7,5.1,1.9
7.1,3,5.9,2.1
6.3,2.9,5.6,1.8
6.5,3,5.8,2.2
7.6,3,6.6,2.1
I do not know why the code only displays 15 rows instead of 17. The only time it displays the correct matrix is when i put -3 in numft but with that, the number of columns would be 1. I am so confused. Is there a better way to read the values?
In the csvRead call in the first line of your script the boundaries of the region to read is incorrect, it should be corrected like this:
param=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%i",'double',[],[],[1 2 2 2]);

Dendrograms with SciPy

I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?

Apply function with pandas dataframe - POS tagger computation time

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2

PIG Item Count and Histogram

This is a two part problem:
PART 1:
I am using the cloudera pig editor to transform my data. The data set is derived from the US Patents Citations data set. The first column is the "Cited" patent. The remaining data is the patents that cite the first patent.
3858241 3634889,3557384,3398406,1324234,956203
3858242 3707004,3668705,3319261,1515701
3858243 3684611,3681785,3574238,3221341,3156927,3146465,2949611
3858244 2912700,2838924,2635670,2211676,17445,14040
3858245 3755824,3699969,3621837,3608095,3553737,3176316,2072303
3858246 3601877,3503079,3451067
3858247 3755824,3694819,3621837,2807431,1600859
I need to create PIG code that will count the number of citation that the first patent has. So, I need the output to be:
3858241 5
3858242 4
3858243 7
3858244 6
3858245 7
3858246 3
3858247 6
PART 2:
I need to create a histogram of the output from problem 1 using a PIG script.
Any help would be greatly appreciated.
Thanks
this script should work.
X = LOAD 'pigpatient.txt' using PigStorage(' ') AS (pid:int,str:chararray);
X1 = FOREACH X GENERATE pid,STRSPLIT(str, ',') AS (y:tuple());
X2 = FOREACH X1 GENERATE pid,SIZE(y) as numofcitan;
dump X2;
X3 = group X2 by numofcitan;
Histograms = foreach X3 GENERATE group as numofcitan,COUNT(X2.pid);
dump Histograms;
input:
3858241 3634889,3557384,3398406,1324234,956203
3858242 3707004,3668705,3319261,1515701
3858243 3684611,3681785,3574238,3221341,3156927,3146465,2949611
3858244 2912700,2838924,2635670,2211676,17445,14040
3858245 3755824,3699969,3621837,3608095,3553737,3176316,2072303
3858246 3601877,3503079,3451067
3858247 3755824,3694819,3621837,2807431,1600859
Result:
(3858241,5)
(3858242,4)
(3858243,7)
(3858244,6)
(3858245,7)
(3858246,3)
(3858247,5)
Histogram output:
Number of citatatins,number of patients
(3,1)
(4,1)
(5,2)
(6,1)
(7,2)
#Sravan K Reddy's answer is good enough to be a solution, but it is essential to know what is histogram?
Histogram is frequency distribution of datasets and gives statistical information about data. Most commonly used histogram types are; Equi-width and equi-depth which is called equi-height or height-balanced.
In database tools, equi-depth histogram is prefered. ex: Oracle see
#Sravan K Reddy intends to create equi-width histogram of patent citations. However, in order to create histogram, data must be sorted. That is vital for histogram construction.
If you want to create histogram of your big data, read this paper and check Apache Pig Scripts.

store matrix data in SQLite for fast retrieval in R

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.