I have to reshape my dataset from wide to long. I have 500 variables that range from 2016 to 2007 and are recorded as abcd2016 and so on. I needed a procedure that allowed me to reshape without writing all the variables' names and I run:
unab vars : *2016
local stubs16 : subinstr local vars "2016" "", all
unab vars : *2015
local stubs15 : subinstr local vars "2015" "", all
and so on, then:
reshape long `stubs16' `stubs15' `stubs14' `stubs13' `stubs12' `stubs11' `stubs10' `stubs09' `stubs08' `stubs07', i(id) j(year)
but I get the error
invalid syntax
r(198);
Why? Can you help me to fix it?
The idea is to just specify the stub when reshaping to long format. To that end, you need to remove the year part from the variable name and store unique stubs in a local that you can pass to reshape:
/* (1) Fake Data */
clear
set obs 100
gen id = _n
foreach s in stub stump head {
forvalues t = 2008(1)2018 {
gen `s'`t' = rnormal()
}
}
/* (2) Get a list of stubs and reshape */
/* Get a list of variables that contain 20, which is stored in r(varlist) */
ds *20*
/* remove the year part */
local prefixes = ustrregexra("`r(varlist)'","20[0-9][0-9]","")
/* remove duplicates from list */
local prefixes: list uniq prefixes
reshape long `prefixes', i(id) j(t)
This will store the numeric suffix in a variable called t.
Related
I've made a small program that is supposed to read data in a range input it into an object oriented program, and then return the full data set. the issue is that when I run the file it only return data on the third procedure
I tried printing other procedure sets but idk how to do that, i'm thinking this will only work if i replace the procedures from generic to specific. as in instead of Procedure name for all of them procedures 1, 2, and 3
for i in range (3):
procedure_name = ('Physical Exam')
date_of = ("Nov 6th 2022")
doctor = ('Dr. Irvine')
charge = ('$ 250.00')
procedure_name = ('X-ray')
date_of = ("Nov 6th 2022")
doctor = ('Dr. Jamison')
charge = ('$ 500.00')
procedure_name = ('Blood test')
date_of = ("Nov 6th 2022")
doctor = ('Dr. Smith')
charge = ('$ 200.00')
procedure = HW6_RODRIGUEZ_1.Procedure(procedure_name,date_of,doctor,charge)
print(f'Procedure {i+1}')
print(procedure)
print(i, end=" ")
if name == 'main':
main()
So, I think you may have misunderstood some things when it comes to variables, OOP and looping.
When you define a variable, that variable is set to the last value it is assigned. So if you have the following code:
a = 1
a = 2
a = 3
The final value of the variable 'a' will be 3, as that is the last value it is assigned.
As for loops, whatever you have written in a for loop will be repeated for a specified number of times. This means if you want to write a loop that prints "hello" 5 times, you'd write the following:
for i in range(5):
print("hello")
What your loop is essentially doing is overwriting the same 3 variables 3 times over, this won't be assigning new values to an object.
When it comes to creating an object that you assign variable to, you need to first write the code for your class. Your class can have attributes like the variables you've stated. It could look something like this:
class procedure:
def __init__(self, procedure_name, date_of, doctor, charge):
self.procedure_name = procedure_name
self.date_of = date_of
self.doctor = doctor
self.charge = charge
Now, to set up a procedure object, you just assign a variable to procedure with the desired variables as parameters, like so:
new = procedure('X-ray','Nov 6th 2022','Dr. Jamison','$ 500.00')
And to access a variable, you just need to write procedureName.attribute. For example, using the object I just set up:
print(new.doctor)
Would output 'Dr. Jamison'.
If you want to store a bunch of them, I would recommend storing them in a list or a dictionary, depending on how you want to look them up.
I hope this helps! If you are new to programming, I would recommend some simpler programs such as a program that prints the nursery rhyme 10 green bottles using loops, or maybe making a quiz.
Best of luck.
In Octave, I am reading very large text files from disk and parsing them. The function textread() does just what I want except for the way it is implemented. Looking at the source, textread.m pulls the entire text file into memory before attempting to parse lines. If the text file is large, it fills all my free RAM (16 GB) with text and then starts saving back to disk (virtual memory), before parsing. If I wait long enough, textread() will complete, but it takes almost forever.
Notice that after parsing into a matrix of floating point values, the same data fit into memory quite easily. So I'm using textread() in an intermediate zone, where there is enough memory for the floats, but not enough memory for the same data as text.
All of that is preparation for my question, which is about strread(). The data in my text files looks like this
0.0647148 -2.0072535 0.5644875 8.6954257
0.1294296 -8.4689583 0.6567095 144.3090450
0.1941444 -9.2658037 -1.0228742 173.8027785
0.2588593 -6.5483359 -1.5767574 90.7337329
0.3235741 -0.7646807 -0.5320896 1.7357120
... and so on. There are no header lines or comments in the file.
I wrote a function that reads the file line by line, and notice the two ways I'm attempting to use strread() to parse a line of data.
function dest = readPowerSpectrumFile(filename, dest)
% read enough lines to fill destination array
[rows, cols] = size(dest);
fid = fopen(filename, 'r');
for line = 1 : rows
lstr = fgetl(fid);
% this line works, but is very brittle
[dest(line, 1), dest(line, 2), dest(line, 3), dest(line, 4)] = strread(lstr, "%f %f %f %f");
% This line doesn't work. Or anything similar I can think of.
% dest(line, 1:4) = strread(lstr, "%f %f %f %f");
endfor
fclose(fid);
endfunction
Is there an elegant way of having strread return parsed values to an array? Otherwise I'll have to write a new function any time I change the number of columns.
Thanks
Your described format is a matrix with floating point values. In this case you can just use load
d = load ("yourfile");
which is much faster than any other function. You can have a look at the used implementation in libinterp/corefcn/ls-mat-ascii.cc: read_mat_ascii_data
If you feed fprintf more values than are in its format specification, it will reapply the print statement until it's used them up:
>> fprintf("%d %d \n", 1:6)
1 2
3 4
5 6
It appears this also works with strread. If you specify only one value to read, but there are multiple on the current line, it will keep reading them and add them to a column vector. All we need to do is to assign those values to the correct row of dest:
function dest = readPowerSpectrumFile(filename, dest)
% read enough lines to fill destination array
[rows, cols] = size(dest);
fid = fopen(filename, 'r');
for line = 1 : rows
lstr = fgetl(fid);
% read all values from current line into column vector
% and store values into row of dest
dest(line,:) = strread(lstr, "%f");
% this will also work since values are assumed to be numeric by default:
% dest(line,:) = strread(lstr);
endfor
fclose(fid);
endfunction
Output:
readPowerSpectrumFile(filename, zeros(5,4))
ans =
6.4715e-02 -2.0073e+00 5.6449e-01 8.6954e+00
1.2943e-01 -8.4690e+00 6.5671e-01 1.4431e+02
1.9414e-01 -9.2658e+00 -1.0229e+00 1.7380e+02
2.5886e-01 -6.5483e+00 -1.5768e+00 9.0734e+01
3.2357e-01 -7.6468e-01 -5.3209e-01 1.7357e+00
I try to parse a CSV file which contains a date string (format "2018-03-30 09:30:05").
It should be turned into one-hot encoded features in the form of day / hour / minute / second.
One obvious way to do this is using pandas and store in a separate file or HDF store.
But in order to simplify the workflow (and leverage the GPU), I would like to do this directly in TensorFlow.
Assume the date string is on position -2, I thought something like tf.int32(tf.substr(row[-2],0,4)) should work to get the year, but it returns TypeError: 'DType' object is not callable.
with tf.python_io.TFRecordWriter("train_sample_sorted.tfrecords") as tf_writer:
i = 0
for row in myArray:
i +=1
if(i%10000==0):
print(row[-2])
#timefeatures = int(row[-2][0:4]) ## TypeError: Value must be iterable
#timefeatures = tf.int32(tf.substr(row[-2],0,4)) ## TypeError: 'DType' object is not callable
features, label = row[:-2], row[-1]
example = tf.train.Example()
example.features.feature["features"].float_list.value.extend(features)
example.features.feature["timefeatures"].float_list.value.extend(timefeatures)
example.features.feature["label"].int64_list.value.append(label)
tf_writer.write(example.SerializeToString())
What is the best practice to handle date strings as input features? Is there a way around pre-processing?
Thanks
The first version int( row[ -2 ][ 0 : 4 ] ) fails for two reasons: one is that indexing cannot be used on a string tensor's strings, and if it didn't fail for that, it would fail because you cannot convert it to int like that.
The second version tf.int32( tf.substr( row[ -2 ], 0, 4 ) ) is almost there, it does the string splitting fine, but to convert strings to numbers you have to use tf.string_to_number you cannot simply cast a string to a number like that with tensors.
Without access to the data you use I couldn't test it, but this should work:
tf.string_to_number( tf.substr( row[ -2 ], 0, 4 ), out_type = tf.int32 )
I want to write scalars that have some pre-generated values into a file. This is a sample of that closely resembles what I am trying to accomplish but those scalars are not writing any output. I tried to to dereference the scalar as can be seen in the code with no success.
scalar Sc1b = 11
scalar Sc2b = 22
scalar Sc3b = 33
scalar Sc4b = 44
scalar Sc5b = 55
scalar Sc6b = 66
scalar Sc7b = 77
scalar Sc8b = 88
file open myfile using"C:/mytable.txt", write replace
forvalues i=1/8 {
forvalues q=1/8 {
display `i', `q', `Sc`i'b', ("`Sc`i'b'"), ("`Sc("`i'")b'")
file write myfile ("`i'") _tab ("`q'") _tab `Sc`i'b' _tab ("`Sc`q'b'") _tab ("`Sc("`q'")b'") _n
}
}
file close myfile
You don't need to dereference scalars here. They don't have temporary names; you assigned them permanent names, so there are no aliases to peel off. I am guessing that the multiple versions of code for writing the scalar were guesses at the correct code and that you only need each scalar once. I also removed the rather specific Windows reference for the sake of those on other platforms.
scalar Sc1b = 11
scalar Sc2b = 22
scalar Sc3b = 33
scalar Sc4b = 44
scalar Sc5b = 55
scalar Sc6b = 66
scalar Sc7b = 77
scalar Sc8b = 88
file open myfile using "mytable.txt", write replace
forvalues i=1/8 {
forvalues q=1/8 {
display `i', `q', Sc`i'b
file write myfile ("`i'") _tab ("`q'") _tab (Sc`i'b) _n
}
}
file close myfile
Note, however, that this code assumes that there are no variables with the same name or whose names abbreviate to the same name as your scalars. Scalars and variables share the same namespace. If necessary, disambiguate using scalar().
The error is in the function below, I'm trying to generate 2 measures of entropy (the latter removes all events with <5 frequency).
My error:
ERROR 1200: Cannot expand macro 'TOTUPLE'. Reason: Macro must be defined before expansion.
Which is weird, because TOTUPLE is a built-in function. Other pig scripts use TOTUPLE with no problems.
Code:
define dual_entropies (search, field) returns entropies {
summary = summary_total($search, $field);
entr1 = count_sum_entropy(summary, $field);
summary = filter summary by events >= 5L;
entr2 = count_sum_entropy(summary, $field);
$entropies = TOTUPLE(entr1, entr2);
};
Note that entr1 and entr2 are both single numbers, not vectors of numbers - I suspect that's part of the issue.
I ran into similar confusions. I'm not sure if it's true in general but Pig only liked TOTUPLE when it's part of a FOREACH operation. I worked around by doing group by all, which returns a bag with a single tuple in it, followed by a FOREACH .. GENERATE such as:
B = group A ALL;
C = foreach B generate 'x', 2, TOTUPLE('a', 'b', 'c');
dump C;
...
(x,2,(hi,2,3))
Perhaps this will help