Table aware parsing of a string field - sql

I have a table of videos with a field, filename, and some of these videos are split in multiple parts with the starting frame number of the video part appended to the end of the filename separated by a '_'.
I want to get the integer which represents the starting frame for each filename, so for e.g.:
movie.avi : frame=0
movie_500.avi: frame=500
For the two files above, I can get it with a regular expression on my table:
SELECT coalesce(substring(filename FROM '_(\d{2,7}).avi$')::int, 0) FROM table;
However, how to deal with the case when the filename of the video might include numbers at the end. Say I have the two files:
anothermovie_100.avi: frame = 100 (WRONG!)
anothermovie_100_500.avi: frame = 500
My select statement above will give me the wrong frame starting number. I want to know from looking at my table that anothermovie_100 has frame=0 because there exists another filename in the same table which contains anothermovie_100 and finishes in three digits at the end.
So basically for a table with the four above-mentioned rows, I would like my select statement to give me this:
movie.avi: frame=0
movie_500.avi: frame=500
anothermovie_100.avi: frame=0
anothermovie_100_500.avi: frame=500
So the query has to somehow know if the filename string is not contained entirely in another filename string of the same table, in which case it must return frame 0 and not the last digits on the filename converted to integer.

I think the issue here is modeling the data - you should keep a reference to which movie each file belongs to.
Otherwise, your data may be ambiguous. Assume you have the files movie.avi and movie_500_500.avi. How would you tell (regardless on SQL syntax, just in plain English) whether movie_500.avi is in fact the 500 frame of movie.avi or the 0 frame of movie_500_500.avi ?

Related

Stata: How to use column value as file name in loop

I am working with 350 datasets. I want to automate naming the final datasets with values from the dataset.
For example, if ID is abc and year is 2010. There are two columns in the dataset with those values. I want to pull that information out and use in the file name. and the name would look like abc_2010.dta in this case.
So basically I want to do
foreach file in `files' {
**calculation codes**
** construct the file name as three digit ID_year.dta **
}
I have already done the calculation part. I need some help with the naming of the files.
If I understand what you are trying to do, I believe you should be able to do this:
foreach file in `files' {
**calculation codes**
** construct the file name as three digit ID_year.dta **
local fname:di "`=id[1]'_`=year[1]'"
save `fname', replace
}
Note that this assume that after the calculation of the current iteration of the loop through files, the value of id in the first row holds the three digit code and the value of year in the first row holds the year.

Keeping table formatting in Sage with multiple tables

As the title suggests, I am trying to keep proper table formatting in Sage while displaying multiple tables (this is strictly a formatting question, so no knowledge of the math involved is necessary). Currently, I am using the following code:
my_table2 = table([column1, column2], frame = True)
my_table1 = table([in_the_cone, lengths_in_cone], frame = True)
result_table1 = my_table1.transpose()
result_table2 = my_table2.transpose()
result_table1
result_table2
With this, I receive no output for table1 and the following output for table2:
I want both tables to look this way, but having no output for the first table is no good. So I tried changing the bottom two lines to:
result_table1, result_table2
While this does display both tables, the formatting now looks like:
Is there a way I can display both tables at the same time with the first formatting?
It would have been nice for you to include a full minimal working example, but in any case it does depend a little on the output.
Basically, in a notebook or other "cell", only the last return value prints to the screen in some fashion (sometimes via a "hook" as in your case). But if you use the comma, that implicitly creates a "tuple" which is then printed as a tuple, so you lose that "hook" to display things with math modes (since a tuple doesn't have that).
In this case, the (newish) canonical way to achieve what you want is
pretty_print(result_table1)
pretty_print(result_table2)
though you may want to put print "\n" in between so they don't end up right on top of each other.
Edit: Here is a picture in Jupyter inside of Sage.

Filtering rows in Pentaho

I have a dataset with columns containing numbers. However, some of the rows in that column have missing data. Instead of numbers, a dash (-) is placed in the cell.
What I want to happen is to separate those rows with a dash and output them to a separate excel file. Those without the dash, should output to a csv file.
I tried the "filter rows" but it gives me an error:
Unexpected conversion error while converting value [constant String] to a Number
constant String : couldn't convert String to number
constant String : couldn't convert String to number : non-numeric character found at position 1 for value [-]
My condition is if
Column1 CONTAINS - (String)
You cant try to convert to number in the select step,and handler the error, if can not convert to number that mean that is (-)
You can convert missing value indicators (like a dash or any other string) to null in Text-File-Input - see field option "Null if". That way you still can use the metadata detection feature and will not trip over a dash arriving in a Number field.
With CSV-File-Input you should stick to the String datatype until a Null-If step has cleansed the values, so you can change the datatype to Number in a Select-Values step.
If you must preserve the dash character, don't use metadata detection (as it suggests datatype Number) or use more rows to sample (so a field with a dash is encountered) or just revert the datatype to String again before saving and running the transformation.
My solution lies on the first 'Replace in String'. I replaced the dash into something numeric and can easily be distinguished from the rest of the numbers (I used 9999) and carried on with the rest of my process.
In filter rows, I had no problems anymore with the data type because both my variables and condition contained numbers, therefore, it no longer had to convert anything.
After filter rows, I added the 'Null-if' to remove the random 9999 that I used
just to have something to replace the dash.
After that, the separation was made just as I hope it would.
Thanks to #marabu for the Null-if idea.

How can I get the pos of row dynamically ? QlikView

I want to delete the rest of a loaded csv file based on the occurrence of a string.
Remove(Row, RowCnd(Interval, Pos(Top, findMeThePositionOfaGivenString('TeddyBear')),
Pos(Bottom, 1), Select(1, 0))
Or just any approach to dynamically delete a range of rows!
If you're doing this during the data import stage then I would recommend
load yourstuff
from yourfile
where index(givenstring,'Teddybear')=0;
Index will return the position of the string in the larger string.
eg. index('ABC','BC')=2 so index()=0 means the string does no exist in the searched text. Be careful of the capitalisation as it will honour that, so use upper or lower to remove that kind of confusion.
I hope I understood your request.

Comparing NSString to NSTextView Range prior to Appending

Coding in Objective-C, I'm appending text to a NSTextView object named subCap in my code like so:
[[[_subCAP textStorage] mutableString]appendString:[NSString stringWithFormat:#"%#", subcapLine]];
subcapLine will have two timecode values such as: "01:00:00:00 01:00:01:00" separated by a single space, then a newline (\n) character, then a string like "ONC314_001_001" followed by two newline chars (\n\n).
The end result will create a list similar to:
01:00:00:00 01:00:01:00
ONC314_001_001
01:00:01:00 01:00:02:00
ONC314_001_002
01:00:02:00 01:00:03:00
ONC314_001_003
etc, etc, etc.
It's a sub caption file for placing text (the ONC314 lines) at appropriate times in a video file, as indicated by the timecodes.
However, I've determined that there is an odd set of circumstances where a timecode pair could be the same as the previous timecode pair, and if that happens, I want to skip appending that line.
So, my question is, given that the timecodes are always 11 chars apiece, separated by a space, can anybody think of a way I can easily grab the prior TC pair and compare it to my current pair in the subcapLine I'm preparing to append? The problem is the text of the sub caption could be random lengths. In my example they're the same, but that isn't always the case.
If I need to check prior to compiling my subcapLine, I can do that too, but I just thought it might be more slick to use a range of some sort to grab the prior pair of TCs from the last-written line in the NSTextView object and compare (again, using a range?) against the TCs in the line I'm about to append?
Thoughts and suggestions much appreciated.
Chris Conlee
When you add a timecode store the length of the text field string just before you add the timecode so you will have the offset to the timecode you are about to add.
Then before adding a new timecode you could simply use the previous offset you stored to extract the substring and do a string comparison and see if the timecodes are identical.
This should allow you to always have an offset to the previous timecode regardless of the length of the subtitles.