gnuplot: Spurious data points in plots when using index - indexing

I'm trying to use gnuplot 4.6 patchlevel 6 to visualize some data from a file test.dat which looks like this:
#Pkg 1
type min max avg
small 1 10 5
medium 5 15 7
large 10 20 15
#Pkg 2
small 3 9 5
medium 5 13 6
large 11 17 13
(Note that the values are actually separated by tabs even though it shows as spaces here.)
My gnuplot commands are
reset
set datafile separator "\t"
plot 'test.dat' index 0 using 2:xticlabels(1) title col, '' using 3 title col, '' using 4 title col
This works fine as long as there is only a single data block in test.dat. When I add the second block spurious data points appear. Why is that and how can it be fixed?
YFTR: Using stat on the file yields only expected results. It reports two data blocks for the full file and correct values (for min, max and sum) when I specify one of the two using index

as mentioned in the comment to the question, one has to explicitly repeat the index 0 specification within all parts of the plot command as
plot 'test.dat' index 0 using 2, '' index 0 using 3, ...
otherwise '' refers to all blocks in the data file

Related

Get coherent subsets from pandas series

I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Example:
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
blocks.append(block)
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
blocks.append(subset[block_start:])
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
EDIT
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]

Unable to identify strange whitespace character in MSSQL table

We have a process that reads an XML file into our database and inserts any rows that aren't currently in another table to that table.
This process also has a trigger to write to an audit table and a nightly snapshot is also held in another table.
In the XML holding table a field looks like 1234567890123456 but it exists on our live table as 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6. Those spaces will not be removed by any combination of REPLACE functions. We have tried all CHAR values and it does not recognise the character. The audit table and nightly snapshot, however, contain the correct values.
Similarly, if we run a comparison between SELECT CASE WHEN '1234567890123456' = '1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 ' THEN 1 ELSE 0 END, this returns 1, so they match. However LEN('1234567890123456') is 16 and LEN('1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 ') is 32.
We have ran some queries to loop through the characters in the field and output the ASCII and Unicode values for the characters. The digits return the correct ASCII/Unicode values, but this random whitespace character does not return a value.
An example of the incorrectly displayed one is 0x35000000320000003800000036000000380000003300000039000000370000003800000037000000330000003000000035000000340000003000000033000000 and a correct one is 0x3500320038003600380033003200300030003000360033003600380036003000. Both were added by the same means on the same day. One has the extra bytes, the other is fine.
How can we identify this character and get rid of it? Is there a reason this would have been inserted originally? How can we avoid this in future?
Data entry
It looks like some null (i.e. Char(0)) characters have got into the data.
If the data was supposed to be ASCII when it was entered but UTF-16 data got, then it could be:
Entered character codes: 48 00
Sent to database: 48 00 00 00
To avoid that, remove disallowed characters as the first step in processing the input, say by using a regex to replace [\x00-\x1F] with an empty string.
Data repair
Search for entries which a Char(0) in them to confirm that they can be found that way.
If so, replace the Char(0) with an empty string.
If that doesn't work, you could convert the data to the format '0x35000000320000003800000036000000380000003300000039000000370000003800000037000000330000003000000035000000340000003000000033000000', replace '000000' with '00', and then convert back.

Processing loading table data

I have a text file "celldata.txt" containing a very simple table of data.
1 2 3 4
5 6 7 8
9 10 11 12
1 2 3 4
2 3 4 5
The problem is when it comes to accessing the data at a certain column and row.
My approach has been to load using loadTable.
Table table;
int numCols;
int numRows;
void setup() {
size(200,200);
table = loadTable("celldata.txt","tsv");
numRows=table.getRowCount();
numCols=table.getColumnCount();
}
void draw() {
background(255);
fill(0);
text(numRows +" "+ numCols,100,100); // Check num of cols and rows
println(table.getFloat(0,0));
}
Question 1: When I do this, it says the number of rows are 5 and the number of columns is just 1. Why is it not 5 x 4?
Question 2: Why is table.getFloat(0,0) "NaN" instead of the first element of the data?
I want to use a much bigger matrix later and access certain elements (of type double) with something like getFloat(i,j) and be able to loop through all elements.
Using the same example data as I, can someone please help me understand what is wrong with my code and how to access the textfile's data? Should I be using another method than loadTable?
You've told Processing that the file contains tab separated values (by using the "tsv" option), but your file contains space separated values.
Since your file does not contain any tabs, it reads the entire row as a single value. So the 0,0 position of your table is 1 2 3 4, which isn't a number- hence the NaN. This is also why it thinks your table only has one column.
You should modify your celldata.txt file to actually be separated by tabs instead of spaces:
1 2 3 4
5 6 7 8
9 10 11 12
1 2 3 4
2 3 4 5
You could also separate them by commas and then use the "csv" option.
If you're still having trouble, you can see what Processing is reading in by adding saveTable(table, "data/new.csv"); to the end of your setup() function and then looking at that file. It will be a list of values separated by commas, so you can see exactly where Processing thinks the cells of the table are.

How to plot a linegraph in SPSS with respect to the data?

Hi all,
Above you'll see a line-graph plotted with SPSS. I want to improve this line-graph according to its data. Meaning that some elements are not presented correctly:
(1) I deliberately adjusted the scaling on the Y-axis from -1 to 10, in order to notice the breaks (i.e. missing values) in the line graph. Otherwise you'll not notice the breaks, as it will overlap with the bottom-line of the graph. Is it possible to notice the breaks, but with a scaling of 0 to 10 (in SPSS)? > SOLVED
(2) On the X-axis, point 14 and 15 are missing, hence the break. However, the line graph shows an upward trend just after point 13, and a downward trend just before point 16. Is it possible to adjust the line-graph (in SPSS), which would delete these described (interpolation) trends?
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Time_Period_Hours
MEAN(MT)[name="MEAN_MT"] MISSING=VARIABLEWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Time_Period_Hours=col(source(s), name("Time_Period_Hours"), unit.category())
DATA: MEAN_MT=col(source(s), name("MEAN_MT"))
GUIDE: axis(dim(2), delta(1))
SCALE: linear(dim(2), min(-0.5), max(9))
ELEMENT: line(position(Time_Period_Hours*MEAN_MT))
ELEMENT: point(position(Time_Period_Hours*MEAN_MT), color(color.black),
size(size."3px"))
END GPL.
Here is an example, for the line element you need to specify the option missing.gap() - I thought just deleting missing.wings() from the default code would work but maybe it is an internal default. You may want to consider changing Time_Period_Hours to a scale variable and doing the aggregation outside of GGRAPH. Also making the Y axis scale in your example go all the way up to 9 seems a bit superfluous.
DATA LIST FREE / Time_Period_Hours MT.
BEGIN DATA
1 1
2 0
3 0
4 0
5 1
6 0
7 0
8 0
9 0
10 0
11 .
12 0
13 0
14 .
15 .
16 1
17 0
18 0
19 0
20 .
21 0
END DATA.
FORMATS Time_Period_Hours MT (F2.0).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Time_Period_Hours
MEAN(MT)[name="MEAN_MT"] MISSING=VARIABLEWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Time_Period_Hours=col(source(s), name("Time_Period_Hours"), unit.category())
DATA: MEAN_MT=col(source(s), name("MEAN_MT"))
GUIDE: axis(dim(2), delta(1))
SCALE: linear(dim(2), min(-0.5), max(9))
ELEMENT: line(position(Time_Period_Hours*MEAN_MT), missing.gap())
ELEMENT: point(position(Time_Period_Hours*MEAN_MT), color(color.black),
size(size."3px"))
END GPL.

Header and repeating time information removal from a GPS TEC rinex file

I have a rinex file and is shown here..an image showing the first part of rinex file
http://imageshack.us/photo/my-images/593/65961409.jpg
The data (AOPR Rinex file) is downloaded from the site after entering a year and a day.
http://www.naic.edu/aisr/GPSTEC/gpstec.html
I want to open this file as a matrix in matlab for further processing..After the end of header at the 42nd line the time information is on 43 rd line. Then data starts. But time information is coming again after some rows say 64 the line, which should be discarded. Header should also be discarded. Also the last column is coming below the first column as a second row which should be transferred to the last column. Totally there are 55700 rows. Kindly help me with this.
I suspect the last column appearing on the line below it is just an artifact of how large the window of your text reader is...
For the rest, I think a trial-and-error loop is in place here:
fid = fopen('test.txt','r');
C = {};
while ~feof(fid)
% read lines with dictated format.
D = textscan(fid, '%d %d %d %d');
% this will fail on headerlines, empty lines, etc.
if isempty(D{1})
% in those cases, advance the file pointer by one line
fgetl(fid);
else
% if that's not the case, save the lines thus read
C = [C;D]; %#ok
end
end
fclose(fid);
% Post-process: concatenate all sub-arrays into one
C = arrayfun(#(ii) cat(1, C{:,ii}), 1:size(C,2), 'UniformOutput', false);
This works, at least with my test.txt:
header
random
garbage
1 2 3 4
4 5 6 7
4 6 7 8
more random garbage
2 5 6 7
5 6 7 8
8 6 3 7
I suspect the last column appearing on the line below it is just an artifact of how large >the window of your text reader is...
For the rest, I think a trial-and-error loop is in place here
Dear Rody I don't have any matlab background and just a beginner. It is actually a Rinex file..with 2780 epochs and 6 observables with 30 satellite values..Decoding it in matlab is tough. That is the problem. You can read a sample code at
http://web.ics.purdue.edu/~tdauterm/EAS591/Lab7/read_rinexo.m
But the problem is that the observables are six and there only 5 in the m-file which also is not in the correct order. I need C1 P2 L1 L2 S1 S2...but the code at the link gives L1 L2 C1 P1 P2. :( Can you just correct that..Then it will be a great help..