I'm trying to format the output from proc report in plain text. I have a variable by which I group observations and which spans two lines. This causes an apparent line break and separates the grouped observations. Is there any neat way to fix this?
The following minimal example illustrates the problem.
EDIT: The program should preferably also work for only one observation in some subjects.
Program:
* Toy data ;
data mydata;
length subj $ 20;
input subj $ val val2;
datalines;
ID001|M 7.1 5.2
ID001|M 7.1 4.9
ID001|M 7.1 5.3
ID001|M 7.1 5.6
ID001|M 7.1 5.7
ID020|F 7.1 3.2
ID020|F 7.3 2.9
ID020|F 7.2 0.9
ID300|M 7.2 1.2
ID300|M 7.2 1.8
;
run;
* Create report ;
ods listing;
proc report data=mydata headline headskip split='|';
column(subj val val2);
define subj / order flow 'Subject ID|Sex';
define val / 'Value 1';
define val2 / 'Value 2';
break after subj / skip;
run;
ods _all_ close;
Output:
Subject ID
Sex Value 1 Value 2
------------------------------------------
ID001 7.1 5.2
M
7.1 4.9
7.1 5.3
7.1 5.6
7.1 5.7
ID020 7.1 3.2
F
7.3 2.9
7.2 0.9
ID300 7.2 1.2
M
7.2 1.8
Desired output:
Subject ID
Sex Value 1 Value 2
------------------------------------------
ID001 7.1 5.2
M 7.1 4.9
7.1 5.3
7.1 5.6
7.1 5.7
ID020 7.1 3.2
F 7.3 2.9
7.2 0.9
ID300 7.2 1.2
M 7.2 1.8
Alternative desired output:
Subject ID
Sex Value 1 Value 2
------------------------------------------
ID001
M 7.1 5.2
7.1 4.9
7.1 5.3
7.1 5.6
7.1 5.7
ID020
F 7.1 3.2
7.3 2.9
7.2 0.9
ID300
M 7.2 1.2
7.2 1.8
Or something similar to these that visually separates the groups clearly.
I indented gender but you can remove that. Make sure you 2 or more obs per subject.
data mydata;
length subj $ 20;
input subj $ val val2;
length sex $3;
sex = ' '||scan(subj,-1);
subj = scan(subj,1);
datalines;
ID001|M 7.1 5.2
ID001|M 7.1 4.9
ID001|M 7.1 5.3
ID001|M 7.1 5.6
ID001|M 7.1 5.7
ID020|F 7.1 3.2
ID020|F 7.3 2.9
ID020|F 7.2 0.9
ID300|M 7.2 1.2
ID300|M 7.2 1.8
;;;;
run;
proc print;
run;
* Create report ;
*ods listing;
proc report data=mydata headline headskip split='|' list
/* showall*/
;
column(subj sex stub val val2);
define subj / order noprint;
define sex / order noprint;
define stub / computed width=10 'Subject' ' Gender';
define val / 'Value 1';
define val2 / 'Value 2';
break after subj / skip;
compute before subj;
xsubj = subj;
endcomp;
compute before sex;
j = 0;
xsex = sex;
endcomp;
compute stub / char length=20;
j + 1;
if j eq 1 then stub = xsubj;
else if j eq 2 then stub = xsex;
else stub = ' ';
endcomp;
run;
Just make sure all subjects have two or more obs by adding one if needed.
data addonemaybe / view=addonemaybe;
set mydata;
by subj;
output;
if first.subj and last.subj then do;
call missing(of val:);
output;
end;
run;
Related
Here is the structure of my dataframe
plan
ADO_ver_x
ADO_incr_x
ADO_ver_y
ADO_incr_y
3ABP3
25.0
4.0
25.0
7.0
I would like to add ADO_incr_y - ADO_incr_x lines, which means in this case the result would be :
plan
ADO_ver_x
ADO_incr_x
ADO_ver_y
ADO_incr_y
3ABP3
25.0
4.0
25.0
5.0
3ABP3
25.0
5.0
25.0
6.0
3ABP3
25.0
6.0
25.0
7.0
Is there a Panda/Pythonic way to do that ?
I was thinking something like :
reps = [ val2-val1 for val2, val1 in zip(df_insert["ADO_incr_y"],df_insert["ADO_incr_x"]) ]
df_insert.loc[np.repeat(df.index_insert.values, reps)]
But I don't get the incremental progression :
4 -> 5, 5->-6, 6 -> 7
How can I get the index inside the list comprehension ?
You can repeat the data, then modify with groupby.cumcount():
repeats = df['ADO_incr_y'].sub(df['ADO_incr_x']).astype(int)
out = df.reindex(df.index.repeat(repeats))
out['ADO_incr_x'] += out.groupby(level=0).cumcount()
out['ADO_incr_y'] = out['ADOE_incr_x'] + 1
Pretend I have this table on a server:
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
iris$id = 1:nrow(iris)
dbWriteTable(con, "iris", iris)
I want to select some some random rows from this dataset - suppose I create an R variable that contains the random rows that I want to select:
rows_to_select = sample.int(10, 5, replace = TRUE)
[1] 1 1 8 8 7
I then tried to select these rows from my table - but this "rows_to_select" variable is not being recognized for some reason:
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (rows_to_select) limit 100;")
Error: no such column: rows_to_select
This code works fine if I manually specify which rows I want (e.g. I want the first row, and the fifth row selected twice):
#works - but does not return the 5th row twice
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (1,5,5) limit 100;")
Does anyone know how to fix this?
Thank you!
In general, just including rows_to_select in a query is not going to know to reach out of the SQLite environment and "invade" the R environment (completely different!) and look for a variable. (For that matter, why doesn't select a.* ... find dplyr::select?) This is the case both for pragmatic reasons and security (though mostly pragmatic).
You may want to consider parameterized queries vice constructing query strings manually. In addition to security concerns about malicious SQL injection (e.g., XKCD's Exploits of a Mom aka "Little Bobby Tables"), it is also a concern for malformed strings or Unicode-vs-ANSI mistakes, even if it's one data analyst running the query. DBI supports parameterized queries.
Long story short, try this:
set.seed(42)
rows_to_select = sample.int(10, 5, replace = TRUE)
rows_to_select
# [1] 1 5 1 9 10
qmarks <- paste(rep("?", length(rows_to_select)), collapse = ",")
qmarks
# [1] "?,?,?,?,?"
DBI::dbGetQuery(con, paste(
"select a.*
from (select *, row_number() over (order by id) as rnum from iris) a
where a.rnum in (", qmarks, ") limit 100;"),
params = as.list(rows_to_select))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species id rnum
# 1 5.1 3.5 1.4 0.2 setosa 1 1
# 2 5.0 3.6 1.4 0.2 setosa 5 5
# 3 4.4 2.9 1.4 0.2 setosa 9 9
# 4 4.9 3.1 1.5 0.1 setosa 10 10
In this case it is rather trivial, but if you have a more complicated query where you use question marks ("bindings") at different places in the query, the order must align perfectly with the elements of the list assigned to the params= argument of dbGetQuery.
Alternative: insert a temp table with your candidate values, then left-join against it.
dbWriteTable(con, "mytemp", data.frame(rnum = rows_to_select), temporary = TRUE)
DBI::dbGetQuery(con,
"select i.* from mytemp m left join iris i on i.id=m.rnum")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
# 1 5.1 3.5 1.4 0.2 setosa 1
# 2 5.0 3.6 1.4 0.2 setosa 5
# 3 5.1 3.5 1.4 0.2 setosa 1
# 4 4.4 2.9 1.4 0.2 setosa 9
# 5 4.9 3.1 1.5 0.1 setosa 10
DBI::dbExecute(con, "drop table mytemp")
# [1] 0
Here is the simplified dataset:
Character x0 x1
0 T 0.0 1.0
1 h 1.1 2.1
2 i 2.2 3.2
3 s 3.3 4.3
5 i 5.5 6.5
6 s 6.6 7.6
8 a 8.8 9.8
10 s 11.0 12.0
11 a 12.1 13.1
12 m 13.2 14.2
13 p 14.3 15.3
14 l 15.4 16.4
15 e 16.5 17.5
16 . 17.6 18.6
The simplified dataset is generated by the following code:
ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
ch.append(s)
x0.append(round(x1[-1]+0.1,1))
x1.append(round(x0[-1]+1,1))
df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)
x0 and x1 represents the starting and ending position of each Character, respectively. Assume that the distance between any two adjacent characters equals to 0.1. In other words, if the difference between x0 of a character and x1 of the previous character is 0.1, the two characters belongs to the same string. If such difference is larger than 0.1, the character should be the start of a new string, etc. I need to produce a dataframe of strings and their respective x0 and x1, which is done by looping through the dataframe using .iterrows()
string = []
x0 = []
x1 = []
for index, row in df.iterrows():
if index == 0:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
else:
if round(row['x0']-x1[-1],1) == 0.1:
string[-1] += row['Character']
x1[-1] = row['x1']
else:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])
Here is the result:
String x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
Is there any other faster way to achieve this?
You could use groupby + agg:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
print(res)
Output
Character x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
The tricky part is this one:
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
The idea is to convert the column of diffs (same) into a True or False column, where every time a True appears it means a new group needs to be created. The cumsum will take care of assigning the same id to each group.
As suggested by #ShubhamSharma, you could do:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)
# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()
The other part remains the same.
Trying to solve Python problems into SQL code.
I would like to assign titles according to the grade in a new column.
For example:
A for the top 0.9% of the column
B for next 15% of the column
C for next 25% of the column
D for next 30% of the column
E for next 13% of the column
F for rest of the column
There is this column:
Grades
2.3
3
2
3.3
3.5
3.6
3.2
2.1
2.3
3.7
3.3
3.1
4.4
4.3
1.4
4.5
3.5
I don't know how sqlite can work with this since it doesn't have a function like quantile that languages like R have.
Something that tried but not even close is this :
SELECT x
FROM MyTable
ORDER BY x
LIMIT 1
OFFSET (SELECT COUNT(*)
FROM MyTable) / 2
to get at half of the column.
I have a dataset and I am trying to pass the contents of a specific column into the SQL where statement.
For example, assuming iris is my dataset
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
I want to pass the contents of column Species { setosa, setosa, setosa.....setosa} to my sql query where statement
sqlQuery(abcd, paste("Select * from TestTableName1
where WHERE DESCRIPTION
IN (values of Species column from r dataframe)");
Need help here
Your question is really about string manipulation (it's incidental that your string will eventually be passed to sqldf), and the answer is that you paste it together, or use sprintf if you're feeling fancy:
vals = paste(paste0('"', levels(iris$Species), '"'), collapse = ", ")
vals
## [1] "\"setosa\", \"versicolor\", \"virginica\""
vals.paren = paste0("(", vals, ")")
qry = paste("select * from table where description in ", vals.paren)
qry
## [1] "select * from table where description in (setosa, versicolor, virginica)"
# sprintf makes the bounding parentheses cleaner
qry = sprintf("select * from table where description in (%s)", vals)
qry
## [1] "select * from table where description in (setosa, versicolor, virginica)"
By prefacing any function call with fn$ from the gsubfn package string interpolation is enabled on its arguments. See ?fn for more info. This is often used with sqldf in the sqldf package but can be used with any function as we show here. In particular inserting $variable into a string argument of the function call substitutes the value of that variable into that string:
library(gsubfn)
lvls <- toString(shQuote(levels(iris$Species)))
fn$sqlQuery(abcd, "select * from TestTableName1 where DESCRIPTION in ($lvls)")
or if we want to examine the string first:
sql <- fn$identity("select * from TestTableName1 where DESCRIPTION in ($lvls)")
cat(sql, "\n") # look at sql string
sqlQuery(abcd, sql)
The output from the cat statement is:
select * from TestTableName1 where DESCRIPTION in ("setosa", "versicolor", "virginica")