I have a following .txt file:
Mark1[Country1]
type1=1 type2=5
type1=1.50 EUR type2=21.00 EUR
Mark2[Country2]
type1=2 type2=1 type3=1
type1=197.50 EUR type2=201.00 EUR type3= 312.50 EUR
....
I am trying to input it in my SAS program, so that it would look something like that:
Mark Country Type Count Price
1 Mark1 Country1 type1 1 1.50
2 Mark1 Country1 type2 5 21.00
3 Mark1 Country1 type3 NA NA
4 Mark2 Country2 type1 2 197.50
5 Mark2 Country2 type2 2 201.00
6 Mark2 Country2 type3 1 312.50
Or maybe something else, but i need it to be possible to print two way report
Country1 Country2
Type1 ... ...
Type2 ... ...
Type3 ... ...
But the question is how to read that kind of txt file:
read and separate Mark1[Country1] to two columns Mark and Country;
retain Mark and Country and read info for each Type (+somehow ignoring type1=, maybe using formats) and input it in a table.
Maybe there is a way to use some kind of input templates to achive that or nasted queries.
You have 3 name/value pairs, but the pairs are split between two rows. An unusual text file requiring creative input. The INPUT statement has a line control feature # to read relative future rows within the implicit DATA Step loop.
Example (Proc REPORT)
Read the mark and country from the current row (relative row #1), the counts from relative row #2 using #2 and the prices from relative row #3. After the name/value inputs are made for a given mark country perform an array based pivot, transposing two variables (count and price) at a time into a categorical (type) data form.
Proc REPORT produces a 'two-way' listing. The listing is actually a summary report (cells under count and price are a default SUM aggregate), but each cell has only one contributing value so the SUM is the original individual value.
data have(keep=Mark Country Type Count Price);
attrib mark country length=$10;
infile cards delimiter='[ ]' missover;
input mark country;
input #2 #'type1=' count_1 #'type2=' count_2 #'type3=' count_3;
input #3 #'type1=' price_1 #'type2=' price_2 #'type3=' price_3;
array counts count_:;
array prices price_:;
do _i_ = 1 to dim(counts);
Type = cats('type',_i_);
Count = counts(_i_);
Price = prices(_i_);
output;
end;
datalines;
Mark1[Country1]
type1=1 type2=5
type1=1.50 EUR type2=21.00 EUR
Mark2[Country2]
type1=2 type2=1 type3=1
type1=197.50 EUR type2=201.00 EUR type3= 312.50 EUR
;
ods html file='twoway.html';
proc report data=have;
column type country,(count price);
define type / group;
define country / ' ' across;
run;
ods html close;
Output image
Combined aggregation
proc means nway data=have noprint;
class type country;
var count price;
output out=stats max(price)=price_max sum(count)=count_sum;
run;
data cells;
set stats;
if not missing(price_max) then
cell = cats(price_max,'(',count_sum,')');
run;
proc transpose data=cells out=twoway(drop=_name_);
by type;
id country;
var cell;
run;
proc print noobs data=twoway;
run;
You can specify the name of variable with the DLM= option on the INFILE statement. That way you can change the delimiter depending on the type of line being read.
It looks like you have three lines per group. The first one have the MARK and COUNTRY values. The second one has a list of COUNT values and the third one has a list of PRICE values. So something like this should work.
data want ;
length dlm $2 ;
length Mark $8 Country $20 rectype $8 recno 8 type $10 value1 8 value2 $8 ;
infile cards dlm=dlm truncover ;
dlm='[]';
input mark country ;
dlm='= ';
do rectype='Count','Price';
do recno=1 by 1 until(type=' ');
input type value1 #;
if rectype='Price' then input value2 #;
if type ne ' ' then output;
end;
input;
end;
cards;
Mark1[Country1]
type1=1 type2=5
type1=1.50 EUR type2=21.00 EUR
Mark2[Country2]
type1=2 type2=1 type3=1
type1=197.50 EUR type2=201.00 EUR type3= 312.50 EUR
;
Results:
Obs Mark Country rectype recno type value1 value2
1 Mark1 Country1 Count 1 type1 1.0
2 Mark1 Country1 Count 2 type2 5.0
3 Mark1 Country1 Price 1 type1 1.5 EUR
4 Mark1 Country1 Price 2 type2 21.0 EUR
5 Mark2 Country2 Count 1 type1 2.0
6 Mark2 Country2 Count 2 type2 1.0
7 Mark2 Country2 Count 3 type3 1.0
8 Mark2 Country2 Price 1 type1 197.5 EUR
9 Mark2 Country2 Price 2 type2 201.0 EUR
10 Mark2 Country2 Price 3 type3 312.5 EUR
Related
I'm new to SQL and i'm not sure what i want to do is possible.
Basically i want to count the number of empty fields present in a row and add it as a column at the end, like so:
Original data
ID
Code
Name
Age
1111
aaa
name1
23
1111
bbb
name1
23
2222
cccc
3333
fdfd
34
3333
rrrr
Result:
ID
Code
Name
Age
Empty Fields
1111
aaa
name1
23
0
1111
bbb
name1
23
0
2222
cccc
2
3333
fdfd
34
1
3333
rrrr
2
After that i want to concatenate the code field for each duplicate ID and delete the line with the higher Empty fields, like so:
Result:
First:
ID
Code
Name
Age
Empty Fields
1111
aaa,bbb
name1
23
0
1111
aaa,bbb
name1
23
0
2222
cccc
2
3333
fdfd,rrrr
34
1
3333
rrrr,fdfd
2
End Result:
ID
Code
Name
Age
Empty Fields
1111
aaa,bbb
name1
23
0
2222
cccc
2
3333
fdfd,rrrr
34
1
This is an inefficient method/logic but since you've laid out your steps I assume there's some reason for this logic.
Count the Number of Missing using CMISS which counts across character and numeric variables. Set the variable to 0 initially so it's not included in the missing count. Use _all_ to refer to all variables in the data set.
*calculate number missing;
data step1;
set have;
*set N_Missing to 0 so it is not counted as missing;
N_Missing = 0;
*count missing;
N_Missing = cmiss(Of _all_);
run;
Combine CODE into one field, using a data step.
*combine CODE from into one field;
proc sort data=step1;
by id code;
run;
data step2;
set have;
by ID code;
retain Code_Agg;
length Code_Agg $80.;
if first.ID then Code_Agg = Code;
else Code_Agg = catx(", ", Code_Agg, Code);
if last.ID then output;
keep ID Code_AGG;
run;
Merge output from #1 and #2 to get the middle table. You may need to sort the data ahead of time.
*merge results with prior table;
data step3;
merge step1 step2;
by ID;
run;
Keep record only of interest using NODUPKEY in PROC SORT which keeps only unique record. Note the double sort, first sorting to get the order correct to take the first record.
proc sort data=step3;
by ID N_Missing;
run;
proc sort data=step3 out=final nodupkey;
by ID;
run;
I have 2 CSV files like so:
sheet1.csv only contains headers
Account_1 Amount_1 Currency_1 Country_1 Date_1
sheet2.csv contains headers and data
Account Currency Amount Date Country
1 GBP 117.89 20/02/2021 UK
2 GBP 129.39 15/02/2021 UK
How can I use pandas to map the data from sheet2 to sheet1 as I want the data to have the new column names in the same exact order.
First arrange the columns on sheet2 by order as sheet1
sheet2 = sheet2[["Account", "Amount", "Currency", "Country", "Date"]]
This will rearrange sheet2 columns and then
sheet2.columns = sheet1.columns
Final output of sheet2.head() will be
Account_1 Amount_1 Currency_1 Country_1 Date_1
1 117.89 GBP UK 20/02/2021
2 129.39 GBP UK 15/02/2021
In Stata, say that I have these data:
sysuse auto2, clear
gen name = substr(make, 1,3)
encode name, gen(name2)
I run this regression, which importantly uses i.:
reg price i.name2 trunk weight turn
The output takes the form of:
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
name2 |
Aud | 4853.048 1254.083 3.87 0.000 2331.545 7374.551
BMW | 5742.124 1560.161 3.68 0.001 2605.211 8879.037
Bui | 1351.065 946.733 1.43 0.160 -552.4696 3254.599
Cad | 7740.865 1168.332 6.63 0.000 5391.776 10089.95
Che | 62.35577 946.1153 0.07 0.948 -1839.937 1964.648
....
I then go to the estimation results:
matrix list e(b)
which produces:
e(b)[1,27]
1b. 2. 3. 4. 5. 6. 7.
name2 name2 name2 name2 name2 name2 name2
y1 0 4853.0482 5742.1237 1351.0647 7740.8653 62.355771 2676.3971
8. 9. 10. 11. 12. 13. 14.
name2 name2 name2 name2 name2 name2 name2
y1 943.4266 1964.8242 1776.4058 2711.4324 6386.7936
....
My question is how can I retrieve the variable labels from the name2 variable after the regression is run? What I want is what is displayed in the initial output: Aud, BMW, Bui, etc. I do not want what is stored in the e(b) matrix: 1b. name2, 2. name2, 3. name2, etc. Is there a way to get what I want stored in e(b) or is is stored elsewhere in other estimation results? Can estout/esttab? I would like to get the results stored in a matrix.
You can store e(b) in a matrix, get the names of your variable with the levelsof command and rename the column names.
sysuse auto2, clear
gen name = substr(make, 1,3)
encode name, gen(name2)
reg price i.name2 trunk weight turn
mat A = e(b)
levelsof name, local(names)
local colnames "`names' trunk weight turn _cons"
matrix colnames A = `colnames'
matrix list A
I know I can get the counts for how many individual entries are in each unique groups of records with the following.
LIST CUSTOMER BREAK-ON CITY TOTAL EVAL "1" COL.HDG "Customer Count" TOTAL CUR_BALANCE BY CITY
And I end up with something like this.
Cust...... City...... Customer Count Currently Owes
6 Arvada 1 4.54
********** -------------- --------------
Arvada 1 4.54
190 Boulder 1 0.00
1 Boulder 1 13.65
********** -------------- --------------
Boulder 2 13.65
...
============== ==============
TOTAL 29 85.28
29 records listed
Which becomes this, after we suppress the details and focus on the groups themselves.
City...... Customer Count Currently Owes
Arvada 1 4.54
Boulder 2 13.65
Chicago 3 4.50
Denver 6 0.00
...
============== ==============
TOTAL 29 85.28
29 records listed
But can I get a count of how many unique grouping are in the same report? Something like this.
City...... Customer Count Currently Owes City Count
Arvada 1 4.54 1
Boulder 2 13.65 1
Chicago 3 4.50 1
Denver 6 0.00 1
...
============== ============== ==========
TOTAL 29 85.28 17
29 records listed
Essentially, I want the unique value count integrated into the other report so that I don't have to create an extra report just for something so simple.
SELECT CUSTOMER SAVING UNIQUE CITY
17 records selected to list 0.
I swear that this should be easier. I see various # variables in the documentation that hint at the possibility of doing this easily but I have never been about to get one of them to work.
If your data is structured in such a way that your id is what you would be grouping by and the data you want is stored in Value delimited field and you don't want to include or exclude anything you can use something like the following.
In UniVerse using the CUSTOMER table in the demo HS.SALES account installed on many systems, you can do this. The CUSTID is the the record #ID and Attribute 13 is where there PRICE is stored in a Value delimited array.
LIST CUSTOMER BREAK-ON CUSTID TOTAL EVAL "DCOUNT(#RECORD<13>,#VM)" TOTAL PRICE AS P.PRICE BY CUSTID DET.SUP
Which outputs this.
DCOUNT(#RECORD<13>,#
Customer ID VM)................. P.PRICE
1 1 $4,200
2 3 $19,500
3 1 $4,250
4 1 $16,500
5 2 $3,800
6 0 $0
7 2 $5,480
8 2 $12,900
9 0 $0
10 3 $10,390
11 0 $0
12 0 $0
==================== =======
15 $77,020
That is a little juice for a lot of squeeze, but I hope you find it useful.
Good Luck!
Since the system variable #NB is set only on the total lines, this will allow your counter to calculate the number of TOTAL lines, which occur per unique city, excluding the grand total.
LIST CUSTOMER BREAK-ON CITY TOTAL EVAL "IF #NB < 127 THEN 1 ELSE 0" COL.HDG "Customer Count" TOTAL CUR_BALANCE BY CITY
I don't have a system to try this on, but this is my understanding of the variable.
I'm working with a dataset of items with different values and I would like a SQL query to calculate the total USD value of the dataset.
Example Dataset:
id | type | numOrdered
0 | apple | 1
1 | orange | 3
2 | apple | 10
3 | apple | 5
4 | orange | 2
5 | apple | 1
Consider this dataset of fruit orders. Let's say apples are worth $1 and oranges are worth $2. I would like to know how much total USD in fruit orders we have.
I'd like to perform the same operation as this example Javascript function, but using SQL:
let sum = 0;
for(let fruitOrder of fruitOrders) {
if(fruitOrder.type == "orange"){
sum += fruitOrder.numOrdered*2;
} else {
sum += fruitOrder.numOrdered*1;
}
}
return sum;
So the correct answer for this dataset would be $27 USD total since there are 17 apples worth $1 and 5 oranges worth $2.
I know how to break it down into two distinct queries giving me the number I want split by type
SELECT
sum("public"."fruitOrders"."num"*2) AS "sum"
FROM "public"."fruitOrders"
WHERE "public"."fruitOrders"."type" = 'orange';
which would return $10, the total USD value of oranges
SELECT
sum("public"."fruitOrders"."num") AS "sum"
FROM "public"."fruitOrders"
WHERE "public"."fruitOrders"."type" = 'apple';
which would return $17, the total USD value of apples
I just don't know how to sum those numbers together in SQL to get $27, the total USD value of the dataset.
If you want the values 1 and 2 hardcoded then you can use a CASE statement with SUM():
SELECT
sum(case type
when 'apple' then 1
when 'orange' then 2
end * numordered
) AS "sum"
FROM "public"."fruitOrders"
See the demo.
Result:
| sum |
| --- |
| 27 |