I have a string variable called country with a value which can be for example Afghanistan2008, but it can also be Brasil2012. I would like to create two new variables, one being the country part and one the year part .
Because there are always numbers at the end of the string, I do know the position the string should be split at from the right side but not from the left side.
Could I use something like:
gen(substr("country",-4,.))
If not, could anyone tell me how to split an entire column of such variables into a country and a year variable? I would also like to keep the original variable.
You can use a regular expression:
clear
set obs 2
generate string = ""
replace string = "Afghanistan2008" in 1
replace string = "Brasil2012" in 2
generate country = regexs(0) if regex(string, "[a-zA-Z]+")
generate year = regexs(1) + regexs(2) if regex(string, "(19|20)([0-9][0-9])")
list
+--------------------------------------+
| string country year |
|--------------------------------------|
1. | Afghanistan2008 Afghanistan 2008 |
2. | Brasil2012 Brasil 2012 |
+--------------------------------------+
Type help regex in Stata's command prompt for more information.
Alternatively you could do the following:
generate len = length(string) - 3
generate country2 = substr(string, 1, len - 1)
generate year2 = substr(string, len, .)
list country2 year2
+---------------------+
| country2 year2 |
|---------------------|
1. | Afghanistan 2008 |
2. | Brasil 2012 |
+---------------------+
For my specific situation the following makes a new year variable:
gen spyear = real(substr(country,-4,.))
I took the other part from #PearlySpencer:
generate len = length(country) - 3
generate spcountry = substr(country, 1, len - 1)
which creates an excess column to be removed.
EDIT (Nick Cox) This can be simplified to
gen spyear = real(substr(country, -4, 4))
gen spcountry = substr(country, 1, length(country) - 4)
showing that
There is no need to create a variable containing the string length.
The puzzling split 4 = 3 + 1 is not needed either.
Related
I'm trying to find a column (I do know the name of the column) base on a value. For example in this dataframe below, I'd like to know which row that has a column contains yellow for Category = A . The thing is I don't know the column name (colour) in advance so I couldn't do select * where Category = 'A' and colour = 'yellow' How can I scan the columns and achieve this? Many thanks for your help.
+--------+-----------+-------------+
|Category|colour |. name. |
+--------+-----------+-------------+
|A. | blue.| Elmo|
|A | yellow | Alex|
|B | desc | Erin|
+--------+-----------+-------------+
You can loop that check through the list of column names. You also can wrap this loop in a function for the readable purpose. Please note that this check per column would happen in sequence.
from pyspark.sql import functions as F
cols = df.columns
for c in cols:
cnt = df.where((F.col('Category') == 'A') & (F.col(c) == 'yellow')).count()
if cnt > 0:
print(c)
string row1
string row2
Is it possible to reduce rows to 1 row?
Rows should be joined with a comma.
As a result I expect
string row1, string row2
One of the workaround could able to solve the above issue,
To concatenate we can use this for e.g | extend New_Column = strcat(tagname,",", tagvalue) with comma between two string.
For example we have tested in our environment with tag name and tag value
resourcecontainers
| where type =~ 'microsoft.resources/subscriptions'
| extend tagname = tostring(bag_keys(tags)[0])
| extend tagvalue = tostring(tags[tagname])
| extend New_Column = strcat(tagname,",", tagvalue) // for concate two rows into one with comma between two string
Here is the sample output for reference:
For more information please refer this SO THREAD
UPDATE: To cancat the rows we tried with the example of code as stated in the given SO THREAD which is suggested by #Yoni L.
| summarize result = strcat_array(make_list(word), ",")
Sample output for reference:
thanks for the tips. in your links I found:
| summarize result = strcat_array(make_list(name_s), ",")
I am trying to rename around 100 dummy variables with the values from a separate variable.
I have a variable products, which stores information on what products a company sells and have generated a dummy variable for each product using:
tab products, gen(productid)
However, the variables are named productid1, productid2 and so on. I would like these variables to take the values of the variable products instead.
Is there a way to do this in Stata without renaming each variable individually?
Edit:
Here is an example of the data that will be used. There will be duplications in the product column.
And then I have run the tab command to create a dummy variable for each product to produce the following table.
sort product
tab product, gen(productid)
I noticed it updates the labels to show what each variable represents.
What I would like to do is to assign the value to be the name of the variable such as commercial to replace productid1 and so on.
Using your example data:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tabulate product, generate(productid)
list, abbreviate(10)
sort product
levelsof product, local(new) clean
tokenize `new'
ds productid*
local i 0
foreach var of varlist `r(varlist)' {
local ++i
rename `var' ``i''
}
Produces the desired output:
list, abbreviate(10)
+---------------------------------------------------------------------------+
| companyid product Commercial CreditCard EMFunds P2P Retail |
|---------------------------------------------------------------------------|
1. | 3 Commercial 1 0 0 0 0 |
2. | 5 CreditCard 0 1 0 0 0 |
3. | 4 CreditCard 0 1 0 0 0 |
4. | 6 EMFunds 0 0 1 0 0 |
5. | 1 P2P 0 0 0 1 0 |
6. | 2 Retail 0 0 0 0 1 |
+---------------------------------------------------------------------------+
Arbitrary strings might not be legal Stata variable names. This will happen if they (a) are too long; (b) start with any character other than a letter or an underscore; (c) contain characters other than letters, numeric digits and underscores; or (d) are identical to existing variable names. You might be better off making the strings into variable labels, where only an 80 character limit bites.
This code loops over the variables and does its best:
gen long obs = _n
foreach v of var productid? productid?? productid??? {
su obs if `v' == 1, meanonly
local tryit = product[r(min)]
capture rename `v' `=strtoname("`tryit'")'
}
Note: code not tested.
EDIT: Here is a test. I added code for variable labels. The data example and code show that repeated values and values that could not be variable names are accommodated.
clear
input str13 products
"one"
"two"
"one"
"three"
"four"
"five"
"six something"
end
tab products, gen(productsid)
gen long obs = _n
foreach v of var productsid*{
su obs if `v' == 1, meanonly
local value = products[r(min)]
local tryit = strtoname("`value'")
capture rename `v' `tryit'
if _rc == 0 capture label var `tryit' "`value'"
else label var `v' "`value'"
}
drop obs
describe
Contains data
obs: 7
vars: 7
size: 133
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
products str13 %13s
five byte %8.0g five
four byte %8.0g four
one byte %8.0g one
six_something byte %8.0g six something
three byte %8.0g three
two byte %8.0g two
-------------------------------------------------------------------------------
Another solution is to use the extended macro function
local varlabel:variable label
The tested code is:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tab product, gen(product_id)
* get the list of product id variables
ds product_id*
* loop through the product id variables and change the
variable name to its label
foreach var of varlist `r(varlist)' {
local varlabel: variable label `var'
display "`varlabel'"
local pos = strpos("`varlabel'","==")+2
local varlabel = substr("`varlabel'",`pos',.)
display "`varlabel'"
rename `var' `varlabel'
}
I am trying to match a string which includes -,.$/ ( and might include other special characters which I don't know yet( with a regex . I have to match first 28 characters in the string
The String is -->
Received - Data Migration 1. Units, of UNITED STATES $ CXXX CORPORATION COMMON SHARE STOCK CERTIFICATE NO. 323248 987,837 SHARES PAR VAL $1.00 NOT ADMINISTERED XX XX, XXXSFHIGSKF/XXXX PURPOSES ONLY
The regex I am using is ((([\w-,.$\/]+)\s){28}).*
Is there a better way to match special characters ?
Also I get an error if the string length is less than 28. What can I do to include the range so that the regex works even if the string is less than 28 characters
the code looks something like this
Select regexp_extract(Txn_Desc,'((([\w-,.$;!#\/%)^#<>&*(]+)\s){1,28}).*',1) as Transaction_Short_Desc,Txn_Desc
from Table x
It seems you are looking for 28 tokens.
Try
(\S+\s+){0,28}
or
([^ ]+ +){0,28}
This is the result for 8 tokens:
Received - Data Migration 1. Units, of UNITED
| | | | | | | |
1 2 3 4 5 6 7 8
I have crosstab which has row columns indicating different classes, and then peoples names across the top.
| | Required | Person 1 | Person 2 | Person 3 |
| Class 1 | 8 6 | 1 6 | 3 6 | 4 6 |
| Class 2 | 6 2 | 3 2 | 2 2 | 1 2 |
Each field contains 2 values The first value is the number of hours spent in the class, the second field is the number of hours required for certification.
The Required field id my grand total summary.
In the cross tab expert the fields are defined as follows.
Rows:
Command.descr -> a field containing the class names
Columns:
Command.fullname -> a field containing students full names
Summarized Fields:
Sum of Command.evlength -> summation of all time spent in a given course
Max of #required -> this formula returns the number of required hours based on the course name
I am trying to highlight the field Sum of Command.evlength if it is greater than or equal to the value of Max of #required.
My solution was to perform background formatting. Right-Click on the Sum of Command.evlength field, select Format Field. Click the borders tab, check Background, and enter a formula.
The formula I was using is:
if CurrentFieldValue >= {#required} then color(152, 251, 152) else crNoColor
This is not the correct formula. My crosstab has been placed in the footer, which causes {#required} to contain the last value in the grid which in the above example is 2.
From my research I thought I would have to use GridRowColumnValue(row or column name) to access the value of {#required} in the crosstab, but I could not come up with the correct string to represent it.
Does anyone have a way for me to correctly perform this comparison?
Frustratingly I don't think you can use the highlighting expert to compare to a dynamic value. You could swap the columns round then add the following formulas:
To the max_of_required background colour:
whileprintingrecords;
global numbervar required_hrs := currentfieldvalue;
crNoColor;
To the sum_of_command.evlength background colour:
whileprintingrecords;
global numbervar required_hrs;
if currentfieldvalue >= required_hrs then
crRed
else
crNoColor;
I think there are a few other ways but i'm not as confident with those so start here.