My dataset contains a year, ID, and binary value variable.
ID
Year
Value
1
2000
0
1
2001
0
1
2002
1
1
2003
1
1
2004
1
1
2005
1
Using Stata, I would like to create a new variable "YearValue" that takes the value of the variable "Year" when the variable value first turned 1.
ID
Year
Value
YearValue
1
2000
0
2002
1
2001
0
2002
1
2002
1
2002
1
2003
1
2002
1
2004
1
2002
1
2005
1
2002
Thank you for your help!
egen wanted = min(cond(Value == 1, Year, .)), by(ID)
See https://www.stata-journal.com/article.html?article=dm0055 (especially Section 9) for this technique in context.
Related
Assuming I'm dealing with this dataframe:
ID
Qualified
Year
Amount A
Amount B
1
No
2020
0
150
1
No
2019
0
100
1
Yes
2019
10
15
1
No
2018
0
100
1
Yes
2018
10
150
2
Yes
2020
0
200
2
No
2017
0
100
...
...
...
...
My desired output should be like this:
ID
Qualified
Year
Amount A
Amount B
1
No
2020
0
150
1
Partial
2019
10
115
1
Partial
2018
10
250
2
Yes
2020
0
200
2
No
2017
0
100
...
...
...
...
As you can see, Qualified column creates new merged values (Yes & No -> Partial, amount A + B ) from a condition: a year in an ID includes both Yes and No in Qualified column.
Don't know how to approach it. Anyone could provide any methodology?
You can use the function agg() and groupby() to perform this operation.
agg() allows you to use not only common aggregation functions (such as sum, mean, etc.) but also custom defined functions.
I would do as follows:
def agg_qualify(x):
values = x.unique()
if len(x)>1:
return 'Partial'
return values[0]
df.groupby(['ID', 'Year']).agg({
'Qualified': lambda x: agg_qualify(x),
'Amount A': 'sum',
'Amount B': 'sum',
}).reset_index()
Output:
ID Year Qualified Amount A Amount B
0 1 2018 Partial 10 250.0
1 1 2019 Partial 10 115.0
2 1 2020 No 0 150.0
3 2 2020 Yes 0 200.0
Below is the data that I have, which has 3 columns:
ID - Member ID
Company : Company Name
Year - Year of Joining the company
import pandas as pd
import numpy as np
data = {'ID':[1,1,1,2,2,3,3,3,3,3,4,4,4],
'Company':['Google','Microsoft','LinkedIn','Youtube','Google','Google','Microsoft','Youtube','Google','Microsoft','Microsoft','Google','LinkedIn'],
'Year':[2001,2004,2009,2001,2009,1999,2000,2003,2006,2010,2010,2012,2020]}
FullData = pd.DataFrame(data)
FullData -
ID Company Year
1 Google 2001
1 Microsoft 2004
1 LinkedIn 2009
2 Youtube 2001
2 Google 2009
3 Google 1999
3 Microsoft 2000
3 Youtube 2003
3 Google 2006
3 Microsoft 2010
4 Microsoft 2010
4 Google 2012
4 LinkedIn 2020
Below I have grouped the data by ID and ranked it according to the Year
FullData['Rank'] = FullData.groupby('ID')['Year'].rank(method='first').astype(int)
FullData
ID Company Year Rank
1 Google 2001 1
1 Microsoft 2004 2
1 LinkedIn 2009 3
2 Youtube 2001 1
2 Google 2009 2
3 Google 1999 1
3 Microsoft 2000 2
3 Youtube 2003 3
3 Google 2006 4
3 Microsoft 2010 5
4 Microsoft 2010 1
4 Google 2012 2
4 LinkedIn 2020 3
Now I need to get only the Member ID's who have joined Microsoft right after google. I need to get only the records partitioned or grouped by ID which has Company Google and Microsoft and the Rank of Google is followed by Microsoft consecutively. (Accepted Output --> Google - Rank 1 and Microsoft -Rank 2
or Google - Rank 4 and Microsoft -Rank 5 and so on..)
Below is the sample of desired output
ID Company Year Rank
1 Google 2001 1
1 Microsoft 2004 2
3 Google 1999 1
3 Microsoft 2000 2
3 Google 2006 4
3 Microsoft 2010 5
OR Count of Unique ID's
Count of Unique ID's/Members who worked for Google prior to Microsoft = 2
Any help is appreciated. Thanks a million in advance
Use boolean indexing:
def myfunc(df):
m1 = (df['Company'].eq('Google') & df['Company'].shift(-1).eq('Microsoft'))
m2 = (df['Rank'].eq(df['Rank'].shift(-1) - 1))
return df[(m1 & m2) | (m1.shift() & m2.shift())]
out = FullData[FullData['Company'].isin(['Google', 'Microsoft'])] \
.groupby('ID').apply(myfunc).droplevel(0)
print(out)
# Output:
ID Company Year Rank
0 1 Google 2001 1
1 1 Microsoft 2004 2
5 3 Google 1999 1
6 3 Microsoft 2000 2
8 3 Google 2006 4
9 3 Microsoft 2010 5
For the unique count, use out['ID'].nunique()
I have a table with the following format:
ID Estation Y M D H N Nh h Cl
1 78357 2017 5 1 1 0 0 -9001 0
2 78357 2017 5 1 2 0 0 -9001 0
3 78357 2017 5 1 3 1 1 750 5
I want to convert the data in this table to the following format:
ID Estation Y M D H Var Value
1 78357 2017 5 1 1 N 0
2 78357 2017 5 1 2 N 0
3 78357 2017 5 1 3 N 1
4 78357 2017 5 1 1 Nh 0
5 78357 2017 5 1 2 Nh 0
6 78357 2017 5 1 3 Nh 1
7 78357 2017 5 1 1 h -9001
8 78357 2017 5 1 2 h -9001
9 78357 2017 5 1 3 h 750
10 78357 2017 5 1 1 Cl 0
11 78357 2017 5 1 2 Cl 0
12 78357 2017 5 1 3 Cl 5
Due to the amount of registration I must take from one format to another I want to do it using Google Refine. Someone has any idea how to do it?.
You can do this in Google Refine (now called OpenRefine) using the Transpose option.
In the 'N' column click the drop down menu and choose "Transpose -> Transpose cells across columns into rows"
In the screen shown choose "N" as the "From Column" and "(last column)" as the "To Column"
Choose to Transpose into Two New Columns. Call the Key column "Var" and the Value column "Value"
Check the box that says "Fill down in other columns"
Click Transpose
This should give you the various variables & values in a single column with multiple rows
To sort in the order you give in your example maybe challenging. If you Sort the Var col in reverse alphabetical order it is close although not quite - not sure how important this is to you.
Remember in OpenRefine you have to choose to Reorder Rows Permanently to commit the new sort order.
You may have to transform the ID column to renumber with unique IDs. You can do this with the GREL rowIndex+1 once you have got the sort order correct
This questions is posted on a suggestion in this thread.
I'm using SQLite/Database browser and my data looks like this:
data.csv
company year value
A 2000 15
A 2001 12
A 2002 20
B 2000 25
B 2001 20
B 2002 10
C 2000 18
C 2001 14
C 2002 22
etc..............
What I want to do is get all companies which have a value of <= 20 for all years in the data set. Using above data this would mean I want the query to answer me:
result.csv
company year value
A 2000 15
A 2001 12
A 2002 20
Thus excluding company C due to value > 20 in 2002 and company B for value > 20 in 2000.
You want all companies whose maximum value is no larger than 20:
SELECT *
FROM Data
WHERE company IN (SELECT company
FROM Data
GROUP BY company
HAVING max(value) <= 20)
Not sure if there are better solutions, but I think this will work:
select company
, sum(case when value < 20 then 1 else 0 end) s
, count(*) c
from data
where year in (2000, 2001, 2002)
group
by company
having s = c
It will check whether the count equals the number of years where the value is less than 20.
I have recently written a script in t-SQL which uses dynamic SQL to generate a table. The output of the script varies, depending on when it is run. The output is something like this:
Group 2010 2011 2012 2013
A 1 2 3 2
B 4 3 3 4
C 4 3 1 1
However, each year another year is added onto the table, meaning the table size varies.
e.g.
Group 2010 2011 2012 2013 2014
A 1 2 3 2 2
B 4 3 3 4 2
C 4 3 1 1 3
I need to be able to access the data in this table via access to generate some reports, so require some sort of view or function to get the data.
What is the best way of doing this?
if you have to use this output in report. Than you have to fix column name in SQL as below.
Group year4 year3 year2 year1
A 1 2 3 2
B 4 3 3 4
C 4 3 1 1
and in report tools you can convert year1 = current year, year2 = current year - 1 and so on.
update 2
using this method you can easily design your report.
Group year5 year4 year3 year2 year1
A 1 2 3 2 2
B 4 3 3 4 2
C 4 3 1 1 3