self join data to get distinct years - pandas

I have this dataframe that I need to join in order to find the academic-years.
df11=pd.read_csv('https://s3.amazonaws.com/todel1623/myso.csv')
df11.course_id.value_counts()
274 3
285 2
260 1
I can use self-join and get the respective years without any problem.
df=df11.merge(df11[['course_id']], on='course_id')
df.course_id.value_counts()
274 9
285 4
260 1
But the expected count in this case is
274 6
285 4
260 2
This is because even if there are 3 years for id 274, the course duration is only 24 months. And even if there is only 1 record for 260 since the duration is 24 months, it should return 2 records. (once for current year and the other current_year + 1), rest of the column values being same for that group.
Can I write a loop for dataframe something like this?
for row in df:
if i in range((df.duration_inmonths / 12)):
df.row.year= df.row.year + i
df.append(df.row)
In the following case, the first record should be 2017 and not 2018.
myl=list()
for row in df11.values:
for i in range(int(row[15]/12)):
row[5]=row[5]+i
myl.append(row)
myl[:2]
[array([383, 1102, 'C-43049', 'M.B.A./M.M.S.', 'Un-Aided', 2018, 80000,
8000, 900, 312, 89212, 2018, 12, 260, 95, 24, 1102.0,
'M.B.A./M.M.S.'], dtype=object),
array([383, 1102, 'C-43049', 'M.B.A./M.M.S.', 'Un-Aided', 2018, 80000,
8000, 900, 312, 89212, 2018, 12, 260, 95, 24, 1102.0,
'M.B.A./M.M.S.'], dtype=object)]

numpy array do not seem to append in a list with changed values. It worked when I converted it to list.
myl.append(row.tolist())

Related

Getting col % from a base size

I'm trying to get an output for a multi-response table in col%. I can get a % from column total but not from a fixed based. How do I do it? For example
Past week used (Seg A, Seg B, Seg C) =
Olive Oil: 80, 100, 150
Sunflower Oil: 35, 95, 105
Coconut Oil: 109, 209, 15
Segment sizes A=120, B=250, C=165
I need col% by each segment
So Seg A should be calculated as
Olive Oil= 80/120; Sunflower Oil=35/120 & Coconut Oil=109/120
Similarly for Seg B & Seg C.
I'm using tidyr and dplyr to generate my outputs.
Any advice will be much appreciated.

Redshift - Breaking number into 10 parts and finding which part does a number fall into

I am trying to break down a given number into 10 equal parts and then compare a row of numbers and see in which of the 10 parts they fall under.
ref_number, number_to_check
70, 34
70, 44
70, 14
70, 24
In the above data set, I would like to break 70 into 10 equal parts (in this case it would be 7,14,21, and so on till 70). Next I would like to see in which "part" does the value in column "number_to_check" fall into.
Output expected:
ref_number, number_to_check, part
70, 34, 5
70, 44, 7
70, 14, 2
70, 24, 4
You want arithmetic. If I understand correctly:
select ceiling(number_to_check * 10.0 / ref_number)
Here is a db<>fiddle (the fiddle happens to use Postgres).

Postgis: How do I select every second point from LINESTRING?

In DBeaver I have a table containing some GPS coordinates stored as Postgis LINESTRING format.
My questions is: If I have, say, this info:
LINESTRING(20 20, 30 30, 40 40, 50 50, 60 60, 70 70)
which built-in ST function can I use to get every N-th element in that LINESTRING? For example, if I choose 2, I would get:
LINESTRING(20 20, 40 40, 60 60)
, if 3:
LINESTRING(20 20, 50 50)
and so on.
I've tried with ST_SIMPLIFY and ST_POINTN, but that's now exactly what I need because I still want it to stay a LINESTRING but just with less points (lower resolution).
Any ideas?
Thanks :-)
Welcome to SO. Have you tried using ST_DumpPoints and applying a module % over the vertices path? e.g. every second record:
WITH j AS (
SELECT
ST_DumpPoints('LINESTRING(20 20, 30 30, 40 40, 50 50, 60 60, 70 70)') AS point
)
SELECT ST_AsText(ST_MakeLine((point).geom)) FROM j
WHERE (point).path[1] % 2 = 0;
st_astext
-------------------------------
LINESTRING(30 30,50 50,70 70)
(1 Zeile)
Further reading:
ST_MakeLine
CTE
ST_Simplify should return a linestring unless the simplification results in an invalid geometry for a lingstring, e.i., less than 2 vertex. If you always want to return a linestring consider ST_SimplifyPreserveTopology . It ensures that at least two vertices are returned in a linestring.
https://postgis.net/docs/ST_SimplifyPreserveTopology.html

How to compute the difference in monthly income for the same id

The dataframe below shows the monthly revenue of two shops (shop_id=11, shop_id=15) during the period of a few years:
data = { 'shop_id' : [ 11, 15, 15, 15, 11, 11 ],
'month' : [ 1, 1, 2, 3, 2, 3 ],
'year' : [ 2011, 2015, 2015, 2015, 2014, 2014 ],
'revenue' : [11000, 5000, 4500, 5500, 10000, 8000]
}
df = pd.DataFrame(data)
df = df[['shop_id', 'month', 'year', 'revenue']]
display(df)
You can notice that shop_id=11 has only one entry in 2011 (january) and shop_id=15 has a few entries in 2015 (january, february, march). Nevertheless, it's interesting to note that the first shop has a few more entries in 2014:
I'm trying to optimize a custom function (used along with .apply()) that creates a new feature called diff_revenue: this feature shows the change in revenue from the previous month, for each shop:
I would like to offer some explanation on how some of the values found in diff_revenue were generated:
The value first cell is 0 (red) because there is no previous information for shop_id=11;
The 2nd cell is also 0 (orange), for the same reason: there is no previous information for shop_id=15;
The 3rd cell is 500 (green), because the change from the last entry (january, 2015) of this shop to the current cell's revenue (february, 2015), is 500 Trumps.
The 5th cell is 1000 (dark blue), because the change from the last entry (january, 2011) of this shop to the current cell's revenue (february, 2014) was 1000 Trumps.
I'm no expert in Pandas and was wondering if the Pandas' gods knew a better way. The DataFrame I have to work with is quite large (+1M observations) and my current approach is too slow. I'm looking for a faster alternative or maybe something more readable.
You more or less want to use Series.diff on the 'Revenue' column, but need to do a few additional things:
Sort to ensure your DataFrame is in chronological order (can undo this later)
Perform a groupby on 'shop_id' to do group level operations
Take the absolute value, since you don't want to distinguish between positive and negative
In terms of code:
# sort the values so they're in order when we perform a groupby
df = df.sort_values(by=['year', 'month'])
# perform a groupby on 'shop_id' and get the row-wise difference within each group
df['diff_revenue'] = df.groupby('shop_id')['revenue'].diff()
# fill NA as zero (no previous info), take absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].fillna(0).abs().astype('int')
# revert to original order
df = df.sort_index()
The resulting output:
shop_id month year revenue diff_revenue
0 11 1 2011 11000 0
1 15 1 2015 5000 0
2 15 2 2015 4500 500
3 15 3 2015 5500 1000
4 11 2 2014 10000 1000
5 11 3 2014 8000 2000
Edit
A little less straight forward solution, but maybe slightly more performant:
# sort the values so they're chronological order by shop_id
df = df.sort_values(by=['shop_id', 'year', 'month'])
# take the row-wise difference ignoring changes in shop_id
df['diff_revenue'] = df['revenue'].diff()
# zero out locations where shop_id changes (no previous info)
df.loc[df['shop_id'] != df['shop_id'].shift(), 'diff_revenue'] = 0
# Take the absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].abs().astype('int')
# revert to original order
df = df.sort_index()

Dask aggregate value into fixed range with start and end time?

In dask or even pandas how would you go about grouping an dask data frame that has a 3 columns of time / level / spread into a set of fixed ranges by time.
Time is only used to move one direction. Like a loop counting up. So the end result would be start time and end time with high of level, low of level, first value of level and last value of level over the fixed range? Example
12:00:00, 10, 1
12:00:01, 11, 1
12:00:02, 12, 1
12:00:03, 11, 1
12:00:04, 9, 1
12:00:05, 6, 1
12:00:06, 10, 1
12:00:07, 14, 1
12:00:08, 11, 1
12:00:09, 7, 1
12:00:10, 13, 1
12:00:11, 8, 1
For a fixed level range of (7). So level from start to end can not be more than 7 total distance from start to end for each bin of level. Just because first bin is only 8 difference in time and second is only 2 different in time, this dose not madder one the high to low madders that the total distance from high to low dose not go passed 7 the fixed bin size. The first bin could have been 5 not 8 for first bin and 200 for next bin not 2 in the example below. So the First few rows in dask would look something like this.
First Time, Last Time, High Level, Low Level, First Level, Last Level, Spread
12:00:00, 12:00:07, 13, 6, 10, 13, 1
12:00:07, 12:00:09, 14, 7, 13, 7, 1
12:00:09, X, 13, 7, X, X, X
How could this be aggregated in dask with a fix window of level moving forward in time binning each time level moves above X or equal too high/low with in X or below X?