Is there a way to count the individual instances of an event per year? - apache-pig

I am working on Apache Pig to get an understanding of working with large databases. The specific problem is, I need to count the number of days per year for all years listed in the dataset when the temperature in the recorded area was recorded to be above 80 degrees.
The data is set up in the following manner.
Date Max Temp
1919-06-03, 36
1919-11-26, 91
1927-09-23, 61
This repeats every day for about 200 years.
Currently, I know that to make this more manageable I will be using the split function, to split the data set based on the temp being above 80 degrees.
SPLIT data INTO max_above_95 if max_t > 80;
I also figured that if you can get the year out of the date, you can group by, after splitting to get the intended results and count.
I, however, could not find a method to use the year's chunk of the date.
I need this to in the end output giving each year, and the number of occurrences for that year such as the following:
(1993, 21)
(1994, 7)
(1995, 13)

Use FILTER and then extract the year,group by year,count the occurrences.
B = FILTER A BY (A.max_t > 80);
C = FOREACH B GENERATE B.Date,GetYear(B.Date) as Year,max_t;
D = GROUP C BY Year;
E = FOREACH D GENERATE FLATTEN(group) as Year,COUNT(C.max_t);
DUMP E;

Related

Python dataframe grouping

I'm trying to provide average movie’s ratings by the following four time intervals during which the movies were released (a) 1970 to 1979 (b) 1980 to 1989, ect.. and I wonder what did I wrong here, since I'm new to DS.
EDIT
Since the dataset have no year column, I extract the released year embedded in the title column and assign a new column to the dataset:
year = df['title'].str.findall('\((\d{4})\)').str.get(0)
year_df = df.assign(year = year.values)
1.5. Because there are some str in the column, I convert the entire "year" column to int. Then I implemented groupby function to group the year in 10 years interval.
year_df['year'] = year_df['year'].astype(int)
year_df = year_df.groupby(year_df.year // 10 * 10)
After that, I want to assign the year group into an interval of 10 years:
year_desc = { 1910: "1910 – 1019", 1920: "1920 – 1929", 1930: "1930 – 1939", 1940: "1940 – 1949", 1950: "1950 – 1959",1960: "1960 – 1969",1970: "1970 – 1979",1980: "1980 – 1989",1990: "1990 – 1999",2000: "2000 – 2009"}
year_df['year'] = [year_desc[x] for x in year_df['year']]
When I run my code after trying to assign year group, I get an error stated that:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
UPDATES:
I tried to follow #ozacha suggestion and I still experiencing error, but this time is
'SeriesGroupBy' object has no attribute 'map'
Ad 1) Your year_df already has a year column, so there is no need to recreate it using df.assign(). .assign() is an alternative way of (re)defining columns in a dataframe.
Ad 2) Not sure what your test_group is, so it is difficult to get what's the source of the error. However, I believe this is what you want – using pd.Series.map:
year_df = ...
year_df['year'] = year_df['year'].astype(int)
year_desc = {...}
year_df['year_group'] = year_df['year'].map(year_desc)
Alternatively, you can also generate year groups dynamically:
year_df['year_group'] = year_df['year'].map(lambda year: f"{year} – {year + 9}")

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Pandas Sequebtial Count of members within a group and the sum

If I want to have a sequential count within a group I can do something like
df['GID'] = df.groupby(['G_COL1','G_COL2]).cumcount()
I cannot however figure out how to generate a column that contains the total number of values within the group. So if the group had three members df['GID'] would contain 0,1 & 2 and df['COUNT'] would contain the value 3 for each of the three members
df["count_zeros"] = pd.DataFrame((df["GID"]==0)).cumsum()
df["COUNT"] = df.groupby("count_zeros").transform(lambda x: len(x))["GID"]
I think the above gives what you want. The GID column starts from zero whenever a new group starts taking place and then we count how many zeros, i.e. new group "starts" we have with len.
As Scott Boston, commented,
df["COUNT"] = df.groupby("count_zeros")['GID'].transform('count')
works and looks great :)

combine data of multiple rows from multiple tables in single row and show multiple rows of data based on input

I have a problem regarding making a data table that incorporates data of two other data tables, depending on what the input is in the input sheet.
These are my sheets:
sheet 1) Data table 1
sheet 2) Data table 2
sheet 3) Input sheet:
In this sheet one fills in the origin, destination, and month.
sheet 4) Output sheet:
Row(s) with characteristics that are a combination of the data in data table 1 and data table 2: 1 column for each characteristic in the row:
(General; Month; Origin; feature 1; feature 2; month max; month min; Transit point; feature 1; feature 2; feature 3; month max; month min; Destination; feature 1; feature 2; month max; month min;) => feature 3 of origin and destination don't have to be incorporated in the output!
Depending on the month, origin and destination filled in in the input sheet; the output has to list all the possible rows (routes) with that origin and that destination and the temperatures in that month at the origin, transit point and destination.
I have tried VLOOKUP(MATCH), but that only helps for 1 row. not if I want to list all possible rows..
I don't think this problem is that difficult, but I am really a rookie in Excel. Maybe it could work with a simple macro..
I'm a little unclear about some of your question, but perhaps you could adapt this solution to work for you?
http://thinketg.com/how-to-return-multiple-match-values-in-excel-using-index-match-or-vlookup/
I think this is what you want.
ColA ColB
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
9
10
11
7
8
9
9
16
17
18
19
20
In Cell E1, enter c (this is the value you are looking up).
In Cell F1, enter the function below and hit Ctrl+Shift+Enter.
=IF(ROWS(B$1:B1)<=COUNTIF($A$1:$A$20,$E$1),INDEX($B$1:$B$20,SMALL(IF($A$1:$A$20=$E$1,ROW($A$1:$A$20)-ROW($E$1)+1),ROWS(B$1:B1))),"")

Looping through variables in spss

Im looking for a way to loop through variables (eg week01 to week52) and count the number of times the value changes across the them. For example
week01 to week18 may be coded as 1
week19 to week40 may be coded as 4
and week 41 to 52 may be coded as 3
That would be 2 transistions within the data.
How could i go about writing a code that can find me this information? I'm rather new to this and some help to get me in the right direction would be very appreciated.
You can use the DO REPEAT command to loop through variable lists. Below is an example of using this command to create a before date and after date to compare, and increment a count variable whenever these two variables are different.
data list fixed / observation (A1).
begin data
1
2
3
4
5
end data.
*making random data.
vector week(52).
do repeat week = week1 to week52.
compute week = RND(RV.UNIFORM(0.5,4.4)).
end repeat.
execute.
*initialize count to zero.
compute count = 0.
do repeat week_after = week2 to week52 / week_before = week1 to week51.
if week_after <> week_before count = count + 1.
end repeat.
execute.