Not able to split chararray field containing spaces and tabs between the words. Help me with the command using Apache Pig? - apache-pig

Sample.txt File
2017-01-01 10:21:59 THURSDAY -39 3 Pick up a bus - Travel for two hours
2017-02-01 12:45:19 FRIDAY -55 8 Pick up a train - Travel for one hour
2017-03-01 11:35:49 SUNDAY -55 8 Pick up a train - Travel for one hour
I
.
.
When I executed the suggested command, it got split into three fields.
when I do the below operation, it is not working as expected.
A = LOAD 'Sample.txt' USING PigStorage() as (line:chararray);
B = foreach A generate STRSPLIT(line, ' ', 3);
c = foreach B generate $2;
split C into buslog if $0 matches '.*bus*.', trainlog if $0 matches '.*train*.';
Note:- Dump of C will give below result.
THURSDAY -39 3 Pick up a bus - Travel for two hours
FRIDAY -55 8 Pick up a train - Travel for one hour
SUNDAY -55 8 Pick up a train - Travel for one hour
Requirement: In the above result, i want to split train and bus into two relations, but it is not happening as expected

The syntax is .*string.*.Notice that it is .* on both sides of the string.
split C into buslog if $0 matches '.*bus.*', trainlog if $0 matches '.*train.*';

Related

Pandas create graph from Date and Time while them being in different columns

My data looks like this:
Creation Day Time St1 Time St2
0 28.01.2022 14:18:00 15:12:00
1 28.01.2022 14:35:00 16:01:00
2 29.01.2022 00:07:00 03:04:00
3 30.01.2022 17:03:00 22:12:00
It represents parts being at a given station. What I now need is something that counts how many Columns have the same Day and Hour e.g. How many parts were at the same station for a given Hour.
Here 2 Where at Station 1 for the 28th and the timespan 14-15.
Because in the end I want a bar graph that show production speed. Additionally later in the project I want to highlight Parts that havent moved for >2hrs.
Is it practical to create a datetime object for every Station (I have 5 in total)? Or is there a much simpler way to do this?
FYI I import this data from an excel sheet
I found the solution. As they are just strings I can just add them and reformat the result with pd.to_datetime().
Example:
df["Time St1"] = pd.to_datetime(
df["Creation Day"] + ' ' + df["Time St1"],
infer_datetime_format=False, format='%d.%m.%Y %H:%M:%S'
)

Adding columns to pandas dataframe based on function

In extension of the post, which in my belief has not finished its quest by using the command shift. This only moves the column by row, and that is not what we want if we desire to map an entire column to another for instance.
I wish to map several columns to a new one, and add it as a column to the existing dataframe.
So the input would be something like:
DATE (MM/DD/YYYY) HOUR-MST AWS # 50m [m/s] AWS # 80m [m/s] AMPLC (2-80m)
1/1/2009 1 8.4262 9.1207 0.1354
1/1/2009 2 9.4238 9.9013 0.1312
1/1/2009 3 5.0494 5.3328 0.0892
1/1/2009 4 4.5928 4.8802 0.1014
Say the name of the dataframe is wind, then upon using the equation:
(here AWS is the wind speed)
wind['new_column1'] = wind['AWS # 50m [m/s]']**3 * wind['AMPLC (2-80m)']
wind['new_column2'] = wind['AWS # 80m [m/s]']**3 * wind['AMPLC (2-80m)']
And the output would be:
DATE (MM/DD/YYYY) HOUR-MST AWS # 50m [m/s] AWS # 80m [m/s] AMPLC (2-80m) new_column1 new_column2
1/1/2009 1 8.4262 9.1207 0.1354 xxx1 xxx2
1/1/2009 2 9.4238 9.9013 0.1312 xxx1 xxx2
1/1/2009 3 5.0494 5.3328 0.0892 xxx1 xxx2
1/1/2009 4 4.5928 4.8802 0.1014 xxx1 xxx2
Then I would also be using radius as another parameter - so I'd prefer to use a for loop to choose a range of numbers based out of np.arange(10, 20, 2) and multiplying together as coefficient.
Something of my idea would be as follows:
count = 0
for rad in range(rad_range):
wind['new_column1'+count] = wind['AWS # 50m [m/s]']**3 * wind['AMPLC (2-80m)'] * rad**2
count += 1
May I get some suggestion please? Thank you in advance!

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?
For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')

Find average of numbers from a specific line

I have a text file with 2 columns of numbers.
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
I would like to find the average of the 2nd column data from the 6th line. That is ( (7+8+9+10+11+12+13+14)/8 = 10.5 )
I could find this post Scripts for computing the average of a list of numbers in a data file
and used the following:
awk'{s+=$2}END{print "ave:",s/NR}' fileName
but I get an average of entire second column data.
Any hint here.
This one-liner should do:
awk -v s=6 'NR<s{next} {c++; t+=$2} END{printf "%.2f (%d samples)\n", t/c, c}' file
This awk script has three pattern/action pairs. The first is responsible for skipping the first s lines. The second executes on every line (from s onwards); it increments a counter and adds column 2 to a running total. The third runs after all data have been processed, and prints your results.
Below script should do the job
awk 'NR>=6{avg+=$2}END{printf "Average of field 2 starting from 6th line %.1f\n",avg/(NR-5)}' file
Output
Average of field 2 starting from 6th line 10.5

Calculating Weekly Returns from Daily Time Series of Prices

I want to calculate weekly returns of a mutual fund from a time series of daily prices. My data looks like this:
A B C D E
DATE WEEK W.DAY MF.PRICE WEEKLY RETURN
02/01/12 1 1 2,7587
03/01/12 1 2 2,7667
04/01/12 1 3 2,7892
05/01/12 1 4 2,7666
06/01/12 1 5 2,7391 -0,007
09/01/12 2 1 2,7288
10/01/12 2 2 2,6707
11/01/12 2 3 2,7044
12/01/12 2 4 2,7183
13/01/12 2 5 2,7619 0,012
16/01/12 3 1 2,7470
17/01/12 3 2 2,7878
18/01/12 3 3 2,8156
19/01/12 3 4 2,8310
20/01/12 3 5 2,8760 0,047
The date is (dd/mm/yy) format and "," is decimal separator. This would be done by using this formula: (Price for last weekday - Price for first weekday)/(Price for first weekday). For example the return for the first week is (2,7391 - 2,7587)/2,7587 = -0,007 and for the second is (2,7619 - 2,7288)/2,7288 = 0,012.
The problem is that the list goes on for a year, and some weeks have less than five working days due to holidays or other reasons. So I can't simply copy and paste the formula above. I added the extra two columns for week number and week day using WEEKNUM and WEEKDAY functions, thought it might help. I want to automate this with a formula or using VBA and hoping to get a table like this:
WEEK RETURN
1 -0,007
2 0,012
3 0,047
.
.
.
As I said some weeks have less than five weekdays, some start with weekday 2 or end with weekday 3 etc. due to holidays or other reasons. So I'm thinking of a way to tell excel to "find the prices that correspond to the max and min weekday of each week and apply the formula (Price for last weekday - Price for first weekday)/(Price for first weekday)".
Sorry for the long post, I tried to be be as clear as possible, I would appreciate any help! (I have 5 separate worksheets for consecutive years, each with daily prices of 20 mutual funds)
To do it in one formula:
=(INDEX(D:D,AGGREGATE(15,6,ROW($D$2:$D$16)/(($C$2:$C$16=AGGREGATE(14,6,$C$2:$C$16/($B$2:$B$16=G2),1))*($B$2:$B$16=G2)),1))-INDEX(D:D,MATCH(G2,B:B,0)))/INDEX(D:D,MATCH(G2,B:B,0))
You may need to change all the , to ; per your local settings.
I would solve it using some lookup formulas to get the values for each week and then do a simple calculation for each week.
Resulting table:
H I J K L M
first last first val last val return
1 02.01.2012 06.01.2012 2,7587 2,7391 -0,007
2 09.01.2012 13.01.2012 2,7288 2,7619 0,012
3 16.01.2012 20.01.2012 2,747 2,876 0,047
Formula in column I:
=MINIFS($A:$A;$B:$B;$H2)
Fomula in column J:
=MAXIFS($A:$A;$B:$B;$H2)
Formula in column K:
=VLOOKUP($I2;$A:$D;4;FALSE)
Formula in column L:
=VLOOKUP($J2;$A:$D;4;FALSE)
Formula in column M:
=(L2-K2)/K2