Convert data in a specific format in Apache Pig. - apache-pig

I want to convert data in to a specific format in Apache Pig so that I can use a reporting tool on top of it.
For example:
10:00,abc
10:00,cde
10:01,abc
10:01,abc
10:02,def
10:03,efg
The output should be in the following format:
abc cde def efg
10:00 1 1 0 0
10:01 2 0 0 0
10:02 0 0 1 0
The main problem here is that a value can occur multiple times in a row, depending on the different values available in the sample csv file, up to a total of 120.
Any suggestions to tackle this are more than welcome.
Thanks
Gagan

Try something like the following:
A = load 'data' using PigStorage(",") as (key:chararray,value:chararray);
B = foreach A generate key,(value=='abc'?1:0) as abc,(value=='cde'?1:0) as cde,(value=='efg'?1:0) as efg;
C = group B by key;
D = foreach C generate group as key, COUNT(abc) as abc, COUNT(cde) as cde, COUNT(efg) as efg;
That should get you a count of the occurances of a particular value for a particular key.
EDIT: just noticed the limit 120 part of the question. If you cannot go above 120 put the following code
E = foreach D generate key,(abc>120?"OVER 120":abc) as abc,(cde>120?"OVER 120":cde) as cde,(efg>120?"OVER 120":efg) as efg;

Related

Join cells from Pandas data frame and safe as one string or .txt file

I am trying to extract data from a data frame in Pandas and merge the results into one string (or .txt file)
Data Frame:
NUM
LETTER
0
4
Z
1
5
U
2
6
A
3
7
P
4
1
B
5
4
P
6
5
L
7
6
T
8
7
V
9
1
E
Script so far:
data = pd.read_csv("TEST.csv")
fdata = data[data["LETTER"].str.contains("A|E|L|P")]
ffdata = fdata.RESULT.to_string()
print(ffdata)
Running the script on TEST.csv gives me this result:
LETTER
2
A
3
P
5
P
6
L
9
E
Next, I want to join the data from the filtered cells and join them into one string:
--> "APPLE", optional with saving them as .txt to use them later.
How do I proceed from here? I was thinking about iterating over the data frame and use join, but I have no idea how to implement this. Any clues?
Based on #Frodnar's answer, this is the code that works:
data = pd.read_csv("TEST.csv")
fdata = data[data["LETTER"].str.contains("A|E|L|P")]
ffdata = ''.join(fdata.LETTER.to_list())
print(ffdata)
which gives the output 'APPLE'
Thank you for your help!!

Reorder rows of pandas DataFrame according to a known list of values

I can think of 2 ways of doing this:
Apply df.query to match each row, then collect the index of each result
Set the column domain to be the index, and then reorder based on the index (but this would lose the index which I want, so may be trickier)
However I'm not sure these are good solutions (I may be missing something obvious)
Here's an example set up:
domain_vals = list("ABCDEF")
df_domain_vals = list("DECAFB")
df_num_vals = [0,5,10,15,20,25]
df = pd.DataFrame.from_dict({"domain": df_domain_vals, "num": df_num_vals})
This gives df:
domain num
0 D 0
1 E 5
2 C 10
3 A 15
4 F 20
5 B 25
1: Use df.query on each row
So I want to reorder the rows according using the values in order of domain_vals for the column domain.
A possible way to do this is to repeatedly use df.query but this seems like an un-Pythonic (un-panda-ese?) solution:
>>> pd.concat([df.query(f"domain == '{d}'") for d in domain_vals])
domain num
3 A 15
5 B 25
2 C 10
0 D 0
1 E 5
4 F 20
2: Setting the column domain as the index
reorder = df.domain.apply(lambda x: domain_vals.index(x))
df_reorder = df.set_index(reorder)
df_reorder.sort_index(inplace=True)
df_reorder.index.name = None
Again this gives
>>> df_reorder
domain num
0 A 15
1 B 25
2 C 10
3 D 0
4 E 5
5 F 20
Can anyone suggest something better (in the sense of "less of a hack"). I understand that my solution works, I just don't think that calling pandas.concat along with a list comprehension is the right approach here.
Having said that, it's shorter than the 2nd option, so I presume there must be some equally simple way I can do this with pandas methods I've overlooked?
Another way is merge:
(pd.DataFrame({'domain':df_domain_vals})
.merge(df, on='domain', how='left')
)

SQL dealing every bit without run query repeatedly

I have a column using bits to record status of every mission. The index of bits represents the number of mission while 1/0 indicates if this mission is successful and all bits are logically isolated although they are put together.
For instance: 1010 is stored in decimal means a user finished the 2nd and 4th mission successfully and the table looks like:
uid status
a 1100
b 1111
c 1001
d 0100
e 0011
Now I need to calculate: for every mission, how many users passed this mission. E.g.: for mission1: it's 0+1+1+0+1 = 5 while for mission2, it's 0+1+0+0+1 = 2.
I can use a formula FLOOR(status%POWER(10,n)/POWER(10,n-1)) to get the bit of every mission of every user, but actually this means I need to run my query by n times and now the status is 64-bit long...
Is there any elegant way to do this in one query? Any help is appreciated....
The obvious approach is to normalise your data:
uid mission status
a 1 0
a 2 0
a 3 1
a 4 1
b 1 1
b 2 1
b 3 1
b 4 1
c 1 1
c 2 0
c 3 0
c 4 1
d 1 0
d 2 0
d 3 1
d 4 0
e 1 1
e 2 1
e 3 0
e 4 0
Alternatively, you can store a bitwise integer (or just do what you're currently doing) and process the data in your application code (e.g. a bit of PHP)...
uid status
a 12
b 15
c 9
d 4
e 3
<?php
$input = 15; // value comes from a query
$missions = array(1,2,3,4); // not really necessary in this particular instance
for( $i=0; $i<4; $i++ ) {
$intbit = pow(2,$i);
if( $input & $intbit ) {
echo $missions[$i] . ' ';
}
}
?>
Outputs '1 2 3 4'
Just convert the value to a string, remove the '0's, and calculate the length. Assuming that the value really is a decimal:
select length(replace(cast(status as char), '0', '')) as num_missions as num_missions
from t;
Here is a db<>fiddle using MySQL. Note that the conversion to a string might look a little different in Hive, but the idea is the same.
If it is stored as an integer, you can use the the bin() function to convert an integer to a string. This is supported in both Hive and MySQL (the original tags on the question).
Bit fiddling in databases is usually a bad idea and suggests a poor data model. Your data should have one row per user and mission. Attempts at optimizing by stuffing things into bits may work sometimes in some programming languages, but rarely in SQL.

Calculating the difference between values based on their date

I have a dataframe that looks like this, where the "Date" is set as the index
A B C D E
Date
1999-01-01 1 2 3 4 5
1999-01-02 1 2 3 4 5
1999-01-03 1 2 3 4 5
1999-01-04 1 2 3 4 5
I'm trying to compare the percent difference between two pairs of dates. I think I can do the first bit:
start_1 = "1999-01-02"
end_1 = "1999-01-03"
start_2 = "1999-01-03"
end_2 = "1999-01-04"
Obs_1 = df.loc[end_1] / df.loc[start_1] -1
Obs_2 = df.loc[end_2] / df.loc[start_2] -1
The output I get from - eg Obs_1 looks like this:
A 0.011197
B 0.007933
C 0.012850
D 0.016678
E 0.007330
dtype: float64
I'm looking to build some correlations between Obs_1 and Obs_2. I think I need to create a new dataframe with the labels A-E as one column (or as the index), and then the data series from Obs_1 and Obs_2 as adjacent columns.
But I'm struggling! I can't 'see' what Obs_1 and Obs_2 'are' - have I created a list? A series? How can I tell? What would be the best way of combining the two into a single dataframe...say df_1.
I'm sure the answer is staring me in the face but I'm going mental trying to figure it out...and because I'm not quite sure what Obs_1 and Obs_2 'are', it's hard to search the SO archive to help me.
Thanks in advance

Check if value is already in a query field to change the value of another

I'll clarify this: I have a data result with the twist that the two PK's (A and B) are the same, and field C doesn't.
Example:
A B C D
> 14 20 1 null
> 14 20 2 1
> 15 20 2 0
As you can see, D field has a null and a 0.
What I have to do is to change D's null value to 1 whenever A fields are the same, and there's more than 1 record with those, not touching the 0's in D.
I tried initially with NVLs and DECODEs, like this:
DECODE(migr.A,NULL,(NVL(C,1)),D) AS D
but I'm not getting all the records, only the D-1's.
I really don't want to relate to an extra table/step for validation, as my query result can be easily over 1 million records, but if that's the best, I'm ok.
Many thanks.