Hive column transpose - hive

I have 5 columns in a table in hive as shown below:
5314|43045.9637152778 0 app_3 1 app_other
5314|43045.9637152778 0 app_9 1 app_other
5314|43045.9637152778 0 app_18 1 app_other
5314|43045.9637152778 0 app_2 1 app_other
5314|43045.9637152778 0 app_12 1 app_other
I would like to convert the above data into the following output with headers:
id app_3 app_9 app_18 app_2 app_12 app_other
5314|43045.9637152778 0 0 0 0 0 1
I want to col3 value as column name and respective col2 value as value of col3. Could you please help me on how do we do this?
Thanks.

Related

how to make a new 0 and 1 column?

I have a pandas data frame and I wanna make a new columns with 0 and 1:
if col1 is zero and col2 is positive set new column to 1.
if 'col1is zero andcol2is negative set new column to 0. ifcol1is 1 andcol2 is positive set new column to 0. if 'col1 is 1 and col2 is negative set new column to 1.
col1 col2
0 2
0 -4
1 -2
1 5
1 9
new_colum
1
0
1
0
0
You can determine if col2 is positive and get the absolute difference with col1 (booleans behave like 0/1):
df['new_column'] = df['col1'].sub(df['col2'].gt(0)).abs()
Or, compare the two outputs, you want them to be different:
df['new_column'] = df['col1'].ne(df['col2'].gt(0)).astype(int)
output:
col1 col2 new_column
0 0 2 1
1 0 -4 0
2 1 -2 1
3 1 5 0
4 1 9 0

Convert duplicate raws into one with diffrent values

I'm trying to find a solution to rearrange my data frame. Currently I have more than a half duplicate raws for a single object and I would like to combine them into one. The fraction of my dataset you can find below:
#NAME Sample1 Sample2 Sample3 sample4 Sample5
AAC(6')-Ib7 5 0 0 0 0
AAC(6')-Ib7 0 3 0 0 25
AAC(6')-Ib7 0 0 0 0 0
AAC(6')-Ib7 0 0 0 10 0
AAC(6')-Ib7 0 0 0 0 0
And I would like to have the output:
#NAME Sample1 Sample2 Sample3 sample4 Sample5
AAC(6')-Ib7 5 3 0 10 25
Can you give me any tips how I can rearrange it?
Because my original dataset has more than 7000 raws, but most are in dublicate (should have around 800 single raws), do I have to do it for each value separately?
Will be appreciated for your help!
Thank you.

unpivot with counts in hive columns transpose

I have table with data needs to unpivot and get aggregated counts.
Source table:
primary_id sys_1 sys_2 sys3_ sy5 sys100
newa889 0 1 0 1 0
den7899 1 1 1 1 0
geo8988 1 1 1 1 0
atla8766 0 1 0 1 1
chic7898 0 1 0 0 1
Desired output:
sys_name count(primary_key) flag_0_or_1
sys_1 129999 0
sys_1 544545 1
sys_2 23333 0
sys2 23322323 1
sys3_ 332233 0
sys3_ 323232 1
sy5 32332 0
sy5 32323 1
Looking to get the data transpose get 0's and 1's counts from each sys_ column.

Pandas, create new column applying groupby values

I have a DF:
Col1 Col2 Label
0 0 5345
1 0 7574
2 0 3445
0 1 2126
1 1 4653
2 1 9566
So I'm trying to groupby on Col1 and Col2 to get index value based on Label column like this:
df_gb = df.groupby(['Col1','Col2'])['Label'].agg(['sum', 'count'])
df_gb['sum_count'] = df_gb['sum'] / df_gb['count']
sum_count_total = df_gb['sum_count'].sum()
index = df_gb['sum_count'] / 10
Col2 Col1
0 0 2.996036
1 3.030063
2 3.038579
1 0 2.925314
1 2.951295
2 2.956083
2 0 2.875549
1 2.899254
2 2.905063
Everything so far is as I expected. But now I would like to assign this 'index' groupby df to my original 'df' based on those two groupby columns. If it was only one column it's working with map() function but not if I would like to assign index values based on two columns order.
df_index = df.copy()
df_index['index'] = df.groupby([]).apply(index)
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Tried with agg() and transform() but without success. Any ideas how to proceed?
Thanks in advance.
Hristo.
I believe you need join:
a = df.join(index.rename('new'), on=['Col1','Col2'])
print (a)
Col1 Col2 Label new
0 0 0 5345 534.5
1 1 0 7574 757.4
2 2 0 3445 344.5
3 0 1 2126 212.6
4 1 1 4653 465.3
5 2 1 9566 956.6
Or GroupBy.transform:
df['new']=df.groupby(['Col1','Col2'])['Label'].transform(lambda x: x.sum() / x.count()) / 10
print (df)
Col1 Col2 Label new
0 0 0 5345 534.5
1 1 0 7574 757.4
2 2 0 3445 344.5
3 0 1 2126 212.6
4 1 1 4653 465.3
5 2 1 9566 956.6
And if no NaNs in Label column use solution from Zero suggestion, thank you:
df.groupby(['Col1','Col2'])['Label'].transform('mean') / 10
If need count only non NaNs values by count use solution with transform.

Calculating ratio value within a line which contain binary numbers "0" & "1"

I have a data file which contain more than 2000 lines and 45001 columns.
The first column is actually a "string" which explains the data type.
Start from column #2, up to column #45001, the data is reprsented as
"1"
or
"0"
For example, the pattern of data in a line is
(0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0)
The total number of data is 25. Within this data line, there are 5 sub-groups which are made by only the number "1"s e.g. (11 111 1111 1 111 ). The "0"s in between the subgroups are assumed as "delimiter". The total of all "1"s is = 13.
I would like to calculate the ratio of
(total of all "1"s / total of number of sub-groups made only by "1"s)
That is
(13/5).
I tried with this code for calculating the total of all "1"s ;
awk -F '0' '{print NF}' < inputfile.in
This gives value 13.
But I donn't know how to go further from here to calcuate the ratio that I want.
I don't know how to find the number of sub-groups within each line beacuse the number of occurances of "1"s and "0"s are random.
Wish to get some kind help to sort this problem.
Appreciate any help in advance.
It is not clear to me from the description what the format of the input file is. Assume the input looks like:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
To count up the number of ones and the number of groups of ones and take their ratio:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; print s1/s2}' file
2.6
Update: Handling all zeros
Suppose one of the lines in the file has all zeros:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
For the second line, both sums are zero which would lead to a divide by zero error. We can avoid that by adding an if statement which will print the ratio if one exists or 0/0 is it doesn't:
if (s2>0)print s1/s2; else print s1"/"s2
The complete code is now:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; if (s2>0)print s1/s2; else print s1"/"s2}' file
2.6
0/0
How it works
The code uses three variables. f is a flag which is true (1) if we are currently in a group of ones and is false (0) otherwise. s1 is the the number of ones on the line. s2 is the number of groups of ones on the line.
f=0;s1=0;s2=0
At the beginning of each line, we initialize the variables.
for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}
We loop over each field on the line starting with field 2. If the field contains a 1, we increment counter s1. If the field is 1 and is the start of a new group, we increment s2.
if (s2>0)print s1/s2; else print s1"/"s2}
If we encountered at least one one, we print the ratio s1/s2. Otherwise, we print 0/0.
Here is an awk that does what you need:
cat file
data 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
data 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
data 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
data 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BMR_10#O24-BMR_6#O13-H13 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
data 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1
awk '{$1="";$0="0 "$0" 0";t=split($0,b,"1")-1;gsub(/ +/,"");n=split($0,a,"[^1]+")-2;print (n?t/n:0)}' t
2.6
0
25
11
5.5
3