A different merge - pandas

So I have two tables and thoses are the samples:
df1:
Element
Range
Family
Ae_aag2/0013F
5-2500
Chuviridae
Ae_aag2/0014F
300-2100
Flaviviridae
df2:
Element
Range
Family
0012F
30-720
Chuviridae
0013F
23-1200
Chuviridae
0013F
1300-2610
Xinmoviridae
And I need to join the tables in the following logic:
Element_df1
Element_df2
Family_df1
Family_df2
Ae_aag2/0013F
"0013F:23-1200,0013F:1300-2610"
Chuviridae
"Chuviridae,Xinmoviridae"
I need the common rows in the two dataframes of the column (Element) in one line, saving the element of the first and second and also the family of the first and second. If the 3 elements are common, in the two df, it should join the 3 in one single line.
I tried using the merge in pandas, but it gets me two lines, not one as I needed:
I searched and didn't find how make exceptions on how to merge the two dataframe. I tried using groupby afterwards but kind make worst :(
Unfortunately I don't have much knowledge on working with pandas. Please be kind I'm new at the subject.

Use:
df1.drop(columns='Range').merge(
df2.assign(group=lambda d: d['Element'],
Element=lambda d: d['Element']+':'+d['Range'])
.groupby('group')[['Element', 'Family']].agg(','.join),
left_on=df1['Element'].str.extract('/(.*)$', expand=False),
right_index=True, suffixes=('_df1', '_df2')
)#.drop(columns='key_0') # uncomment to remove the key
Output:
key_0 Element_df1 Family_df1 Element_df2 Family_df2
0 0013F Ae_aag2/0013F Chuviridae 0013F:23-1200,0013F:1300-2610 Chuviridae,Xinmoviridae

Related

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

Using pandas to join on multiple soft keys and multiple hard keys with different names

Is it possible to use pandas to join on multiple soft keys e.g when we allow tolerance range for a match and multiple hard keys that are named differently in both tables?
It seems that pandas.merge_asof allows only to join on one soft key and does not allow to specify hard key names separately for left and right tables (in case they are differently named and renaming isn't easy to process).
Consider the following two datasets
table1:
soft keys: sk1, sk2
hard keys: x, y
sk1,sk2,x,y,val1
10,100,10,15,1
20,200,20,25,2
30,300,10,10,3
table2:
soft keys: sk1,sk2
hard keys: k1,k2
sk1,sk2,k1,k2,val2,x,y
15,110,10,15,3,1,1
23,230,20,25,5,2,2
34,330,10,10,-1,3,3
I would need something equivalent to
soft_merge(t1, t2, left_by=["x","y"], right_by=["k1","k2"], on=[sk1, sk2], tolerance=[5,15])
to get output (showed vals only for clarity):
val1 | val2
1 | 3
I understand that instead of left_by and right_by for hard keys we can just
use by and rename columns, but this might not be easily supportable by a system since other system components might rely on old namings. Is there any clean and nice way of achieving it without multiple naming-renaming?
But the problem of joining on multiple soft keys still remain unclear ...
Implement the tolerances after an exact merge:
m = df1.merge(df2, left_on=["x","y"], right_on=["k1","k2"])
mask = (m.sk1_x - m.sk1_y).abs().le(5) & (m.sk2_x - m.sk2_y).abs().le(15)
m.loc[mask, ['val1', 'val2']]
# val1 val2
#0 1 3
This doesn't ensure a 1:1 merge, and will give all combinations that achieve that tolerance. If you need the "nearest" match you will need to specify some distance formula and keep only the closest. Here I use the total absolute distance. Assuming val1 is a unique key:
m['dist'] = (m.sk1_x - m.sk1_y).abs() + (m.sk2_x - m.sk2_y).abs()
m.sort_values('dist').loc[mask].drop_duplicates('val1')

Performing calculations on multiple columns in dataframe and create new columns

I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.

Organizing data (pandas dataframe)

I have a data in the following form:
product/productId B000EVS4TY
1 product/title Arrowhead Mills Cookie Mix, Chocolate Chip, 1...
2 product/price unknown
3 review/userId A2SRVDDDOQ8QJL
4 review/profileName MJ23447
5 review/helpfulness 2/4
6 review/score 4.0
7 review/time 1206576000
8 review/summary Delicious cookie mix
9 review/text I thought it was funny that I bought this pro...
10 product/productId B0000DF3IX
11 product/title Paprika Hungarian Sweet
12 product/price unknown
13 review/userId A244MHL2UN2EYL
14 review/profileName P. J. Whiting "book cook"
15 review/helpfulness 0/0
16 review/score 5.0
17 review/time 1127088000
I want to convert it to a dataframe such that the entries in the 1st column
product/productId
product/title
product/price
review/userId
review/profileName
review/helpfulness
review/score
review/time
review/summary
review/text
are the column headers with the values arranged corresponding to each header in the table.
I still had a tiny doubt about your file, but since both my suggestions are quite similar, I will try to address both the scenarios you might have.
In case your file doesn't actually have the line numbers inside of it, this should do it:
filepath = "./untitled.txt" # you need to change this to your file path
column_separator="\s{3,}" # we'll use a regex, I explain some caveats of this below...
# engine='python' surpresses a warning by pandas
# header=None is that so all lines are considered 'data'
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None)
df = df.set_index(0) # this takes column '0' and uses it as the dataframe index
df = df.T # this makes the data look like you were asking (goes from multiple rows+1column to multiple columns+1 row)
df = df.reset_index(drop=True) # this is just so the first row starts at index '0' instead of '1'
# you could just do the last 3 lines with:
# df = df.set_index(0).T.reset_index(drop=True)
If you do have line numbers, then we just need to do some little adjustments
filepath = "./untitled1.txt"
column_separator="\s{3,}"
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None, index_col=0)
df.set_index(1).T.reset_index(drop=True) #I did all the 3 steps in 1 line, for brevity
In this last case, I would advise you change it in order to have line numbers in all of them (in the example you provided, the numbering starts at the second line, this might be an option about how you handle headers when exporting the data in whatever tool you might be using
Regarding the regex, the caveat is that "\s{3,}" looks for any block of 3 consecutive whitespaces or more to determine the column separator. The problem here is that we'll depend a bit on the data to find the columns. For instance, if in any of the values just so happens to appear 3 consecutive spaces, pandas will raise an exception, since the line will have one more column than the others. One solution to this could be increasing it to any other 'appropriate' number, but then we still depend on the data (for instance, with more than 3, in your example, "review/text" would have enough spaces for the two columns to be identified)
edit after realising what you meant by "stacked"
Whatever "line-number scenario" you have, you'll need to make sure you always have the same number of columns for all registers and reshape the continuous dataframe with something similar to this:
number_of_columns = 10 # you'll need to make sure all "registers" do have the same number of columns otherwise this will break
new_shape = (-1,number_of_columns) # this tuple will mean "whatever number of lines", by 10 columns
final_df = pd.DataFrame(data = df.values.reshape(new_shape)
,columns=df.columns.tolist()[:-10])
Again, take notice of making sure that all lines have the same number of columns (for instance, a file with just the data you provided, assuming 10 columns, wouldn't work). Also, this solution assumes all columns will have the same name.

Dividing values from 2 different datasets

I am trying to divide 2 different fields from 2 different datasets. Also using a lookup in the statement but for some reason it does the lookup part of the expression but doesn't do the division part. Any ideas?
=IIF(Fields!PACKSHORT_DESC.Value = "EA",(LOOKUP(TRIM(Fields!PRODUCT_CODE.value), TRIM(Fields!item.value),Fields!tcost.value,"Cost")/Fields!NO_OF_EACHES.Value),(LOOKUP(TRIM(Fields!PRODUCT_CODE.value), TRIM(Fields!item.value),Fields!tcost.value,"Cost")))
Get it to output the two numbers you are trying to divide first to see if they are pulling through correctly first, assign them names and then divide them instead.