Pandas add two data frames by common variable - pandas

I have two dataframes A, B which have the same column names;
["x","0","1","2"]
of which I wish to add the numeric columns ["0","1","2"] together by the string column"x". What is the best way to do this?

You'll want to set the common index and then sum all columns:
pd.merge(A, B, on = ['x']).set_index(['x']).sum(axis=1)

Related

Proper way to join data based on coditions

I want to add a new column to a datframe "table" (name: conc) which uses the values in columns (plate, ab) to get the numeric value from the dataframe "concs"
Below is what I mean, with the dataframe "exp" used to show what I expect the data to look like
what is the proper way to do this. Is it using some multiple condition, or do I need to reshape the concs dataframe somehow?
Use DataFrame.melt with left join for new column concs, if no match is created NaNs:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table,on=['plate', 'ab'], how='left')
Solution should be simplify - if same columns names 'plate', 'ab' in both DataFrames and need merge by both is possible omit on parameter:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table, how='left')
First melt the concs dataframe and then merge with table:
out = concs.melt(id_vars=['plate'],
value_vars=concs.columns.drop('plate').tolist(),
var_name='ab').merge(table, on=['plate', 'ab'
]).rename(columns={'value': 'concs'})
or just make good use of parameters of melt like in jezraels' answer:
out = concs.melt(id_vars=['plate'],
value_name='concs',
var_name='ab').merge(table, on=['plate', 'ab'])

Julia select a group from a grouped Data Frame

I have the following DataFrame with 3 columns a, b, c. I grouped the DF by c
dfByC = groupby(df, [:C])
How can I select a group from dfByC for a certain value of c?
Do:
dfByC[(the_value_you_have,)]
or
dfByC[(C=the_value_you_have,)]
or
dfByC[Dict(:C => the_value_you_have)]
In essence - you can do such selection by passing a Tuple, a NamedTuple or a dictionary.
The reason why it is not allowed to just write dfByC[the_value_you_have] is that you can also index GroupedDataFrame by integer, where you get the consecutive group, so we need some wrapper to disambiguate. Also if you groupby multiple columns you need some wrapper to keep them together.
Also group selection by grouping variable value is fast (so you can safely write a code where you do millions of such lookups and it will be efficient).

Is there a pandas function for get variables names in a column?

I'm just thinking in a hypothetical dataframe (df) with around 50 columns and 30000 rows, and one hypothetical column like e.g: Toy = ['Ball','Doll','Horse',...,'Sheriff',etc].
Now I only have the name of the column (Toy) and I want to know what are the variables inside the column without duplicated values.
I'm thinking an output like the .describe() function
df['Toy'].describe()
but with more info, because now I'm getting only this output
count 30904
unique 7
top "Doll"
freq 16562
Name: Toy, dtype: object
In other words, how do I get the 7 values in this column. I was thinking in something like copy the column and delete duplicated values, but I'm pretty sure that there is a shorter way. Do you know the right code or if I should use another library?
Thank you so much!
You can use unique() function to list out all the unique values in your columns. In your case, to list out the unique values in the column name toys in the dataframe df the syntax would look like
df["toys"].unique()
You can also use .drop_duplicates(), which returns a pandas Series:
df['toys'].drop_duplicates()

Can we sort multiple data frames comparing values of each element in column

I have two csv files having some data and I would like to combine and sort data based on one common column:
Here is data1.csv and data2.csv file:
The data3.csv is the output file where you I need data to be combined and sorted as below:
How can I achieve this?
Here's what I think you want to do here:
I created two dataframes with simple types, assume the first column is like your timestamp:
df1 = pd.DataFrame([[1,1],[2,2], [7,10], [8,15]], columns=['timestamp', 'A'])
df2 = pd.DataFrame([[1,5],[4,7],[6,9], [7,11]], columns=['timestamp', 'B'])
c = df1.merge(df2, how='outer', on='timestamp')
print(c)
The outer merge causes each contributing DataFrame to be fully present in the output even if not matched to the other DataFrame.
The result is that you end up with a DataFrame with a timestamp column and the dependent data from each of the source DataFrames.
Caveats:
You have repeating timestamps in your second sample, which I assume may be due to the fact you do not show enough resolution. You would not want true duplicate records for this merge solution, as we assume timestamps are unique.
I have not repeated the timestamp column here a second time, but it is easy to add in another timestamp column based on whether column A or B is notnull() if you really need to have two timestamp columns. Pandas merge() has an indicator option which would show you the source of the timestamp if you did not want to rely on columns A and B.
In the post you have two output columns named "timestamp". Generally you would not output two columns with same name since they are only distinguished by position (or color) which are not properties you should rely upon.

Add a column using another column as its index [duplicate]

Is it possible to only merge some columns? I have a DataFrame df1 with columns x, y, z, and df2 with columns x, a ,b, c, d, e, f, etc.
I want to merge the two DataFrames on x, but I only want to merge columns df2.a, df2.b - not the entire DataFrame.
The result would be a DataFrame with x, y, z, a, b.
I could merge then delete the unwanted columns, but it seems like there is a better method.
You want to use TWO brackets, so if you are doing a VLOOKUP sort of action:
df = pd.merge(df,df2[['Key_Column','Target_Column']],on='Key_Column', how='left')
This will give you everything in the original df + add that one corresponding column in df2 that you want to join.
You could merge the sub-DataFrame (with just those columns):
df2[list('xab')] # df2 but only with columns x, a, and b
df1.merge(df2[list('xab')])
If you want to drop column(s) from the target data frame, but the column(s) are required for the join, you can do the following:
df1 = df1.merge(df2[['a', 'b', 'key1']], how = 'left',
left_on = 'key2', right_on = 'key1').drop(columns = ['key1'])
The .drop(columns = 'key1') part will prevent 'key1' from being kept in the resulting data frame, despite it being required to join in the first place.
You can use .loc to select the specific columns with all rows and then pull that. An example is below:
pandas.merge(dataframe1, dataframe2.iloc[:, [0:5]], how='left', on='key')
In this example, you are merging dataframe1 and dataframe2. You have chosen to do an outer left join on 'key'. However, for dataframe2 you have specified .iloc which allows you to specific the rows and columns you want in a numerical format. Using :, your selecting all rows, but [0:5] selects the first 5 columns. You could use .loc to specify by name, but if your dealing with long column names, then .iloc may be better.
This is to merge selected columns from two tables.
If table_1 contains t1_a,t1_b,t1_c..,id,..t1_z columns,
and table_2 contains t2_a, t2_b, t2_c..., id,..t2_z columns,
and only t1_a, id, t2_a are required in the final table, then
mergedCSV = table_1[['t1_a','id']].merge(table_2[['t2_a','id']], on = 'id',how = 'left')
# save resulting output file
mergedCSV.to_csv('output.csv',index = False)
Slight extension of the accepted answer for multi-character column names, using inner join by default:
df1 = df1.merge(df2[["Key_Column", "Target_Column1", "Target_Column2"]])
This assumes that Key_Column is the only column both dataframes have in common.