Compare columns in two dataframes to set values in one from another - pandas

I need to take values from a larger database, coor, and add them to an existing column in a smaller one, xshooter, where the entries match. I'm using RA columns in both to find the matching instances. I need a way to rewrite the for loops using pandas functions to reduce runtime.
for j in coor.index:
for i in xshooter_master_tabl.index:
if coor['RA_ICRS'][j] == xshooter_master_tabl['ra'][i]:
xshooter_master_tabl['star_id'][i] = coor['SimbadName'][j]

Related

Copy values from specific rows in a dataframe to the same columns in other rows in the same dataframe

I have a dataframe, say it has columns "A" through "Z", and several thousand rows. I have also two same-length lists, each with dataframe indices. The first list represents the indices of rows that are to be written to, and the second list represents in order the indices of rows to be read from. At the same time I only want to copy specific columns, say "N" through "Z", which are not all of the same dtype (i.e., some floating point numbers, some booleans, some timestamps).
Is there a way to do the copy without resorting to for loops?
I've read up on joins, merges, concats, merge_asof, use of the .loc attribute, etc. None of them seem to address quite what I'm looking for, nor do any of the searches I've done on stackoverflow.

Slice dataframe according to unique values into many smaller dataframes

I have a large dataframe (14,000 rows). The columns include 'title', 'x' and 'y' as well as other random data.
For a particular title, I've written a code which basically performs an analysis using the x and y values for a subset of this data (but the specifics are unimportant for this).
For this title (which is something like "Part number Y1-17") there are about 80 rows.
At the moment I have only worked out how to get my code to work on 1 subset of titles (i.e. one set of rows with the same title) at a time. For this I've been making a smaller dataframe out of my big one using:
df = pd.read_excel(r"mydata.xlsx")
a = df.loc[df['title'].str.contains('Y1-17')]
But given there are about 180 of these smaller datasets I need to do this analysis on, I don't want to have to do it manually.
My question is, is there a way to make all of the smaller dataframes automatically, by slicing the data by the unique 'title' value? All the help I've found, it seems like you need to specify the 'title' to make a subset. I want to subset all of it and I don't want to have to list all the title names to do it.
I've searched quite a lot and haven't found anything, however I am a beginner so it's very possible I've missed some really basic way of doing this.
I'm not sure if its important information but the modules I'm working with pandas, and numpy
Thanks for any help!
You can use Pandas groupby
For example:
df_dict = {key: title for key, title in df.copy().groupby('title', sort=False)}
Which creates a dictionary of DataFrames each containing all the columns and only the rows pertaining to each unique value of title.

Finding the latest version of each row using Dask with Parquet files and partition_on?

How can I make sure that am able to retain the latest version of a row (based on unique constraints) with Dask using Parquet files and partition_on?
The most basic use case is that I want to query a database for all rows where updated_at > yesterday and partition the data based on the created_at_date (meaning that there can be multiple dates which have been updated, and these files already exist most likely).
└───year=2019
└───month=2019-01
2019-01-01.parquet
2019-01-02.parquet
So I want to be able to combine my new results from the latest query and the old results on disk, and then retain the latest version of each row (id column).
I currently have Airflow operators handling the following logic with Pandas and it achieves my goal. I was hoping to accomplish the same thing with Dask without so much custom code though:
Partition data based on specified columns and save files for each partition (common example would be using the date or month column to create files 2019-01-01.parquet or 2019-12.parquet
Example:
df_dict = {k: v for k, v in df.groupby(partition_columns)}
Loop through each partition and check if the file name exists. If there is already a file with the same name, read that file as a separate dataframe and concat the two dataframes
Example:
part = df_dict[partition]
part= pd.concat([part, existing], sort=False, ignore_index=True, axis='index')
Sort the dataframes and drop duplicates based on a list of specified columns (unique constraints sorted by file_modified_timestamp or updated_at columns typically to retain the latest version of each row)
Example:
part = part.sort_values([sort_columns], ascending=True).drop_duplicates(unique_constraints, keep='last')
The end result is that my partitioned file (2019-01-01.parquet) has now been updated with the latest values.
I can't think of a way to use the existing parquet methods of a dataframe to do what you are after, but assuming your dask dataframe is reasonably partitioned, you could do the exact same set of steps within a map_partitions call. This means you pass the constituent pandas dataframes to the function, which acts on them. So long as the data in each partition is non-overlapping, you will do ok.

Pandas, turn all the data frame to unique categorical values

I am relatively new to Pandas and to python and I am trying to find out how to turn all content(all fields are strings) of a Pandas Dataframe to categorical one.
All the values from rows and columns have to be treated as a big unique data set before turning them to categorical numbers.
So far I was able to write the following piece of code
for col_name in X.columns:
if(X[col_name].dtype == 'object'):
X[col_name]= X[col_name].astype('category')
X[col_name] = X[col_name].cat.codes
that works on a data frame X of multiple columns. It takes the strings and turns them to unique numbers.
What I am not sure for the code above is that my for loop only works per column and I am not sure if the codes assigned are unique per column or per whole data frame (the latter is the desired action).
Can you please provide advice on how I can turn my code to provide unique numbers considering all the values of the data frame?
I would like to thank you in advance for your help.
Regards
Alex
Use DataFrame.stack with Series.unstack for set MultiIndex Series to unique values:
cols = df.select_dtypes('object').columns
df[cols] = df[cols].stack().astype('category').cat.codes.unstack()

Looping through columns to conduct data manipulations in a data frame

One struggle I have with using Python Pandas is to repeat the same coding scheme for a large number of columns. For example, below is trying to create a new column age_b in a data frame called data. How do I easily loop through a long (100s or even 1000s) of numeric columns, do the exact same thing, with the newly created column names being the existing name with a prefix or suffix string such as "_b".
labels = [1,2,3,4,5]
data['age_b'] = pd.cut(data['age'],bins=5, labels=labels)
In general, I have many simply data frame column manipulations or calculations, and it's easy to write the code. However, so often I want to repeat the same process for dozens of columns, that's when I get bogged down, because most functions or manipulations would work for one column, but not easily repeatable to many columns. It would be nice if someone can suggest a looping code "structure". Thanks!