How can I rename the data of a pandas dataframe column? - pandas

To be specific. I have a pandas dataframe column called id, its data is 10,20,30, etc
how can I rename the column's data all the way from 1,2,3,4, etc in ascending order instead of 10,20,30?

Related

Creating Dynamic names for columns in a dataframe

Suppose I have three columns named a,b and c in a dataframe. I want to prepend/postpend some keyword in front/end of the column names. how can i do this in dataframes.jl?

How to remove rows in a dataframe whose column values are not in a list

I have a dataframe with several different possible values for a particular column. I also have a set that has the column values of rows that I actually care about. I want to update the dataframe such that it removes all rows whose column values are not found in the list I made . How would I do this?
If I get your question then, for a given column col you could do something like this:
df = df.loc[df[col].isin(your_list)]

Pyspark partition data by a column and write parquet

I need to write parquet files in seperate s3 keys by values in a column. The column city has thousands of values. Iteration using for loop, filtering dataframe by each column value and then writing parquet is very slow. Is there any way to partition the dataframe by the column city and write the parquet files?
What I am currently doing -
for city in cities:
print(city)
spark_df.filter(spark_df.city == city).write.mode('overwrite').parquet(f'reporting/date={date_string}/city={city}')
partitionBy function solves the issue
spark_df.write.partitionBy('date', 'city').parquet('reporting')

Pandas and SQLAlchemy: renaming columns during join

I have table A and table B. Both have a column id and a column name.
When I use pd.read_sql() to convert the result of a SQLAlchemy query to a pandas DataFrame, the resulting DataFrame has two columns named id and two columns named name.
The join is executed on the id column, therefore, even if there are two id columns, there won't be any ambiguity since both columns contain the same values. I can simply drop one of the column.
The two columns named name represent an issue because they are not identical: column name of table A represents name of an entity A, while column name of table B represents name of an entity B. At this point I won't know for sure which of the two columns of the DataFrame comes from table A and which from table B. Is there any way to solve this by, for instance, adding a prefix to the column names? More in general, is there any way to exploit the practical pd.from_sql() in this situation?
my_dataframe = pd.read_sql(
session.query(TableA, TableB)
.join(TableB)
.statement,
session.bind)
Note: in this question I am trying to simplify the structure of a more complex preexisting Postgres database. Therefore, it won't be possible to alter the structure of the database.
The solution was actually really simple, but you have to rename each single field:
my_dataframe = pd.read_sql(
session.query(TableA.field1.label('my_new_name1'),
TableA.field2.label('my_new_name2'),
TableB.field1.label('my_other_name2'))
.join(TableB)
.statement,
session.bind)

Change order of columns when some them are variable (Pandas)

I want to change order of my columns (Excel File) but some of the column names are variable(changes every day).
could we assign number to the columns?