Create a fixed and dynamic header for a python dataframe - pandas

I am reading a gzip file and converting it to a Dataframe by the below method
df = pd.read_csv(file.gz, compression='gzip', header=0, sep=',', quotechar='"', error_bad_lines=False)
This actually populates the first row as column header. As the data in the gzip varies every time the column header also changes.Also there is no fixed column count it also differs as per file as below .
File 1
01-10-2019 Samsung Owned
-----------------------------
01-10-2019 Samsung Owned
03-10-2019 Motorolla Sold
File 2
SAMSUNG Walmart DHL 300$ Sold Alaska
--------------------------------------------------
SAMSUNG Walmart DHL 300$ Sold Alaska
Sony Motorolla Fedex 250$ Sold Chicago
For me to do some data manipulation it would be great if I have a fixed column as 1,2,3 based on the no of columns the dataframe has like
File 1
1 2 3
-----------------------------
01-10-2019 Samsung Owned
03-10-2019 Sony Sold
File 2
1 2 3 4 5 6
--------------------------------------------------
SAMSUNG Walmart DHL 300$ Sold Alaska
Sony Motorolla Fedex 250$ Sold Chicago

If I understood you correctly, you don't want to read the header from the csv file.
That can be done using header=None.
If your csv file contain a header that you want to ignore, then you can also add skiprows=1.
df = pd.read_csv(file.gz, compression='gzip', header=None, sep=',', quotechar='"', error_bad_lines=False)

Related

Choosing companies from a dataframe with monthl returns based on company list of other dataframe

I'm currently writing my master thesis and I would like to calculate the portfolio returns for a list of companies. Therefore, I want to choose the companies from the dataframe with their monthly returns. I want to select the companies, which are in another dataframe.
The dataframe with the returns looks like that:
The second dataframe with the company names according to which I want to choose, looks like that:
I tried something like this:
mrt.loc[mrt['Company Name'] == SL1['Company Name']], which just gives me the Error 'Company Name'. I checked and the spelling should be correct.
I tried as well this:
mrt.loc[mrt == SL0['Company Name']]
This gives me a list of companies but I need as well the monthlyreturns from the dataframe mrt.
So as a recap I want the rows from mrt according to the company names in the dataframe SL0. And I need to do this afterwards with other dataframes like SL0 but with different length.
Could someone help? Thank you very much and have a nice day.
isin() would resolve this issue.
df
###
Company Name value1
0 Apple 1
1 Google 2
2 Microsoft 3
3 Facebook 4
4 Tesla 5
5 Amazon 6
6 Alphabet 7
7 Oracle 8
8 IBM 9
9 Facebook 10
df2
###
value2 value3 Company Name
0 51 11 Apple
1 52 22 Google
2 35 33 Microsoft
3 54 44 Facebook
Selecting
df[df['Company Name'].isin(df2['Company Name'])]
###
Company Name value1
0 Apple 1
1 Google 2
2 Microsoft 3
3 Facebook 4
9 Facebook 10

How do you copy data from a dataframe to another

I am having a difficult time getting the correct data from a reference csv file to the one I am working on.
I have a csv file that has over 6 million rows and 19 columns. I looks something like this :
enter image description here
For each row there is a brand and a model of a car amongst other information.
I want to add to this file the fuel consumption per 100km traveled and the type of fuel that is used.
I have another csv file that has the fuel consumption of every model of car that looks something like this : enter image description here
What I want to ultimately do is add the matching values of G,H, I and J columns from the second file to the first one.
Because of the size of the file I was wondering if there is another way to do it other than with a "for" or a "while" loop?
EDIT :
For example...
The first df would look something like this
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
NaN
NaN
2
Honda
Civic
b
NaN
NaN
3
GMC
Sierra
c
NaN
NaN
4
Toyota
Rav4
d
NaN
NaN
The second df would be something like this
ID
Brand
Model
Fuel_consu_1
Fuel_consu_2
1
Toyota
Corrola
100
120
2
Toyota
Rav4
80
84
3
GMC
Sierra
91
105
4
Honda
Civic
112
125
The output should be :
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
80
84
2
Honda
Civic
b
112
125
3
GMC
Sierra
c
91
105
4
Toyota
Rav4
d
80
84
The first df may have many times the same brand and model for different ID's. The order is completely random.
Thank you for providing updates I was able to put something together that should be able to help you
#You drop these two columns because you won't need them once you join them to df1 (which is your 2nd table provided)
df.drop(['Fuel_consu_1', 'Fuel_consu_2'], axis = 1 , inplace = True)
#This will join your first and second column to each other on the Brand and Model columns
df_merge = pd.merge(df, df1, on=['Brand', 'Model'])

Pandas Display Format on a specific column

So I want to display a single column with a currency format. Basically with a dollar sign, thousand comma separators, and two decimal places.
Input:
Invoice Name Amount Tax
0001 Best Buy 1324 .08
0002 Target 1238593.1 .12
0003 Walmart 10.32 .55
Output:
Invoice Name Amount Tax
0001 Best Buy $1,324.00 .08
0002 Target $1,238,593.10 .12
0003 Walmart $10.32 .55
Note: I still want to be able to do calculations on it, so it would only be a display feature.
If you are just format to print out, you can try:
df.apply(lambda x: [f'${y:,}'for y in x] if x.name=='Amount' else x)
which creates a new dataframe that looks like:
Invoice Name Amount Tax
0 1 Best Buy $1,324.0 0.08
1 2 Target $1,238,593.1 0.12
2 3 Walmart $10.32 0.55
You can simply add this line (before printing your data frame of course):
pd.options.display.float_format = '${:,.2f}'.format
it will print your columns in the data frame (but only float columns ) like this :
$12,500.00

pandas group by and sum with values being displayed

I need to group by two columns and sum the third one. My data looks like this:
site industry spent
Auto Cars 1000
Auto Fashion 200
Auto Housing 100
Auto Housing 300
Magazine Cars 100
Magazine Fashion 200
Magazine Housing 300
Magazine Housing 500
My code:
df.groupby(by=['site', 'industry'])['Revenue'].sum()
The output is:
spent
site industry
Auto Cars 1000
Fashion 200
Housing 400
Magazine Cars 100
Fashion 200
Housing 800
When I convert it to csv I only get one column - spent. My desired output is the same format as the original data only the revenue needs to be summed and I need to see all the values in columns.
Try this, using as_index=False:
df = df.groupby(by=['site', 'industry'], as_index=False).sum()
print(df)
site industry spent
0 Auto Cars 1000
1 Auto Fashion 200
2 Auto Housing 400
3 Magazine Cars 100
4 Magazine Fashion 200
5 Magazine Housing 800

Concatenate multiple row and sum on a specific column

I want to concatenate multiple row into a single row. I manage to concatenate the row, however, when I try to apply sum based on a specific column, it gave me an error TypeError: can only concatenate str (not "float") to str
Item Sum Brand Type User ID
ABC 5 High Zinc John 20A
CDD 3 Low Iron Bail 10B
ABC 10 High Zinc John 20A
CDD 200 Low Iron Bail 10B
Below is my code:
df = df.groupby(['ID','User','Type','Brand']).agg({'Item':''.join, 'Sum':'sum'}).reset_index()
Desired Output:
Item Sum Brand Type User ID
ABC 15 High Zinc John 20A
CDD 203 Low Iron Bail 10B
Thank You in advance!
df = df.pivot_table(index=['Brand', 'Type', 'User', 'ID'],values=['Sum'], columns=['Item'], aggfunc=sum).stack().reset_index()
Brand Type User ID Item Sum
0 High Zinc John 20A ABC 15.0
1 Low Iron Bail 10B CDD 203.0