I have excel that have format like this
| No | Date | Time | Name | ID | Serial | Total |
| 1 |2021-03-01| 11.45 | AB | 124535 | 5215635 | 50 |
Im trying to convert excel to pandas dataframe using below code
pd.read_excel(r'path', header=0)
pandas read the excel successfully however, I found strange result when I see the column time.
the dataframe have below result
| No | Date | Time | Name | ID | Serial | Total |
| 1.0 |2021-03-01| 11.4 | AB | 124535 | 5215635.0 | 50.0 |
Column Time is missing 1 digit. is my method to read excel is not correct?
read_excel is interpreting your dot-separated time as a float, which is quite expected.
I suggest telling read_excel to see this column as a string and convert it to datetime afterwards:
df = pd.read_excel(r'path', header=0, converters={'Time': str})
df['Time'] = pd.to_datetime(df.Time, format="%H.%M")
Related
I have the following DataFrame in Python Pandas:
df.head(3)
+===+===========+======+=======+
| | year-month| cat | count |
+===+===========+======+=======+
| 0 | 2016-01 | 1 | 14 |
+---+-----------+------+-------+
| 1 | 2016-02 | 1 | 22 |
+---+-----------+------+-------+
| 2 | 2016-01 | 2 | 10 |
+---+-----------+------+-------+
year-month is a combination of year and month, dating back about 8 years.
cat is an integer from 1 to 10.
count is an integer.
I now want to plot count vs. year-month with matplotlib, one line for each cat. How can this be done?
Easiest is seaborn:
import seaborn as sns
sns.lineplot(x='year-month', y='count', hue='cat', data=df)
Note: it might also help if you convert year-month to datetime type before plotting, e.g.
df['year-month'] = pd.to_datetime(df['year-month'], format='%Y-%m').dt.to_period('M')
Having a 750k rows df with 15 columns and a pd.Timestamp as index called ts.
I process realtime data down to milliseconds in near-realtime.
Now I would like to apply some statistical data derived from a higher time resolution in df_stats as new columns to the big df. The df_stats has a time resolution of 1 minute.
$ df
+----------------+---+---------+
| ts | A | new_col |
+----------------+---+---------+
| 11:33:11.31234 | 1 | 81 |
+----------------+---+---------+
| 11:33:11.64257 | 2 | 81 |
+----------------+---+---------+
| 11:34:10.12345 | 3 | 60 |
+----------------+---+---------+
$ df_stats
+----------------+----------------+
| ts | new_col_source |
+----------------+----------------+
| 11:33:00.00000 | 81 |
+----------------+----------------+
| 11:34:00.00000 | 60 |
+----------------+----------------+
Currently I have the code below, but it is inefficient, because it nees to iterate over the complete data.
I am wondering if there couldnt be an easier solution using pd.cut, bin or pd.Grouper? Or something else to merge the time-buckets on the two indexes?
df_stats['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df = df.merge(df_stats, on='ts_timeonly', how='left', sort=True, suffixes=['', '_hist']).set_index('ts')
Let us try something new reindex
df_stats=df_stats.set_index('ts').reindex(df['ts'], method='nearest')
df_stats.index=df.index
df=pd.concat([df,df_stats],axis=1)
Or
df=pd.merge_asof(df, df_stats, on='ts',direction='nearest')
I am trying to convert a one hot key dataframe into a 2 d frame
Is there anyways I can iterate over rows and columns and fill the values having a 1 with the column name.
problem dataframe:
+------------------+-----+-----+
| sentence | lor | sor |
+------------------+-----+-----+
| sam lived here | 0 | 1 |
+------------------+-----+-----+
| drack lived here | 1 | 0 |
+------------------+-----+-----+
Solution dataframe:
+------------------+------+
| sentence | tags |
+------------------+------+
| sam lived here | sor |
+------------------+------+
| drack lived here | lor |
+------------------+------+
You can segregate the rows having 1 for every column. For these columns, replace the value 1 with the name specified along with renaming the column names
lor_df = df.loc[df["lor"].eq(1), "lor"].rename(columns={"lor": "tags"}).replace(1, "lor")
sor_df = df.loc[df["sor"].eq(1), "sor"].rename(columns={"sor": "tags"}).replace(1, "sor")
After this, concatenate the individual results using pandas.concat, followed by dropping the columns which aren't required.
df["tags"] = pd.concat([lor_df, sor_df], sort=False)
df.drop(columns=["lor", "sor"], inplace=True)
To ensure unique values we can use pandas.DataFrame.drop_duplicates
df.drop_duplicates(inplace=True)
print(df)
In jupyter notebook, extra whitespaces in dataframe are removed. But sometime that is not preferred, e.g.
df=pd.DataFrame({'A':['a b','c'],'B':[1,2]})
df
The result I get:
| | A | B |
|---|-----|---|
| 0 | a b | 1 |
| 1 | c | 2 |
But I want:
| | A | B |
|---|-------|---|
| 0 | a b | 1 |
| 1 | c | 2 |
Is it possible? Thanks
It's actually HTML: pandas dutifully write all the spaces into the HTML markup (the front end format used by Jupyter Notebook). HTML, by default, collapses multiple adjacent whitespaces into one. Use the style object to change this:
df.style.set_properties(**{'white-space': 'pre'})
You unfortunately can't change the default render style of a DataFrame yet. You can write a function to wrap that line:
def print_df(df):
return df.style.set_properties(**{'white-space': 'pre'})
print_df(df)
I am trying to use pandas to process a series of XLS files. The code I am currently using looks like:
with pandas.ExcelFile(data_file) as xls:
data_frame = pandas.read_excel(xls, header=[0, 1], skiprows=2, index_col=None)
And the format of the XLS file looks like
+---------------------------------------------------------------------------+
| REPORT |
+---------------------------------------------------------------------------+
| Unit: 1000000 USD |
+---------------------------------------------------------------------------+
| | | | | Balance |
+ ID + Branch + Customer ID + Customer Name +--------------------------+
| | | | | Daily | Monthly | Yearly |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 1 | Company A | 10 | 5 | 2 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 222222 | Branch2 | 2 | Company B | 20 | 25 | 20 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 3 | Company C | 30 | 35 | 40 |
+--------+---------+-------------+---------------+-------+---------+--------+
Even I explicitly gave index_col=None, pandas still take ID column as the index. I am wondering the right way of making row numbers to be the index.
pandas currently doesn't support parsing a MultiIndex columns without also parsing a row index. Related issue here - it probably could be supported, but this gets tricky to define in a non-ambiguous way.
It's a hack, but the easiest way to work around this right now is to add a blank column on the left side of data, then read it in like this.
pd.read_excel('file.xlsx', header=[0,1], skiprows=2).reset_index(drop=True)
Edit:
If you can't / don't want to modify the files, a couple options are:
If the data has a known / common header, use pd.read_excel(..., skiprows=4, header=None) and assign the columns yourself, suggested by #ayhan.
If you need to parse the header, use pd.read_excel(..., skiprows=2, header=0), then munge the second level of labels into a MultiIndex. This will probably mess up dtypes, so you may also need to do some typecasting (pd.to_numeric) as well.