Let pandas use 0-based row number as index when reading Excel files - pandas

I am trying to use pandas to process a series of XLS files. The code I am currently using looks like:
with pandas.ExcelFile(data_file) as xls:
data_frame = pandas.read_excel(xls, header=[0, 1], skiprows=2, index_col=None)
And the format of the XLS file looks like
+---------------------------------------------------------------------------+
| REPORT |
+---------------------------------------------------------------------------+
| Unit: 1000000 USD |
+---------------------------------------------------------------------------+
| | | | | Balance |
+ ID + Branch + Customer ID + Customer Name +--------------------------+
| | | | | Daily | Monthly | Yearly |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 1 | Company A | 10 | 5 | 2 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 222222 | Branch2 | 2 | Company B | 20 | 25 | 20 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 3 | Company C | 30 | 35 | 40 |
+--------+---------+-------------+---------------+-------+---------+--------+
Even I explicitly gave index_col=None, pandas still take ID column as the index. I am wondering the right way of making row numbers to be the index.

pandas currently doesn't support parsing a MultiIndex columns without also parsing a row index. Related issue here - it probably could be supported, but this gets tricky to define in a non-ambiguous way.
It's a hack, but the easiest way to work around this right now is to add a blank column on the left side of data, then read it in like this.
pd.read_excel('file.xlsx', header=[0,1], skiprows=2).reset_index(drop=True)
Edit:
If you can't / don't want to modify the files, a couple options are:
If the data has a known / common header, use pd.read_excel(..., skiprows=4, header=None) and assign the columns yourself, suggested by #ayhan.
If you need to parse the header, use pd.read_excel(..., skiprows=2, header=0), then munge the second level of labels into a MultiIndex. This will probably mess up dtypes, so you may also need to do some typecasting (pd.to_numeric) as well.

Related

Druid generate missing records

I have a data table in druid and which has missing rows and I want to fill them by generating the missing timestamps and adding the precedent row value.
This is the table in druid :
| __time | distance |
|--------------------------|----------|
| 2022-05-05T08:41:00.000Z | 1337 |
| 2022-05-05T08:42:00.000Z | 1350 |
| 2022-05-05T08:44:00.000Z | 1360 |
| 2022-05-05T08:47:00.000Z | 1377 |
| 2022-05-05T08:48:00.000Z | 1400 |
And i want to add the missing minutes either by forcing it in the side of druid storage or by query it directly in druid without passing by other module.
The final result that I want will be look like this:
| __time | distance |
|--------------------------|----------|
| 2022-05-05T08:41:00.000Z | 1337 |
| 2022-05-05T08:42:00.000Z | 1350 |
| 2022-05-05T08:43:00.000Z | 1350 |
| 2022-05-05T08:44:00.000Z | 1360 |
| 2022-05-05T08:45:00.000Z | 1360 |
| 2022-05-05T08:46:00.000Z | 1360 |
| 2022-05-05T08:47:00.000Z | 1377 |
| 2022-05-05T08:48:00.000Z | 1400 |
And thank you in advance !
A Driud time series query will produce a densely populated timeline at a given time granularity like the one you want for every minute. But its current functionality either skips empty time buckets or assigns them a value of zero.
Doing other gap filling functions like LVCF (last value carried forward) that you describe seems like a great enhancement. You can join the Apache Druid community and create an issue that describes this request. That's a great way to start a conversation about requirements and how it might be achieved.
And/Or you could also add the functionality and submit a PR. We're always looking for more members in the Apache Druid community.

SQL Query to look up one table against another

I have two Excel tables- the first one is the data table and the second one is a look up table. Here is how they are structured-
Data Table
+----------+-------------+----------+----------+
| Category | Subcategory | Division | Business |
+----------+-------------+----------+----------+
| A | Red | Home | Q |
| B | Blue | Office | R |
| C | Green | City | S |
| D | Yellow | State | T |
| D | Red | State | T |
| D | Green | Office | Q |
+----------+-------------+----------+----------+
Lookup Table Lookup Table
+----------+-------------+----------+----------+--------------+
| Category | Subcategory | Division | Business | LookUp Value |
+----------+-------------+----------+----------+--------------+
| 0 | 0 | 0 | Q | ABC |
| B | 0 | Office | 0 | DEF |
| C | Green | 0 | 0 | MNO |
| D | 0 | State | T | RST |
+----------+-------------+----------+----------+--------------+
So I want to add the lookup value column to the data table based on the criteria given in the lookup table. Eg, for the first row in the lookup table, I dont want to lookup on Category, Subcategory, or Division. but if the Business is Q, then I want to populate the lookup value as ABC. Similarly, for the second row I dont want to consider the Subcategory. and Business. but if the Category. is "B" and Division is "Office", I want it to populate DEF. So the result should look like this-
[Final Resulting Data Table]
+----------+-------------+----------+----------+--------------+
| Category | Subcategory | Division | Business | LookUp Value |
+----------+-------------+----------+----------+--------------+
| A | Red | Home | Q | ABC |
| B | Blue | Office | R | DEF |
| C | Green | City | S | MNO |
| D | Yellow | State | T | RST |
| D | Red | State | T | RST |
| D | Green | Office | Q | ABC |
+----------+-------------+----------+----------+--------------+
I am very new to SQL and the actual data set is very complex wih multiple lookup values based on different criteria. IF you think any other scripting language would work better, I am open to that too. My data is in Excel currently
If the data is so complex, you should first consider if you want to put it in a (relational) database (like MS Access, MySQL, etc.) instead of in a spreadsheet (like MS Excel).
Both kind of programs are used for structured data handling, but databases focus primarily on efficient data storage and data integrity (including guarding type safety, required fields, unique fields, required references between various datasets/tables, etc.) and spreadsheets focus primarily on data analysis and calculations.
Relational databases support Structured Query Language (SQL) to let clients query their data. Spreadsheets normally do not use or support SQL (as far as I know).
It is possible to let MS Excel import or reference data in an external data source (like a relational database) to perform analysis and calculations on it.
The other way around is (sometimes) possible too: to link to spreadsheet worksheets as external tables inside a relational database system to - within certain limits - allow that data to be queried using SQL. But using a database to store the data and a spreadsheet (as a database client) to perform analysis on the data in the database would be a more logical design in my opinion.
However, creating such an integrated solution using multiple MS Office applications and/or external databases can be a complex challenge, especially when you are just starting to learn about them.
To be honest, I am not experienced with designing MS Office based solutions, so I cannot guide you around any pitfalls. I do hope, that this answer helps you a little with finding the right way to go here...

How to import Excel table with double headers into oracle database

I have this excel table I am trying to transfer over to an oracle database. The thing is that the table has headers that overlap and I'm not sure if there is a way to import this nicely into an Oracle Database.
+-----+-----------+-----------+
| | 2018-01-01| 2018-01-02|
|Item +-----+-----+-----+-----+
| | RMB | USD | RMB | USD |
+-----+-----+-----+-----+-----+
| | | | | |
+-----+-----+-----+-----+-----+
| | | | | |
+-----+-----+-----+-----+-----+
| | | | | |
+-----+-----+-----+-----+-----+
| | | | | |
+-----+-----+-----+-----+-----+
The top headers are just the dates for the month and then their respective data for that date. Is there a way to nicely transfer this to an oracle table?
EDIT: Date field is an actual date such as 02/19/2018.
If you pre-create a table (as I do), then you can start loading from the 3rd line (i.e. skip the first two), putting every Excel column into the appropriate Oracle table column.
Alternatively (& obviously), rename column headers so that file wouldn't have two header levels).

ALV display one column like two

Is it possible with cl_gui_alv_grid to make two columns with the same header?
Suppose I want to display data like this :
| Tuesday | Wednesday | Thursday |
|---------------|---------------|---------------|
| Po | Delivery | Po | Delivery | Po | Delivery |
|----|----------|----|----------|----|----------|
| 7 | 245.00 | 4 | 309.00 | 12 | 774.00 |
| 4 | 105.00 | 2 | 88.00 | 3 | 160.00 |
| 10 | 760.00 | 5 | 291.00 | 20 | 1836.00 |
...
For this I think about two solutions, but I don't know if it possible.
First solution : Make two levels of field catalog, in the first one three columns, and in the second 6 columns.
Second : Make field catalog with 3 columns, and concatenate two values under each column.
Thanks.
There is a strange workaround on a german site, which deals with inheriting from alv_grid in order to override some crucial methods, to allow it, to merge cells, the source is a well known and appreciated german abap page, but, as it says, it is in german. Let us hope, any translator engine can translate this for You in a proper way, but as it looks like, this could be a step in the right direction.... but as it seems, You should fix all columns for that ( or at least those with merged cells ).
Please refer to this and tell me, if it helped:
Merge cells of alv-grid

Django field widget doesn't show appropriate attribute

I'm using Django and this is a question on how to organize your models, or equivalentely, organize tables in SQL.
At the moment I have a table where each row contains a primary key, a "value" (a float multiple of 0.01) and a "amount" (integer). This is how I need this data.
However, I need to serve it differentely. I need to sum the "amount"s over rows with the same "value".
Example, my table is
| id | value | amount |
| 1 | 1.2 | 10 |
| 2 | 1.2 | 27 |
| 3 | 1.2 | 4 |
| 4 | 1.3 | 21 |
| 5 | 1.3 | 1 |
| 6 | 1.4 | 5 |
| 7 | 1.4 | 9 |
For my app I need to serve this as
| value | amount |
| 1.2 | 41 |
| 1.3 | 22 |
| 1.4 | 14 |
Now my question is: What is the best way to do this? Should I generate the second table from the first every time I need to serve it? Or should I add a new model to my app that gets updated everytime my current model gets updated, and so containing redundant information but getting the job done faster?
EDIT:
qb = Order.objects.filter(
models.Q(status='B')|models.Q(status='K')
).filter(
side='L', market__pk=self.pk
).order_by(
'-value'
).values('value').annotate(amount_sum=Sum('amount'))
The output is
[{'amount_sum': 22, 'value': Decimal('1.3')}, {'amount_sum': 41, 'value': Decimal('1.2')}]
from django.db.models import Sum
MyTable.objects.values('value').annotate(amount_sum=Sum('amount'))
This will return a list of dictionaries that contain value and amount_sum. You can name amount_sum whatever.
Django doc for Sum