How to transpose cell data by section in Open Refine? - openrefine

I have a data table that looks like this:
Name | Date-Freq | Date-Amount | Date-Freq | Date-Amount
A | 4 | 3000 | 8 | 9000
B | 5 | 4000 | 9 | 7000
C | 6 | 5000 | 10 | 8000
and I want it to look like this:
Name | Date | Freq | Amount
A | July 2014 | 4 | 3000
A | Aug 2014 | 8 | 9000
B | July 2014 | 5 | 4000
B | Aug 2014 | 9 | 7000
C | July 2014 | 6 | 5000
C | Aug 2014 | 10 | 3000
What is the best way to do something like this? Should I just create two new columns?

What you want to accomplish will require lots of steps and faceting and adding new columns, etc. But you can accomplish this with OpenRefine, YES.
You will need to use a combination of:
Always work in Records mode (not row mode) for any kind of Merging work.
Transpose Cells across columns into rows (into one MERGE column with prepend)
Moving your Name column to beginning column.
Fill down on your Name column (because it is now blank on some cells after the Transpose. and later you might need to fill down again after any particular Transpose or Merging.)
Use Custom Text Faceting with value.startsWith("Amount") etc.
Use Add new column based on to create new columns based on the MERGE column
Move Columns as necessary to do Step 2 again.
Repeat steps as necessary.
Here is an example OpenRefine project showing the beginning of what happens after the 1st set of the above steps (Use Undo/Redo to see, doesn't show Facets though):
OpenRefine Project with Transpose across cells into one column

Related

SQL Server: create rows with info from multiple tables having the same column name

I am doing an integration with a customer's ERP. The database tables have a normalization so that the columns that have the same name throughout any table, must have the same data type.
With this premise, I would like to generate a SQL, or a stored procedure that drags data from several source tables in a given order always matching the column names, to 2 target tables. As it is highly probable that the ERP vendor will add new columns without notifying my department, I need the columns to be obtained dynamically.
All this is to generate a single record in a table (in this case, the head data of a purchase to a supplier), and several rows in another table (the items of the purchase).
My idea is to have an auxiliary table where I put the information coming from my system, and then, execute that SQL/procedure to consolidate the information into the ERP purchase tables.
Let's take an example.
My tables would have information similar to this
(Purchase header)
ExternalOrderId | SupplierCode | PurchaseDate | PurchaseStatus | FiscalYear | Series
--------------------------------------------------------------------------------------------
ABCD | 00001 | 2021-12-11 12:00:00 | DRAFT | 2021 | S
(Purchase items)
ExternalOrderId | ArticleCode | ItemOrder | Units
--------------------------------------------------
ABCD | 1234 | 1 | 2
ABCD | 2345 | 5 | 4
ABCD | 3456 | 10 | 10
ABCD | 1234 | 15 | 3 (very important, same article can be repeated multiple times in one purchase)
.....
ABCD | 9999 | 100 | 10
Very important step is to take fiscal year, series and number from a table of counters. The counter should be incremented after the process.
Example of table "Counters" (note that there may be several numbers for one type depending on the series and the exercise):
Type | FiscalYear | Series | LastNumber
----------------------------------------------------
SupplierPurchase | 2021 | S | 26
SupplierPurchase | 2021 | A | 60
SupplierPurchase | 2021 | B | 15
SaleOrder | 2021 | S | 19
SaleOrder | 2021 | X | 200
Table "Accounting data".
SupplierCode | AdditionalColumn1 | AdditionalColumn2 | AdditionalColumn3
-------------------------------------------------------------------------
00001 | AC1A | AC2A | AC3A
Table "Company data".
SupplierCode | AdditionalColumn2 | AdditionalColumn3 | AdditionalColumn4
-------------------------------------------------------------------------
00001 | AC2B | AC3B | AC4B
Table "Supplier data".
SupplierCode | AdditionalColumn3 | AdditionalColumn5
-----------------------------------------------------
00001 | AC3C | AC5C
In this case the result should be something like this: for the columns with the same name, the data coming from the last table read should be kept. For example, AdditionalColumn1, will have the value of the first table (AC1A) because is the only table with that column name, and in the case of AdditionalColumn3, the data from the last one (AC3C).
The final result should look something like this:
Purchase Header
FiscalYear | Series | Number | SupplierCode | AdditionalColumn1 | AdditionalColumn2 | AdditionalColumn3 | AdditionalColumn4 | AdditionalColumn5 | PurchaseStatus | PurchaseDate | ExternalPurchaseID
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2021 | S | 27 | 00001 | AC1A | AC2B | AC3C | AC4B | AC5C | DRAFT | 2021-10-11 12:00:00:00 | ABCD
Note that the purchase number is 27, because in the counters table the last number used for the series "S" was 26. After creating this row, the counter must be set to 27.
In the case of the purchase items, it would be the same, obtaining the data from:
The purchase header created in the previous step.
Data from the Articles table
Data from another table with additional information about the articles.
The data from the purchase items table that I generated earlier.
But in this case, instead of being a single record, it will be a record for each item that I reflect in my auxiliary table, matching the info by the item's "ArticleCode".
I could do all this through programmed code, but I would like to abstract from the programming language and include all this in the database logic, to make a very fast, transactional process that can be retried in case of failure. Besides, as I said, they will be dynamic columns, since the ERP provider will be able to create new columns. In this way, I will not have to worry about having to escape the information of possible unicode characters and I will be sure that the data types are respected at all times.
It would be nice if i can get a boolean flag marked on my auxiliary table to indicate that the purchase has been consolidated correctly.
Thanks in advance
EDIT
As #JeroenMostert said in one response this question is too vague. The purpose of my question is to know how to use the column names obtained, for example from INFORMATION_SCHEMA.COLUMNS, from a table A and use them in a query, but only the ones that intersect with the columns of a table B, and do it several times with several tables so that I can generate the header of the purchase. And then use the same process (and the resulting data) to generate the purchase rows.

Select Value based on Multiple Value Range in SQL

I am having multiple criteria to give incentive to my employees. For example as shown in below image
Grid Table is dynamic in nature. It keeps on changing based on business conditions.
I have a table where I have emp Ids whose Resolution % I have calculated and also calculated their Normalization %. Now, I need to give them % Incentives based on the above Grid using SQL Query.
Output Table in which i need to update the incentives
I assume the grid table is also stored as a database table (so you can update it):
+-----------------+---------------+--------------------+------------------+-----------+
| INCENTIVES |
+-----------------+---------------+--------------------+------------------+-----------+
| from_resulution | to_resolution | from_normalization | to_normalization | incentive |
+-----------------+---------------+--------------------+------------------+-----------+
| 0 | 70 | 0 | 5 | 9 |
| 0 | 70 | 5 | 10 | 11 |
| 0 | 70 | 10 | 100 | 13 |
| 71 | 75 | 0 | 5 | 10 |
... I hope you get the idea
+-----------------+---------------+--------------------+------------------+-----------+
And the update query can be:
update employee E
set E.incentive = (select I.incentive
from incentives I
where e.resolution >= I.from_resolution
and e.resolution < I.to_resolution
and e.normalization >= I.from_normalization
and e.normalization < I.to_normalization)
UPDATE: the TO values are not in the scope of the range. By using the TO value equal to the FROM value of the next range we assure to cover all values (including floating point). Thanks to Gordon

How can I get a record to be counted in multiple columns of a Crosstab Query?

Background information:
My company requires employees to maintain at least one certification (cert) on a position. There are a total of 17 different certifications that an employee can get.
An employee can hold multiple certs. But on any one day they can only "sit" one of the positions that they are certified in. Most employees primarily sit the highest level position that they hold a cert in, but can sit a lower level position if there are manning shortages in that position and if they hold that particular cert (some employees come to us holding the higher level certs but none of the lower ones because they let them expire).
Multiple employees can hold the same cert.
Around 90% of employees are on contract, meaning they have a set termination date. Contracts can be extended but for the sake of this Access database, and the report to be generated, we're presuming that the termination date is set in stone.
My boss (and boss' boss) are wanting to put together a manning projection report so that they don't get caught off guard should we start running low on employees certified in any one position.
Example of what they want:
Lets say you have three employees:
Employee1 has certs in position1, position2, and position3 but he primarily sits as position3 and his contract expires June 2020.
Employee2 has certs in position1 and position2 but primarily sits as position2 and her contract expires in February 2022.
Employee3 is new and arrived August 2019 and is in training to get position1, maximum allowed training time for initial cert is 3 months, so presumably he should have his position1 cert by December 2019 and his contract expires August 2025.
Lets say my boss wants to project out 12 months with the starting month being November 2019 (he'll only be able to select a starting month-year that is equal to or later than the current month-year). The charts below, which are generated in subreports, should be what gets generated off of the above employee information.
All Certifications Chart
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Cert | Nov 19 | Dec 19 | Jan 20 | Feb 20 | Mar 20 | Apr 20 | May 20 | Jun 20 | Jul 20 | Aug 20 | Sep 20 | Oct 20 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Position1 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 |
| Position2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 |
| Position3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
Primary Certifications Chart
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Cert | Nov 19 | Dec 19 | Jan 20 | Feb 20 | Mar 20 | Apr 20 | May 20 | Jun 20 | Jul 20 | Aug 20 | Sep 20 | Oct 20 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| Position1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Position2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Position3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
Now I already have a solution in place but it's extremely inefficient and involves a query for each cell (2 Charts X 12 Months X 17 positions = 408 Queries when a report is generated). I'm hoping to do something more efficient with a crosstab query.
The tables are set up as such (only listing relevant fields):
Emp_table
ID (autoNum)
contractStarted (Date)
contractEnd (Date)
Cert_individual
ID (autoNum)
certID (num, many->one relationship to cert_table.ID)
EmpID (num, many->one relationship to Emp_table.ID)
date_cert_received (date)
primary (yes/no)
cert_table
ID (autoNum)
cert_name (short text)
Obviously I'd need to do a couple of INNER JOINS in order to get everything together and I tried using the format from this website for my crosstab query but it would only add an individual cert to a count on the month-year that the employee received it and not to every month that the employee will hold the cert.
So my question is:
Is there a way in SQL or VBA to get a cert counted across multiple columns (month-years) based off of when the employee received the cert and when their contract is scheduled to terminate?
As far as I know, the main problem in getting the crosstab query is that it can only generate columns with data that you already have.
A solution for you to get the monthly columns would be to have a side table with the 12 dates and then use the Cartesian product to generate the monthly data for each of your records in your certification table. This "date" table can be updated and maintained to match the months that you require in your report with a query.
For example, if you have a table named TempDates :
And a table with Employees with the following data :
You can generate the cartesian product with a query that I named QryCertsDates :
SELECT Employees.*, TempDates.* FROM TempDates, Employees;
Which lets you attach all the wanted dates with your original date from the table Employees in order to obtain data similar to below :
Now you can generate your crosstab query pivoting on the month and year and filtering the dates with the WHERE criteria such as :
TRANSFORM Count(QryCertsDates.Cert) AS CountOfCert SELECT QryCertsDates.Cert FROM QryCertsDates WHERE (((CDate([Yr] & "-" & [Mo])) Between CDate([Start]) And CDate([Expire]))) GROUP BY QryCertsDates.Cert PIVOT CDate([Yr] & "-" & Format([Mo],"00"));
You will end up ultimately with something like this :
You can do the same thing to get your second table/report as well. I don't know your database structure, so you will most likely need to do some adaptation. The other possible way that you can achieve a similar result would be to fill in a table using VBA.
However, this might be the easier solution to implement. Good luck!

Combine column x to n in OpenRefine

I have a table with an unknown number of columns, and I need to combine all columns after a certain point. Consider the following:
| A | B | C | D | E |
|----|----|---|---|---|
| 24 | 25 | 7 | | |
| 12 | 3 | 4 | | |
| 5 | 5 | 5 | 5 | |
Columns A-C are known, and the information in them correct. But column D to N (an unknown number of columns starting with D) needs to be combined as they are all parts of the same string. How can I combine an unknown number of columns in OpenRefine?
As some columns may have empty cells (the string may be of various lengths) I also need to disregard empty cells.
There is a two step approach to this that should work for you.
From the first column you want to merge (Col D in this case) choose Transpose->Transpose cells across columns into rows
You will be asked to set some options. You'll want to choose 'From Column' D and 'To Column' N. Then choose to transpose into One Column, assign a name to that column, make sure the option to 'Ignore Blank Cells' is checked (should be checked by default. Then click Transpose.
You'll get the values that were previously in cols D-N appearing in rows. e.g.
| A | B | C | D | E | F |
|----|----|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 |
Transposes to:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 4 |
| | | | 5 |
| | | | 6 |
You can then use the dropdown menu from the head of the 'new' column to choose
Edit cells->Join multi-value cells
You'll be asked what character you want to use to separate the characters in the joined cell. Probably in your use case you can delete the joining character and combine the cells without any joining characters. This will give you:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 456 |

Return multiple rows before and after the match row based on time span in Excel/VBA

I have the following kind of data:
+---------------+-------------------------+----
| time | item | line index number
+---------------+-------------------------+----
| 05:00:00 | | 1
| 05:00:01 | MatchingValue | 2
| 05:15:00 | | 3
| 06:00:00 | B | 4
| 06:01:00 | | 5
| 06:45:00 | | 6
| 07:00:00 | MatchingValue | 7
| 07:15:00 | | 8
| 08:00:00 | | 9
| 09:00:00 | | 10
+---------------+-------------------------+
What I am trying to do is to extract multiple rows before and after the matching row with item == "MatchingValue", together with the matching row itself . Those returned multiple rows are within 15 minutes of the time where item == "MatchingValue"
For example, if I was searching "MatchingValue" in the 2nd column, I would like to get the results of rows 1, 2, 3 and 6, 7, 8.
I know that one can get the return of rows 2, 7 at the same time by using array formula (e.g. Index and Match). but I really don't know how to use array formula for my own question.
I appreciate any assistance.
Easiest way is to add a helper column and filter your data in place or just use a pivot table to get only the data you need.
Formula in your helper column: =or(b2="MatchingValue",countifs(b:b,MatchingValue,A:A,">=" & A2-1/(24*4),A:A,"<=" & A2+1/(24*4))>0)
Of course you can also write array formula to collect your data in a new range but considering your already complex criteria and variable number of results that would be really a complex formula.