Single Column name splitting to multiple columns with data - pandas

I am analyzing the inverter data from a power plant. There are more than 10 inverters and each inverter has 3 parameters that need to be analyzed. The parameters are Energy generated per interval, AC Power P_AC and DC Power P_DC. The inverters are numbered as 17.02 or 22.03 etc. The data is taken at a time step of 5minutes. After downloading the data in a csv file, there is only 1 column in the csv file. The column name contains numbers of all the inverter and their parameter names separated by a ';'. Also, the data at each time step is in 1 single cell separated by ';'. I want to analyse all the parameters of all the inverters and i want to make sure that each parameter of every inverter comes in a separate column. Can somebody help me to segregate this? Also, I want to ensure that columns are sorted in the increasing order of inverter numbering. I am attaching the the link to actual csv file - https://drive.google.com/file/d/1Rp54DEarzFUGm2oU5Bfkl3karbUYYwcd/view?usp=sharing
https://drive.google.com/file/d/12InL3N-ZMMODGWVUYn_8nTwPgAQtSBzq/view?usp=sharing
In the data frame above, you can see that every column has a project code -'SM10046 Akadyr Ext', then the inverter number 'INV 17.02' and then the name of parameter 'Energy generated per interval [kWh]' and lastly the code of parameter 'E_INT' . I want that the project code should be removed and only inverter number and parameter code should be present as a column name. Also, all the inverter should come in a serial order.

Essentially you have a multitude of columns, and from your description, you need to sort /analyze data from each plant ?
If you need permanent storage of data, I would use SQLite or similar, and convert each plant into a row with a key holding plant ID.
Like this:
2020-07-28 13:33:09;A1;A2;A3;B1;B2;B3
turned into something like this (now in a database, 5 fiels per record)
2020-07-28 13:33:09;A;A1;A2;A3
2020-07-28 13:33:09;B;A1;A2;A3
my goto-too for this would be a scripting language like AutoIT3, Perl or Python, which makes splitting lines and connecting to SQLite trivial.
If you just need real-time sorting/reporting etc, AWK is a perfect tool for this, since you can create sorted arrays very easily. (Perl/Python again of course alternatives as well).
It could be useful if you provide actual (trivial) example of what you expect output to be ?

Related

Separating columns ( array of arrays) - Advanced SQL looping

I tried using a name that more accurately describes my question but msg said I am limited to 150 chars.
Looking for assistance from someone who has advanced SQL skills. Ideally I want to do it in SQL to let the computer do the work. Too much manual manipulation is ripe with the possibility of mistakes.
I've already searched for users groups within Google. All emails are being returned saying the email does not exist anymore.
What I am using appears to be a proprietary version of Dremel SQL / Google SQL, however, someone experienced in Dremel SQL will probably be able to guide me in the right direction.
BACKGROUND INFO:
Pulling a column that is an array column which holds another array (a notes column). I think maybe an array of arrays?
I have not figured a way to do what I am trying to do with Google or Dremel SQL yet.
So for now, I am doing it the hard way.
As originally pulled, the data looks like this [{Array of arrays}, {Array of arrays}, {Array of arrays}, etc., repeat... :
More specifically: [{4 or more text fields which could also hold numbers and separated by commas}, {another set of fields}, {another set of fields}...]
I.E. (this is all in just one column of data and hundreds of rows)
[
{"created":"1540236216969","notes": blah... blah... blah", "original_text_length":534, "User_email":"someone#emailaddress.com","user_shortname":"someone"},
{"created":"1540236216969","notes": blah... blah... blah", "original_text_length":1224, "User_email":"someone#emailaddress.com","user_shortname":"someone"},
{"created":"1540236216969","notes": blah... blah... blah", "original_text_length":1664, "User_email":"someone#emailaddress.com","user_shortname":"someone"}
...
]
The number of these is different for each row pulled and each has a specific ID #
A typical row of data is:
ID #, start_date, end_date, some other fields, notes_(the array field)
WHAT I AM DOING NOW is:
SQL data pull,
exporting to google sheets,
make separate tabs for the different array columns.
copying the notes column (the array column holding arrays) to a separate tab on Google Sheets, then
Split Text To Columns using the first curly brace "{" as the separater.
Here is where my dillema is.
Once pulled, I need to split all of those columns again to separate each of the individual elements in each array. Unable to Split text to Columns again with all of them highlighted. I can Split Text to Columns again one at a time but will really be a pain if I have to do that individually for each column and every row (hundreds of rows). Need to find a way to automate this.
I will also need to change each of unix dates to calendar dates within each array PLUS add rows to the spreadsheet depending on the number of columns from the first split. The columns are different for each row depending on how many notes have been added.
OR... do it with SQL (which appears to be a proprietary type of SQL similar to NoSQL but not the same). I have tried using syntax's for IBM SQL, Oracle SQL, SQL Server, and others found online but none work.
OR... do it with a looping function within Google Sheets.
Possibly re-add it to the database as a new table once both sets of arrays are completely split up.
END RESULT
ID#, date1, date 2, first created date (right now a unix date), first note, first other field, etc...
Then add a new row with
Same ID# from above, date1 from row above, date 2 from row above, next (2nd) created date (right now a unix date), 2nd note, 2nd other field, etc...
Add a new row...
3rd set of notes etc.

How can I map each specific row value to an ID in Pentaho?

I’m new to Pentaho and I’m currently having an issue with mapping specific row values to an ID.
I have a data file with around 30 columns, one of which is for currencies (USD, GBP, AUD, etc).
The main objective is to have the user select up to 8 (minimum of 1) currencies and map them to a corresponding ID 1-8. All other currencies not in the specified 8 will be mapped with an ID of 9.
The final step is to then output the original data set, along with the IDs.
I’m pretty sure I’m making this way harder than it should, but here is what I have at the moment.
I have created a job where the first step is to set the variables for my 8 currencies, selectionOne -> AUD, selectionTwo -> GBP, …, selectionEight -> JPY.
I then have a transformation to read the data from the file and use the copy rows to result step.
Following that I have a second job called for-each which is my loop for checking the current currency in the row.
Within this job I have two transformations, one called set-current, one called map-currencies.
set-current simply uses the get rows from result step (to grab the data from the first transformation). I then use the set variable step to set the current currency to the value in currency field. This works fine, as each pass through in the loop changes the current variable to the correct value.
Map-currencies is where I’m having the most issues.
The goal is to use the filter row step to compare the current currency against the original 8 selected currencies, and then using the value mapper step to map it to an ID, before outputting the csv file.
The main issue here, is that I can’t use my original variables in the filter or value mapper.
So, what I’ve done here is use the get variables step to retrieve the variables and named them: one, two, three, …, eight. This allows me to bypass the filtering issue, but they don’t seem to work for the value mapper, which is the all important step.
The second issue is that when the file is output it only outputs one value (because of the loop), selecting the append option works, but this could be a problem if the job is run more than once.
However, the priority here is the mapping issue.
I understand that this is rather long, and perhaps a tad confusing, but I will greatly appreciate any help on this, even if it’s an entirely new approach 😊.
Like I said, I’m probably making it harder than it should be.
Thanks for your time.
Edit for AlainD
Input example
Output example
This should be doable in a single transformation using the Stream Lookup step.
Text File Input is your main file, Property input reads your property file into Key and Value columns. You could use a normal text file with two columns instead of the property input.
Below are the settings of the Stream lookup. Note the default value "9" for records that are not found in the lookup stream.

Dynamically creating a pivot table using fuzzy matching

So, I'm constantly being given data in new and different formats. I'm on a crusade to get my work to standardize data for easy use, and if I managed to convince the powers that be to standardize data, this problem becomes entirely moot. Until then, I have the following problem:
I get data in a variety of ways. Sometimes my gross sales are called total sales. Sometimes gross sales before discounts, total sales before discounts, Gross_Sales, etc. Discounts, deductions, exempt amounts, etc. form another column. So on and so forth. I'd like to be able to do the following:
1) Figure out what columns I want,
2) Turn those columns into a pivot table.
For part 1, I have two options, and I'm wondering if there's anymore: The 1st is to use Microsoft's fuzzy-matching add-in to help me match. I'd have a separate tab dedicated to fuzzy matching each column I need. The second is to just generate a long list of all the variants, and to test each one until I find a hit, assign it, and move onto testing the next one.
The second part is turning all of this into a pivot table - the resouces I have so far are https://www.thespreadsheetguru.com/blog/2014/9/27/vba-guide-excel-pivot-tables and How to Create a Pivot Table in VBA
Is there a better method? Is there another way?
Edit: Slightly better method - Grab the data columns, place them into a table, and pivot everything off of that table - it removes the need to re-create pivot tables, just need to move the data over.
Having the same problem, I use a mix of your two methods.
My data consists of a bunch of logs for rejected x-ray images, and the reject reason is a free text field. My solution was to create a table where the first column contains my desired output categories, and then each subsequent column contains a different variation of it.
For example, a row might have (column one/ouput first entry):
Positioning, POS, Positioning Error, Patient Positioning
Note that these are all fairly different from each other. Where the fuzzy matching comes in - it is used to capture all the smaller differences and mispellings around those other columns. When the fuzzy matching section decides a given reason matches a column's entry, it is then replaced with the appropriate desired output reason from column 1 of the table. In my example, a reason of 'Possitioning Err' [sic] would match to column 3 (Positioning Error) and then get converted to Positioning.
Then wash rinse repeat over the rest of your data as needed. This approach was super useful and fairly flexible in helping standardize my data. It was also computationally more expensive, but you'd only need to run the matching portion once I guess.
As for the actual mechanics of going about doing this - I use 2010, so no inbuilt functionality. I run the fuzzy matching code on a temporary worksheet until best percentage matches are found, and then overwrite the actual source data afterwards.

Summing different parts of a column in SQL

I have a database extract in excel and want to create a custom value in Tablue using their create calculation, which I believe is SQL based.
Basically I have a large number of feeds which all show up different amounts in a column. For example:
feed 1
feed 1
feed 2
feed 3
feed 4
feed 4
feed 4
And I want to have a sum for feed 1, feed 2, and feed 4. But in my actual DB there's about 100 feeds all with different number of appearances. I'm having troubles finding a good way to do this. If there even is one. Any help or direction would be appreciated!
I'm assuming that your list is a single column and you need a count of the number of occurrences of each feed. For the sake of example, since a column or table names were not supplied, let's call them colname and tablename.
select colname, count(*) as Ct from tablename group by colname
It would be easier to give an exact answer if you posted a small simplified subset of your spreadsheet. But assuming you have a column called "feed_name" which takes on values like "feed 1", "feed 2" etc depending on the row. Then the feed_name column should be a discrete dimension in Tableau.
Then just put the feed_name pill on a shelf, say the row shelf. And put the "Number of Records" field on another shelf, say the column shelf.
You don't need to write SQL to do this (or most tasks) in Tableau. It helps to understand SQL concepts and its very helpful to drop down to the SQL level when needed to solve tricky issues. But for most situations, you can just interactively explore the data by moving fields around and writing some simple calculations -- and let Tableau take care of generating the SQL necessary to retrieve the data needed to build the visualization you requested.
Tableau supports SQL and some NO-SQL data sources, along with some cubes too. It does that quite well and in multiple ways. You just can work more quickly and efficiently by using Tableau's visual based manipulations in most cases, and then drop to the lower level detail when needed. It just takes getting used to how Tableau operates.

SSIS - Column names as Variable/Changed Programmatically

I'm hoping someone might be able to help me out with this one - I have 24 files in CSV format, they all have the same layout and need to be joined onto some pre-existing data. Each file has a single column that needs to be joined onto the rest of the data, but those columns all have the same names in the original files. I need the columns automatically renamed to the filename as part of the join.
The final name of the column needs to be: Filename - data from another column.
My current approach is to use a foreach container and use the variable generated by the container to name the column, but there's nowhere I can input that value in the join, and even if I did, it'd mess up the output mappings, because the column names would be different.
Does anyone have any thoughts about how to get around these issues? Whoever has an idea will be saving my neck!
EDIT In case some more detail helps with this... SSIS version is 2008 and there are only a few hundred rows per file. It's basically a one time task to collect a full billing history from several bills which are issued monthly.
The source data has three columns, the product number, the product type and the cost.
The destination needs to have 24*3 columns, each of which has a monthly cost for a given product category. There are three product categories, and 24 bills (in seperate files) hence 24*3.
So hopefully I'm being a bit clearer - all I really need to know how to do, is to change the name of a column using a variable passed in from the foreach file container.
I think the easiest is to create a tmp database (aka staging db)
to load data from xls file to it and to define stored procedures where you can pass paramas (ex file names etc) and to build your won logic ...
Cheers Mario