Machine Learning - Feature Generation using dataframes with different sizes

Machine Learning - Feature Generation using dataframes with different sizes - pandas

I have multiple CSV files with 18 columns of sensordata of a production cycle ordered by time. Each CSV file represents one product (smartphone), which production was either sucessful (1) or unsucessful (0). I converted each CSV to a dataframe and brought them together in a dictionary. The CSV files have different numbers of rows.
My question is now, if I do have to compress them into one single row with the result of either 1 or 0 at the end to compare different machine learning algorithm (like the multiple logistic regression). For my algorithm, each input is a dataframe, and the output is a label. Concatenating all the rows side by side into one single row could create feature vectors of different lengths.
For example: I have 7 CSV files converted into 7 dataframes and put them together in one dataframe with 7 rows (a single row for every dataframe).
If I have to compresss one DataFrame to a single row, could you tell me how to do so?
Or is it possible to tell the algorithm, that it has to consider every row of a whole dataframe (30000 rows).
Thank you very much!

Related

Highest count in a data frame pandas

I'm trying to extract 10 columns from a csv file with pandas in Google colab so I'm using this line firstten = data.iloc[:, 0:10] # first ten columns of data frame with all rows
After that I used firstten.count() to count the number of values in each column so after that what I want to know is that if in one of those columns have 80 different values and the others are lower I want to know from this ten columns the highest number of counts for that, I used the max but it didn't work some help with this please I don't need to know which column just the number in some pieces of data will be repeated 80 like 4 times so also to have this in mind with the solution. Also I thought about using the sort function but maybe there is another option.
Thanks for your help this is the output that I would like to get:

Single Column name splitting to multiple columns with data

I am analyzing the inverter data from a power plant. There are more than 10 inverters and each inverter has 3 parameters that need to be analyzed. The parameters are Energy generated per interval, AC Power P_AC and DC Power P_DC. The inverters are numbered as 17.02 or 22.03 etc. The data is taken at a time step of 5minutes. After downloading the data in a csv file, there is only 1 column in the csv file. The column name contains numbers of all the inverter and their parameter names separated by a ';'. Also, the data at each time step is in 1 single cell separated by ';'. I want to analyse all the parameters of all the inverters and i want to make sure that each parameter of every inverter comes in a separate column. Can somebody help me to segregate this? Also, I want to ensure that columns are sorted in the increasing order of inverter numbering. I am attaching the the link to actual csv file - https://drive.google.com/file/d/1Rp54DEarzFUGm2oU5Bfkl3karbUYYwcd/view?usp=sharing
https://drive.google.com/file/d/12InL3N-ZMMODGWVUYn_8nTwPgAQtSBzq/view?usp=sharing
In the data frame above, you can see that every column has a project code -'SM10046 Akadyr Ext', then the inverter number 'INV 17.02' and then the name of parameter 'Energy generated per interval [kWh]' and lastly the code of parameter 'E_INT' . I want that the project code should be removed and only inverter number and parameter code should be present as a column name. Also, all the inverter should come in a serial order.

Essentially you have a multitude of columns, and from your description, you need to sort /analyze data from each plant ?
If you need permanent storage of data, I would use SQLite or similar, and convert each plant into a row with a key holding plant ID.
Like this:
2020-07-28 13:33:09;A1;A2;A3;B1;B2;B3
turned into something like this (now in a database, 5 fiels per record)
2020-07-28 13:33:09;A;A1;A2;A3
2020-07-28 13:33:09;B;A1;A2;A3
my goto-too for this would be a scripting language like AutoIT3, Perl or Python, which makes splitting lines and connecting to SQLite trivial.
If you just need real-time sorting/reporting etc, AWK is a perfect tool for this, since you can create sorted arrays very easily. (Perl/Python again of course alternatives as well).
It could be useful if you provide actual (trivial) example of what you expect output to be ?

Splitting data into new columns

I was wondering if anyone could advise me on a problem I am having with 'r'.
I am working on a drug trial of sorts, with data from 3 experimental conditions.
For every participant, six data points were collected. Meaning each participant is represented on one row, with 6 columns of data points.
However I am now trying to create a 3 level longitudinal model, but in order to do so I need to reformat my data so that instead of each participant having one row with 6 columns, there will be 6 rows for one participant, with two columns, one indicating the round of data collection (1-6), and the other giving the actual value.
Is there a straightforward way to do this? I am hoping I won't be stuck manually reformatting my data, as it will cost me a lot of time that I'd rather invest into analysing the results.
I will also need to combine all three data frames into one main data frame to run the model, but I assume this will be an easy step once I figure out the first problem.

Printing a large dataframe across pages

I have a need to print a large table across multiple pages which contains both header rows and a “header”column. Representative of what I would like to achieve is:
https://github.com/EricG-Personal/table_print/blob/master/table.png
I do not want the contents of any cell to be clipped, split between pages, or auto-scaled to be smaller. Each page should have the appropriate header rows and each page should have the appropriate header column (the ID column).
The only aspect not depicted is that some of the cells would contain image data.
Can I achieve this with pandas?
What possible solutions do I have when attempting to print a large dataframe?

Pandas has no such capabilities, it wasn't designed for that in the first place.
I'd suggest converting your DataFrame to excel sheet and print that using MS Excel. It has -to the best of my knowledge- all what you need.

Combining condition files

I am working on an experiment and have parts of it built using some coder and some builder, but am stuck on a spot. The experiment presents two different lists of words to people (list A and list B) and each word in it's respective list is randomly paired with a number at the beginning of the experiment. Psychopy shows both the word and the number to participants and it is important that after they are randomly paired the word-number pairs are then yoked for the remainder of the experiment. I have used coder to randomize the pairing and construct a conditions file for the two word lists on the fly. Builder then uses these constructed conditions files to present the stimuli (words and numbers).
This is the part where I'm stuck. After the two word lists and their paired numbers are presented, I need to present a subset of both lists A and B as well as a third list of word-number pairs that was not previously presented. So, for example, a person might see something like this during the experiment:
First presentation:
List A:
frank - 1, susan - 3
List B:
shoe - 2, dingy - 1
Second presentation:
frank - 1, shoe - 2, hotel - 4
The beginning of the experiment is where coder is used to create the word and number lists as well as write the two list's condition files. That code is below:
import random
import csv
studylista=["shoe","bear","balls","dingy"]
pointslista=[1,2,3,4]
listaRand=random.sample(studylista,len(studylista))
listapointsRand=random.sample(pointslista,len(pointslista))
with open('WordsandPointslista.csv','wb') as w:
writer=csv.writer(w)
writer.writerow(['studylista','pointslista'])
for i in range(len(listaRand)):
writer.writerow([listaRand[i],listapointsRand[i]])
studylistb=["frank","robert","daniel","susan"]
pointslistb=[1,2,3,4]
listbRand=random.sample(studylistb,len(studylistb))
listbpointsRand=random.sample(pointslistb,len(pointslistb))
with open('WordsandPointslistb.csv','wb') as w:
writer=csv.writer(w)
writer.writerow(['studylistb','pointslistb'])
for i in range(len(listbRand)):
writer.writerow([listbRand[i],listbpointsRand[i]])
I need a random subset of the two previously presented lists along with an additional list that has not been presented to be seen all together by the participant. The previous word-number pairings for the already seen lists also need to be preserved. I cannot seem to discover how to do this.
I currently have the two word-number lists presented in separate routines with loops around each one. I am trying to figure out how to create a third routine that will show only some of the previously seen word-number pairs along with some new word-number pairs.
Thanks.

In your code above, don't create two separate two-column CSV files, but combine them into a single four-column file. This file can be used in multiple loops. If I understand your design correctly, it would be used first in a loop to present all the 'A' word/number pairs, and then again in a second loop to present all the 'B' word/number pairs. Lastly, use it in a final loop to present just a subset of the 'A' & 'B' pairs. This subsetting is applied via the "Selected rows" field in the loop dialog. Randomisation is optional in the first two loops, as you have already shuffled the rows, but would likely be necessary in the third loop to avoid presenting rows in the same order as in the first two loops.
Then there is the question of how to handle the third set of word/number pairs. The easiest thing to do would be to simply create them at the same time as the A & B sets and stick them in the same CSV file. But in this case, you would need the same number of words and numbers, some of which wouldn't be presented due to only running through a subset in the final loop. The alternative is to have a second code component prior to the third loop which reads in the existing file, shuffles the rows, subsets it, and then adds the new columns. i.e. doing a lot of the things which the Builder loop would otherwise do for you, but allowing you not to 'waste' words, if that is important to you.
Lastly, I also simplified your code above. Builder already imports the numpy.radom.shuffle function, which is simpler than doing all the sampling and so on, and then you don't need to import the standard random library.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas