Looping through columns to conduct data manipulations in a data frame - pandas

One struggle I have with using Python Pandas is to repeat the same coding scheme for a large number of columns. For example, below is trying to create a new column age_b in a data frame called data. How do I easily loop through a long (100s or even 1000s) of numeric columns, do the exact same thing, with the newly created column names being the existing name with a prefix or suffix string such as "_b".
labels = [1,2,3,4,5]
data['age_b'] = pd.cut(data['age'],bins=5, labels=labels)
In general, I have many simply data frame column manipulations or calculations, and it's easy to write the code. However, so often I want to repeat the same process for dozens of columns, that's when I get bogged down, because most functions or manipulations would work for one column, but not easily repeatable to many columns. It would be nice if someone can suggest a looping code "structure". Thanks!

Related

How to overcome the 2GB limit for a single column value in Spark

I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.

Slice dataframe according to unique values into many smaller dataframes

I have a large dataframe (14,000 rows). The columns include 'title', 'x' and 'y' as well as other random data.
For a particular title, I've written a code which basically performs an analysis using the x and y values for a subset of this data (but the specifics are unimportant for this).
For this title (which is something like "Part number Y1-17") there are about 80 rows.
At the moment I have only worked out how to get my code to work on 1 subset of titles (i.e. one set of rows with the same title) at a time. For this I've been making a smaller dataframe out of my big one using:
df = pd.read_excel(r"mydata.xlsx")
a = df.loc[df['title'].str.contains('Y1-17')]
But given there are about 180 of these smaller datasets I need to do this analysis on, I don't want to have to do it manually.
My question is, is there a way to make all of the smaller dataframes automatically, by slicing the data by the unique 'title' value? All the help I've found, it seems like you need to specify the 'title' to make a subset. I want to subset all of it and I don't want to have to list all the title names to do it.
I've searched quite a lot and haven't found anything, however I am a beginner so it's very possible I've missed some really basic way of doing this.
I'm not sure if its important information but the modules I'm working with pandas, and numpy
Thanks for any help!
You can use Pandas groupby
For example:
df_dict = {key: title for key, title in df.copy().groupby('title', sort=False)}
Which creates a dictionary of DataFrames each containing all the columns and only the rows pertaining to each unique value of title.

Organising csv. file data in Python

I am quite a beginner with Python but I have a programming-related project to work on, so, I really would like to ask some help. I didn´t find many simple solutions to organize the data such a way that I could do some analysis with that.
First, I have multiple csv-files, which I read in as DataFrame objects. In the end, I need them all to analyze together (right now the files are separated to the list of DataFrames but later on I probably will need those as one DataFrame object).
However, I have a problem with organizing and separating the data. These are thousands of rows in one column, a part of it is presented:
CIP;Date;Hour;Cons;REAL/ESTIMATED
EN025140855608477018TC2L;11/03/2020;1;0 057;R
EN025140855608477018TC2L;11/03/2020;2;0 078;R
EN025140855608477018TC2L;11/03/2020;3;0 033;R
EN025140855608477018TC2L;11/03/2020;4;0 085;R
EN025140855608477018TC2L;11/03/2020;5;0 019;R
...
EN025140855608477018TC2L;11/04/2020;20;0 786;R
EN025140855608477018TC2L;11/04/2020;21;0 288;R
EN025140855608477018TC2L;11/04/2020;22;0 198;R
EN025140855608477018TC2L;11/04/2020;23;0 728;R
EN025140855608477018TC2L;11/04/2020;24;0 275;R
The area, where the huge space between, the number should be merged together, for example, 0.057, which information represents "Cons" (actually it is the most important information).
I should be able to split data into 5 columns in order to proceed with the analysis. However, it should be a universal tool for different csv-files without knowing the including symbols. But the structure of the content and the heading is always the same.
I would be happy if anyone might know to recommend a way to work with this kind of data.
Sounds like what you are trying to do is convert the Cons column so that the spaces become a dot.
df = pd.read_csv("file.txt", sep=";")
df['Cons'] = df['Cons'].str.replace("\s+",".")
df['Cons'].head()
Output:
0 0.057
1 0.078
2 0.033
3 0.085
4 0.019

How to map the column wise data in flowfile in NiFi?

i have csv file which having following structure.,
Alfreds,Centro,Ernst,Island,Bacchus
Germany,Mexico,Austria,UK,Canada
01,02,03,04,05
Now i have to move that data into database like below.
Name,City,ID
Alfreds,Germay,01
Centro,Mexico,02
Ernst,Austria,03
Island,UK,04
Bacchus,Canda,05
i try to map those colums but i can't able to extract the data in column wise.
Here my input data in column wise but i need to insert those in row wise in SQLServer
Can anyone suggest way to transfer column wise data into row wise in sql server?.
Thanks
There is no existing Apache NiFi processor to perform column transposition. One of the problems is that this is difficult to do in a streaming manner, as most NiFi components are designed, because in a naïve implementation you need to hold the entire contents of the flowfile in active memory at the same time.
I would recommend using an ExecuteScript processor to do this (here's a 6 line Python example). Be careful doing this because you can easily end up overflowing your heap if it is not set properly/you read unexpectedly large files into memory.
You could write a custom processor which performs a streaming transpose operation by iterating over each of n rows and reading up to your delimiter, storing a byte counter per row, combining the n elements as a single output row, and repeating the process starting from the respective byte counter of each row. (Given m columns, this is O(m * n)).
Another solution would be splitting the CSV input into individual rows using the SplitText processor, using an ExecuteScript or custom processor to transpose a single row into a single column, and then using a custom merge operation (either extend the existing MergeContent processor or write a script to do this) which laterally concatenates the incoming columns into a reconstructed matrix. (O(n) + O(n) + O(m) => O(2n + m) but the individual transposition operations can be performed in parallel so with x threads it's O(n + n/x + m)).
Any of these approaches will require some level of custom development. If you are really hesitant to pursue that, you could try using ExecuteStreamCommand and one of the many bash solutions to do the transposition on the command-line.
#Andy,
It could be possible in NiFi also without using ExecuteScript.
I have extract the 3 input rows as input.1,input.2,input.3 in ExtractText. And then count number of columns in "input.1" using AnydelinateValues in expression language and store that in "TotalCount" Attribute.
Initially made "Count=1".
Using Loop Concept to get the first column by using "Count" and then increment "Count" Check "Count" in RouteOnAttribute
"le(totalcount)"
Now form insert Query with "Count" Attribute.
It worked well for me.It could be useful for someone.

Excel VBA using SUMPRODUCT and COUNTIFS - issue of speed

I have an issue of speed. (Apologies for the long post…). I am using Excel 2013 and 2016 for Windows.
I have a workbook that performs 10,000+ calculations on a 200,000 cell table (1000 rows x 200 columns).
Each calculation returns an integer (e.g. count of filtered rows) or more usually a percentage (e.g. sum of value of filtered rows divided by sum of value of rows). The structure of the calculation is variations of the SUMPRODUCT(COUNTIFS()) idea, along the lines of:
=IF($B6=0,
0,
SUMPRODUCT(COUNTIFS(
Data[CompanyName],
CompanyName,
Data[CurrentYear],
TeamYear,
INDIRECT(VLOOKUP(TeamYear&"R2",RealProgress,2,FALSE)),
"<>"&"",
Data[High Stage],
NonDom[NonDom]
))
/$B6
)
Explaining above:
the pair Data[Company Name] and CompanyName is the column in the table and the condition value for the first filter.
The pair Data[Current Year] and TeamYear are the same as above and constitute the second filter.
The third pair looks up a intermediary table and returns the name of the column, the condition ("<>"&"") is ‘not blank’, i.e. returns all rows that have a value in this column
Finally, the fourth pair is similar to 3 above but returns a set of values that matches the set of values in
Lastly, the four filters are joined together with AND statements.
It is important to note that across all the calculations the same principle is applied of using SUMPRODUCT(COUNTIFS()) – however there are many variations on this theme.
At present, using Calculate on a select range of sheets (rather than the slower calculating the whole workbook), yields a speed of calculation of around 30-40 seconds. Not bad, and tolerable as calculations aren’t performed all the time.
Unfortunately, the model is to be extended and now could approach 20,000 rows rather 1,000 rows. Calculation performance is directly linked to the number of rows or cells, therefore I expect performance to plummet!
The obvious solution [1] is to use arrays, ideally passing an array, held in memory, to the formula in the cell and then processing it along with the filters and their conditions (the lookup filters being arrays too).
The alternative solution [2] is to write a UDF using arrays, but reading around the internet the opinion is that UDFs are much slower than native Excel functions.
Two questions:
Is solution [1] possible, and the best way of doing this, and if so how would I construct it?
If solution [1] is not possible or not the best way, does anyone have any thoughts on how much quicker solution [2] might be compared with my current solution?
Are there other better solutions out there? I know about Power BI Desktop, PowerPivot and PowerQuery – however this is a commercial application for use by non-Excel users and needs to be presented in the current Excel ‘grid’ form of rows and columns.
Thanks so much for reading!
Addendum: I'm going to try running an array calculation for each sheet on the Worksheet.Activate event and see if there's some time savings.
Writing data to arrays is normally a good idea if looking to increase speed. Done like this:
Dim myTable As ListObject
Dim myArray As Variant
'Set path for Table variable
Set myTable = ActiveSheet.ListObjects("Table1")
'Create Array List from Table
myArray = myTable.DataBodyRange
(Source)