I have a dataframe which consists of 4 rows and more than 20 columns(dates). The dataframe is a table which I read and convert it in a dataframe. The SUM row contains the sum of the values per date.
+----+-----+-----+
|PR |date1|date2|......
+----+-----+-----+
| a | 30 | 17 |......
| b | 30 | 12 |......
| SUM| 60 | 29 |......
+----+---+-------+
I created this dataframe after the submitting a question here. Since the table is constantly being populated with new data, I want the new data to be added to that dataframe.
I am coding in pySpark and script is the following one:
from pyspark.sql import functions as F
if df.filter(df.PR.like('SUM')):
print("**********")
print("SUM FOUND")
df = df.union(df.select(df.where(df.index == 'SUM').select('PR'), *[F.sum(F.col(c)).alias(c) for c in df.columns if c != 'PR']))
else:
df = df.union(df.select(F.lit("SUM").alias("PR"), *[F.sum(F.col(c)).alias(c) for c in df.columns if c != 'PR']))
What I want to achieve is that, for any new date create a new column and fill in the SUM without adding new rows. Unfortunately I am getting the error AttributeError: 'DataFrame' object has no attribute 'index'
Any help/hint? Should I follow a different approach?
Related
I have two dataframes with different row counts.
df1 has the problems and count
problems | count
broken, torn | 10
torn | 15
worn-out, broken | 25
df2 has the order_id and problems
order_id | problems
123 | broken
594 | torn
811 | worn-out, broken
I need to remove all rows from df1 that do not match the individual problems in the list in df2. And I want to maintain the count of df1.
The final df1 data frame would look like this:
problems | count
broken | 10
torn | 15
worn-out, broken | 25
I have only done this for columns in the same dataframe before. Not sure how to deal with multiple data frames.
Can someone please help?
Try this to merge the two df's together:
(pd.merge(df.assign(problems = df['problems'].str.split(', ').map(frozenset)),
df2.assign(problems = df2['problems'].map(frozenset)),on = 'problems'))
I am trying to add an Array of values as a new column to the DataFrame.
Ex:
Lets assume there is an Array(4,5,10) and a dataframe
+----------+-----+
| name | age |
+----------+-----+
| John | 32 |
| Elizabeth| 28 |
| Eric | 41 |
+----------+-----+
My requirement is to add the above array as a new column to the dataframe. My expected output is as follows:
+----------+-----+------+
| name | age | rank |
+----------+-----+------+
| John | 32 | 4 |
| Elizabeth| 28 | 5 |
| Eric | 41 | 10 |
+----------+-----+------+
I am trying if I can achieve this using rdd and zipWithIndex.
df.rdd.zipWithIndex.map(_.swap).join(array_rdd.zipWithIndex.map(_.swap))
This is resulting in something of this sort.
(0,([John, 32],4))
I want to convert the above RDD back to required dataframe. Let me know how to achieve this.
Are there any alternatives available for achieving the desired result other than using rdd and zipWithIndex? What is the best way to do it?
PS:
Context for better understanding:
I am using Xpress optimization suite to solve a mathematical problem. Xpress takes inputs interms of Arrays and also outputs the result in an Array. I get input as a DataFrame and I am extracting columns as Arrays(using collect) and passing to Xpress. Xpress outputs Array[Double] as solution. I want to add this solution back to the dataframe as a column and every value in the solution array corresponds to the row of the dataframe at its index i.e value at index 'n' of the output Array corresponds to 'n'th row of the dataframe
After the join just map the results to what you are looking for.
You can convert this back to a dataframe after joining the RDDs.
val originalDF = Seq(("John", 32), ("Elizabeth", 28), ("Eric", 41)).toDF("name", "age")
val rank = Array(4, 5, 10)
// convert to Seq first
val rankDF = rank.toSeq.toDF("rank")
val joined = originalDF.rdd.zipWithIndex.map(_.swap).join(rankDF.rdd.zipWithIndex.map(_.swap))
val finalRDD = joined.map{ case (k,v) => (k, v._1.getString(0), v._1.getInt(1), v._2.getInt(0)) }
val finalDF = finalRDD.toDF("id", "name", "age", "rank")
finalDF.show()
/*
+---+---------+---+----+
| id| name|age|rank|
+---+---------+---+----+
| 0| John| 32| 4|
| 1|Elizabeth| 28| 5|
| 2| Eric| 41| 10|
+---+---------+---+----+
*/
The only alternate way that I can think of is to use the org.apache.spark.sql.functions.row_number() window function. This essentially achieves the same thing by adding an increasing, consecutive row number to the dataframe.
The drawback with this is the large amount of data shuffle into one partition, since we need to have unrepeated row numbers for all rows in the dataframe. If your data is very large this can lead to an out of memory issue. (Note: this may not be applicable in your case, since you mentioned you are doing a collect on the data and have not mentioned any memory issues in this).
The approach of converting to an rdd and using zipWithIndex is an acceptable solution, but generally converting from dataframe to rdd is not recommended due to the performance difference of using an RDD instead of a dataframe.
Let's say i have a dataframe
df = pd.Dataframe({'A': [6,5,9,6,2]})
I also have an array/series
ser = pd.Series([5,6,7])
How can i insert this series into the existing df as a new column, but start at the specific index, while "padding" missing indexes with nan (i think pandas does this automatically).
Ie. psuedo code:
insert ser into df at index 2 as column 'B'
Example output
A B
----------
1| 6 | Nan
2| 5 | 5
3| 9 | 6
4| 6 | 7
5| 2 | Nan
Assuming that the start index value is in startInd variable:
startInd = 2
use the following code:
df['B'] = pd.Series(data=ser.values, index=df.index[df.index >= startInd]
.to_series().iloc[:ser.size])
Details:
df.index[df.index >= startInd] - returns a fragment of df.index,
starting from the "start value" (for now, up to the end).
.to_series() - converts it to a Series (in order to by able to
"slice" it using iloc, in a moment).
.iloc[:ser.size] - takes as many values as needed.
index=... - what we got in the previous step use as the index of the
created Series.
pd.Series(data=ser.values, ... - Create a Series - the source of
data, which will be saved in a new column in df (in a moment).
df['B'] = - Save the above data in a new column (only in rows with
index values matching the above index, other rows will be set to NaN).
There is a subtle but unavoidable difference from your expected result:
As some values are NaN, the type of the new column is coerced to float.
So the result is:
A B
1 6 NaN
2 5 5.0
3 9 6.0
4 6 7.0
5 2 NaN
How can we merge 2 dataframes and form a new data using conditions.for eg.
if data is present in dataframe B , use the row from dataframe B else use data from dataframe A.
DataFrame A
+-----+-------------------+--------+------+
| Name| LastTime|Duration|Status|
+-----+-------------------+--------+------+
| Bob|2015-04-23 12:33:00| 1|logout|
|Alice|2015-04-20 12:33:00| 5| login|
+-----+-------------------+--------+------+
DataFrame B
+-----+-------------------+--------+------+
| Name| LastTime|Duration|Status|
+-----+-------------------+--------+------+
| Bob|2015-04-24 00:33:00| 1|login |
+-----+-------------------+--------+------+
I want to form a new dataframe by using whole data in Dataframe A but update rows using data in B
+-----+-------------------+--------+------+
| Name| LastTime|Duration|Status|
+-----+-------------------+--------+------+
| Bob|2015-04-24 00:33:00| 1|login |
|Alice|2015-04-20 12:33:00| 5| login|
+-----+-------------------+--------+------+
I tried full outer join as
val joined = df.as("a").join(df.as("b")).where($"a.name" === $"b.name","outer")
But it resulted in 1 row with duplicate columns.How can I ignore the row in first table if there is one corresponding row is present in second.
val combined_df = dfa.join(dfb,Seq("Name"),"right").select(dfa("Name"), coalesce(dfa("LastTime"), dfb("LastTime")), coalesce(dfa("Duration"), dfb("Duration")),coalesce(dfa("Status"), dfb("Status")))
Some illustrative data in a DataFrame (MultiIndex) format:
|entity| year |value|
+------+------+-----+
| a | 1999 | 2 |
| | 2004 | 5 |
| b | 2003 | 3 |
| | 2007 | 2 |
| | 2014 | 7 |
I would like to calculate the slope using scipy.stats.linregress for each entity a and b in the above example. I tried using groupby on the first column, following the split-apply-combine advice, but it seems problematic since it's expecting one Series of values (a and b), whereas I need to operate on the two columns on the right.
This is easily done in R via plyr, not sure how to approach it in pandas.
A function can be applied to a groupby with the apply function. The passed function in this case linregress. Please see below:
In [4]: x = pd.DataFrame({'entity':['a','a','b','b','b'],
'year':[1999,2004,2003,2007,2014],
'value':[2,5,3,2,7]})
In [5]: x
Out[5]:
entity value year
0 a 2 1999
1 a 5 2004
2 b 3 2003
3 b 2 2007
4 b 7 2014
In [6]: from scipy.stats import linregress
In [7]: x.groupby('entity').apply(lambda v: linregress(v.year, v.value)[0])
Out[7]:
entity
a 0.600000
b 0.403226
You can do this via the iterator ability of the group by object. It seems easier to do it by dropping the current index and then specifying the group by 'entity'.
A list comprehension is then an easy way to quickly work through all the groups in the iterator. Or use a dict comprehension to get the labels in the same place (you can then stick the dict into a pd.DataFrame easily).
import pandas as pd
import scipy.stats
#This is your data
test = pd.DataFrame({'entity':['a','a','b','b','b'],'year':[1999,2004,2003,2007,2014],'value':[2,5,3,2,7]}).set_index(['entity','year'])
#This creates the groups
groupby = test.reset_index().groupby(['entity'])
#Process groups by list comprehension
slopes = [scipy.stats.linregress(group.year, group.value)[0] for name, group in groupby]
#Process groups by dict comprehension
slopes = {name:[scipy.stats.linregress(group.year, group.value)[0]] for name, group in groupby}