How a list can enter into a python dataframe's cell? - pandas

I have some dataframes (df, tmp_df and final_df) and I want to enter two columns of tmp_df into two differenrt cells of final_df as list type. My code and dataframes are presented as follow. However, the loop part is not working correctly. The other questions in stackoverflow or other websites, answer this question if there is an available dictionary for the lists from the beginning of the program. But here, the tmp_df dataframe changes during the for loop and at each iteration suitable prices are calculated. Also, the most related data are founded and they must locate as a realted cell of final_df.
import pandas as pd
df = pd.read_csv('myfile.csv')
tmp_df = pd.DataFrame()
final_df = pd.DataFrame()
tmp_df = df[df['Type'] == True]
cnt = 0
for c in tmp_df['Category']:
#################
# Apply some calculations and call different methods to do some changes on Price column of tmp_df.
#################
final_df.at[cnt,'Data'] = list(set(tmp_sub['Data']))
final_df ['Category'], final_df['Acceptable'], final_df['Rank'],final_df['Price'] = \
tmp_df['Rank'], list(tmp_sub['Price'])
cnt +=1
df:
| Data | Category | Acceptable | Rank | Price |
| ------- | -------- | ---------- | ---- | ----- |
| 30275 | A | Yes | 1 | 52787 |
| 35881 | C | No | 2 | 14804 |
| 28129 | C | Yes | 3 | 180543|
| 30274 | D | No | 2 | 8066 |
| 30351 | D | Yes | 3 | 273478|
| 35886 | A | Yes | 2 | 10808 |
| 39900 | C | Yes | 1 | 21893 |
| 35887 | A | No | 2 | 2244 |
| 35883 | A | Yes | 1 | 10066 |
| 35856 | D | Yes | 3 | 19011 |
| 35986 | C | No | 2 | 6895 |
| 30350 | D | No | 3 | 5243 |
| 28129 | C | Yes | 1 | 112859|
| 31571 | C | Yes | 1 | 20701 |
tmp_df:
| Data | Category | Acceptable | Rank | Price |
| ------- | -------- | ---------- | ---- | ----- |
| 30275 | A | Yes | 1 | 52787 |
| 38129 | C | Yes | 3 | 180543|
| 30351 | D | Yes | 3 | 273478|
| 35886 | A | Yes | 2 | 10808 |
| 39900 | C | Yes | 1 | 21893 |
| 35883 | A | Yes | 1 | 10066 |
| 35856 | D | Yes | 3 | 19011 |
| 28129 | C | Yes | 1 | 112859|
| 31571 | C | Yes | 1 | 20701 |
The prices in the final dataframe (final_df) are changed because of the calculations over the tmp_df. Now, what should I do if I want the following result?
final_df:
| Data | Category | Acceptable | Rank | Price |
| ------- | -------- | ---------- | ---- | ----- |
| [30275,35886,35883] | A | Yes | [1,2]| 195543|
| [28129,39900,38129,31571] | C | Yes | [1,3]| 210089|
| [30351,35856] | D | Yes | 3 | 113859|

You can aggregate list and for Price another aggregation function, e.g. sum, mean...:
#generate custom function for Price
def func(x):
return x.sum()
d = {'Data':list,'Rank':lambda x: list(set(x)), 'Price':func}
final_df = (tmp_df.groupby(['Category','Acceptable'],as_index=False)
.agg(d)
.reindex(tmp_df.columns, axis=1))
d = {'Data':list,'Rank':lambda x: list(set(x)), 'Price':'max'}
final_df = (tmp_df.groupby(['Category','Acceptable'],as_index=False)
.agg(d)
.reindex(tmp_df.columns, axis=1))
print (final_df)
Data Category Acceptable Rank Price
0 [30275, 35886, 35883] A Yes [1, 2] 52787
1 [38129, 39900, 28129, 31571] C Yes [1, 3] 180543
2 [30351, 35856] D Yes [3] 273478
Solution with custom function:
def func1(x):
return x.sum()
def f(x):
a = list(x['Data'])
b = list(set(x['Rank']))
c = func1(x['Price'])
return pd.Series({'Data':a,'Rank':b,'Price':c})
final_df = (tmp_df.groupby(['Category','Acceptable'])
.apply(f)
.reset_index()
.reindex(tmp_df.columns, axis=1))

Related

Pandas Pivot-Table Containing List

I'd like to create a pivot table with the counts of values in a list, filtered by another column but am not sure how to use pandas pivot table (or function) with a list.
Here's an example what I'd like to do:
| Col1 | Col2 |
| --- | ----------- |
| A | ["e", "f"] |
| B | ["g", "f"] |
| C | ["g", "h"] |
| A | ["e", "g"] |
| B | ["g", "f"] |
| C | ["g", "e"] |
Ideal Pivot Table
| 1 | 2 |count|
| A | e | 2 |
| | f | 1 |
| | g | 1 |
| B | g | 2 |
| | f | 2 |
| C | g | 2 |
| | h | 1 |
| | e | 1 |
I cannot use a list to make a pivot table and am struggling to figure out how to modify the data or find a different method. Any help would be much appreciated!
Try this:
cols = ['Col1','Col2']
df.explode('Col2').groupby(cols).size()

Count values less than in another dataframe based on values in existing dataframe

I have two python pandas dataframes, in simplified form they look like this:
DF1
+---------+------+
| Date 1 | Item |
+---------+------+
| 1991-08 | A |
| 1992-08 | A |
| 1997-02 | B |
| 1998-03 | C |
| 1999-02 | D |
| 1999-02 | D |
+---------|------+
DF2
+---------+------+
| Date 2 | Item |
+---------+------+
| 1993-08 | A |
| 1993-09 | B |
| 1997-01 | C |
| 1999-03 | D |
| 2000-02 | E |
| 2001-03 | F |
+---------|------+
I want to count how many element in Item column DF2 appeared in DF1 if the date in DF1 are less than the date in DF2
Desired Output
+---------+------+-------+
| Date 2 | Item | Count |
+---------+------+-------+
| 1993-08 | A | 2 |
| 1993-09 | B | 0 |
| 1997-01 | C | 0 |
| 1999-03 | D | 2 |
| 2000-02 | E | 0 |
| 2001-03 | F | 0 |
+---------+------+-------+
Appreciate any comment and feedback, thanks in advance
Let's merge with a cartesian join and filter, then use value_counts and map back to your dataframe:
df_c = df1.merge(df2, on='Item')
df_c = df_c[df_c['Date 1'] < df_c['Date 2']]
df2['Count'] = df2['Item'].map(df_c['Item'].value_counts()).fillna(0)
print(df2)
Output:
Date 2 Item Count
0 1993-08 A 2.0
1 1993-09 B 0.0 # Note, I get no counts for B
2 1997-01 C 0.0
3 1999-03 D 2.0
4 2000-02 E 0.0
5 2001-03 F 0.0

Deterministic Finite Automata on JFLAP

I have a DFA problem and I need to use JFLAP to create a diagram for the automata. I have successfully done a more simple problem, however I just can't figure out how to solve this one:
"A DFA that receives sequences of "1" and "2" values, accepting only sequences that result in 4. Any other combinations that result in more than or less than 4 are to be rejected."
The alphabet is {1,2} and as far as I know these are the possible combinations that will be accepted:
1111, 22, 121, 112, 211
Any help will be very much appreciated. Thank you.
A DFA for this finite language could look a lot like this:
1 1 1 1
----->q----->q1----->q11----->q111----->q1111
| | | | |
| 2 | 2 | 1 | 2 | 1,2
| | | | |
V 1 V 1 V | |
q2----->q21----->q211 | |
| | | | |
| 2 | 2 | 1,2 | |
| | | | |
V | | | |
q22 | | | |
| | | | |
| 1,2 | | | | +-----+
| | | | | | | 1,2
V V V V V V |
+-------+--------+------+---------+--------->qDead----+
Another approach would be just to remember the current sum:
1
----->q0----->q1
| /|
| / |
| / |
2 | 1 / | 2
| / |
| / |
| / |
|/ |
V 1 V
q2----->q3
| /|
| / |
| / |
2 | 1 / | 2
| / |
| / |
| / |
|/ |
V 1,2 V
q4----->qDead

Spark DataFrame: Ignore columns with empty IDs in groupBy

I have a dataframe e.g. with this structure:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1 | 123 | 1 | | | A1 | B1 | | ... <- only P1_x columns filled
1 | 123 | 2 | | | A2 | B2 | | ... <- only P1_x filled
1 | 123 | 3 | | | A3 | B3 | | ... <- only P1_x filled
1 | 123 | | 1 | | | | A4 | ... <- only P2_x filled
1 | 123 | | 2 | | | | A5 | ... <- only P2_x filled
1 | 123 | | | 1 | | | | ... <- only P3_x filled
I need to combine the rows, that have the same ID, Date and Px_ID values, but without caring for empty values in the Px_ID when comparing the key columns.
In the end I need a dataframe like this:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1 | 123 | 1 | 1 | 1 | A1 | B1 | A4 | ...
1 | 123 | 2 | 2 | | A2 | B2 | A5 | ...
1 | 123 | 3 | | | A3 | B3 | | ...
Is this possible and how? Thank you!
I found a solution for this problem: Since the non-relevant x_ID columns are empty, one possible way is to create a new column combined_ID that contains a concatenation of all x_ID column values (this will only contain one value, since only one x_ID is not empty in each row):
var xIdArray = Seq[Column]("P1_ID", "P2_ID", "P3_ID")
myDF = myDF.withColumn("combined_ID", concat(xIdArray : _*))
This changes the DF to following structure:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ... | combined_ID
===========================================================================
1 | 123 | 1 | | | A1 | B1 | | ... | 1
1 | 123 | 2 | | | A2 | B2 | | ... | 2
1 | 123 | 3 | | | A3 | B3 | | ... | 3
1 | 123 | | 1 | | | | A4 | ... | 1
1 | 123 | | 2 | | | | A5 | ... | 2
1 | 123 | | | 1 | | | | ... | 1
Now, I can simply group my DF by ID, Date and combined_ID and aggreate all the relevant columns by e.g. max function to get the values of the non-empty cells:
var groupByColumns : Seq[String] = Seq("ID", "Date", "x_ID")
var aggColumns : Seq[String] = Seq("P1_ID", "P2_ID", "P3_ID", "P1_A", "P1_B", "P2_A", ...)
myDF = myDF.groupBy(groupByColumns.head, groupByColumns.tail : _*).agg(aggColumns.head, aggColumns.tail : _*)
Result:
ID | Date | combined_ID | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
===========================================================================
1 | 123 | 1 | 1 | 1 | 1 | A1 | B1 | A4 | ...
1 | 123 | 2 | 2 | 2 | | A2 | B2 | A5 | ...
1 | 123 | 3 | 3 | | | A3 | B3 | | ...

How to use Count for specific condition

How Can I count and show how many Opportunity have Stage 3 but dont have Stage 2?
+-------+-------+
| OppID | Stage |
+-------+-------+
| ABC | 1 |
| ABC | 2 |
| ABC | 3 |
| ABC | 4 |
| CDF | 3 |
| CDF | 4 |
| EFG | 1 |
| EFG | 2 |
| EFG | 3 |
| HIJ | 2 |
| HIJ | 3 |
| LMI | 1 |
| LMI | 2 |
| LMI | 4 |
+-------+-------+
The count result is 1
+-------+-------+
| OppID | Stage |
+-------+-------+
| CDF | 3 |
| CDF | 4 |
+-------+-------+
Got it, you could use NOT EXISTS and COUNT DISTINCT in following:
SELECT COUNT(DISTINCT OppID)
FROM tbl AS t1
WHERE NOT EXISTS (SELECT 1 FROM tbl AS t2 WHERE t1.OppID = t2.OppID and t2.Stage = 2) and t1.Stage = 3