Let's assume we have a DataFrame df with N rows:
| multiple-index | | ordinary columns |
I_A, I_B, I_C, I_D, C_A, C_B, C_C, C_D
How can we extract all N values for I_B index column? df.index gives us all combinations of I_A...I_D but it is not what we need. Of course, we can iterate over it but it will cost productivity, there must be an easier, more straightforward way?
Thank you for your time.
UPDATE
E.g., we have df generated by:
data = {
"animal": ["cat", "dog", "parrot", "hamster"],
"size": ["big", "big", "small", "small"],
"feet": [4, 4, 2, 4]
}
multi = pd.DataFrame(data)
multi.set_index(["size", "feet"], inplace = True)
and which is:
animal
size feet |
big 4 | cat
big 4 | dog
small 2 | parrot
small 4 | hamster
Its index is:
MultiIndex([( 'big', 4),
( 'big', 4),
('small', 2),
('small', 4)],
names=['size', 'feet'])
from which we would like to get all sizes:
['big', 'big', 'small', 'small']
How can we do that?
I think you're looking for MultiIndex.get_level_values:
multi.index.get_level_values('size')
Output: Index(['big', 'big', 'small', 'small'], dtype='object', name='size')
Or as list:
multi.index.get_level_values('size').to_list()
Output: ['big', 'big', 'small', 'small']
Related
I have the following dataframe:
+---+---+
| A | B |
+---+---+
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | f |
| 2 | g |
| 3 | j |
+---+---+
I need it to be in a df/rdd format
(1, [a, b, c])
(2, [f, g])
(3, [j])
I'm new to spark and was wondering if this operation can be performed by a single function
I tried using flatmap but I don't think I'm using it correctly
You can group by "A" and then use aggregate function for example collect_set or collect_array
import pyspark.sql.functions as F
df = [
{"A": 1, "B": "a"},
{"A": 1, "B": "b"},
{"A": 1, "B": "c"},
{"A": 2, "B": "f"},
{"A": 2, "B": "g"},
{"A": 3, "B": "j"}
]
df = spark.createDataFrame(df)
df.groupBy("A").agg(F.collect_set(F.col("B"))).show()
Output
+---+--------------+
| A|collect_set(B)|
+---+--------------+
| 1| [c, b, a]|
| 2| [g, f]|
| 3| [j]|
+---+--------------+
First step, create sample data.
#
# 1 - Create sample dataframe + view
#
# array of tuples - data
dat1 = [
(1, "a"),
(1, "b"),
(1, "c"),
(2, "f"),
(2, "g"),
(3, "j")
]
# array of names - columns
col1 = ["A", "B"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# make temp hive view
df1.createOrReplaceTempView("sample_data")
Second step, play around with temporary table.
%sql
select * from sample_data
%sql
select A, collect_list(B) as B_LIST from sample_data group by A
Last step, write code to execute Spark SQL to create dataframe that you want.
df2 = spark.sql("select A, collect_list(B) as B_LIST from sample_data group by A")
display(df2)
In summary, you can use the dataframe methods to create the same output. However, the Spark SQL looks clean and makes more sense.
There is a dataframe:
d = {'date' : ['2020-02-01', '2020-02-01', '2020-02-01', '2020-02-01', '2020-02-02', '2020-02-02', '2020-02-02'], 'type' : ['Bird', 'Dog', 'Cat', 'Bird', 'Dog', 'Cat', 'Bird'], 'weight' : [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(d)
I would like to split the "type" column by the value of type and get columns - Bird, Dog, Cat. And values in these columns must be the average weight of the birds, dogs, etc. on the same date.
To get something like that.
date bird dog cat
2020-02-01 ... ... ...
2020-02-02 ... ... ...
I started to try group by but can't figure out with that. Maybe split dataframe by val in the "type" column and merge the obtained dataframes again?
Use pivot_table and apply mean to aggregate values those has the same index/column:
out = df.pivot_table(index='date', columns='type', values='weight', aggfunc='mean') \
.rename_axis(columns=None).reset_index()
print(out)
# Output:
date Bird Cat Dog
0 2020-02-01 2.5 3.0 2.0
1 2020-02-02 7.0 6.0 5.0
columns: datetime | clientid | amounts | *new_column_to_be_implemented* (rolling mean of values before but only for values that are the same in clientid)
`day 1` | 2 | 50 | (na)
`day 2` | 2 | 60 | 50
`day 3` | 1 | 45 | (na)
`day 4` | 2 | 45 | 110
`day 5` | 3 | 90 | (na)
`day 6` | 3 | 10 | 90
`day 7` | 2 | 10 | 105
so this gets the mean of the last 2 amounts of the same clientid for example.
I know it is possible to add a list and append/pop values to remember them, but is there a better way in pandas?
Please make sure to following the guidelines described in How to make good reproducible pandas examples when asking pandas related questions, it helps a lot for reproducibility.
The key element for the answer is the pairing of the groupby and rolling methods. groupby will group all the records with the same clientid and rolling will select the correct amount of records for the mean calculation.
import pandas as pd
import numpy as np
# setting up the dataframe
data = [
['day 1', 2, 50],
['day 2', 2, 60],
['day 3', 1, 45],
['day 4', 2, 45],
['day 5', 3, 90],
['day 6', 3, 10],
['day 7', 2, 10]
]
columns = ['date', 'clientid', 'amounts']
df = pd.DataFrame(data=data, columns=columns)
rolling_mean = df.groupby('clientid').rolling(2)['amounts'].mean()
rolling_mean.index = rolling_mean.index.get_level_values(1)
df['client_rolling_mean'] = rolling_mean
I am new to spark and I am trying to do the following, using Pyspark:
I have a dataframe with 3 columns, "id", "number1", "number2".
For each value of "id" I have multiple rows and what I want to do is create a list of tuples with all the rows that correspond to each id.
Eg, for the following dataframe
id | number1 | number2 |
a | 1 | 1 |
a | 2 | 2 |
b | 3 | 3 |
b | 4 | 4 |
the desired outcome would be 2 lists as such:
[(1, 1), (2, 2)]
and
[(3, 3), (4, 4)]
I'm not sure how to approach this, since I'm a newbie. I have managed to get a list of the distinct ids doing the following
distinct_ids = [x for x in df.select('id').distinct().collect()]
In pandas that I'm more familiar with, now I would loop through the dataframe for each distinct id and gather all the rows for it, but I'm sure this is far from optimal.
Can you give me any ideas? Groupby comes to mind but I'm not sure how to approach
You can use groupby and aggregate using collect_list and array:
import pyspark.sql.functions as F
df2 = df.groupBy('id').agg(F.collect_list(F.array('number1', 'number2')).alias('number'))
df2.show()
+---+----------------+
| id| number|
+---+----------------+
| b|[[3, 3], [4, 4]]|
| a|[[1, 1], [2, 2]]|
+---+----------------+
And if you want to get back a list of tuples,
result = [[tuple(j) for j in i] for i in [r[0] for r in df2.select('number').orderBy('number').collect()]]
which gives result as [[(1, 1), (2, 2)], [(3, 3), (4, 4)]]
If you want a numpy array, you can do
result = np.array([r[0] for r in df2.select('number').collect()])
which gives
array([[[3, 3],
[4, 4]],
[[1, 1],
[2, 2]]])
I have a Pandas DataFrame extracted from a PDF with tabula-py.
The PDF is like this:
+--------------+--------+-------+
| name | letter | value |
+--------------+--------+-------+
| A short name | a | 1 |
+-------------------------------+
| Another | b | 2 |
+-------------------------------+
| A very large | c | 3 |
| name | | |
+-------------------------------+
| other one | d | 4 |
+-------------------------------+
| My name is | e | 5 |
| big | | |
+--------------+--------+-------+
As you can see A very large name has a line break and, as the original pdf does not have borders, a row with ['name', NaN, NaN] and another with ['A very large', 'c', 3] are created in the DataFrame, when I want only a sigle one with content: ['A very large name', 'c', 3].
Same happens with My name is big
As this happens for several rows which I'm trying to achieve is concatenate the content of the name cell with the previous one when the rest of the cells in the row are NaN. Then delete the NaN rows.
But any other strategy that obtain the same result is welcome.
import pandas as pd
import numpy as np
data = {
"name": ["A short name", "Another", "A very large", "name", "other one", "My name is", "big"],
"letter": ["a", "b", "c", np.NaN, "d", "e", np.NaN],
"value": [1, 2, 3, np.NaN, 4, 5, np.NaN],
}
df = pd.DataFrame(data)
data_expected = {
"name": ["A short name", "Another", "A very large name", "other one", "My name is big"],
"letter": ["a", "b", "c", "d", "e"],
"value": [1, 2, 3, 4, 5],
}
df_expected = pd.DataFrame(data_expected)
I'm trying code like this, but is not working
# Not works and not very `pandastonic`
nan_indexes = df[df.iloc[:, 1:].isna().all(axis='columns')].index
df.loc[nan_indexes - 1, "name"] = df.loc[nan_indexes - 1, "name"].str.cat(df.loc[nan_indexes, "name"], ' ')
# remove NaN rows
you can try with groupby.agg with join or first depending on the columns. the groups are created with checking where it is notna in the column letter and value and cumsum.
print (df.groupby(df[['letter', 'value']].notna().any(1).cumsum())
.agg({'name': ' '.join, 'letter':'first', 'value':'first'})
)
name letter value
1 A short name a 1.0
2 Another b 2.0
3 A very large name c 3.0
4 other one d 4.0
5 My name is big e 5.0