Creating a Weighted Graph from Julia DataFrame - dataframe

Given the following DataFrame:
| a | b | c | d |
|---|---|---|---|
| 1 | 0 | 1 | 0 |
| 1 | 0 | 1 | 1 |
| 1 | 0 | 0 | 1 |
| 0 | 1 | 0 | 1 |
How does one efficiently construct a weighted graph, such that:
The nodes correspond to the column names;
Two vertices are connected if they both have 1's in the same line of the DataFrame
(e.g. 'a' is connected to 'c' in the first row);
The weight is equal to the number of times two vertices are connected (e.g. edge 'a'-'c' has weight 2, while 'c'-'d' has weight 1).
Here is how to manually construct this graph using SimpleWeightedGraphs.jl and GraphPlot.jl:
g = SimpleWeightedGraph(4)
add_edge!(g,1,3,2)
add_edge!(g,1,4,2)
add_edge!(g,2,4,1)
add_edge!(g,3,4,1)
nodes = ["a","b","c","d"]
gplot(g,nodelabel=nodes,edgelinewidth=[2,2,1,1])

Something like this should work assuming df is your data frame:
using LinearAlgebra
function gengraph(df)
g = SimpleWeightedGraph(ncol(df))
ew = Int[]
for i in 1:ncol(df), j in i+1:ncol(df)
w = dot(df[!, i], df[!, j])
if w > 0
push!(ew, w)
add_edge!(g, i, j, w)
end
end
gplot(g,nodelabel=names(df),edgelinewidth=ew)
end

Related

Join/Add data to MultiIndex dataframe in pandas

I have some measurement data from different dust analytics.
Two Locations MC174 and MC042
Two fractions PM2.5 and PM10
several analytic results [Cl,Na, K,...]
I created a multicolumn dataframe like this:
| MC174 | MC042 |
| PM2.5 | PM10 | PM2.4 | PM10 |
| Cl | Na| K | Cl | Na| K | Cl | Na| K | Cl | Na| K |
location = ['MC174','MC042']
fraction = ['PM10','PM2.5']
value = [ 'date' ,'Cl','NO3', 'SO4','Na', 'NH4','K', 'Mg','Ca', 'masse','OC_R', 'E_CR','OC_T', 'EC_T']
midx = pd.MultiIndex.from_product([location, fraction,value],names=['location','fraction','value'])
df = pd.DataFrame(columns=midx)
df
and i prepared 4 Dataframes with matching colums for those four locations and fractions.
date | Cl | Na | K |
______________________________
01-01-2021 | 3.1 | 4.3 | 1.0|
... ...
31-12-2021 | 4.9 | 3.8 | 0.8
Now i want to fill the large dataframe with the data from the four locations/fractions:
DF1 -> MainDF[MC174][PM10]
DF2 -> MainDF[MC174][PM2.5]
and so on...
My goal is to have one dataframe with the dates of the year in its index and the multilevel columnstructure i discribed at the top and all the data inside it.
I tried:
main_df['MC174']['PM10'].append(data_MC174_PM10)
pd.concat([main_df['MC174']['PM10'], data_MC174_PM10],axis=0)
main_df.loc[:,['MC174'],['PM10']] = data_MC174_PM10
but the dataframe is never filled.
Thanks in advance!

pandas outliers with and without calculations

I'm contemplating making decisions on outliers on a dataset with over 300 features. I'd like to analyse the frame without removing the data hastingly. I have a frame:
| | A | B | C | D | E |
|---:|----:|----:|-----:|----:|----:|
| 0 | 100 | 99 | 1000 | 300 | 250 |
| 1 | 665 | 6 | 9 | 1 | 9 |
| 2 | 7 | 665 | 4 | 9 | 1 |
| 3 | 1 | 3 | 4 | 3 | 6 |
| 4 | 1 | 9 | 1 | 665 | 5 |
| 5 | 3 | 4 | 6 | 1 | 9 |
| 6 | 5 | 9 | 1 | 3 | 2 |
| 7 | 1 | 665 | 3 | 2 | 3 |
| 8 | 2 | 665 | 9 | 1 | 0 |
| 9 | 5 | 0 | 7 | 6 | 5 |
| 10 | 0 | 3 | 3 | 7 | 3 |
| 11 | 6 | 3 | 0 | 3 | 6 |
| 12 | 6 | 6 | 5 | 1 | 5 |
I have coded some introspection to be saved in another frame called _outliers:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = (Q3 - Q1)
min_ = (Q1 - (1.5 * IQR))
max_ = (Q3 + (1.5 * IQR))
# Counts outliers in columns
_outliers = ((df.le (min_)) | (df.ge (max_))).sum().to_frame(name="outliers")
# Gives percentage of data that outliers represent in the column
_outliers["percent"] = (_outliers['outliers'] / _outliers['outliers'].sum()) * 100
# Shows max value in the column
_outliers["max_val"] = df[_outliers.index].max()
# Shows min value in the column
_outliers["min_val"] = df[_outliers.index].min()
# Shows median value in the column
_outliers["median"] = df[_outliers.index].median()
# Shows mean value in the column
_outliers["mean"] = df[_outliers.index].mean()
That yields:
| | outliers | percent | max_val | min_val | median | mean |
|:---|-----------:|----------:|----------:|----------:|---------:|---------:|
| A | 2 | 22.2222 | 665 | 0 | 5 | 61.6923 |
| B | 3 | 33.3333 | 665 | 0 | 6 | 164.385 |
| C | 1 | 11.1111 | 1000 | 0 | 4 | 80.9231 |
| D | 2 | 22.2222 | 665 | 1 | 3 | 77.0769 |
| E | 1 | 11.1111 | 250 | 0 | 5 | 23.3846 |
I would like to calculate the impact of the outliers on the column by calculating the mean and the median without them. I don't want to remove them to do this calculation. I suppose the best way is to add "~" to the outlier filter but I get lost in the code... This will benefit a lot of people as a search on removing outliers yields a lot of results. Other than the why they sneaked in the data in the first place, I just don't think the removal decision should be made without consideration on the potential impact. Feel free to add other considerations (skewness, sigma, n, etc.)
As always, I'm grateful to this community!
EDIT: I added variance and its square root standard deviation with and without outliers. In some fields you might want to keep outliers and go into ML directly. At least, by inspecting your data beforehand, you'll know how much they are contributing to your results. Used with nlargest() in the outliers column you get a quick view of which features contain the most. You could use this as a basis for filtering features by setting up thresholds on variance or mean. Thanks to the contributors, I have a powerful analytics tool now. Hope it can be useful to others.
Take advantage of apply method of DataFrame.
Series genereator
Just define the way you want the robust mean to apply by creating a method consuming Series and returning scalar and apply it to your DataFrame.
For the IRQ mean, here is a simple snippet:
def irq_agg(x, factor=1.5, aggregate=pd.Series.mean):
q1, q3 = x.quantile(0.25), x.quantile(0.75)
return aggregate(x[(q1 - factor*(q3 - q1) < x) & (x < q3 + factor*(q3 - q1))])
data.apply(irq_agg)
# A 3.363636
# B 14.200000
# C 4.333333
# D 3.363636
# E 4.500000
# dtype: float64
The same can be done to filter out based on percentiles (both side version):
def quantile_agg(x, alpha=0.05, aggregate=pd.Series.mean):
return aggregate(x[(x.quantile(alpha/2) < x) & (x < x.quantile(1 - alpha/2))])
data.apply(quantile_agg, alpha=0.01)
# A 12.454545
# B 15.777778
# C 4.727273
# D 41.625000
# E 4.909091
# dtype: float64
Frame generator
Even better, create a function that returns a Series, apply will create a DataFrame. Then we can compute at once a bunch of different means and medians in order to compare them. We can also reuse Series generator method defined above:
def analyze(x, alpha=0.05, factor=1.5):
return pd.Series({
"p_mean": quantile_agg(x, alpha=alpha),
"p_median": quantile_agg(x, alpha=alpha, aggregate=pd.Series.median),
"irq_mean": irq_agg(x, factor=factor),
"irq_median": irq_agg(x, factor=factor, aggregate=pd.Series.median),
"standard": x[((x - x.mean())/x.std()).abs() < 1].mean(),
"mean": x.mean(),
"median": x.median(),
})
data.apply(analyze).T
# p_mean p_median irq_mean irq_median standard mean median
# A 12.454545 5.0 3.363636 3.0 11.416667 61.692308 5.0
# B 15.777778 6.0 14.200000 5.0 14.200000 164.384615 6.0
# C 4.727273 4.0 4.333333 4.0 4.333333 80.923077 4.0
# D 41.625000 4.5 3.363636 3.0 3.363636 77.076923 3.0
# E 4.909091 5.0 4.500000 5.0 4.500000 23.384615 5.0
Now you can filter out outlier in several ways computes relevant aggregate on it such as mean or median.
No comment on whether this is an appropriate method to filter out your outliers. The code below should do what you asked:
q1, q3 = df.quantile([0.25, 0.75]).to_numpy()
delta = (q3 - q1) * 1.5
min_val, max_val = q1 - delta, q3 + delta
outliers = (df < min_val) | (max_val < df)
result = pd.concat(
[
pd.DataFrame(
{
"outliers": outliers.sum(),
"percent": outliers.sum() / outliers.sum().sum() * 100,
"max_val": max_val,
"min_val": min_val,
}
),
df.agg(["median", "mean"]).T,
df.mask(outliers, np.nan).agg(["median", "mean"]).T.add_suffix("_no_outliers"),
],
axis=1,
)
Result:
outliers percent max_val min_val median mean median_no_outliers mean_no_outliers
A 2 15.384615 13.5 -6.5 5.0 61.692308 3.0 3.363636
B 3 23.076923 243.0 -141.0 6.0 164.384615 5.0 14.200000
C 1 7.692308 13.0 -3.0 4.0 80.923077 4.0 4.333333
D 2 15.384615 16.0 -8.0 3.0 77.076923 3.0 3.363636
E 1 7.692308 10.5 -1.5 5.0 23.384615 5.0 4.500000

pandas cumcount in pyspark

Currently attempting to convert a script I made from pandas to pyspark, I have a dataframe that contains data in the form of:
index | letter
------|-------
0 | a
1 | a
2 | b
3 | c
4 | a
5 | a
6 | b
I want to create the following dataframe in which the occurrence count for each instance of a letter is stored, for example the first time we see "a" its occurrence count is 0, second time 1, third time 2:
index | letter | occurrence
------|--------|-----------
0 | a | 0
1 | a | 1
2 | b | 0
3 | c | 0
4 | a | 2
5 | a | 3
6 | b | 1
I can achieve this in pandas using:
df['occurrence'] = df.groupby('letter').cumcount()
How would I go about doing this in pyspark? Cannot find an existing method that is similar.
The feature you're looking for is called window functions
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df.withColumn("occurence", row_number().over(Window.partitionBy("letter").orderBy("index")))

pandas:how to get each customer probability with predict_proba

I am using xgboost with objective='binary:logistic' to calculate each customer probability if he/she will make the spend.
Using predic_proba in sklearn will print two probability for both 0 and 1,like:
[[0.56651809 0.43348191]
[0.15598162 0.84401838]
[0.86852502 0.13147498]]
how to insert each customer ID by pandas to get something like:
+----+------------+------------+
| ID | prob_0 | prob_1 |
+----+------------+------------+
| 1 | 0.56651809 | 0.43348191 |
| 2 | 0.15598162 | 0.84401838 |
| 3 | 0.86852502 | 0.13147498 |
+----+------------+------------+
You can use pandas DataFrame() in order to make your form.
list_data = [[0.56651809, 0.43348191],[0.15598162, 0.84401838],[0.86852502, 0.13147498]]
columns = ['prob_0', 'prob_1']
index = [1, 2, 3]
pd.DataFrame(data = list_data, columns = columns, index= index)

SQLite - Complex Query

This is what I want to get.
Art|CANTIDAD1|CANTIDAD2|CANTIDAD1CARGA1 |CANTIDAD2CARGA1 |CANTIDAD1CARGA2 | CANTIDAD2CARGA2
----------------------------------------------------------------------------------------------
001| 7 | 0 | 4 | 0 | 3 | 0
002| 0 | 2 | 0 | 1 | 0 | 1
003| 2 | 0 | 2 | 0 | 0 | 0
004| 3 | 0 | 1 | 0 | 2 | 0
005| 2 | 0 | 0 | 0 | 2 | 0
006| 0 | 1 | 0 | 0 | 0 | 1
I get CANTIDAD1 and CANTIDAD2 doing this query. It is the result of the sum of the amounts corresponding to the "where"
SELECT
SUM(D.NCANTIDAD1) AS NTOTCANTIDAD1,
SUM(D.NCANTIDAD2) AS NTOTCANTIDAD2
FROM
CABPEDIDOS C,
DETPEDIDOS D,
ARTICULOS A
WHERE
C.DFECHAALBARAN IS NULL
AND C.CSERIE = D.CSERIE
AND C.NPEDIDO = D.NPEDIDO
AND D.NFABRICANTE = A.NFABRICANTE
AND D.CARTICULO = A.CARTICULO
GROUP BY
D.NFABRICANTE, D.CARTICULO, A.CNOMBRE
CANTIDAD1CARGA1, CANTIDAD2CARGA1 are quantities that are in the database (d.cantidad1, d,cantidad2 are the real names, I have to sum all of them to get CANTIDAD1 and CANTIDAD2), but I need to get the quantities corresponding to the respective C.CARGA:
(CANTIDAD1 = CANTIDAD1CARGA1 + CANTIDAD1CARGA2)
How can I get these values?
** C.NCARGA can have more than one value, I need to get all CANTIDAD1CARGA'x' and CANTIDAD2CARGA'x'
I don't care if I have to use two querys,
- one for CANTIDAD1 and CANTIDAD2
- other for CANTIDAD1CARGA1, CANTIDAD2CARGA1, CANTIDAD1CARGA2... etc
I have a feeling I'm not really understanding the question, but it seems like you just need:
SELECT CANTIDAD1CARGA1 + CANTIDAD1CARGA2 AS CANTIDAD1,
CANTIDAD2CARGA1 + CANTIDAD2CARGA2 AS CANTIDAD2,
CANTIDAD1CARGA1, CANTIDAD2CARGA1, CANTIDAD1CARGA2, CANTIDAD2CARGA2
FROM ...