syntaxerror defining schema for sparksql dataframe - apache-spark-sql

My pyspark console is telling me that I have invalid syntax on the line following my for loop. the console doesn't execute the for loop until the schema = StructType(fields) line where it has the SyntaxError, but the for loop looks good to me...
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
lines = sc.textFile('file:///home/w205/hospital_compare/surveys_responses.csv')
parts = lines.map(lambda l: l.split(','))
surveys_responses = parts.map(lambda p: (p[0:33]))
schemaString = 'Provider Number, Hospital Name, Address, City, State, ZIP Code, County Name, Communication with Nurses Achievement Points, Communication with Nurses Improvement Points, Communication with Nurses Dimension Score, Communication with Doctors Achievement Points, Communication with Doctors Improvement Points, Communication with Doctors Dimension Score, Responsiveness of Hospital Staff Achievement Points, Responsiveness of Hospital Staff Improvement Points, Responsiveness of Hospital Staff Dimension Score, Pain Management Achievement Points, Pain Management Improvement Points, Pain Management Dimension Score, Communication about Medicines Achievement Points, Communication about Medicines Improvement Points, Communication about Medicines Dimension Score, Cleanliness and Quietness of Hospital Environment Achievement Points, Cleanliness and Quietness of Hospital Environment Improvement Points, Cleanliness and Quietness of Hospital Environment Dimension Score, Discharge Information Achievement Points, Discharge Information Improvement Points, Discharge Information Dimension Score, Overall Rating of Hospital Achievement Points, Overall Rating of Hospital Improvement Points, Overall Rating of Hospital Dimension Score, HCAHPS Base Score, HCAHPS Consistency Score'
fields = []
for field_name in schemaString.split(", "):
if field_name != ("HCAHPS Base Score" | "HCAHPS Consistency Score"):
fields.append(StructField(field_name, StringType(), True))
else:
fields.append(StructField(field_name, IntegerType(), True))
schema = StructType(fields)

Here | is wrong with != condition so use:-
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
lines = sc.textFile('file:///home/w205/hospital_compare/surveys_responses.csv')
parts = lines.map(lambda l: l.split(','))
surveys_responses = parts.map(lambda p: (p[0:33]))
schemaString = 'Provider Number, Hospital Name, Address, City, State, ZIP Code, County Name, Communication with Nurses Achievement Points, Communication with Nurses Improvement Points, Communication with Nurses Dimension Score, Communication with Doctors Achievement Points, Communication with Doctors Improvement Points, Communication with Doctors Dimension Score, Responsiveness of Hospital Staff Achievement Points, Responsiveness of Hospital Staff Improvement Points, Responsiveness of Hospital Staff Dimension Score, Pain Management Achievement Points, Pain Management Improvement Points, Pain Management Dimension Score, Communication about Medicines Achievement Points, Communication about Medicines Improvement Points, Communication about Medicines Dimension Score, Cleanliness and Quietness of Hospital Environment Achievement Points, Cleanliness and Quietness of Hospital Environment Improvement Points, Cleanliness and Quietness of Hospital Environment Dimension Score, Discharge Information Achievement Points, Discharge Information Improvement Points, Discharge Information Dimension Score, Overall Rating of Hospital Achievement Points, Overall Rating of Hospital Improvement Points, Overall Rating of Hospital Dimension Score, HCAHPS Base Score, HCAHPS Consistency Score'
fields = []
for field_name in schemaString.split(", "):
if field_name not in ("HCAHPS Base Score", "HCAHPS Consistency Score"):
fields.append(StructField(field_name, StringType(), True))
else:
fields.append(StructField(field_name, IntegerType(), True))
schema = StructType(fields)

Related

Trying to find the highest improvement in score for a given city

I tried grouping it by city, date and score. Then sorting by city and score and finally applying linear regression to calculate the highest slopes.
Code below
from scipy.stats import linregress
result = data.groupby(['facility_city', 'score', 'activity_date'], as_index=False).sum()
result = result.sort_values(by=['facility_city', 'activity_date'], ascending=True)
print(result)
result = result.apply(lambda v: linregress(v['activity_date'], v['score'])[0])
print(result)
An example of the data is
"activity_date","employee_id","facility_address","facility_city","facility_id","facility_name","facility_state","facility_zip","grade","owner_id","owner_name","pe_description","program_element_pe","program_name","program_status","record_id","score","serial_number","service_code","service_description"
2017-05-09,"EE0000593","17660 CHATSWORTH ST","GRANADA HILLS","FA0175397","HOVIK'S FAMOUS MEAT & DELI","CA","91344","A","OW0181955","JOHN'S FAMOUS MEAT & DELI INC.","FOOD MKT RETAIL (25-1,999 SF) HIGH RISK",1612,"HOVIK'S FAMOUS MEAT & DELI","ACTIVE","PR0168541",98,"DAHDRUQZO",1,"ROUTINE INSPECTION"
2017-04-10,"EE0000126","3615 PACIFIC COAST HWY","TORRANCE","FA0242138","SHAKEY'S PIZZA","CA","90505","A","OW0237843","SCO, LLC","RESTAURANT (61-150) SEATS HIGH RISK",1638,"SHAKEY'S PIZZA","ACTIVE","PR0190290",94,"DAL3SBUE0",1,"ROUTINE INSPECTION"
2017-04-04,"EE0000593","17515 CHATSWORTH ST","GRANADA HILLS","FA0007801","BAITH AL HALAL","CA","91344","A","OW0031150","SABIR MOHAMMAD SHAHID","FOOD MKT RETAIL (25-1,999 SF) HIGH RISK",1612,"BAITH AL HALAL","INACTIVE","PR0036723",95,"DAL2PIKJU",1,"ROUTINE INSPECTION"
2017-08-15,"EE0000971","44455 VALLEY CENTRAL WAY","LANCASTER","FA0013858","FOOD 4 LESS #306","CA","93536","A","OW0012108","FOOD 4 LESS, INC.","RESTAURANT (0-30) SEATS HIGH RISK",1632,"FOOD 4 LESS DELI/BAKERY#306","ACTIVE","PR0039905",98,"DA0ZMAJXZ",1,"ROUTINE INSPECTION"
2016-09-26,"EE0000145","11700 SOUTH ST","ARTESIA","FA0179671","PHO LITTLE SAIGON","CA","90701","A","OW0185167","PHO SOUTH ST INC","RESTAURANT (61-150) SEATS HIGH RISK",1638,"PHO LITTLE SAIGON","ACTIVE","PR0173311",96,"DA41DBXA2",1,"ROUTINE INSPECTION"
2016-05-11,"EE0000720","1309 S HOOVER ST","LOS ANGELES","FA0179745","HAPPY TACOS TO GO","CA","90006-4903","A","OW0185239","MAT L. MORA","RESTAURANT (0-30) SEATS HIGH RISK",1632,"HAPPY TACOS TO GO","INACTIVE","PR0173403",96,"DAURQTTVR",1,"ROUTINE INSPECTION"
2017-02-28,"EE0000741","4959 PATATA ST","CUDAHY","FA0012590","EL POTRERO CLUB","CA","90201","B","OW0036634","TSAY, SHYR JIN","RESTAURANT (151 + ) SEATS HIGH RISK",1641,"EL POTRERO CLUB","ACTIVE","PR0041708",87,"DAUNXDSVP",1,"ROUTINE INSPECTION"
I am getting an error KeyError: 'activity_date' when I try to apply the linear regression. Any tips would be welcome
You need to groupby again to use apply like you did. Also, linear regression does not work with date directly, you'll need to convert them to numerical first (see this post).
Change your last result assignment to:
result = (result.groupby('facility_city')
.apply(lambda x: linregress(x.activity_date.map(dt.datetime.toordinal), x.score)[0])
)
Your full code should look like:
import pandas as pd
import datetime as dt
from scipy.stats import linregress
# To make sure activity_date is a Timestamp:
data['activity_date'] = pd.to_datetime(data.activity_date)
# Sort cities and dates
result = data.sort_values(['facility_city', 'activity_date'])
# Calculate linear regressions and retrieve their slopes
result = (result.groupby('facility_city')
.apply(lambda x: linregress(x.activity_date.map(dt.datetime.toordinal), x.score)[0])
)
# Show higheest slopes first
result.sort_values(ascending=False)

How to calculate tfidf score from a column of dataframe and extract words with a minimum score threshold

I have taken a column of dataset which has description in text form for each row. I am trying to find words with tf-idf greater than some value n. but the code gives a matrix of scores how do I sort and filter the scores and see the corresponding word.
tempdataFrame = wineData.loc[wineData.variety == 'Shiraz',
'description'].reset_index()
tempdataFrame['description'] = tempdataFrame['description'].apply(lambda
x: str.lower(x))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(tempDataFrame['description'])
Sample Data:
description
This tremendous 100% varietal wine hails from Oakville and was aged over
three years in oak. Juicy red-cherry fruit and a compelling hint of caramel
greet the palate, framed by elegant, fine tannins and a subtle minty tone in
the background. Balanced and rewarding from start to finish, it has years
ahead of it to develop further nuance. Enjoy 2022–2030.
In the absence of a full data frame column of wine descriptions, the sample data you have provided is split in three sentences in order to create a data frame with one column named 'Description' and three rows. Then the column is passed to the tf-idf for analysis and a new data frame containing the features and their scores is created. The results are subsequently filtered using pandas.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
doc = ['This tremendous 100% varietal wine hails from Oakville and was aged over \
three years in oak.', 'Juicy red-cherry fruit and a compelling hint of caramel \
greet the palate, framed by elegant, fine tannins and a subtle minty tone in \
the background.', 'Balanced and rewarding from start to finish, it has years \
ahead of it to develop further nuance. Enjoy 2022–2030.']
df_1 = pd.DataFrame({'Description': doc})
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(df_1['Description'])
# New data frame containing the tfidf features and their scores
df = pd.DataFrame(score.toarray(), columns=tfidf.get_feature_names())
# Filter the tokens with tfidf score greater than 0.3
tokens_above_threshold = df.max()[df.max() > 0.3].sort_values(ascending=False)
tokens_above_threshold
Out[29]:
wine 0.341426
oak 0.341426
aged 0.341426
varietal 0.341426
hails 0.341426
100 0.341426
oakville 0.341426
tremendous 0.341426
nuance 0.307461
rewarding 0.307461
start 0.307461
enjoy 0.307461
develop 0.307461
balanced 0.307461
ahead 0.307461
2030 0.307461
2022â 0.307461
finish 0.307461

comapring compressed distribution per cohort

How can I easily compare the distributions of multiple cohorts?
Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.
It was created as:
SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender
where compress_distributionUDF simply takes a list of tuples and returns the counts per group.
This leaves me with a list of
Row(distribution_value=60.0, count=314251, target_y_n=0)
nested inside a pandas.Series, but one per each chohort.
Basically, it is similar to:
pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})
and I wonder how to compare distributions:
within a cohort 0 vs. 1 of target_y_n
over multiple cohorts
in a way which is visually still understandable and not only a mess.
edit
For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?
I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.
sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)
sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)
plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])
sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)
Finally try to have a look at the FacetGrid class to extend your comparison (see here).
g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
In your case you would have something like:
g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
And a qqplot option:
from scipy import stats
def qqplot(x, y, **kwargs):
_, xr = stats.probplot(x, fit=False)
_, yr = stats.probplot(y, fit=False)
plt.scatter(xr, yr, **kwargs)
g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')

Creating similar samples based on three different categorical variables

I am trying to do an analysis where I am trying to create two similar samples based on three different attributes. I want to create these samples first and then do the analysis to see which out of those two samples is better. The categorical variables are sales_group, age_group, and country. So I want to make both samples such as the proportion of countries, age, and sales is similar in both samples.
For example: Sample A and B have following variables in it:
Id Country Age Sales
The proportion of Country in Sample A is:
USA- 58%
UK- 22%
India-8%
France- 6%
Germany- 6%
The proportion of country in Sample B is:
India- 42%
UK- 36%
USA-12%
France-3%
Germany- 5%
The same goes for other categorical variables: age_group, and sales_group
Thanks in advance for help
You do not need to establish special procedure for sampling as one-sample proportion is unbiased estimate of population proportion. In case you have, suppose, >1000 observations and you are sampling more than, let us say, 30 samples the estimate would be quite exact (Central Limit Theorem).
You can see it in the simulation below:
set.seed(123)
n <- 10000 # Amount of rows in the source data frame
df <- data.frame(sales_group = sample(LETTERS[1:4], n, replace = TRUE),
age_group = sample(c("old", "young"), n, replace = TRUE),
country = sample(c("USA", "UK", "India", "France", "Germany"), n, replace = TRUE),
amount = abs(100 * rnorm(n)))
s <- 100 # Amount of sampled rows
sampleA <- df[sample(nrow(df), s), ]
sampleB <- df[sample(nrow(df), s), ]
table(sampleA$sales_group)
# A B C D
# 23 22 32 23
table(sampleB$sales_group)
# A B C D
# 25 22 28 25
DISCLAIMER: However if you have some very small or very big proportion and have too little samples you will need to use some advanced procedures like Laplace smoothing

Testing a cube - is slicing by each dimension in turn sufficient?

One for the mathematicians.
Say I have two cubes, or dimensionally-modelled datasets A and B.
To prove that they're identical, is it sufficient to slice each of them by every dimension in turn, and verify that the totals for each member are identical?
A simple example: dimensions Country (England and Scotland), Gender (Male and Female) and Married (Yes or No). Measure CountPeople.
If I slice CountPeople by Country, comparing the results from A and B, then by Gender, then by Married, and find identical results, have I proved that every cell in A and B is identical?
I think that I have, but I'm not sure.
No, slicing on each dimension in turn is not sufficient to prove that the cubes are identical at cell level. It probably will be close enough most of the time, but it's not mathematically guaranteed.
We can prove this with a fairly simple example with just Gender and Country dimensions. Imagine we have the following data at cell level:
(Male, England): 100, (Female, Scotland): 100
If we slice separately by Gender or Country we get:
Male: 100, Female: 100
England: 100, Scotland: 100.
Now if all of those males move to Scotland and all the females move to England, we'll have different data at cell level:
(Male, Scotland): 100, (Female, England): 100
But the data reported by either single dimension will be the same:
Male: 100, Female: 100
England: 100, Scotland: 100
This is a fairly trivial example, but the same possibility exists for non-trivial data, so to be 100% sure two cubes are identical, you would need to validate at cell level.