I am starting out with pandas on jupyter notebook. In the error message, there is a ^ below the = operator, but I cannot see the problem. What's missing? thanks!
import pandas as pd
data2 = ([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
s = pd.Series(data2)
print(s.shape)
This is the error:
File "<ipython-input-30-57c99bd7e494>", line 4
data2 = ([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
^
SyntaxError: invalid syntax
There proper way to do this is, separate variables for data and index:
import pandas as pd
data2 = [1,2,3,4]
index = ['a','b','c','d']
s = pd.Series(data2,index)
print(s.shape)
Or as ayhan points our you could unpack a dictionary with **:
data2 = dict(data=[1,2,3,4], index=['a','b','c','d'])
s = pd.Series(**data2)
print(s.shape)
Related
Assume that I have a PySpark DataFrame. Some of the cells contain only special characters.
Sample dataset:
import pandas as pd
data = {'ID': [1, 2, 3, 4, 5, 6],
'col_1': ['A', '?', '<', ' ?', None, 'A?'],
'col_2': ['B', ' ', '', '?>', 'B', '\B']
}
pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)
I want to count the number of rows which contain only special characters (except blank cells). Values like 'A?' and '\B' and blank cells are not counted.
The expected output will be:
{'ID': 0, 'col_1': 3, 'col_2': 1}
Is there anyway to do that?
Taking your sample dataset, this should do:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
spark = SparkSession.builder.getOrCreate()
import pandas as pd
data = {'ID': [1, 2, 3, 4, 5, 6],
'col_1': ['A', '?', '<', ' ?', None, 'A?'],
'col_2': ['B', ' ', '', '?>', 'B', '\B']
}
pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)
res = {}
for col_name in df.columns:
df = df.withColumn('matched', when((col(col_name).rlike('[^A-Za-z0-9\s]')) & ~(col(col_name).rlike('[A-Za-z0-9]')), True).otherwise(False))
res[col_name] = df.select('ID').where(df.matched).count()
print(res)
The trick is to use regular expressions with two conditions to filter the cells that are valid according to your logic.
We have a spark dataframe. We are trying to find the values with top 20 frequencies in a column.
Ex) [1, 1, 1, 2, 2, 4]
In the above list,
1 is occuring 3 times
2 is occuring 2 times
4 is occuring 1 time
We are trying to find this using pandas.
And then creating UDF in spark and using it there.
This works for smaller datasets, but when the datasets are too tall (20M rows), we are facing memory issues sometimes.
from pyspark.sql import functions as F
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("unit_testing_spark").getOrCreate()
spark.sparkContext.setLogLevel("WARN")
def find_freq_values(col_list):
if len(col_list) == 0:
return []
df = pd.DataFrame(col_list, columns=["value"])
df = df[['value']].groupby(['value'])['value'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(20)
res = df.to_dict(orient='records')
for curr_data in res:
curr_value = curr_data["value"]
ldt = str(type(curr_value)).lower()
if "time" in ldt or "date" in ldt:
curr_data["value"] = str(curr_value)
return res
s = find_freq_values([1, 1, 1, 2, 2, 4])
print(s)
# Output: [{'value': 1, 'count': 3}, {'value': 2, 'count': 2}, {'value': 4, 'count': 1}]
column_data = ["col_1"]
column_header = tuple(column_data)
data = [[1], [1], [1], [2], [2], [4]]
df = spark.createDataFrame(data, column_header)
find_freq_udf = F.udf(find_freq_values, ArrayType(MapType(StringType(), StringType(), True)))
freq_res_df = df.select(*[find_freq_udf(F.collect_list(c)).alias(c) for c in df.columns])
freq_res = freq_res_df.collect()[0].asDict()
print(freq_res)
# Output: {'col_1': [{'count': '3', 'value': '1'}, {'count': '2', 'value': '2'}, {'count': '1', 'value': '4'}]}
Error message:
"An error occurred while calling o514.collectToPython.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 66.0 failed 10 times, most recent failure: Lost task 0.9 in stage 66.0 (TID 379) (w382f6d7a114442a8bd741d53661a2c3b-srpz3ldim3sr2-w-1.c.projectid.internal executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Container from a bad node: container_1658234225716_0001_01_000011 on host: w382f6d7a114442a8bd741d53661a2c3b-srpz3ldim3sr2-w-1.c.projectid.internal. Exit status: 143.
To perform multiple columns parallely we are using the following statement.
How can this be optimized to avoid memory issue?
One way would be only use spark for computing the frequencies. here is one way to do it.
from pyspark.sql.functions import count, col
x = [1, 1, 1, 2, 2, 4] # input data
df = sc.parallelize(x).toDF(['ID'])
df = df.groupBy('ID')
df = df.agg(count('ID').alias('id_count'))
df = df.orderBy(col('id_count').asc())
df = df.limit(20)
df.show()
from pandas import Series
data = [1000, 2000, 3000]
index = ['a', 'b', 'c']
s = Series(data = data, index = index)
s.to_excel('/content/drive/MyDrive/Colab Notebooks/data.xlsx')
This does not work and it says
[Errno 2] No such file or directory:
But I do have such directories and same file names!!
data = {
'X': [3, 2, 0, 1],
'Y': [0, 3, 7, 2]
}
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
df.style.set_properties(**{
'font-family':'Courier New'
})
df
The index column is displayed in bold, is it possible to change font of index column?
You must use table_styles. In this example I manage to make the "font-weight":"normal" for the index and columns:
Let's define some test data:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4],
'B':[5,4,3,1]})
We define style customization to use:
styles = [
dict(selector="th", props=[("font-weight","normal"),
("text-align", "center")])]
We pass the style variable as the argument for set_table_styles():
html = (df.style.set_table_styles(styles))
html
And the output is:
Please feel free to read about the documentation in pandas Styling for more details.
Im new to Python and working with API,
My code is below:
import pandas as pd
import json
from pandas.io.json import json_normalize
import datetime
threedaysago = datetime.date.fromordinal(datetime.date.today().toordinal()-3).strftime("%F")
import http.client
conn = http.client.HTTPSConnection("api.sendgrid.com")
payload = "{}"
keys = {
# "CF" : "SG.UdhzjmjYR**.-",
}
df = [] # Create new Dataframe
for name, value in keys.items():
headers = { 'authorization': "Bearer " + value }
conn.request("GET", "/v3/categories/stats/sums?aggregated_by=&start_date={d}&end_date={d}".format(d=threedaysago), payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
d = json.loads(data.decode("utf-8"))
c=d['stats']
# row = d['stats'][0]['name']
# Add Brand to data row here with 'name'
df.append(c) # Load data row into df
#1
df = pd.DataFrame(df[0])
df_new = df[['name']]
df_new.rename(columns={'name':'Category'}, inplace=True)
df_metric =pd.DataFrame(list(df['metrics'].values))
sendgrid = pd.concat([df_new, df_metric], axis=1, sort=False)
sendgrid.set_index('Category', inplace = True)
sendgrid.insert(0, 'Date', threedaysago)
sendgrid.insert(1,'BrandId',99)
sendgrid.rename(columns={
'blocks':'Blocks',
'bounce_drops' : 'BounceDrops',
'bounces': 'Bounces',
'clicks':'Clicks',
'deferred':'Deferred',
'delivered':'Delivered',
'invalid_emails': 'InvalidEmails',
'opens':'Opens',
'processed':'Processed',
'requests':'Requests',
'spam_report_drops' : 'SpamReportDrops',
'spam_reports' : 'SpamReports',
'unique_clicks' : 'UniqueClicks',
'unique_opens' : 'UniqueOpens',
'unsubscribe_drops' : 'UnsubscribeDrops',
'unsubscribes': 'Unsubscribes'
},
inplace=True)
When I run this however, I receive an error:
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
I know the reason this happens is because there are no stats available for three days ago:
{"date":"2020-02-16","stats":[]}
But how do I handle these exceptions in my code because this is going to run as a daily report and it will break if this error is not handled.
Sorry for the late answer.
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]" means there is no column called name in your dataframe.
But, you believe that error occurred because of "stats" : []. It is also not true. If any of the indexes is empty the error should occur as ValueError: arrays must all be same length
I have recreated this problem and I will show you to get an idea to overcome this problem.
Recreating KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c']}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
KeyError: "None of [Index(['D'], dtype='object')] are in the [columns]"
Solution -: You can see there is no column called 'D' in the data frame. Therefore, recheck your columns
Adding 'D' and see what happens
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D': []}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
ValueError: arrays must all be same length
Solution -: column 'D' need to fill same data count as 'A', 'B', and 'C'
Overcome from those two problems
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
print(df)
Output -:
0 1 2
A 1 4 5
B 4 5 6
C a b c
D None None None
You can see columns are now represented as rows. You can use loc to select each row column.
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
df = df.loc[['A']] # uses loc
print(df)
Output -:
0 1 2
A 1 4 5