How to convert LinkedIn API country geo codes into country names - api

Is there any file or mapping or lookup table that maps the country codes to their corresponding country names?
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# Define a dictionary that maps country codes to their corresponding country names
country_map = {
"us": "United States",
"ca": "Canada",
"mx": "Mexico",
# Add more country mappings as needed
}
# Define a PySpark UDF to map country codes to their corresponding country names
#udf(returnType=StringType())
def get_country_name(country_code):
return country_map.get(country_code.lower(), "Unknown")
# Create a PySpark DataFrame with the organizationalEntityFollowerStatistics geo code
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CountryCodeDemo").getOrCreate()
data = [("us", 100), ("ca", 200), ("mx", 300)]
df = spark.createDataFrame(data, ["country_code", "follower_count"])
# Add a new column with the corresponding country name for each country code
df = df.withColumn("country_name", get_country_name(df["country_code"]))
# Display the DataFrame with the follower count and country name for each row
df.show()
type here

Related

Convert a column to a list of prevoius columns in a Dataframe

I would like to create a column that is the form of a list of values from two previous columns, such as location that is made up of the long and lat columns.
This is what the DataFrame looks like
You can create a new columne based on other columns using zip, as follows:
import pandas as pd
df = pd.DataFrame({
'admin_port': ['NORTH SHIELDS', 'OBAN'],
'longitude': [-1.447104, -5.473469],
'latitude': [55.008766, 54.415695],
})
df['new'] = pd.Series(list(zip(df['longitude'].values, df['latitude'].values)))
print(df)
>>> df
admin_port longitude latitude new
0 NORTH SHIELDS -1.447104 55.008766 (-1.447104, 55.008766)
1 OBAN -5.473469 54.415695 (-5.473469, 54.415695)
For your information, you can see how to use zip() here: https://www.w3schools.com/python/ref_func_zip.asp

Dictionary type data sorting

I have this type of data
{"id":"colvera","gg_unique_id":"colvera","gg_unique_prospect_account_id":"cobra-hq-enterprises","completeness_score":100,"full_name":"chris olvera","first_name":"chris","last_name":"olvera","linkedin_url":"linkedin.com/in/colvera","linkedin_username":"colvera","facebook_url":null,"twitter_url":null,"email":"colvera#cobrahq.com","mobile_phone":null,"industry":"information technology and services","title":"independent business owner","company_name":"cobra hq enterprises","domain":"cobrahq.com","website":"cobrahq.com","employee_count":"1-10","company_linkedin_url":"linkedin.com/company/cobra-hq-enterprises","company_linkedin_username":"cobra-hq-enterprises","company_location":"raymore, missouri, united states","company_city":"raymore","company_state":"missouri","company_country":"united states"
i want to set "id","gg_unique_id" etc as column name and the values as row. How can i do that?
Im trying the following codes but nothing happens:
import pandas as pd
import numpy as np
data = pd.read_csv("1k_sample_data.txt")
data.info()
df = pd.DataFrame.from_dict(data)
df
I am new to this type of data, any help would be appriciated
Looks like you have data in Json format. Try:
df = pd.read_json("1k_sample_data.txt", lines=True)
print(df)

Restructering Dataframes (stored in dictionairy)

i have footballdata stored in a dictionairy by league for different seasons. So for example I have the results from 1 league for the seasons 2017-2020 in one dataframe stored in the dictionary. Now I need to create new dataframes by season, so that I have all the results from 2019 in one dataframe. What is the best way to do this?
Thank you!
assuming you are using open football as your source
use GitHub API to get all files in repo
function to normalize the JSON
simple to generate either a concatenated DF of a dict of all the results
import requests
import pandas as pd
# normalize footbal scores data into a dataframe
def structuredf(res):
js = res.json()
if "rounds" not in res.json().keys():
return (pd.json_normalize(js["matches"])
.pipe(lambda d: d.loc[:,].join(d["score.ft"].apply(pd.Series).rename(columns={0:"home",1:"away"})))
.drop(columns="score.ft")
.rename(columns={"round":"name"})
.assign(seasonname=js["name"], url=res.url)
)
df = (pd.json_normalize(pd.json_normalize(js["rounds"])
.explode("matches").to_dict("records"))
.assign(seasonname=js["name"], url=res.url)
.pipe(lambda d: d.loc[:,].join(d["matches.score.ft"].apply(pd.Series).rename(columns={0:"home",1:"away"})))
.drop(columns="matches.score.ft")
.pipe(lambda d: d.rename(columns={c:c.split(".")[-1] for c in d.columns}))
)
return df
# get listing of all datafiles that we're interested in
res = requests.get("https://api.github.com/repos/openfootball/football.json/git/trees/master?recursive=1")
dfm = pd.DataFrame(res.json()["tree"])
# concat into one dataframe
df = pd.concat([structuredf(res)
for p in dfm.loc[dfm["path"].str.contains(r".en.[0-9]+.json"), "path"].iteritems()
for res in [requests.get(f"https://raw.githubusercontent.com/openfootball/football.json/master/{p[1]}")]])
# dictionary of dataframe
d = {res.json()["name"]:structuredf(res)
for p in dfm.loc[dfm["path"].str.contains(r".en.[0-9]+.json"), "path"].iteritems()
for res in [requests.get(f"https://raw.githubusercontent.com/openfootball/football.json/master/{p[1]}")]}

how to extract the unique values and its count of a column and store in data frame with index key

I am new to pandas.I have a simple question:
how to extract the unique values and its count of a column and store in data frame with index key
I have tried to:
df = df1['Genre'].value_counts()
and I am getting a series but I don't know how to convert it to data frame object.
Pandas series has a .to_frame() function. Try it:
df = df1['Genre'].value_counts().to_frame()
And if you wanna "switch" the rows to columns:
df = df1['Genre'].value_counts().to_frame().T
Update: Full example if you want them as columns:
import pandas as pd
import numpy as np
np.random.seed(400) # To reproduce random variables
df1 = pd.DataFrame({
'Genre': np.random.choice(['Comedy','Drama','Thriller'], size=10)
})
df = df1['Genre'].value_counts().to_frame().T
print(df)
Returns:
Thriller Comedy Drama
Genre 5 3 2
try
df = pd.DataFrame(df1['Genre'].value_counts())

groupby all results without resorting

Sort in groupby does not work the way I thought it would.
In the following example, I do not want to group "USA" together because there is one row of "Russia".
from io import StringIO
myst="""india, 905034 , 19:44
USA, 905094 , 19:33
Russia, 905154 , 21:56
USA, 345345, 45:55
USA, 34535, 65:45
"""
u_cols=['country', 'index', 'current_tm']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
When I use groupby I get the following:
df.groupby('country', sort=False).size()
country
india 1
USA 3
Russia 1
dtype: int64
Is there anyway I can get results something like this...
country
india 1
USA 1
Russia 1
USA 2
You could try this bit of code instead of a direct groupby:
country = [] #initialising lists
count = []
for i, g in df.groupby([(df.country != df.country.shift()).cumsum()]): #Creating a list that increases by 1 for every time a unique value appears in the dataframe country column.
country.append(g.country.tolist()[0]) #Adding the name of country to list.
count.append(len(g.country.tolist())) #Adding the number of times that country appears to list.
pd.DataFrame(data = {'country': country, 'count':count}) #Binding the lists all into a dataframe.
This df.groupby([(df.country != df.country.shift()).cumsum()]) creates a dataframe that gives a unique number (cumulatively) to every change of country in the country column.
In the for loop, i represents the unique cumulative number assigned to each country appearance and g represents the corresponding full row(s) from your original dataframe.
g.country.tolist() outputs a list of the country names for each unique appearance (aka i) i.e.
['india']
['USA']
['Russia']
['USA', 'USA']
for your given data.
Therefore, the first item is the name of the country and the length represents the number of appearances. This info can then be (recorded in a list and then) put together into a dataframe and give the output you require.
You could also use list comprehensions rather than the for loop:
cumulative_df = df.groupby([(df.country != df.country.shift()).cumsum()]) #The cumulative count dataframe
country = [g.country.tolist()[0] for i,g in cumulative_df] #List comprehension for getting country names.
count = [len(g.country.tolist()) for i,g in cumulative_df] #List comprehension for getting count for each country.
Reference: Pandas DataFrame: How to groupby consecutive values
Using the trick given in #user2285236 's comment
df['Group'] = (df.country != df.country.shift()).cumsum()
df.groupby(['country', 'Group'], sort=False).size()