Day 20: Star Wars and strings

The .str functions in pandas

.str.strip()

s = pd.Series(['1. Ant.  ', '2. Bee!\n', '3. Cat?\t', '4. Beat?\t', np.nan])

s.str.strip()

s.str.strip('123.!? \n\t')

s.str.strip('1234.!? \n\t')

.str.replace()

s.str.replace('Ant.', 'Man')
s.str.replace('a', 8)
s.str.replace('a', '8')
s.str.replace('a', '8', case = False)
s.str.replace('a|e', '8', case = False)

s.str.replace('\d', '', case = False)

.str.split()

Let’s split these series into multiple columns.

s2 = pd.Series(['1-20', '21-50', '51-80', '81-100', np.nan])
s3 = pd.Series(
    [
        "this is a regular sentence",
        "https://docs.python.org/3/tutorial/index.html",
        np.nan
    ]
)

.str.cat()

two_columns = s2.str.split("-", expand = True).rename(
   columns = {0: 'minimum', 1: 'maximum'})

two_columns.fillna("").agg("__".join, axis = 1)

two_columns.minimum.str.cat(two_columns.maximum, sep = "__")

Cleaning our data

Creating column names

Let’s look at the column names and figure out what we have.

  • Where are the column names located?
  • Why are they stored like that?
  • How could we shorten them?

Now what do we want?

Run the below code and tell me what we have.

url = 'https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv'

dat_cols = pd.read_csv(url, encoding = "ISO-8859-1", nrows = 1).melt()
dat = pd.read_csv(url, skiprows =2, header = None )

Creating new columns

# %%
# Which of the following Star Wars films have you seen? Please select all that apply.' as 'seen'
# Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' as 'rank'
# Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.' as 'view'
# Do you consider yourself to be a fan of the Star Trek franchise?' as 'star_trek_fan'
# Do you consider yourself to be a fan of the Expanded Universe\\?\x8cæ' as 'expanded_fan'
# Are you familiar with the Expanded Universe?' as 'know_expanded'
# Have you seen any of the 6 films in the Star Wars franchise?' as 'seen_any'
# Do you consider yourself to be a fan of the Star Wars film franchise?' as 'star_wars_fans'
# Which character shot first?' as 'shot_first'
# see the code snippet for the other four replaclements.  
# the four examples.  Should fix the other questions.

# this is not complete.
variables_replace = {
    'Which of the following Star Wars films have you seen\\? Please select all that apply\\.':'seen',
    'Do you consider yourself to be a fan of the Expanded Universe\\?\x8cæ':'expanded_fan',
    'Unnamed: \d{1,2}':np.nan,
    ' ':'_',
}
# one example.  My code has three.
# 'Response' is replaced with '' 
# ' ' is replaced with '_'
values_replace = {
    'Star Wars: Episode ':'',
}

dat_cols_use = (dat_cols
    .assign(
        value_replace = lambda x:  x.value.str.strip().replace(values_replace, regex=True),
        variable_replace = lambda x: x.variable.str.strip().replace(variables_replace, regex=True)
    )
    .fillna(method = 'ffill')
    .fillna(value = "")
    .assign(column_names = lambda x: x.variable_replace.str.cat(x.value_replace, sep = "__").str.strip('__').str.lower())
    )
dat_cols_use

dat.columns = dat_cols_use.column_names.to_list()