Day 20: Star Wars and strings

The `.str` functions in pandas

.strip: Strip white space
.replace: replace one string of characters with another.
.split: Separate a character string into two values.
.cat: Combine strings in series

`.str.strip()`

s = pd.Series(['1. Ant.  ', '2. Bee!\n', '3. Cat?\t', '4. Beat?\t', np.nan])

s.str.strip()

s.str.strip('123.!? \n\t')

s.str.strip('1234.!? \n\t')

`.str.replace()`

s.str.replace('Ant.', 'Man')
s.str.replace('a', 8)
s.str.replace('a', '8')
s.str.replace('a', '8', case = False)
s.str.replace('a|e', '8', case = False)

s.str.replace('\d', '', case = False)

`.str.split()`

Let’s split these series into multiple columns.

s2 = pd.Series(['1-20', '21-50', '51-80', '81-100', np.nan])
s3 = pd.Series(
    [
        "this is a regular sentence",
        "https://docs.python.org/3/tutorial/index.html",
        np.nan
    ]
)

`.str.cat()`

two_columns = s2.str.split("-", expand = True).rename(
   columns = {0: 'minimum', 1: 'maximum'})

two_columns.fillna("").agg("__".join, axis = 1)

two_columns.minimum.str.cat(two_columns.maximum, sep = "__")

Cleaning our data

Creating column names

Let’s look at the column names and figure out what we have.

Where are the column names located?
Why are they stored like that?
How could we shorten them?

Now what do we want?

Run the below code and tell me what we have.

url = 'https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv'

dat_cols = pd.read_csv(url, encoding = "ISO-8859-1", nrows = 1).melt()
dat = pd.read_csv(url, skiprows =2, header = None )

Creating new columns

# %%
# Which of the following Star Wars films have you seen? Please select all that apply.' as 'seen'
# Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' as 'rank'
# Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.' as 'view'
# Do you consider yourself to be a fan of the Star Trek franchise?' as 'star_trek_fan'
# Do you consider yourself to be a fan of the Expanded Universe\\?\x8cæ' as 'expanded_fan'
# Are you familiar with the Expanded Universe?' as 'know_expanded'
# Have you seen any of the 6 films in the Star Wars franchise?' as 'seen_any'
# Do you consider yourself to be a fan of the Star Wars film franchise?' as 'star_wars_fans'
# Which character shot first?' as 'shot_first'
# see the code snippet for the other four replaclements.  
# the four examples.  Should fix the other questions.

# this is not complete.
variables_replace = {
    'Which of the following Star Wars films have you seen\\? Please select all that apply\\.':'seen',
    'Do you consider yourself to be a fan of the Expanded Universe\\?\x8cæ':'expanded_fan',
    'Unnamed: \d{1,2}':np.nan,
    ' ':'_',
}
# one example.  My code has three.
# 'Response' is replaced with '' 
# ' ' is replaced with '_'
values_replace = {
    'Star Wars: Episode ':'',
}

dat_cols_use = (dat_cols
    .assign(
        value_replace = lambda x:  x.value.str.strip().replace(values_replace, regex=True),
        variable_replace = lambda x: x.variable.str.strip().replace(variables_replace, regex=True)
    )
    .fillna(method = 'ffill')
    .fillna(value = "")
    .assign(column_names = lambda x: x.variable_replace.str.cat(x.value_replace, sep = "__").str.strip('__').str.lower())
    )
dat_cols_use

dat.columns = dat_cols_use.column_names.to_list()

Day 20: Star Wars and strings

The .str functions in pandas

.str.strip()

.str.replace()

.str.split()

.str.cat()

Cleaning our data

Creating column names

Let’s look at the column names and figure out what we have.

Now what do we want?

Creating new columns

The `.str` functions in pandas

`.str.strip()`

`.str.replace()`

`.str.split()`

`.str.cat()`