Python String Methods
Python’s str class comes with various built-in methods to manipulate strings. Here are some common examples:
mystring = 'hello'
print(mystring.capitalize()) # Capitalizes the first letter: 'Hello'
print(mystring.isdigit()) # Checks if the string contains only digits: FalseTo explore all available string methods:
help(str)Working with Text in Pandas
Pandas extends text manipulation capabilities to its Series objects. These methods can process each element in the Series efficiently. Refer to the official Pandas documentation for advanced string indexing and regular expressions.
Applying Text Methods to a Pandas Series
import pandas as pd
names = pd.Series(['andrew', 'bobo', 'claire', 'david', '4'])
print(names.str.capitalize()) # Capitalizes the first letter of each string
print(names.str.isdigit()) # Checks if each element contains only digitsSplitting, Extracting, and Expanding Strings
tech_finance = ['GOOG,APPL,AMZN', 'JPM,BAC,GS']
tickers = pd.Series(tech_finance)
# Splitting strings into lists
print(tickers.str.split(','))
# Extracting the first ticker from each string
print(tickers.str.split(',').str[0])
# Expanding the split strings into multiple columns
print(tickers.str.split(',', expand=True))Cleaning and Editing Strings
Messy string data can often be cleaned using a combination of str methods:
messy_names = pd.Series(["andrew ", "bo;bo", " claire "])
# Removing semicolons and extra spaces
cleaned = messy_names.str.replace(";", "").str.strip()
print(cleaned)
# Capitalizing the cleaned strings
final_names = cleaned.str.capitalize()
print(final_names)Using apply() for Custom Cleaning Logic
For more complex cleaning tasks, use apply() with a custom function:
def cleanup(name):
name = name.replace(";", "")
name = name.strip()
name = name.capitalize()
return name
cleaned_names = messy_names.apply(cleanup)
print(cleaned_names)Comparing Performance: .str vs apply() vs np.vectorize
When handling large datasets, performance matters. Here’s an example using the timeit module to measure the execution time of different approaches:
Setup Code
import pandas as pd
import numpy as np
messy_names = pd.Series(["andrew ", "bo;bo", " claire "])
def cleanup(name):
name = name.replace(";", "")
name = name.strip()
name = name.capitalize()
return nameMeasuring Performance
import timeit
stmt_pandas_str = '''
messy_names.str.replace(";", "").str.strip().str.capitalize()
'''
stmt_pandas_apply = '''
messy_names.apply(cleanup)
'''
stmt_pandas_vectorize = '''
np.vectorize(cleanup)(messy_names)
'''
print(timeit.timeit(setup=setup, stmt=stmt_pandas_str, number=10000))
print(timeit.timeit(setup=setup, stmt=stmt_pandas_apply, number=10000))
print(timeit.timeit(setup=setup, stmt=stmt_pandas_vectorize, number=10000))Key Takeaway
.strmethods are convenient for most use cases.apply()offers flexibility for custom logic but may be slower.np.vectorize()provides a balance of performance and flexibility, often outperformingapply().