What you should know about working with text in Python

Python is getting more and more popular every year. Thanks to the transparent syntax and rich libraries programming in Python is fast and effective. It is constantly evolving, thanks to what users regularly receive its next versions or new libraries. All of this makes Python a popular choice as a language used for learning programming by people who didn’t have the opportunity to deal with computer science before.

Standard functions

Thanks to Pythons "batteries included" approach, you have access to numerous functions ready to use out of the box.Let's start from the beginning.

Concatenation

Concatenation of strings is achieved in the easiest way by using the + operator.

>>> print("Hello" + " " + "world" + "!")
Hello world!

You can also use the multiplication operator * - then the string is repeated specified number of times.

>>> print ("Hello\n" * 3)
Hello
Hello
Hello

Chars in the string

Since the string is composed of the characters, maybe we'd like to do something with them, for example extract them or count?

At first we have to count the number of characters in the string character (and determine its length).

>>> len("Hello world")
11

Now let’s find out what is the third character in the word “Poznan”

>>> word = "Poznan"
>>> print(word[2]) # indexing starts from zero!
z

But what if we want to receive a part of a word?

>>> print ( word[:3] )
Poz
>>> print ( word[1:3] )
oz
>>> print ( word[1:] )
oznan

As you can see finding individual pieces and characters of the string it is extremely easy. We need to treat it like a Python list, in which each character corresponds with list’s elements. The index value in square brackets indicates the desired character (indexing starts from zero). It’s the same when we want to define the range of characters that we want to be returned. The number before the colon indicates the initial index of the range, while the one after the colon - the final index. No number means "start from the beginning of the string," or "go to the end." What's more, just like with the list, it is permitted to use negative numbers in the indexes! Then, the index value is calculated from the end.

>>> word = 'Poznan'
>>> word[-1]
'n'
>>> word[:-1]
'Pozna'
>>> word[-3:]
'nan'
>>> word[-3:-1]
'na'

You can add the third parameter, which defines the step used while retrieving list (or string) slice.

>>> word[1:5]
'ozna'
>>> word[1:5:2]
'on'
>>> word[::2]
'Pza'

Similarities to lists

I mentioned that strings are treated as a list, but it doesn’t mean that they are the same. Nevertheless, it’s not a problem if you want to represent a string as a list! All you have to do is call the list() function.

>>> str = "merixstudio"
>>> list(str)
['m', 'e', 'r', 'i', 'x', 's', 't', 'u', 'd', 'i', 'o']

What if we reverse this process?

>>> "".join(['m', 'e', 'r', 'i', 'x', 's', 't', 'u', 'd', 'i', 'o'])
'merixstudio'

We use join() by specifying any string that combines the elements of a list ( string can be empty, as shown above). This is useful when we want to display list elements and separate them with commas.

>>> ", ".join(['apple', 'pear', 'banana', 'orange'])
'apple, pear, banana, orange'

The opposite of the join() function is split(). According to the given separator it creates a list from the char string.

>>> "apple, pear, banana, orange".split(", ")
['apple', 'pear', 'banana', 'orange']

String transformations

In Python, there are also methods that return modified content of the string, for instance:

  • letter case;

>>> "merixstudio".upper()
'MERIXSTUDIO'
>>> "merixstudio".capitalize()
'Merixstudio'
>>> "MERIX".lower()
'merix'

  • position of the string (in a string of a specified length): centered, aligned to left / right,

>>> "merix".center(15)
' merix '
>>> "merix".ljust(15)
'merix '
>>> "merix".rjust(15)
' merix'

  • white space control - removing them from the beginning or end of the string.

>>> " merix ".strip()
'merix'
>>> " merix ".lstrip()
'merix '
>>> " merix ".rstrip()
' merix'

There are also methods that provide some information about the string.

>>> "merixstudio".isdigit()
False
>>> "merixstudio".islower()
True
>>> "merixstudio".isupper()
False
>>> "merixstudio".count('i')
2

Method isdigit() checks whether the string contains only digits, islower()/isupper() whether it consists only of lowercase/capital letters, while count() method counts number of occurrences of the letter "i" in the specified string.

Searching

In the basic level it is possible without using regular expressions. Here's the easiest method of searching:

>>> 'm' in 'merixstudio'
True
>>> 'a' in 'merixstudio'
False

On the other hand the find() method will give us the index number, which starts the found string of characters (it searches to the first occurrence). If it won’t find anything, it returns -1 as a result.

>>> 'merixstudio'.find('m')
0
>>> 'merixstudio'.find('x')
4
>>> 'merixstudio'.find('mer')
0
>>> 'merixstudio'.find('a')
-1

replace() method allows for simple text transformations by changing all occurrences of specified substring with a new string.

>>> "I like cats.".replace("cats", "dogs")
'I like dogs.'

Regular expressions

Regular expressions are used in both Python and other programming languages to find pieces of strings that match a fixed pattern. They are very useful for all, also more complicated searches of text, as well as converting it by removal, replacing its specific fragments, dividing it into parts and for checking compliance with the pattern, which is useful for example when we want to validate form fields. Because of the fact that the topic of construction of regular expressions is quite broad, I will limit myself to only describing the basic symbols.

In a nutshell, regular expressions are used to construct patterns by which we search the text. They include:

  • alphabetic , numeric and special characters;
  • alternatives marked with the char |, for example. 'good (morning|evening);
  • brackets [abc], which match exactly one character from the middle, for example. 'be[ae]r', or deny its occurrence - in a situation where we use the ^ char inside the parentheses, for example. c[^a]t;
  • dot ., which replaces any character except new line (preceded by a backslash \ becomes an ordinary character);
  • quantifiers:
  • Kleene star - matches 0 or more repetitions,
  • + - matches one or more repetitions,
  • ? - matches 0 or 1 repetition.

Quantifiers * and + are greedy by default, what means that they are matching to the longest possible substring. We can make them lazy by adding the ? char. Consequently, it will look like this: *? or +?.

  • anchors:
  • ^ - matches the beginning,
  • $ - matches the end of a string.
  • special characters:
  • \d - digit from 0 to 9,
  • \w - any letter, digit or underscore _,
  • \t - tab character,
  • \r - end of the line,
  • \n - new line,
  • \s - any whitespace.
  • braces:
    {x} - x repetitions,
    {x, y} - from x to y repetitions,
    {x, } - at least x repetitions.

“re” module

We can use the benefits of regular expressions in Python by importing the re module. Let’s learn some basic methods.

The match function checks for the pattern only at the beginning of a string.

>>> result = re.match(r'Hello', 'Hello there. What a wonderful day.')
>>> result.group()
'Hello'
>>> re.match(r'Hello', 'What a wonderful day. Hello!') # won't find any matches

In the example above I’m searching for the word "Hello". To print your search results I’m using the group() function, about which you will learn more later in the article.

The search function checks for a pattern in the whole string.

>>> result = re.search(r'Hello', 'What a wonderful day. Hello!')
>>> result.group()
'Hello'

We can mark the groups to which want to refer in the pattern by using round brackets. group() by default refers to the index 0, which stores the whole matched substring. Other indices indicate the corresponding group of characters within the brackets. groups() function returns all sub-groups gathered in the tuple (one of the Python data structures).

>>> result = re.search(r'(.+) there\.\sWhat a (.+) day', 'Hello there. What a wonderful day.')
>>> result.group()
'Hello there. What a wonderful day'
>>> result.group(0)
'Hello there. What a wonderful day'
>>> result.group(1)
'Hello'
>>> result.group(2)
'wonderful'
>>> result.groups()
('Hello', 'wonderful')

match and search can be used with flags, such as:

  • re.IGNORECASE / re.I - case-insensitive matching;
  • re.MULTILINE / re.M - ^ i $ - in the regular expression they indicate the start and end of the line instead of the beginning and end of the entire string;
  • re.DOTALL / re.S - dot char in a regular expression is denoting any character (including the new line).

>>> re.search(r'hello', 'Hello there. What a wonderful day.') # won't give the results
>>> result = re.search(r'hello', 'Hello there. What a wonderful day.', re.I)
>>> result
<_sre.SRE_Match object; span=(0, 5), match='Hello'>
>>> result.group()
'Hello'

You can also use matched groups in the pattern itself:

>>> result = re.search(r'(cat)(.+)(\1)', "I like my cat, but my cat does't like me.")
>>> result.groups()
('cat', ', but my ', 'cat')
>>> result.group()
'cat, but my cat'

And finally, to get a list of all instances found the pattern, you should use findall.

>>> result = re.findall(r'[^\w](\w{3})[^\w]', "I like my cat, but my cat does't like me.")
>>> result
['cat', 'but', 'cat']

This pattern matches all words that are exactly three letters long.

Another (and also the last) method which I want to present is the sub method . It is used to replace some strings with new ones, similar to the previously mentioned replace(). However, sub has a lot more capabilities through the use of regular expressions.

>>> re.sub(r'<.+?>', '', '<div>Do you like Python?</div>')
'Do you like Python?'

These are just some of the features that Python's re module offers. Check out the Python documentation if you want to read about its other capabilities or if you are interested in more extensive description of regular expressions.

As we can see, Python is a great tool for text processing. I hope I was able to show how well Python handles the processing of the text and that I encouraged you to explore this wonderful language.

Navigate the changing IT landscape

Some highlighted content that we want to draw attention to to link to our other resources. It usually contains a link .