trivago tech tips - #18 Python Edition
We at trivago have two big use cases for Python: On the one side Backend Software Engineering and Data Science with Machine Learning. Although the lines between the two groups are blurred, these 5 tips to follow are more about software development. Enjoy!
#1 Basic data structure sets
In case you want to find common elements of two lists, consider casting them to sets first and use their powerful operations.
hotel_traits = ["beach", "party", "pet friendly"]
desired_traits = ["beach", "hiking"]
print(f"fulfilled traits: {set(hotel_traits) & set(desired_traits)}")
A set is a very useful data structure in Python. Use it when the order doesn’t matter, and an element doesn’t need to be included multiple times.
You can use it check for membership in O(1) and find common elements or differences with other sets. For in depth blog post about sets visit realpython.com.
#2 Dict unpacking
When you instantiate an injected object in your function, it is a nice idea to allow for all possible arguments during construction. Python’s dict unpacking will help you in this case.
def get_hotels(traits, search_factory=default_search_builder, **kwargs):
search_instance = search_factory(backend=config.backend, **kwargs)
...
search_instance.start_search(traits, debug=kwargs.get("debug"))
yield from search_instance.generate_results()
...
search_config = dict(traits=desired_traits, strategy="ab_test_42", debug=True)
get_hotels(**search_config)
Dict unpacking is a very powerful feature and available since Python 3.5. It allows you carry function parameters in a dictionary. Filter out specific parameters and catch all others in a different dictionary. This can be used to be passed down to calls further down in the stack.
When you start using it, you will probably also start to encounter new errors from time to time. For example, when someone tries to call get_hotels with a backend parameter like this:
get_hotels(desired_traits, strategy="ab_test_42", backend=NewBackend, debug=True)
You will get this TypeError:
TypeError: default_search_builder() got multiple values for keyword argument 'backend'
If you want to allow the user to pass in another search_factory, you need to preprocess the kwargs like this:
clean_kwargs = dict(backend=config.backend)
clean_kwargs.update(**kwargs)
search_instance = search_factory(**clean_kwargs)
You create a new dictionary with all default arguments, and update it with everything that was passed in earlier. Now you use this dictionary, unpack it and call your function.
In case you want to prohibit to override the backend
, because it should be set only via the config, you can remove it manually. But it is a good idea to inform that something needed to be corrected to help others and your future self when this code will be changed later on.
if (illegal_backend := kwargs.pop("backend")):
log.warning(f"a backend {illegal_backend!r} was ignored")
search_instance = search_factory(backend=config.backend, **kwargs)
#3 defaultdict
Do you catch yourself checking the existence of a key before modify a value in a dictionary over and over again? Use a defaultdict and save two lines of code.
# instead of
error_statistics = {}
...
if error not in error_statistics:
error_statistics[error] = 0
error_statistics[error] += 1
# do this
from collections import defaultdict
error_statistics = defaultdict(int)
...
error_statistics[error] += 1
The defaultdict will call the object that was passed to it, whenever you access the dictionary with a missing key, and stores it in its place. Other than that, it works like a normal dictionary and has all the same methods available to interact with. For example, if you want to group a list of hotels by the type of validation error you can use the default dict like this:
error_statistics = defaultdict(list)
for hotel in hotels:
errors = validate(hotel)
...
for error in errors:
error_statistics[error].append(hotel)
#4 Counter
Often when you use a defaultdict to count elements based on a key you want to know what key was counted the most. Also in this case you don’t have to reinvent the wheel yourself and you can take use of the Counter from the Python standard library.
from collections import Counter error_statistics = Counter for hotel in hotels: errors = validate(hotel) ... for error in errors: error_statistics.update([error]) for error, count in error_statistics.most_common(3): print(f"We have {count} {error} errors in our input")
This could Produce an output like this:
We have 10 MissingAddress errors in our input We have 4 InvalidName errors in our input We have 2 MissingImage errors in our input
Be careful that the update
method needs an iterable input. If you want to update it like in our case (one element at a time) you need to pass a list of one element. Also be careful, since strings are also iterable and will not raise an Exception when you pass a single element but can create undesired results.
For a full list of the errors you can call it like this error_statistics.most_common(None)
.
#5 Named Tuple
Lets look at this example:
arguments = get_arguments()
main(*arguments)
This example has code smell because it couples both functions together and the reader can’t see immediately what attributes are passed. To increase the readability we can pass the parameters by name instead of position:
main(
desired_traits=arguments[0],
backend=arguments[1],
debug=arguments[2]
)
Now we can change the order of the arguments in the main function without the need to adapt the get_arguments
function. But get_arguments
is still coupled to main, whenever we change the order of the elements there, or add new elements in the middle we need to adapt the code.
One solution is using dictionaries and dict unpacking from our previous tech tip, but I want to show you a slightly more elegant solution today: named tuples. If we change the get_arguments function to return a named tuple, it is still compatible with all code that expects normal tuples, but it allows for accessing members by name instead of a cryptic number.
They are around for a while and got a revamp in Python 3.6 and are available for every maintained Python version.
from typing import NamedTuple, List
class Arguments(NamedTuple):
desired_traits: List[str]
backend: str
debug: bool = False
...
main(
desired_traits=arguments.desired_traits,
backend=arguments.backend,
debug=arguments.debug
)
After you access the elements by name, you are free to add new elements to it without affecting the existing code. You decoupled both functions.
FYI: There is also a namedtuple in the collections module, but the syntax to construct it is not as nice.
That’s wrap on Python! Happy long weekend everyone 🥳