perf: Optimize searching tags with DB indexes#1129
perf: Optimize searching tags with DB indexes#1129TheBobBobs wants to merge 8 commits intoTagStudioDev:mainfrom
Conversation
01bb83b to
30d9403
Compare
| if limit > 0 and not name: | ||
| query = query.limit(limit).order_by(func.lower(Tag.name)) |
There was a problem hiding this comment.
This causes the sorting to happen after truncating the results, which differs from the behaviour when searching, no?
There was a problem hiding this comment.
Yes this was not sorting by len(text) before truncating. The previous implementation only sorted priority results by len so I've updated sort_key to do that. Which will make this order_by statement and sort_key produce the same results when no query is provided.
| tags.sort(key=lambda t: sort_key(t[1])) | ||
| seen_ids = set() | ||
| tag_ids = [] | ||
| for row in tags: | ||
| id = row[0] | ||
| if id in seen_ids: | ||
| continue | ||
| tag_ids.append(id) | ||
| seen_ids.add(id) |
There was a problem hiding this comment.
Couldn't this be written as the following?
tags = dict(tags)
tag_ids = sorted(tags.keys(), key=lambda t: sort_key(tags[t]))
del tags # not sure if this is makes a diff, but `tags` could become quite large and triggering gc on it sooner can't hurtthis is both simpler code wise and should be faster by only sorting the deduplicated list
There was a problem hiding this comment.
This is so it will use the order from Tag.name or TagAlias.name depending on which comes first for each tag.
There was a problem hiding this comment.
in that case the following should work since dict.keys() maintains insertion order and dict deduplicates by key:
tags.sort(key=lambda t: sort_key(t[1]))
tag_ids = dict(tags).keys() # get the deduplicated list of idsThere was a problem hiding this comment.
other than that I think this is good to merge
src/tagstudio/qt/mixed/tag_search.py
Outdated
| direct_tags, ancestor_tags = self.lib.search_tags(name=query, limit=tag_limit) | ||
|
|
||
| if query and query.strip(): | ||
| for tag in raw_results: | ||
| if tag.name.lower().startswith(query_lower): | ||
| priority_results.add(tag) | ||
| all_results = [t for t in direct_tags if t.id not in self.exclude] | ||
| for tag in ancestor_tags: | ||
| if tag.id not in self.exclude: | ||
| all_results.append(tag) |
There was a problem hiding this comment.
This code previously handled self.exclude being None, is there a reason you removed that?
There was a problem hiding this comment.
Its type is list[int] and I couldn't find any code that could cause it to be None.
There was a problem hiding this comment.
good reason ^^, I missed that
src/tagstudio/qt/mixed/tag_search.py
Outdated
| all_results = [t for t in direct_tags if t.id not in self.exclude] | ||
| for tag in ancestor_tags: | ||
| if tag.id not in self.exclude: | ||
| all_results.append(tag) |
There was a problem hiding this comment.
| all_results = [t for t in direct_tags if t.id not in self.exclude] | |
| for tag in ancestor_tags: | |
| if tag.id not in self.exclude: | |
| all_results.append(tag) | |
| all_results = [t for t in direct_tags if t.id not in self.exclude] | |
| all_results += [t for t in ancestor_tags if t.id not in self.exclude] |
There was a problem hiding this comment.
Ended up doing this to avoid creating extra lists.
all_results.extend(t for t in ancestor_tags if t.id not in self.exclude)
|
Sorry for the delay on the review had to work on my thesis ^^ |
|
Sorry for my own inactivity on this, I haven't pulled this so far since I wanted to do some rigorous testing to ensure that no issues slipped though with query performance. I'll get back to this in the near future and let you know if I find anything quirky on my end |
Summary
Create indexes on columns commonly used in queries.
Tag.name, Tag.shorthandfinding tags by nameTagParent.child_idfetching tag hierarchiesTagEntry.entry_idfetching entries with their tagsChange query in
Library.search_tagsso it will use above indexes.Sort results before applying limit to prevent truncating tags that should be prioritized.
Tasks Completed