implementa BaseDateRange e utilitários de limpeza de XML/HTML #1264

robertatakenaka · 2026-02-01T18:09:42Z

Descrição do PR

🟢 O que foi feito?

Models:
Adicionada a classe abstrata BaseDateRange para lidar com datas em formato string (YYYY-MM-DD, YYYY-MM), conforme padrão SciELO.
Adicionadas propriedades initial_date_isoformat e final_date_isoformat na classe BaseHistory para facilitar a transição.
Marcada a classe BaseHistory como legada (deprecated).
Utils:
Implementada a função remove_html_tags para extrair texto limpo de strings com tags.
Implementada a função clean_xml_tag_content para padronização de conteúdos vindos de XML.
Adicionada validação has_only_alpha_or_space para garantir integridade de strings alfabéticas.

🎯 Motivação

A migração de DateField para CharField no intervalo de datas é necessária para suportar a flexibilidade de precisão (apenas ano ou ano/mês) exigida pelos padrões de indexação, mantendo a compatibilidade com os dados existentes através das propriedades ISO.

🧪 Como testar?

Verifique se o novo modelo BaseDateRange aceita strings parciais de data.
Execute os utilitários de limpeza com strings contendo HTML (ex: <p>Texto</p>) e valide se o retorno é apenas o conteúdo textual.
Verifique se clean_xml_tag_content retorna None ao encontrar números quando assert_string=True.

…eHistory

Copilot

Pull request overview

Este PR introduz um novo modelo abstrato para intervalos de data com precisão flexível (padrão SciELO) e adiciona utilitários para limpeza/padronização de texto vindo de XML/HTML, visando facilitar a migração de DateField para CharField mantendo compatibilidade via propriedades ISO.

Changes:

Adiciona BaseDateRange (datas como string) e propriedades ISO em BaseHistory (marcado como legado).
Implementa utilitários de limpeza: remove_html_tags, clean_xml_tag_content e validação has_only_alpha_or_space.
Mantém os utilitários existentes de padronização e amplia o suporte para conteúdo com tags.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
core/utils/standardizer.py	Adiciona helpers para extrair texto de HTML/XML e validar conteúdo alfabético antes de padronizar.
core/models.py	Introduz `BaseDateRange` para datas em string e adiciona propriedades ISO em `BaseHistory` para transição gradual.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-01T18:13:57Z

core/models.py

+class BaseDateRange(models.Model):
+    # Used to replace BaseHistory, which will be DEPRECATED
+    # Uso de datas em formato YYYY-MM-DD, YYYY-MM ou YYYY adotado por SciELO
+    initial_date = models.CharField(_("Initial Date"), max_length=10, null=True, blank=True)
+    final_date = models.CharField(_("Final Date"), max_length=10, null=True, blank=True)


BaseDateRange stores dates as free-form CharFields but (per the comment/PR description) the accepted formats are constrained to YYYY, YYYY-MM, or YYYY-MM-DD. Without a validator (e.g., a RegexValidator) or normalization, invalid values can be persisted and later break consumers that assume SciELO-compliant date strings.

Copilot · 2026-02-01T18:13:57Z

core/utils/standardizer.py

+    text = text.replace("<", "BREAKTAG<")
+    text = text.replace(">", ">BREAKTAG")
+    for part in text.split("BREAKTAG"):
+        if part.startswith("<") and part.endswith(">"):


remove_html_tags uses the literal sentinel string "BREAKTAG" to split the input. If the source text contains "BREAKTAG" naturally, it will be removed/altered by the split+join, producing incorrect output. Consider replacing this approach with a proper HTML/XML text extractor (e.g., Django's strip_tags or html.parser) or, at minimum, use a collision-safe strategy that cannot appear in user content.

Copilot · 2026-02-01T18:13:58Z

core/utils/standardizer.py

+def remove_html_tags(text):
+    if not text:
+        return text


remove_html_tags returns text directly for falsy inputs (e.g. None), but yields parts (generator) for non-empty inputs. This inconsistent return type makes the function error-prone for callers. It should consistently return a string (recommended, given the name) or consistently be a generator (and just return early for falsy input).

Copilot · 2026-02-01T18:13:58Z

core/utils/standardizer.py

+    """ Verifica se o conteúdo do texto é válido como string, ou seja,
+    não é vazio e não contém números. """


The docstring for has_only_alpha_or_space says the function only rejects empty strings and numbers, but the implementation rejects any token that is not purely alphabetic (e.g., punctuation, hyphens). Please align the docstring with the actual behavior (or adjust the logic if the intent is only to block digits).

Suggested change

""" Verifica se o conteúdo do texto é válido como string, ou seja,

não é vazio e não contém números. """

"""Verifica se o conteúdo do texto é válido como string: o texto não é

vazio e cada parte separada por espaço contém apenas letras

(caracteres alfabéticos), sem dígitos, pontuação ou outros símbolos."""

Copilot · 2026-02-01T18:13:58Z

core/utils/standardizer.py

+def remove_html_tags(text):
+    if not text:
+        return text
+    text = text.replace("<", "BREAKTAG<")
+    text = text.replace(">", ">BREAKTAG")
+    for part in text.split("BREAKTAG"):
+        if part.startswith("<") and part.endswith(">"):
+            continue
+        if part.startswith("<"):
+            continue
+        if part.endswith(">"):
+            continue
+        yield part
+
+
+def has_only_alpha_or_space(text):
+    """ Verifica se o conteúdo do texto é válido como string, ou seja,
+    não é vazio e não contém números. """
+    if not text:
+        return False
+    parts = text.split()
+    for part in parts:
+        if not part.isalpha():
+            return False
+    return True
+
+
+def clean_xml_tag_content(text, assert_string=True):
+    if not text:
+        return text
+    text = "".join(remove_html_tags(text))
+    text_ = remove_extra_spaces(text)
+    if assert_string:
+        if has_only_alpha_or_space(text_):
+            return text_
+        else:
+            return None
+    return text_


New behaviors were added in this module (remove_html_tags, has_only_alpha_or_space, clean_xml_tag_content), but the existing core/tests_standardizer.py only covers the older helpers. Please add unit tests for the new functions (including cases with nested tags, adjacent tags, empty/None inputs, and assert_string=True/False).

robertatakenaka added 2 commits February 1, 2026 15:07

feat(models): add BaseDateRange model and isoformat properties to Bas…

fea303d

…eHistory

feat(utils): add XML content cleaning and HTML tag removal utilities

c56dee4

Copilot AI review requested due to automatic review settings February 1, 2026 18:09

Copilot started reviewing on behalf of robertatakenaka February 1, 2026 18:09 View session

Copilot AI reviewed Feb 1, 2026

View reviewed changes

robertatakenaka merged commit 815614e into scieloorg:main Feb 1, 2026
9 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementa BaseDateRange e utilitários de limpeza de XML/HTML #1264

implementa BaseDateRange e utilitários de limpeza de XML/HTML #1264

Uh oh!

robertatakenaka commented Feb 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 1, 2026

Uh oh!

Copilot AI Feb 1, 2026

Uh oh!

Copilot AI Feb 1, 2026

Uh oh!

Copilot AI Feb 1, 2026

Uh oh!

Copilot AI Feb 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		""" Verifica se o conteúdo do texto é válido como string, ou seja,
		não é vazio e não contém números. """

-    """ Verifica se o conteúdo do texto é válido como string, ou seja,
-    não é vazio e não contém números. """
+    """Verifica se o conteúdo do texto é válido como string: o texto não é
+    vazio e cada parte separada por espaço contém apenas letras
+    (caracteres alfabéticos), sem dígitos, pontuação ou outros símbolos."""

implementa BaseDateRange e utilitários de limpeza de XML/HTML #1264

implementa BaseDateRange e utilitários de limpeza de XML/HTML #1264

Uh oh!

Conversation

robertatakenaka commented Feb 1, 2026

Descrição do PR

🟢 O que foi feito?

🎯 Motivação

🧪 Como testar?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant