如何在PostgreSQL中存储文本_PostgreSQL

译者 | 赵青窕

审校 | 孙淑娟

基于JPA实体定义的DDL生成器是许多开发人员的日常任务。在大多数情况下，我们使用Hibernate内置生成器或JPA Buddy插件等工具，这些工具会简化我们的工作，但也有例外，当涉及到在数据库中存储大量数据时，情况会变得有点复杂。

用例：存储文档

假设需要在PostgreSQL数据库中存储一个非空的文档对象。JPA实体代码如下所示:

Java：
@Entity  
@Table(name = "document")  
public class Document {  
   @Id  
   @GeneratedValue(strategy = GenerationType.IDENTITY)  
   @Column(name = "id", nullable = false)  
   private Long id;  
 
   @Column(name = "date_created", nullable = false)  
   private LocalDateTime dateCreated;  
 
   @Column(name = "doc_txt")  
   private String docText;  
   
   //Getters and setters omitted for brevity  
}

问题是：如果我们需要存储非常长的文档文本怎么办?在Java中，字符串数据类型可以保存大约2Gb的文本数据，但是对于上面的模型，表的列(table column)大小将被限制为255个字符。那么，我们应该改变什么呢?

方法1：使用LOB存储

在关系数据库中，存在一种用于存储大量数据的特定数据类型:LOB(Large OBject)。一旦需要在数据库中存储大型文本，我们就可以开始定义LOB列。我们需要做的就是用@Lob注释标记docText属性。

Java：
@Lob   
@Column(name = "doc_txt")   
private String docText;

让我们使用Hibernate为表生成DDL，以映射“Document”实体。SQL将是:

SQL：
create table document ( 
    id int8 generated by default as identity, 
    date_created timestamp not null, 
    doc_txt oid, 
    primary key (id) 
);

可以看到，doc_text列的数据类型是oid。它是什么?在文档中有如下说明:

PostgreSQL提供了两种不同的方式来存储二进制数据。二进制数据可以使用数据类型BYTEA或使用Large Object特性存储在表中，该特性以特殊格式将二进制数据存储在单独的表中，并通过在表中存储OID类型的值来引用该表。

在我们的例子中，第二种方式是有效的。这个单独的表名为pg_largeobject，它存储分成“页”的数据，通常每个页2kb，如文档中所述。

因此，Hibernate将大型文本作为二进制数据存储在单独的表中。这是否意味着我们应该在选择数据时进行额外的连接或在保存数据时进行额外的插入操作?让我们启用SQL日志记录，创建Document实体并使用Spring Data JPA将其保存到数据库中。

Java：
Document doc = new Document();   
doc.setDateCreated(LocalDateTime.of(2020, 1, 1, 10, 10));   
doc.setDocText("This is the doc text");   
Document saved = documentRepository.save(doc);

Hibernate会在控制台中显示一个普通的SQL插入:

SQL：
insert  
into 
    document 
    (date_created, doc_txt)  
values 
    (?, ?)

现在，我们可以通过在控制台中执行以下SQL语句来检查数据是否被正确存储:

SQL:
select * from document

我们将看到上述命令的结果应该与下表类似:

id	data_created	doc_txt
1	2020-01-01 10:10:00	76388

我们在这个表中看不到文档文本，只是对大对象存储中的对象的一个引用。让我们检查pg_largeobject表:

SQL
select * from pg_largeobject where loid=76338

此时，就可以看到文档文本了。

loid	pageno	data
76388	0	This is the doc text

因此，Hibernate在幕后自动将数据保存到两个表中。现在，我们可以尝试使用Spring data JPA获取文档数据:

Java
documentRepository.findById(1L).ifPresent(d -> System.out.println(d.getDocText()));

我们可以在控制台中看到以下SQL语句:

SQL
select   
   document0_.id as id1_0_0_,   
   document0_.date_created as date_cre2_0_0_,   
   document0_.doc_txt as doc_txt3_0_0_    
from   
   document document0_    
where   
   document0_.id=?

输出应符合下面的预期：

Plain Text
This is the doc text

Hibernate从pg_largeobject表中选择数据。让我们尝试使用JPQL来执行相同的查询。为此，我们创建了一个附加的Spring Data JPA存储库方法并调用它:

Java
//repository   
@Query("select d from Document d where d.id = ?1")   
Optional<Document> findByIdIs(Long id); 
//... 
//invocation 
documentRepository.findByIdIs(1L).ifPresent(d -> System.out.println(d.getDocText()));

这种方式将无法完成我们的预期工作：

Plain text
org.springframework.orm.jpa.JpaSystemException: Unable to access lob stream   
…   
Caused by: org.hibernate.HibernateException: Unable to access lob stream   
…   
Caused by: org.postgresql.util.PSQLException: Large Objects may not be used in auto-commit mode.

Hibernate执行额外的数据库读取来获取LOB数据。在auto-commit模式下，此读取将在单独的事务中执行。PostgreSQL驱动程序显式地禁止它，如上面的错误消息所示。为了解决这个问题，我们需要在一个事务中执行这样的查询，或者禁用auto-commit模式。

“CrudRepository”中的Spring Data JPA方法，如findById()和findAll()，默认情况下在一个事务中执行。这就是为什么在第一个例子中一切正常的原因。当我们使用Spring Data JPA查询方法或JPQL查询时，我们必须像下面的示例那样显式地使用@Transactional。

Java
@Transactional 
@Query("select d from Document d where d.id = ?1")   
Optional<Document> findByIdIs(Long id); 

@Transactional 
List<Document> findByDateCreatedIsBefore(LocalDateTime dateCreated);

但是禁用auto-commit模式似乎比使用@Transactional注释更可取。例如，为了在Spring Boot中实现默认的连接池(HikariCP)，我们需要设置spring.datasource.hikari.auto-commit属性为false。

将文本存储在单独的表中可能会导致其他问题。让我们添加一个存储库方法，使用LIKE子句为docText字段选择文档:

Java
@Transactional   
List<Document> findByDocTextLike(String text);

该方法将生成以下查询:

SQL
select   
   document0_.id as id1_0_,   
   document0_.date_created as date_cre2_0_,   
   document0_.doc_txt as doc_txt3_0_    
from   
   document document0_    
where   
   document0_.doc_txt like ? escape ?

这个查询将会失败，错误如下:

Plain Text
org.springframework.dao.InvalidDataAccessResourceUsageException: could not extract ResultSet; SQL [n/a];    
…   
Caused by: org.postgresql.util.PSQLException: ERROR: function pg_catalog.like_escape(bigint, character varying) does not exist   
Hint: No function matches the given name and argument types. You might need to add explicit type casts.

Hibernate无法生成正确的SQL来处理LOB文本列的LIKE子句。对于这种情况，我们可以使用nativequery。在这个查询中，我们必须从LOB存储中获取文本数据，并将其转换为字符串格式。之后，我们可以在LIKE子句中使用它(不要忘记' @Transactional '):

Java
@Query(value = "select * from document d " +   
       "where convert_from(lo_get(doc_txt), 'UTF-8') like '%'||?1||'%'", nativeQuery = true)   
@Transactional   
List<Document> findByDocTextLike(String text);

现在一切都工作正常。请记住，nativequery可能与其他RDBMS不兼容，并且不会在运行时进行验证。只有在绝对必要的时候才使用。

总结：采用LOB来存储

在PostgreSQL中将大型文本存储为LOB对象的优点和缺点是什么呢?

优点:

PostgreSQL为LOB对象使用了优化的存储方式；
可以存储多达4Gb的文本。

缺点:

WHERE子句中的一些函数(LIKE、SUBSTRING等)在Hibernate中不能用于LOB文本列。对此，我们需要使用nativeQuery；
要使用JPQL或Spring Data JPA存储库查询方法获取文本，我们必须对存储库方法使用@Transactional注释或禁用auto-commit模式。

这里有一个问题：为什么不直接将文本数据存储在表中？我们也来讨论一下这种方式。

方式2：Column Re-Definition

PostgreSQL允许我们将长文本数据存储在特定数据类型(TEXT)的列中。我们可以在注释中指定列定义。

Java
@Column(name = "doc_txt", columnDefinition = "text")   
private String docText;

这使我们能够以“通常”的方式处理长文本。没有事务，native query和JPQL按照预期工作。与LOB类型相比，有一个限制是可存储的最长字符串约为1GB。

当长度小于4GB时，可以使用LOB，但对于大多数用例来说已经足够长了。

这里唯一的问题是硬编码的列定义(hardcoded column definition)。为了克服这个问题，我们可以在Hibernate 5中使用注释@Type和转换器org.hibernate.type.TextType。它比前面的列定义有一个优点：它不是特定于供应商的(vendor-specific)。

Java
@Type(type = "org.hibernate.type.TextType")   
@Column(name = "doc_txt")   
private String docText;

在Hibernate 6中，org.hibernate.type.TextType类被删除了。为了定义存储长文本的列，我们可以这样定义属性:

Java
@Column(name = "doc_txt", length = Length.LOB_DEFAULT)   
private String docText;

这将在数据库中给出以下列定义:doc_txt varchar(1048576)。它不是TEXT类型，但仍然可以在表中存储大约1Gb的文本。它是PostgreSQL中最大的字符串。

在Hibernate 6中，我们可以按照下面的方式定义docText属性来生成一个包含TEXT数据类型的列:

Java
@JdbcTypeCode(SqlTypes.LONG32VARCHAR)   
@Column(name = "doc_txt")   
private String docText;

不幸的是，自2022年6月开始，Hibernate 6无法从表中获取数据。它生成TEXT类型的正确表和列定义。将数据从doc_txt列提取到实体属性将会失败。错误文本如下所示:

Plain Text
Unknown wrap conversion requested: [B to java.lang.String : `org.hibernate.type.descriptor.java.StringJavaType` (java.lang.String)

因此，在text /VARCHAR列中存储长文本带来的问题较少。事务、LIKE条件等没有问题。唯一的缺点就是存储大小最多是1Gb。还有其他注意事项吗?

如果我们在数据库中使用TEXT列类型和@Lob注释，可能会出现问题。让我们看看它是如何工作的。首先，让我们创建一个表文档，并向其中插入一些数据:

SQL
create table document ( 
    id int8 generated by default as identity, 
    date_created timestamp not null, 
    doc_txt text, 
    primary key (id) 
);   
  
insert into document (id, date_created, doc_txt) values (1, '2021-10-10', 'This is the document text number 1');

我们将使用带有@Lob列的文档实体定义:

Java
@Entity   
@Table(name = "document")   
public class Document {   
   @Id   
   @GeneratedValue(strategy = GenerationType.IDENTITY)   
   @Column(name = "id", nullable = false)   
   private Long id;   
  
   @Column(name = "date_created", nullable = false)   
   private LocalDateTime dateCreated;   
  
   @Lob   
   @Column(name = "doc_txt")   
   private String docText;   
  
   //Getters and setters omitted for brevity   
}

文档获取的代码将是相同的:

Java
documentRepository.findById(1L).ifPresent(d -> System.out.println(d.getDocText()));

如果我们尝试执行这个repository方法，将会看到下面的结果:

Plain Text
java.lang.IllegalStateException: Failed to execute Application   
…  
Caused by: org.hibernate.exception.DataException: could not execute query   
…   
Caused by: org.postgresql.util.PSQLException: Bad value for type long: This is the document text number 1   
…

我们可以看到，Hibernate处理@Lob属性值作为对LOB对象数据的引用。数据库表中的TEXT列类型不影响此行为。

那保存数据呢？让我们先清空表，尝试保存带有@Lob字段的文档实体，并使用Spring Data JPA获取它。下面是对应的代码：

Java
//Saving 
Document doc = new Document();   
doc.setDateCreated(LocalDateTime.now());   
doc.setDocText("This is another text document");   
documentRepository.save(doc);  
... 
//Fetching 
documentRepository.findAll().forEach(d -> System.out.println(d.getDocText));   
... 
//Result 
This is another text document

因此，看起来带有@Lob属性的实体似乎可以处理TEXT列。在数据库表中，我们会看到熟悉的画面:

id	data_created	doc_txt
1	2022-06-16 15:28:26.751041	76388

loid	pageno	data
76388	0	This is another text document

如果我们使用SQL将文档数据插入到表中，然后选择数据，我们将得到以下结果:

SQL
insert into document (date_created, doc_txt) values ('2021-10-10', 'This is the document text'); 

select * from document;

id	data_created	doc_txt
1	2022-06-16 15:28:26.751041	76388
2	2021-10-10 00:00:00	This is the document text

现在我们不能使用Spring data JPA从数据库中选择数据。当选择第二行时，应用程序将因类型转换错误而崩溃。

让我们将@Type注释添加到属性中…

Java
@Lob   
@Type(type = "org.hibernate.type.TextType")   
@Column(name = "doc_txt")   
private String docText;

并尝试将文档的文本数据打印到应用程序控制台。

Java
documentRepository.findAll().forEach(d -> System.out.println(d.getDocText));

我们将会看到下面的输出信息：

Hibernate: select document0_.id as id1_0_, document0_.date_created as date_cre2_0_, document0_.text as text3_0_ from document document0_   
  
76388   
This is the document text