JPA使用非自动生成的id批量插入



正在尝试批量插入几百万个实体。Batch insert是可行的,但我的程序在后台执行了一些JDBC语句,这是我不想要的。

List < IceCream > iceList = new ArrayList < IceCream > ();
for (CSVRecord record: records) {
if (flushCounter > 40000) {
iceCreamRepository.saveAll(iceList);
iceList= new ArrayList < IceCream > ();
flushCounter = 0;
}
flushCounter++;
IceCream iceCream = new IceCream();
int id = getIdFromCSV();
iceCream.setId(id);
iceCream.set...
...
iceList.add(iceCream);

}

我的存储库:

public interface IceCreamRepository extends JpaRepository<IceCream, Long>
{
}

我的实体:

@Entity
@Table(name="IceCream")
public class IceCream 
{
private static final long serialVersionUID = 1L;
@OneToMany(targetEntity=entity.OtherEntity.class, mappedBy="IceCream")
private Set<OtherEntity> otherEntitys = new HashSet<OtherEntity>();
@Id
private int id;
@Basic
@Column(name="import_tstamp")
@Temporal(TemporalType.TIMESTAMP)
private Date importTstamp;
@Basic
@Column(name="import_type", length=2147483647)
private String importType;
@Basic
@Column(length=2147483647)
private String text;
...
}

我的JPA设置:

spring.jpa.properties.hibernate.batch_versioned_data: true
spring.jpa.properties.hibernate.order_updates: true
spring.jpa.properties.hibernate.order_inserts: true
spring.jpa.properties.hibernate.generate_statistics: true
spring.jpa.properties.hibernate.jdbc.format_sql: true
spring.jpa.properties.hibernate.jdbc.batch_size: 1000

Batch Insert确实有效,但如果我尝试上传100个实体,我有33个JDBC语句正在检查id。

这是33个实体的输出:

2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(1)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(2)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(3)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(4)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(5)]

我的程序正在尝试加载实体,但我不知道为什么,我还没有插入它们。他这样做是为了32个身份证。对于除第一个id之外的每个id(0(在该输出之后,将为所有33个实体插入一个批处理…

2020-03-25 09:25:50.334 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:11, Success:True, Type:Prepared, Batch:True, QuerySize:1, BatchSize:33, Query:["insert into iceCream(import_tstamp, import_type, text, id) values (?, ?, ?, ?)"], Params:[(2020-03-25 09:25:50.127,ice,teext,0),(2020-03-25 09:25:50.127,ice,teext,1),(2020-03-25 09:25:50.127,ice,teext,2)...]

之后我得到了这个总结:

2020-03-25 09:25:50.359 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
2222222 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
21234400 nanoseconds spent preparing 33 JDBC statements;
40600005 nanoseconds spent executing 32 JDBC statements;
27859771 nanoseconds spent executing 1 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
100978099 nanoseconds spent executing 1 flushes (flushing a total of 34 entities and 33 collections);
0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

如果我只使用一个实体,输出为:

2020-03-25 11:17:40.119 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
1375995 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
12024409 nanoseconds spent preparing 1 JDBC statements;
0 nanoseconds spent executing 0 JDBC statements;
5597005 nanoseconds spent executing 1 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
38446070 nanoseconds spent executing 1 flushes (flushing a total of 1 entities and 1 collections);
0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

对于2个实体,它显示如下(我的id从0开始,所以他只JDBC执行第二个实体(:

2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(1)]
2020-03-25 11:25:00.180 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
1446363 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
13101435 nanoseconds spent preparing 2 JDBC statements;
11427142 nanoseconds spent executing 1 JDBC statements;
3762785 nanoseconds spent executing 1 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
22309803 nanoseconds spent executing 1 flushes (flushing a total of 2 entities and 2 collections);
0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

3的输出是

2020-03-25 11:47:00.277 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
1010843 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
31706133 nanoseconds spent preparing 3 JDBC statements;
57180996 nanoseconds spent executing 2 JDBC statements;
3839505 nanoseconds spent executing 1 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
23923340 nanoseconds spent executing 1 flushes (flushing a total of 3 entities and 3 collections);
0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

所以我有两个问题:

  1. 当我只想要一个批插入时,为什么我有所有这些JDBC语句?(以及如何修复此问题(

  2. 我尝试了几百万个实体,但在程序完成之前,我看不到数据库中的任何更新。我确实调用了iceCreamRepository.saveAll(iceList(;功能每4000行。我认为这将把所有实体写入数据库。我的ram不是hughe,我有一个10 gb的数据文件,只有2 gb的ram。如果程序等待写入所有数据直到结束,为什么我不用完ram?

答案会有点复杂,但请耐心等待。

我确实调用了iceCreamRepository.saveAll(iceList(

根据以上内容,我假设您使用的是带有JPA的Spring Data。

当我只想要一个Batch insert时,为什么我有所有这些JDBC语句?(以及如何修复此问题(

JpaRepository.saveAll()的实现是在列表中的每个实体上调用save(),而save()的实现如下:

if (entityInformation.isNew(entity)) {
em.persist(entity);
return entity;
} else {
return em.merge(entity);
}

EntityInformation的默认实现"只要EntityInformation.getId(Object)返回null,就认为实体是新的",这意味着您的实体属于if ... else ...语句的第二个分支。

实际上,SpringData告诉JPA将实体与其DB中的现有版本合并。因此,JPA需要首先加载现有版本,这就是您看到所有额外查询的原因。

要解决此问题,请执行以下任一操作:

  • 使您的实体实现Persistable,并从isNew()返回true(请注意,这可能会影响其他地方的持久化逻辑;有关详细信息,请参阅此链接(
  • OR注入并直接与EntityManager交互,调用persist()而不是merge()

我尝试了几百万个实体,但在程序完成之前,我在数据库中看不到任何更新

对于要执行的实际查询,您需要在每个批次之后调用EntityManager.flush()(如果您选择不直接与EntityManager交互,请改用JpaRepository.flush()(

(顺便说一句,JPA在缓存、转换等方面有很多开销,对于批处理操作来说通常是一个糟糕的选择。如果我是你的话,我会考虑切换到带有JDBC的Spring batch(

最新更新