I have a Apache beam application which running with spark runner in yarn cluster, it reads multiple inputs, does transforms and produce 2 outputs, one is in parquet and the other is in text file format.

In my transforms, one of the step is to generate a uuid to give to one attribute of my pojo, then I got a PCollection, after that from this PC I applied transforms to convert myPojo to String and Generic Record, and applied TextIO and ParquetIO to save to my storage.

Just now I observed one strange issue is that, in the output files, the uuid attribute is different between parquet data and text data for the same record!

I expect that they are from one same PCollection, they are just output into different formats, so the data must be same, right?

The issue happens only with big input file volume. In my unit test case, it gives me same value in both formats.

I assumed that there happened kinds of recalculation? When sink to different IOs. But I can’t confirm.. anyone can help to explain?.


Anonymous Asked question May 13, 2021