Cloud Data Fusion(CDF)으로 SAP OData 를 BigQuery에 저장할 때 벌어진 이슈

GCP

Cloud Data Fusion(CDF)으로 SAP OData 를 BigQuery에 저장할 때 벌어진 이슈

whistory 2022. 10. 14. 14:45

SAP 의 데이터를 OData를 이용해 BIgQuery에 저장하려고 한다.

GCP 의 Data Fusion을 이용해 데이터를 적재한다.

데이터가 적은 경우는 괜찮았지만, 1만건 이상의 데이터를

조회할 때마다 데이터가 다르게 나온다.

돌릴때 마다 전체 row count는 동일하나,

데이터가 중복되거나 누락되는 케이스가 보인다.

Data Fusion으로 Job실행 할 때마다, 가져오는 데이터가 다르다.

(3차수의 데이터만 정상이다.)

CDF 팀에 문의해보니 아래와 같은 답변이 왔다.

The Cloud Data Fusion (CDF) Odata plugin uses these OData Service query options internally to partition the data (data is split and extracted in parallel threads and also within the same thread is divided in packages): $count, $top, $skip The plugin cannot function without this basic functionality of the OData service.

Again, this statement is being added to the plugin User Guide and will be published soon: "Any custom OData service must have support for $top, $skip and $count in order to allow the plugin to partition the data for sequential and parallel extraction. In addition, if there is a requirement to use in plugin $filter, $expand or $select this support should also be added."

To test your OData service independently from CDF:
There are some sample OData v2 services listed here - https://www.odata.org/odata-services/ (use Postman to call them, as browsers don't work with OData).

SAP OData plugin uses $top, $skip & $count internally and if users wants to use "Advanced" sections then those query parameter support should be added accordingly.
Mandatory:
- $top
- $skip
- $count

CDF에서 SAP OData를 가져와 처리할때, top, skip, count 기능을 필수로 사용한다면, 해당 기능을 사용한다는 뜻이다.

그래서 $top, $skip, $count 에 대해서 알아봤다.

$count
컬렉션의 레코드 수를 반환하거나 컬렉션에 필터가 있는 경우 필터와 일치하는 레코드 수를 반환합니다.

$skip
결과 집합에서 건너뛸 쿼리 된 모음의 항목 수를 지정합니다.

$top
결과에 포함할 쿼리 된 모음의 항목 수를 지정합니다. 클라이언트 구동 페이징 및 queryMore() 호출의 경우 후자가 필요에 따라 수정되므로 SOQL 쿼리의 LIMIT 절 값이 요청한 $top 값과 항상 일치하지는 않습니다.

$count는 가져올 row수를 return하는 것이고,

### 30개를 skip 하고, 위에서 부터 10개
### 전체가 100 이라면, 30~40 의 데이터를 가져온다.
/odata?$skip=30&$top=10

이런 식이라면, cloud function에서

$count를 이용해 데이터 수를 확인하고,

데이터 양이 많을경우, 이런식으로 분산처리 하는것으로 예상된다.

(정확한 로직을 설명해주지 않아서 추측...)

/odata?$skip=0&$top=100
/odata?$skip=100&$top=100
/odata?$skip=200&$top=100
/odata?$skip=300&$top=100

SAP OData에서 호출 시 Sorting을 동일하게만 진행 해 주면 정상적으로 처리 될 일이었지만....

SAP OData를 수정할 수 있는 상황이 아니었고, (프로젝트 종료일정… 이 이슈를 너무 늦게 파악한 잘못...)

고객사의 SAP 엔지니어가 수정을 해 주길 마냥 기다릴 순 없어서..

Cloud Function으로 개발방향을 변경하였다…

결론 :

GCP Cloud Data Fusion 을 이용해

SAP OData 를 끌고 오려면,

SAP OData 에서 정렬이 완벽히 구현 되여야 사용 가능하다.

그렇지 않다면 데이터의 중복과 유실이 발생한다.