Es deduplicator

12/16/2023 0 Comments

Es deduplicator

Rely on intermediate kv storage such as redis to remove duplication. If the amount is small, it can be done in flink. It should be noted that if the demand changes and the mobile phone number needs to be deduplicated, then the logic of deduplication should be added when deduplication. For example, the data after grouping can be directly reduced, or count, sum, max, min. When the grouping and window are set, the data can be aggregated. It is worth noting that if event time is used as the time standard, we need messages in Kafka with a timestamp. If we want to accurately count the number of visits, we need to use event time. Visited the platform at 7:01:23, but the point was passed through flume, kafka and then flink was delayed to 7:01:45, then the event time is 7:01:23, and the process time is 7:01: 45. There is a difference between the two times. One is event time refers to the time when the time occurs, and the other is process time refers to the message processing time. There are two concepts to pay attention to. When choosing time window, you need to pay attention to our time standard. , What should we do if we want to see the number of visits per minute, so generally our window duration will choose the smallest granularity, here we choose 1 minute window duration. For example, the current demand needs to see the number of visits within 10 minutes.

In order to meet the flexible demand, we need to choose the appropriate window length. In our needs, we need to see the number of visits within 10 minutes, so we choose the time window here. Flink's windows are relatively rich, including time windows, thumb window and so on. Because we need to see the statistical results of the data, we must first divide the data stream into batches, and then aggregate the data in the batches. The data window is a more important feature in real-time processing. Data of the same page type will be in the same group. If we group by page type first, the result of the grouping is me. When grouping, we generally group according to the lowest dimension of data to increase data flexibility. Data filteringĭata filtering is to filter out illegal data, for our needs, such as filtering out records with empty mobile phone numbersĭata grouping is a relatively important stage, which involves the way our data is counted. For our needs, we can be divided into several stages: 1. The process includes filtering, format conversion, grouping, etc., deduplication, additional fields, aggregation, etc. Here, flink is used to process Kafka streams. Generally, I use the method of burying points to form a burying point into the log by, and then use flume or other components to collect the burying point log into kafka (According to the number of visits, set the number of partitions) data processingĭata processing is the most critical stage. It is assumed that every time a user visits the platform, we can obtain information including: mobile phone number, type of page visited, and visit time. In order to calculate the number of visits, the premise is to collect data. If we now have an e-commerce platform with huge daily traffic, and the main traffic is concentrated on clothing and home appliances pages, then we want to see the traffic trends of these two types of pages in real time (a statistic in ten minutes) As an important indicator of the platform, the visualized data is as follows. We start from a simple requirement to illustrate how the various components work together. I believe everyone is familiar with the above components, so I won’t go into details here. It mainly involves several components, kafka, flink, redis, druid and es. In the process of mining pits, I have summarized a relatively simple real-time calculation scheme for your reference.

Recently, I am engaged in real-time work, which mainly involves data processing, processing and visualization.

0 Comments

YOUR CART

Es deduplicator

Leave a Reply.

Author

Archives

Categories