Solution to deduplication of List Stream in Java

Author：Eve Cole Update Time：2025-08-03 15:32:01

question

Nowadays, Internet technology is mature, and more and more tend to be decentralized, distributed, and stream computing, which has put many things that have been done on the database side in the Java side. Today someone asked, if the database field has no index, how should it be deduplicated based on the field? Everyone agrees to use Java to do it, but how to do it?

answer

Suddenly I remembered the article I wrote in the list to remove heavy heavy weights before, and found it and read it. The method is to rewrite the hashcode and equals methods of the object in the list, throw it into the HashSet, and then take it out. This is the answer I wrote down like a dictionary when I first learned Java. For example, when interviewing, people who have been in Java for 3 years, they can memorize the difference between Set and HashMap, but they don’t know how to implement it. In other words, beginners only memorize the characteristics. But when you are actually using it in a project, you need to make sure that it is true. Because endorsement is useless, I can only believe in the result. You need to know how HashSet helps me to remove the heavy weight. If you think about it, can you remove the heavy load without HashSet? The simplest and most direct way is to compare it with historical data every time, and insert it into the tail of the queue if it is different. And HashSet just speeds up this process.

First, give the object User to sort

 @Data@Builder@AllArgsConstructorpublic class User { private Integer id; private String name;}List<User> users = Lists.newArrayList( new User(1, "a"), new User(1, "b"), new User(2, "b"), new User(1, "a"));

The goal is to take out user with no duplicate id. In order to prevent quarrel, I give a rule. Just take out data with unique ids at will, and do not have to be conscientious about which one is calculated when the id is the same.

Use the most intuitive method

This method is to use an empty list to store the traversed data.

 @Testpublic void dis1() { List<User> result = new LinkedList<>(); for (User user : users) { boolean b = result.stream().anyMatch(u -> u.getId().equals(user.getId())); if (!b) { result.add(user); } } System.out.println(result);}

Use HashSet

Anyone who has memorized the features knows that HashSet can remove heavy weights, so how do I remove heavy weights? Memorize it a little deeper and according to the hashcode and equals methods. So how do it based on these two? People who have not read the source code cannot continue, and the interview ends here.

In fact, HashSet is implemented by HashMap (I have never seen the source code and I have always intuitively thought that the key of HashMap is implemented by HashSet, which is exactly the opposite). I won't expand the description here, just look at the construction method and add method of HashSet to understand.

 public HashSet() { map = new HashMap<>();}/*** Obviously, if it exists, it returns false, if it does not exist, it returns true*/public boolean add(E e) { return map.put(e, PRESENT)==null;}

Then, it can also be seen from this that the repetition of HashSet is implemented based on HashMap, and the implementation of HashMap completely relies on the hashcode and equals methods. Now it is completely opened. If you want to use HashSet, you must be optimistic about your two methods.

In this question, we need to deduplicate based on id, so our comparison basis is id. Modifications are as follows:

 @Overridepublic boolean equals(Object o) { if (this == o) { return true; } if (o == null || getClass() != o.getClass()) { return false; } User user = (User) o; return Objects.equals(id, user.id);}@Overridepublic int hashCode() { return Objects.hash(id);}//hashcoderesult = 31 * result + (element == null ? 0 : element.hashCode());

Among them, Objects calls Arrays' hashcode, and the content is as shown above. Multiply by 31 equals x<<5-x.

The final implementation is as follows:

 @Testpublic void dis2() { Set<User> result = new HashSet<>(users); System.out.println(result);}

Use Java Stream to deduplicate

Going back to the initial question, the reason for asking this question is that if you want to re-get the database side to the Java side, the amount of data may be relatively large, such as 100,000 pieces. For big data, using Stream-related functions is the easiest. Just as Stream also provides the distinct function. So how should it be used?

 users.parallelStream().distinct().forEach(System.out::println);

I didn't see lambda as a parameter, that is, no custom conditions were provided. Fortunately, Javadoc marked the deduplication standard:

 Returns a stream consisting of the distinct elements(according to {@link Object#equals(Object)}) of this stream.

We know that we must also memorize this principle: when equals returns true, the return value of hashcode must be the same. This is a little logically confusing when memorizing, but as long as we understand the implementation method of HashMap, we will not feel difficult to talk. HashMap first locates according to the hashcode method, and then compares the equals method.

Therefore, to use distinct to achieve deduplication, you must override the hashcode and equals methods unless you use the default one.

So, why do you do this? Click in and take a look at the implementation.

 <P_IN> Node<T> reduce(PipelineHelper<T> helper, Spliterator<P_IN> spliterator) { // If the stream is SORTED then it should also be ORDERED so the following will also // preserve the sort order TerminalOp<T, LinkedHashSet<T>> reduceOp = ReduceOps.<T, LinkedHashSet<T>> makeRef(LinkedHashSet::new, LinkedHashSet::add, LinkedHashSet::addAll); return Nodes.node(reduceOp.evaluateParallel(helper, splitterator));}

The internal implementation is achieved by reducing. When you think of reduce, you instantly think of a method to implement distinctBykey by yourself. I just need to use reduce, and the calculation part is to compare the Stream elements with a built-in HashMap, skip them if there is, and put them in if there is no. In fact, the idea is the most straightforward method at the beginning.

 @Testpublic void dis3() { users.parallelStream().filter(distinctByKey(User::getId)) .forEach(System.out::println);}public static <T> Predicate<T> distinctByKey(Function<? super T, ?> keyExtractor) { Set<Object> see = ConcurrentHashMap.newKeySet(); return t -> see.add(keyExtractor.apply(t));}

Of course, if it is a parallel stream, the one that is taken is not necessarily the first one, but is random.

The above method is the best found and is non-invasive. But if you have to use distinct. You can only rewrite hashcode and equals like the HashSet method.

summary

You can only practice whether you can use these things yourself. Otherwise, it will be difficult to take them out at once when you really want to use them, or you will take the risk. And if you really want to use it boldly, it is also necessary to understand the rules and implementation principles. For example, how are the implementations of LinkedHashSet and HashSet different?

Attached with the simple LinkedHashSet source code:

 public class LinkedHashSet<E> extends HashSet<E> implements Set<E>, Cloneable, java.io.Serializable { private static final long serialVersionUID = -2851667679971038690L; public LinkedHashSet(int initialCapacity, float loadFactor) { super(initialCapacity, loadFactor, true); } public LinkedHashSet(int initialCapacity) { super(initialCapacity, .75f, true); } public LinkedHashSet() { super(16, .75f, true); } public LinkedHashSet(Collection<? extends E> c) { super(Math.max(2*c.size(), 11), .75f, true); addAll(c); } @Override public Spliterator<E> spliterator() { return Spliterators.spliterator(this, Spliterator.DISTINCT | Spliterator.ORDERED); }}

Replenish:

Method for removing duplicate data from List collection in Java

1. Loop all elements in list and then delete duplicates

 public static List removeDuplicate(List list) { for ( int i = 0 ; i < list.size() - 1 ; i ++ ) { for ( int j = list.size() - 1 ; j > i; j -- ) { if (list.get(j).equals(list.get(i))) { list.remove(j); } } } return list; }

2. Kick off duplicate elements through HashSet

 public static List removeDuplicate(List list) { HashSet h = new HashSet(list); list.clear(); list.addAll(h); return list; }

3. Delete duplicate elements in ArrayList to keep order

 // Delete duplicate elements in ArrayList, keep order public static void removeDuplicateWithOrder(List list) { Set set = new HashSet(); List newList = new ArrayList(); for (Iterator iter = list.iterator(); iter.hasNext();) { Object element = iter.next(); if (set.add(element)) newList.add(element); } list.clear(); list.addAll(newList); System.out.println( " remove duplicate " + list); }

4. Iterate over the object in the list, use list.contain(), and if it does not exist, put it in another list collection.

 public static List removeDuplicate(List list){ List listTemp = new ArrayList(); for(int i=0;i<list.size();i++){ if(!listTemp.contains(list.get(i))){ listTemp.add(list.get(i)); } } return listTemp; }